Anthropic looks to fund a new, more comprehensive generation of AI benchmarks

Kyle Wiggers

Updated July 1, 2024 at 9:42 PM·4 min read

Anthropic is launching a program to fund the development of new types of benchmarks capable of evaluating the performance and impact of AI models, including generative models like its own Claude.

Unveiled on Monday, Anthropic's program will dole out payments to third-party organizations that can, as the company puts it in a blog post, "effectively measure advanced capabilities in AI models." Those interested can submit applications to be evaluated on a rolling basis.

"Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem," Anthropic wrote on its official blog. "Developing high-quality, safety-relevant evaluations remains challenging, and the demand is outpacing the supply."

As we've highlighted before, AI has a benchmarking problem. The most commonly cited benchmarks for AI today do a poor job of capturing how the average person actually uses the systems being tested. There are also questions as to whether some benchmarks, particularly those released before the dawn of modern generative AI, even measure what they purport to measure, given their age.

The very-high-level, harder-than-it-sounds solution Anthropic is proposing is creating challenging benchmarks with a focus on AI security and societal implications via new tools, infrastructure and methods.

The company calls specifically for tests that assess a model's ability to accomplish tasks like carrying out cyberattacks, "enhance" weapons of mass destruction (e.g. nuclear weapons) and manipulate or deceive people (e.g. through deepfakes or misinformation). For AI risks pertaining to national security and defense, Anthropic says it's committed to developing an "early warning system" of sorts for identifying and assessing risks, although it doesn't reveal in the blog post what such a system might entail.

Anthropic also says it intends its new program to support research into benchmarks and "end-to-end" tasks that probe AI's potential for aiding in scientific study, conversing in multiple languages and mitigating ingrained biases, as well as self-censoring toxicity.

To achieve all this, Anthropic envisions new platforms that allow subject-matter experts to develop their own evaluations and large-scale trials of models involving "thousands" of users. The company says it's hired a full-time coordinator for the program and that it might purchase or expand projects it believes have the potential to scale.

"We offer a range of funding options tailored to the needs and stage of each project," Anthropic writes in the post, though an Anthropic spokesperson declined to provide any further details about those options. "Teams will have the opportunity to interact directly with Anthropic's domain experts from the frontier red team, fine-tuning, trust and safety and other relevant teams."

Anthropic's effort to support new AI benchmarks is a laudable one — assuming, of course, there's sufficient cash and manpower behind it. But given the company's commercial ambitions in the AI race, it might be a tough one to completely trust.

In the blog post, Anthropic is rather transparent about the fact that it wants certain evaluations it funds to align with the AI safety classifications it developed (with some input from third parties like the nonprofit AI research org METR). That's well within the company's prerogative. But it may also force applicants to the program into accepting definitions of "safe" or "risky" AI that they might not agree with.

A portion of the AI community is also likely to take issue with Anthropic's references to "catastrophic" and "deceptive" AI risks, like nuclear weapons risks. Many experts say there's little evidence to suggest AI as we know it will gain world-ending, human-outsmarting capabilities anytime soon, if ever. Claims of imminent "superintelligence" serve only to draw attention away from the pressing AI regulatory issues of the day, like AI's hallucinatory tendencies, these experts add.

In its post, Anthropic writes that it hopes its program will serve as "a catalyst for progress towards a future where comprehensive AI evaluation is an industry standard." That's a mission the many open, corporate-unaffiliated efforts to create better AI benchmarks can identify with. But it remains to be seen whether those efforts are willing to join forces with an AI vendor whose loyalty ultimately lies with shareholders.

https://techcrunch.com/2024/03/07/heres-why-most-ai-benchmarks-tell-us-so-little

Engadget
Apple reportedly even held talks with Meta about an AI partnership as it plays catch-up
According to a report by the Wall Street Journal, citing sources with knowledge of the discussions, Apple has held talks with Meta, Anthropic and Perplexity about the possibility of using the companies' generative AI models.
Engadget
Anthropic’s newest Claude chatbot beats OpenAI’s GPT-4o in some benchmarks
Anthropic rolled out its newest AI language model on Thursday, Claude 3.5 Sonnet. The updated chatbot outperforms the company’s previous top-tier model, Claude 3 Opus, but works at twice the speed.
TechCrunch
Apple Intelligence is the company's new generative AI offering
On Monday at WWDC 2024, Apple unveiled Apple Intelligence, its long-awaited, ecosystem-wide push into generative AI. As earlier rumors suggested, the new feature is called Apple Intelligence (AI, get it?). The company promised the feature will be built with safety at its core, along with highly personalized experiences.
TechCrunch
AI apocalypse? ChatGPT, Claude and Perplexity all went down at the same time
After a multi-hour outage that took place in the early hours of the morning, OpenAI's ChatGPT chatbot went down again — but this time, it wasn't the only AI provider affected. On Tuesday morning, both Anthropic's Claude and Perplexity began seeing issues, too, but these were more quickly resolved. It's unusual for three major AI providers to all be down at the same time, which could signal a broader infrastructure issue or internet-scale problem, such as those that affect multiple social media sites simultaneously, for example.
Engadget
Meta changes its labels for AI-generated images after complaints from photographers
Meta is updating its “Made with AI” labels after widespread complaints from photographers that the company was mistakenly flagging non-AI-generated content.
Engadget
NFC Forum wants to bundle age verification and payment receipts in tap-to-pay
The NFC Forum envisions a future wherein one tap is all you need for multiple actions at once, including paying for purchases, getting loyalty point and receiving digital receipts.
TechCrunch
Robinhood snaps up Pluto to add AI tools to its investing app
Investment app Robinhood is adding more AI features for investors with its acquisition of AI-powered research platform Pluto Capital, Inc. Announced on Monday, the company says that Pluto will allow Robinhood to add tools for quicker identification of trends and investment opportunities, help guide users with their investment strategies, and offer real-time portfolio optimization. Pluto founder Jacob Sansbury will join Robinhood with the deal's closure, but terms were not disclosed. At Robinhood, Sansbury will be tasked with accelerating the trading app's adoption of AI technologies.
Yahoo Sports
Copa América: Internet roasts Fox for bizarre USA-Uruguay camera angle, prompting a change mid-game: 'Are they using a blimp?'
The Goodyear blimp confirmed that, no, it was not providing footage for Monday night's game.
TechCrunch
Senators urge owners, partners, and VC backers of fintech Synapse to restore customers' access to their money
In a letter shared publicly on Monday, U.S. Senator Sherrod Brown (D-OH), Chairman of the Senate Committee on Banking, Housing, and Urban Affairs, along with Senators Ron Wyden (D-OR), Tammy Baldwin (D-WI), and John Fetterman (D-PA) pointed out that customers of companies that partnered with banking-as-a-service startup Synapse have not been able to access their money since mid-May. The letter was addressed to W. Scott Stafford, president and CEO of Evolve Bank & Trust, but was also sent to major investors in Synapse, as well as to the company’s principal bank and fintech partners.
Yahoo Celebrity
Kyle MacLachlan, 65, is a force on TikTok who's connected with Charli XCX and Chappell Roan. Here's what he thinks about his cult-favorite status.
The "Twin Peaks" actor spoke with Yahoo Entertainment about the song and snacks he's most passionate about this summer.
Yahoo News
After Supreme Court 'absolute immunity' ruling, Trump’s Jan. 6 trial now hinges on whether these 5 acts were 'official' or 'unofficial'
Here’s a quick guide to the former president's acts leading up to the Jan. 6 riot, as outlined in special counsel Jack Smith’s indictment — and what the nation's high court said about them in its historic ruling.
Yahoo Sports
Detroit Tigers, Bally Sports Detroit address analyst Craig Monroe's absence after sexual assault allegations
Craig Monroe has been absent from the broadcast booth for about a month.
Yahoo News
Why Hurricane Beryl's 'insane' intensification has experts worried
The first hurricane of 2024 made history in several ways, and none of them are good news, experts say.
Yahoo Life Shopping
Ina Garten keeps this plastic wrap dispenser on her counter, and it's genius
The Barefoot Contessa 'absolutely loves' that it prevents her from 'fighting with that box' — the bane of every cook's existence.
Yahoo Sports
USMNT holds the fate of Berhalter’s job vs. Uruguay, England finds a way to survive
Christian Polanco and Alexis Guerreros discuss if Gregg Berhalter’s job is on the line when the USMNT faces off against Uruguay, England’s national team finding a way to survive in the Euro and the mess that is Mexcio’s national team.
Yahoo Life Shopping
The best 4th of July mattress sales: Up to 40% off Nectar, Casper and Purple
Have a restful summer and beyond with these limited-time deals.
Yahoo Life
Why you should eat more whole grains like quinoa, farro and oats
Even if you're being mindful of carbohydrates, here's why should still eat whole grains.
Yahoo Finance
Stock market today: Nasdaq leads gains to kick off new quarter as Tesla pops 6%
Political turmoil in France gripped investors weighing whether stocks can build on their stellar first-half performance.
Yahoo Life Shopping
Walmart's 4th of July sale is here with the best deals to shop this week: Save on Shark, Keurig, HP and more
Scoop up quality noise-cancelling headphones for just $20, an HP laptop for a whopping $600 off and a Keurig hot and iced coffeemaker for under $60.
Yahoo Finance
Healthcare workers are in demand — and the industry can't get enough of them
Healthcare jobs have been on a tear, adding faster than other sectors every month. But the industry has been struggling.

News

Life

Entertainment

Finance

Sports

New on Yahoo

Anthropic looks to fund a new, more comprehensive generation of AI benchmarks

Recommended Stories

Apple reportedly even held talks with Meta about an AI partnership as it plays catch-up

Anthropic’s newest Claude chatbot beats OpenAI’s GPT-4o in some benchmarks

Apple Intelligence is the company's new generative AI offering

AI apocalypse? ChatGPT, Claude and Perplexity all went down at the same time

Meta changes its labels for AI-generated images after complaints from photographers

NFC Forum wants to bundle age verification and payment receipts in tap-to-pay

Robinhood snaps up Pluto to add AI tools to its investing app

Copa América: Internet roasts Fox for bizarre USA-Uruguay camera angle, prompting a change mid-game: 'Are they using a blimp?'

Senators urge owners, partners, and VC backers of fintech Synapse to restore customers' access to their money

Kyle MacLachlan, 65, is a force on TikTok who's connected with Charli XCX and Chappell Roan. Here's what he thinks about his cult-favorite status.

After Supreme Court 'absolute immunity' ruling, Trump’s Jan. 6 trial now hinges on whether these 5 acts were 'official' or 'unofficial'

Detroit Tigers, Bally Sports Detroit address analyst Craig Monroe's absence after sexual assault allegations

Why Hurricane Beryl's 'insane' intensification has experts worried

Ina Garten keeps this plastic wrap dispenser on her counter, and it's genius

USMNT holds the fate of Berhalter’s job vs. Uruguay, England finds a way to survive

The best 4th of July mattress sales: Up to 40% off Nectar, Casper and Purple

Why you should eat more whole grains like quinoa, farro and oats

Stock market today: Nasdaq leads gains to kick off new quarter as Tesla pops 6%

Walmart's 4th of July sale is here with the best deals to shop this week: Save on Shark, Keurig, HP and more

Healthcare workers are in demand — and the industry can't get enough of them