Anthropic's Secret Trick for Measuring Claude
**Anthropic's Secret Trick for Measuring Claude**
By Abram Brown
Mar 1, 2025, 7:00am PST
Welcome, Weekenders! In this newsletter:
On a bit of a whim last year, an Anthropic staffer, David Hershey, sought out a different method for tracking the development of the startup's Claude chatbot. (After all, even the nerdiest AI nerds get tired of staring at the same old benchmark material.) He wanted something that would let him see how Claude performed on a long-term solo project, so he decided to have it play the original Pokémon game released by Nintendo in 1996.
When Hershey set up the older 3.0 Sonnet version of Claude to play, the bot couldn't even get the game going. The next iteration of Claude, 3.5 Sonnet, did a little better. "There were glimmers of hope," Hershey said. And now 3.7 Sonnet, the hybrid reasoning model version of Claude released this week, which can think more about what it's doing, can go much further into the game. Discussion of Hershey's impromptu Pokémon tests with Claude became a viral pastime within Anthropic, and to let more people in on the fun, the startup has decided to set up a public Twitch stream of 3.7 Sonnet playing Pokémon. (When I last checked, Claude was at Mount Moon preparing to square off against a Team Rocket henchman just as a wild Level 10 Paras appeared, forcing Claude to battle it out with a Level 14 Spearow.)
"This is just a much more visceral way of seeing general improvement on intelligence," Hershey said. Monitoring Claude's Pokémon games has allowed Hershey and other researchers at Anthropic to hone their thinking around the development of the startup's agentic technology, a buzzy artificial intelligence subcategory focused on developing AI that can complete tasks by itself. Sure, Anthropic could—and does—keep measuring Claude with traditional tests, but they're a matter of routine. The researchers hope more unusual assignments like having the bot play Pokémon can spark insights on how to hone the model that might not have dawned on them from using just the standard assessments.
"In the last few years, benchmarks and evaluations don't really tell the full story of the quality of these models just like you don't know how smart someone is just by giving them a SAT test," said Dianne Na Penn, who leads the research efforts of Anthropic's product team. "Knowing how well a model can do on a goal-oriented agentic task is not something you can know from multiple-choice questions."
Hershey suspects 3.7 Sonnet still won't be able to finish the game. On the Twitch stream, it had spent at least a day in labyrinthine Mount Moon. (I feel for Claude: Exactly the same fate befell me when I first played the game at age 8.) And Hershey thinks even if the model makes it out of Mount Moon, it will really struggle to get through Fuchsia City's Safari Zone, which it must traverse in a set number of steps. "If we've learned anything from Mount Moon, we know that optimal pathing is not what we're good at yet," Hershey acknowledged.
Would Claude's ability to finish the game be a sign that artificial general intelligence is finally here? "We'd be a lot closer," Na Penn said with a laugh. "Yeah, it would definitely make me question some things," Hershey said.