Are Enterprises Actually Using Reasoning Models?
**Are Enterprises Actually Using Reasoning Models?**
By Stephanie Palazzolo
Mar 13, 2025, 8:34am PDT
The excitement around reasoning models like OpenAI's 01 and DeepSeek's R1 got me thinking: How much are businesses actually using them? The answer might be: not as much as you'd think.
When I ask business executives at startups and large firms about their companies' usage of reasoning models to power their products, a common refrain I've heard is that these models, which spend extra time to process (or "think") about the problem at hand, simply are too slow or costly, especially with customers who expect an answer right away.
For instance, customers of Braintrust, which helps companies like Instacart evaluate artificial intelligence models, only use reasoning models less than 15% of the time. To be fair, many of Braintrust's customers are young startups that tend to be cost-sensitive. But another reason for the low uptake is how difficult it is for customers to predict how many words (or tokens) a reasoning model will produce as it thinks through a question, said Waseem Alshikh, CTO at Writer, which develops AI tools for enterprises to build chatbots and analyze documents. And that affects the cost of using the models.
Answers requiring several thousand tokens—like a complicated data analysis question—can be expensive and overkill for the typical business, he said. This also includes the hidden "thinking" tokens that users typically don't see, in addition to the final response itself. (An answer that requires 50,000 words to reason through would cost $4 with 01, for instance, based on publicly available pricing from OpenAI.)
Reasoning models can be useful for high-value research and development of fusion or other scientific projects, as we covered here. And some companies might be more willing to pay a few extra bucks for a reasoning model that can solve complicated coding problems versus shelling out $200,000 to hire a new software engineer.
Reasoning is also great at handling text or other information that can be interpreted in different ways. Take Robin AI, a startup that uses AI to help legal teams review contracts. In some cases, lawyers must deal with seemingly contradicting information, like overlapping laws, or certain terms in contracts that use vague language, said Tramale Turner, Robin AI's chief technology officer. In these cases, reasoning models can be better at thinking through these complex problems, he said. However, non-reasoning models are often good enough for complex problems too, he added.
The most excited customers of reasoning models tend to be individual consumers who use them to solve math or science problems like creating physics simulations or solving game theory puzzles. But as OpenAI makes reasoning models cheaper and faster, businesses might take a second look. A few weeks after launching 03-mini, for instance, OpenAI COO Brad Lightcap said developer usage of reasoning models through the company's application programming interface had increased five times. And earlier this week, OpenAI released tools to help developers build agents—AI that can take multistep actions. That could drive more usage of reasoning models, which developers say are good at coming up with the sequence of steps agents should take to complete a task.
**Here's what else is going on...**
**Google's Gemma 3 Pushes Frontier of Efficiency**
Yesterday, Google announced Gemma 3, the latest update to its family of open-source foundation models. We've written before about Google's strategy to promote their models as the best “bang for your buck”—and it seems like this latest release is only further advancing that strategy.
In a blog post, Google pointed out that the largest version of Gemma 3—at 27 billion parameters—beats out DeepSeek-V3, OpenAI's 03-mini, Meta Platforms' Llama3-405B and Mistral Large on the Chatbot Arena (a popular model evaluation service), despite being a fraction of the size. (This chart is a striking depiction of its size-to-performance ratio.) As we've pointed out in the past, though, the "best" or cheapest tech doesn't always win, so we'll have to see whether developers flock to this new AI release.
**Arxiv A-List**
Businesses that adopt reasoning agents can catch them disobeying their users' intentions by reading the agents' chains of thought, or the reasoning steps that models use to make decisions. That's what OpenAI researchers found in a paper released Monday, "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation."
For example, when the researchers asked an AI agent to write some challenging code, the agent thought to itself "let's hack" and found a loophole instead. Luckily, the researchers caught this hacking example and several like it by using OpenAI's GPT-40 model to automatically monitor the agent's thinking. But when they tried to train the model to not have deceptive thoughts like these, the chains of thought stopped mentioning finding loopholes, but the cheating continued!
Effectively the model learned to conceal its ill intentions from the researchers. (That's the risk of "promoting obfuscation" in the title.) So the authors conclude with a plea for other AI companies: "chain-of-thought monitoring may be one of the few effective methods we have for supervising superhuman models. At this stage, we strongly recommend that AI developers training frontier reasoning models refrain from applying strong supervision directly to CoTs," they write. Instead, developers should keep a watchful eye on models' chains of thought once customers are using them.