Claude just beat GPT-5, Gemini, and Grok in real-world job tasks, according to OpenAI’s own study

OpenAI has released GDPval, a new evaluation system to test how AI performs at work-related tasks
Claude Opus 4.1 comes out in the lead, with ‘ChatGPT-5 high’ in second place
Tasks include things like emailing a response to a dissatisfied customer

We’re all familiar with AI benchmarks, which measure performance at certain tasks, but often these tasks don’t reflect the real world and how people actually use AI, especially at work.

To combat this problem, OpenAI, the maker of ChatGPT, is introducing GDPval, a new way of measuring AI model performance using real-world work tasks compared to a real human across 44 occupations, from software developers and lawyers to registered nurses and mechanical engineers.

Surprisingly, the OpenAI study shows that the best performing model was Anthropic’s Claude Opus 4.1, which outpaced not only OpenAI’s GPT-5 but also Gemini and Grok.

GDPval win rate

This graph shows the overall GDPval win rate (the times when the AI did better than an industry expert) and shows that Claude Opus 4.1 is out in the lead with a win rate of 47.6, with ‘ChatGPT-5 high’ coming second with 38.8 and ‘ChatGPT o3 high’ at 34.1. ChatGPT-4o scores the lowest, with a win rate of 12.4, which is significantly behind both Grok 4 and Gemini 2.5 Pro.

The study found that Claude was the highest-performing across eight of the nine industry sectors it tested, including government, health care, and social assistance. The results clearly show that Claude Opus 4.1 leads across a diverse range of work-related tasks.

Claude win rates by sector

(Image credit: OpenAI)

Examples of the tasks include things like emailing a response to a dissatisfied customer requesting a return, optimizing a table layout for a Spring vendor fair, and auditing price inconsistencies in purchase orders.

What’s in a name?

The name used by OpenAI, GDPval, comes from the concept of Gross Domestic Product (GDP) as a key economic indicator. OpenAI wants GPDval to be widely adopted to help ground conversations about future AI improvements in evidence rather than guesswork.

Releasing the results showing a competitor out in front appears to be an exercise in radical transparency by OpenAI, but that fits in perfectly with the company’s philosophy. “Our mission is to ensure that artificial general intelligence benefits all of humanity. As part of our mission, we want to transparently communicate progress on how AI models can help people in the real world”, reads a statement from OpenAI.

The paper, which is available to read in its entirety online, comes a week after OpenAI released a more consumer-focused paper that showed that the majority of ChatGPT users (70%) were actually using it at home, rather than at work.

The study was conducted by OpenAI’s Economic Research team and Harvard economist David Deming for the National Bureau of Economic Research (NBER). The results were surprising to a lot of people, as previously, the focus of new ChatGPT releases has been very focused on work-related tasks like coding, making presentations, and being a good research tool.

The news that Claude Opus 4.1 is better at actual work-related tasks, not just benchmarks, than even ‘ChatGPT-5 high’ could mean a renewed focus by OpenAI towards its changing user base.