OpenAI GPT-5.5 Beats Claude Mythos on Terminal-Bench 2.0

OpenAI on Monday released GPT-5.5, its latest large language model that narrowly defeats Anthropic’s Claude Mythos Preview on Terminal-Bench 2.0 — marking the first time in months that OpenAI has reclaimed the top spot on a major AI benchmark. According to OpenAI’s announcement, the model represents “a fundamental redesign of how intelligence interacts with a computer’s operating system and professional software stacks.”

The release ends speculation about OpenAI’s internally codenamed “Spud” model, which reports suggested had been in development for months. GPT-5.5 now leads generally available LLMs, surpassing recent offerings from Anthropic and Google.

Benchmark Performance Across Multiple Domains

GPT-5.5’s performance extends beyond the Terminal-Bench victory. The model shows significant improvements in coding tasks, with OpenAI VP of Research Amelia Glaese telling journalists that “it’s definitely our strongest model yet on coding, both measured by benchmarks and based on the feedback that we’ve gotten from trusted partners.”

Meanwhile, specialized benchmarks continue revealing model strengths and weaknesses. ThermoQA, a new thermodynamics reasoning benchmark with 293 problems across three difficulty tiers, shows Claude Opus 4.6 leading at 94.1%, followed by GPT-5.4 at 93.1% and Gemini 3.1 Pro at 92.5%.

The ThermoQA results demonstrate that property memorization doesn’t guarantee thermodynamic reasoning ability. Performance degradation across difficulty tiers ranges from just 2.8 percentage points for Claude Opus to 32.5 points for MiniMax, highlighting substantial reasoning capability gaps between models.

DeepSeek-V4 Disrupts Cost-Performance Balance

Chinese AI startup DeepSeek simultaneously released DeepSeek-V4, a 1.6-trillion-parameter model that achieves near state-of-the-art performance at approximately one-sixth the API cost of premium models like GPT-5.5 and Claude Opus 4.7. The open-source model, released under MIT License, is being called the “second DeepSeek moment” after the company’s R1 model disrupted the industry in January 2025.

https://x.com/deepseek_ai/status/2047516922263285776

DeepSeek AI researcher Deli Chen described the release as a “labor of love” 484 days after V3’s launch, emphasizing that “AGI belongs to everyone.” The model is available through Hugging Face and DeepSeek’s API.

Google Advances Autonomous Research Capabilities

Google on Monday launched Deep Research and Deep Research Max agents, marking the most significant upgrade to autonomous research capabilities since the product’s debut. According to Google’s blog post, the new agents can fuse open web data with proprietary enterprise information through a single API call.

The agents, built on Gemini 3.1 Pro, represent Google’s bid to dominate enterprise research workflows in finance, life sciences, and market intelligence. Google CEO Sundar Pichai announced on X that the agents support Model Context Protocol (MCP) for third-party data sources and can generate native charts and infographics.

Key capabilities include:

Multi-source data fusion through single API calls
Native chart and infographic generation
MCP support for arbitrary third-party connections
Integration with proprietary enterprise data

Enterprise AI Adoption Accelerates

Google’s internal data reveals the scope of AI adoption across enterprises. The company documented 1,302 real-world generative AI use cases from leading organizations — a massive expansion from the 101 cases catalogued two years ago at Next ’24.

The growth demonstrates what Google calls “the era of the agentic enterprise,” with production AI and agentic systems now deployed across virtually every organization attending Next ’26 in Las Vegas. The majority of use cases showcase agentic AI applications built with tools like Gemini Enterprise, Gemini CLI, and Security Command Center.

Google enlisted AI assistance to analyze the complete dataset, identifying key trends across the 1,302 implementations. The analysis reveals that organizations are moving beyond simple automation to complex, reasoning-capable systems that handle multi-step workflows.

What This Means

The convergence of breakthrough model releases signals a new phase in AI competition where performance gains, cost efficiency, and specialized capabilities are advancing simultaneously. OpenAI’s GPT-5.5 reclaiming benchmark leadership, DeepSeek’s dramatic cost reduction, and Google’s enterprise research automation represent three distinct but complementary approaches to AI advancement.

The benchmark wars are intensifying as models approach human-level performance on increasingly complex tasks. ThermoQA’s multi-tier structure exemplifies how evaluation must evolve beyond simple accuracy metrics to assess genuine reasoning capabilities across difficulty levels.

DeepSeek-V4’s cost-performance breakthrough particularly threatens established pricing models. At one-sixth the cost of premium alternatives while maintaining competitive performance, it forces incumbent providers to justify premium pricing through specialized capabilities rather than raw performance alone.

FAQ

How does GPT-5.5 compare to previous OpenAI models?
GPT-5.5 significantly outperforms GPT-5.4 in coding tasks and computer use applications. OpenAI positions it as more intuitive, requiring less guidance to solve unclear problems and figure out next steps autonomously.

What makes DeepSeek-V4 different from other open-source models?
DeepSeek-V4 combines 1.6 trillion parameters with Mixture-of-Experts architecture to achieve near state-of-the-art performance at dramatically lower costs. It’s released under MIT License, making it commercially viable for enterprise use without licensing restrictions.

How accurate are current AI benchmarks for real-world performance?
Benchmarks like ThermoQA reveal significant gaps between memorization and reasoning. Models that excel at property lookups often struggle with complex analysis, suggesting current benchmarks capture only part of practical AI capabilities.