DeepSeek-V4 Beats GPT-5.5 on Key Benchmarks at 1/6th Cost

DeepSeek released its V4 model on Monday, achieving near state-of-the-art performance across multiple AI benchmarks while undercutting OpenAI’s GPT-5.5 and Anthropic’s Claude Opus 4.7 by approximately 83% on API pricing. The 1.6-trillion-parameter Mixture-of-Experts model, available under MIT License, marks what researchers are calling the “second DeepSeek moment” following the company’s breakthrough R1 release in January 2025.

According to VentureBeat, DeepSeek-V4 matches or exceeds performance of leading closed-source models on several benchmarks, continuing the Chinese startup’s pattern of delivering frontier-class capabilities at dramatically reduced costs.

https://x.com/deepseek_ai/status/2047516922263285776

OpenAI Launches GPT-5.5 Amid Intensifying Competition

OpenAI unveiled GPT-5.5 earlier this week, positioning it as a “fundamental redesign” of how AI interacts with operating systems and professional software. The model narrowly beats Anthropic’s Claude Mythos Preview on Terminal-Bench 2.0, though VentureBeat reports the margin represents “essentially a statistical tie.”

“What is really special about this model is how much more it can do with less guidance,” OpenAI co-founder Greg Brockman told journalists during a pre-launch briefing. “It’s way more intuitive to use. It can look at an unclear problem and figure out what needs to happen next.”

Amelia “Mia” Glaese, VP of Research at OpenAI, emphasized GPT-5.5’s coding capabilities: “It’s definitely our strongest model yet on coding, both measured by benchmarks and based on the feedback that we’ve gotten from trusted partners, as well as our own experience.”

The model remains available alongside GPT-5.4, which OpenAI continues offering at half the API cost of its successor.

New Benchmark Reveals Performance Gaps in Specialized Domains

Researchers introduced ThermoQA, a three-tier benchmark evaluating thermodynamic reasoning across 293 engineering problems. The arXiv paper reveals significant performance variations between frontier models, with Claude Opus 4.6 leading at 94.1%, followed by GPT-5.4 at 93.1% and Gemini 3.1 Pro at 92.5%.

The benchmark exposes critical weaknesses in current AI systems. Cross-tier performance degradation ranges from 2.8 percentage points for Claude Opus to 32.5 percentage points for MiniMax, confirming that “property memorization does not imply thermodynamic reasoning,” according to the researchers.

Supercritical water analysis, R-134a refrigerant calculations, and combined-cycle gas turbine problems serve as particularly challenging discriminators, creating 40-60 percentage point performance spreads between models. Multi-run consistency varies dramatically, with standard deviations ranging from ±0.1% to ±2.5% across different models.

Google Expands Research Agent Capabilities

Google launched Deep Research and Deep Research Max agents on Monday, marking the company’s most significant upgrade to autonomous research capabilities since the product’s debut. According to VentureBeat, the new agents can fuse open web data with proprietary enterprise information through a single API call.

Built on Google’s Gemini 3.1 Pro model, the agents introduce native chart and infographics generation within research reports and support connections to third-party data sources through the Model Context Protocol (MCP). “We are launching two powerful updates to Deep Research in the Gemini API, now with better quality, MCP support, and native chart/infographics generation,” Google CEO Sundar Pichai announced on X.

The release targets enterprise research workflows in finance, life sciences, and market intelligence — sectors where information accuracy carries high stakes. Google positions this as infrastructure for replacing human analyst workflows that traditionally consume hours or days.

Enterprise AI Adoption Accelerates Across Industries

Google documented 1,302 real-world generative AI use cases from leading organizations, expanding from the original 101 cases published at Next ’24. The Google Cloud blog indicates the growth reflects “enthusiastic commitment to AI” from enterprise customers.

The vast majority showcase agentic AI applications built with tools like Gemini Enterprise, Gemini CLI, Security Command Center, and Google’s AI Hypercomputer infrastructure. Matt Renner, President of Global Revenue at Google Cloud, described this as evidence that organizations have entered “the era of the agentic enterprise.”

Production AI and agentic systems are now deployed meaningfully across virtually every organization attending Google’s Next ’26 conference in Las Vegas, according to the company. Google enlisted AI assistance to analyze the expanded dataset, using Gemini Enterprise running the latest Gemini Pro models to identify key trends and insights.

What This Means

The benchmark wars have intensified dramatically, with three major releases in a single week reshaping competitive dynamics. DeepSeek-V4’s combination of frontier performance and radical cost reduction poses an existential challenge to the pricing models of Western AI leaders. At one-sixth the cost of GPT-5.5 and Claude Opus 4.7, DeepSeek forces a fundamental recalculation of AI economics.

Specialized benchmarks like ThermoQA reveal that headline performance metrics often mask critical domain-specific weaknesses. The 40-60 percentage point spreads in thermodynamic reasoning suggest that enterprise buyers need granular evaluation frameworks rather than relying on general-purpose scores.

Google’s research agent expansion signals the next competitive front: autonomous workflows that replace human analysts. The ability to fuse proprietary data with web research through single API calls represents a significant moat-building opportunity, particularly for enterprises reluctant to expose sensitive information to external AI providers.

The rapid pace of releases — three major announcements in one week — indicates an acceleration in the AI arms race that shows no signs of slowing.

FAQ

How much cheaper is DeepSeek-V4 compared to other frontier models?
DeepSeek-V4 costs approximately one-sixth the price of GPT-5.5 and Claude Opus 4.7 through API access, while delivering comparable or superior performance on many benchmarks. This represents an 83% cost reduction compared to leading Western alternatives.

What makes ThermoQA different from other AI benchmarks?
ThermoQA evaluates domain-specific reasoning in engineering thermodynamics rather than general language capabilities. It reveals that models can memorize properties without understanding thermodynamic principles, creating performance gaps of 40-60 percentage points on complex problems like supercritical water analysis.

Can Google’s Deep Research agents access private company data?
Yes, Deep Research and Deep Research Max can fuse open web data with proprietary enterprise information through a single API call. They support connections to third-party data sources via Model Context Protocol (MCP) and can generate native charts and infographics within research reports.