OpenAI on Monday officially launched ChatGPT Images 2.0, a major upgrade to its image generation capabilities that can create multilingual text, full infographics, presentation slides, maps, and manga-style illustrations with unprecedented accuracy. According to OpenAI’s announcement, the new model has been available for testing on LM Arena AI under the codename “duct tape” for several weeks before its public release.
The update encompasses the new `gpt-image-2` model for API users and introduces “Thinking” features for ChatGPT subscribers across all subscription tiers. VentureBeat reported that early users have been impressed with the model’s capacity to generate long text blocks, realistic user interface mockups, and its ability to perform web research and incorporate results directly into generated images.
Advanced Multimodal Capabilities
ChatGPT Images 2.0 represents a fundamental shift in visual media generation, with capabilities extending far beyond traditional image creation. The model can produce floor plans, image grids containing multiple smaller images, and character models from various angles. According to OpenAI’s release notes, “Images are a language” — reflecting the company’s broader vision of visual content as a communication medium.
The system demonstrates particular strength in text-heavy visual content, generating accurate multilingual text within images and creating complex infographics that combine textual and visual elements. Early testing revealed the model’s ability to reproduce real-life figures, including OpenAI CEO Sam Altman, and generate realistic screenshots from popular websites and platforms.
Users can also upload their own imagery and apply the new generation features to existing visual content, expanding the model’s utility beyond pure creation to enhancement and modification of existing materials.
Google Counters with Deep Research Agents
Google responded to OpenAI’s visual AI advances by launching two new autonomous research agents — Deep Research and Deep Research Max — that combine web data with proprietary enterprise information through a single API call. Google CEO Sundar Pichai announced the release on X, highlighting the agents’ improved quality, Model Context Protocol (MCP) support, and native chart generation capabilities.
Built on Google’s Gemini 3.1 Pro model, the new agents can produce charts and infographics directly within research reports and connect to third-party data sources. VentureBeat noted this marks an inflection point in the race to build AI systems capable of autonomous, multi-source research that traditionally required hours or days of human analyst time.
The agents target enterprise research workflows in finance, life sciences, and market intelligence — industries where information accuracy is critical. However, the new capabilities are currently available only through the API, not in Google’s consumer Gemini app, drawing criticism from some users.
Real-World AI Adoption Accelerates
The multimodal AI advances come as enterprise adoption reaches unprecedented levels. Google documented 1,302 real-world generative AI use cases from leading organizations, representing massive growth from the original 101 cases published two years ago at Google Next ’24.
According to Google’s analysis, production AI and agentic systems are now deployed across virtually every organization attending Google Next ’26 in Las Vegas. The company describes this as “the fastest technological transformation we’ve seen,” with customers driving the adoption of agentic enterprise systems built using tools like Gemini Enterprise, Gemini CLI, and Security Command Center.
The expansion demonstrates how multimodal AI capabilities are becoming integral to business operations rather than experimental technologies. Organizations are implementing these systems for everything from automated research to visual content generation, marking a shift toward AI-native workflows.
Specialized Applications Emerge
Beyond consumer and enterprise applications, multimodal AI is finding specialized uses in critical domains. Researchers at MIT developed automated systems for detecting dosing errors in clinical trial narratives using multi-modal feature engineering approaches. A recent arXiv paper described a system combining 3,451 features spanning traditional NLP, semantic embeddings, and transformer-based scores to achieve 0.8725 test ROC-AUC on clinical trial error detection.
The research demonstrates how multimodal AI can address safety-critical applications where traditional single-modality approaches fall short. The system processes nine complementary text fields with a median of 5,400 characters per sample across 42,112 clinical trial narratives, showcasing the scale and complexity of real-world multimodal applications.
At MIT, AI adoption has become so pervasive that researchers are finding their way into artificial intelligence applications almost accidentally. MIT Technology Review reported how mechanical engineering professor Sili Deng pivoted to machine learning during COVID-19 lab shutdowns, eventually developing “digital twin” models for energy and flow devices.
What This Means
The simultaneous launches of OpenAI’s ChatGPT Images 2.0 and Google’s Deep Research agents signal a new phase in the AI competition, where multimodal capabilities are becoming the primary differentiator rather than text-only performance. Both companies are betting that the future of AI lies in systems that can seamlessly work across visual, textual, and data modalities.
For enterprises, these developments represent a shift from experimental AI pilots to production-ready multimodal workflows. The ability to generate complex visual content with embedded research, create multilingual materials, and process diverse data types through single interfaces could fundamentally change how organizations approach content creation and analysis.
The rapid adoption documented by Google — from 101 to 1,302 use cases in two years — suggests we’re entering what the company calls “the era of the agentic enterprise,” where AI systems don’t just assist but autonomously execute complex, multi-step workflows across different media types.
FAQ
What makes ChatGPT Images 2.0 different from previous image generation models?
ChatGPT Images 2.0 can generate accurate multilingual text within images, create complex infographics and presentations, and apply its capabilities to user-uploaded content. It also includes web research functionality that incorporates findings directly into generated visuals.
How do Google’s Deep Research agents work with enterprise data?
The agents combine open web data with proprietary enterprise information through a single API call, generate native charts and infographics within research reports, and connect to third-party data sources via the Model Context Protocol (MCP).
What industries are seeing the most multimodal AI adoption?
According to Google’s data, finance, life sciences, market intelligence, and healthcare are leading adoption, particularly for applications requiring high accuracy like clinical trial monitoring and automated research workflows.






