Digital Mind News | AI: Multimodal AI Advances Real-Time Video Analysis and Voice

Multimodal AI systems gained significant ground this week with breakthroughs in real-time interaction models and cost-effective video analysis capabilities. Thinking Machines unveiled preview “interaction models” that process multiple input types simultaneously, while Perceptron launched its Mk1 video analysis model at 80-90% lower cost than competing services from Anthropic, OpenAI, and Google.

Thinking Machines Debuts Real-Time Multimodal Interaction

Thinking Machines, the AI startup founded by former OpenAI CTO Mira Murati and co-founder John Schulman, announced a research preview of what it calls “interaction models” — systems designed to handle simultaneous multimodal inputs rather than traditional turn-based exchanges. According to the company’s blog post, these models treat interactivity as “a first-class citizen of model architecture rather than an external software harness.”

The technology represents a departure from current AI interaction patterns where users input a query, wait for processing, and receive a response. Instead, interaction models can process and respond to ongoing human inputs across text, audio, and video simultaneously. The company reported “impressive gains on third-party benchmarks and reduced latency” compared to traditional approaches.

Thinking Machines plans to open a limited research preview “in the coming months” to collect feedback before wider release. The startup, which focuses on multimodality and human-AI collaboration, has not yet announced pricing or general availability dates.

Perceptron Launches Cost-Effective Video Analysis Model

Perceptron Inc. released its flagship Mk1 video analysis model with pricing at $0.15 per million input tokens and $1.50 per million output tokens — representing cost savings of 80-90% compared to Claude Sonnet 4.5, GPT-5, and Gemini 3.1 Pro. According to Perceptron’s announcement, the model targets enterprise applications including security monitoring, marketing video optimization, and behavioral analysis.

Co-founder and CEO Armen Aghajanyan, formerly of Meta FAIR and Microsoft, said the company spent 16 months developing a “multi-modal recipe” to address real-world video understanding challenges. The model demonstrates capabilities in understanding cause-and-effect relationships, object dynamics, and physics principles within video content.

Perceptron offers a public demo for users to test the model’s video analysis capabilities. The company positions Mk1 as addressing the gap in mainstream video AI functionality, particularly for live feed analysis and complex video reasoning tasks.

Advances in Multimodal Training and Evaluation

Researchers introduced Auto-Rubric as Reward (ARR), a framework that improves multimodal model training by converting implicit preferences into explicit, criteria-based evaluations. According to research published on arXiv, ARR addresses limitations in current reinforcement learning from human feedback (RLHF) approaches that reduce complex human judgments to scalar values.

The ARR framework externalizes vision-language model preference knowledge as “prompt-specific rubrics,” breaking down holistic assessments into verifiable quality dimensions. This approach demonstrated improvements over traditional pairwise reward models in text-to-image generation and image editing benchmarks.

The research introduces Rubric Policy Optimization (RPO), which distills structured multi-dimensional evaluations into binary rewards for more stable policy gradient training. The authors report that ARR enables both zero-shot deployment and few-shot learning with minimal supervision.

Data Supply Chain Developments

Wirestock raised $23 million in Series A funding to expand its multimodal data supply business serving AI labs. The company, led by Nava Ventures with participation from SBVP and Formula VC, pivoted from stock photography distribution to AI training data provision in 2023.

According to TechCrunch, Wirestock now supplies datasets of images, videos, design assets, and 3D content to six major foundation model makers. The platform connects over 700,000 artists and designers who complete data collection tasks, generating $40 million in annual run-rate revenue and paying out $15 million to contributors.

Co-founder and CEO Mikayel Khachatryan said the majority of the platform’s original 100,000 photographers opted into the data supply business after the company’s transparent pivot announcement. The funding will support expansion of custom content creation and data annotation services.

Enterprise Deployment Focus

Parloa, featured in an OpenAI case study, demonstrates practical multimodal AI deployment in customer service. The Berlin-based company’s AI Agent Management Platform (AMP) uses GPT-5.4 to handle voice-driven customer interactions, moving beyond rule-based systems to natural language behavior definition.

The platform enables business users to design and deploy customer service agents without coding, handling everything from simple routing to complex multi-step requests. Parloa emphasizes production reliability, continuously testing models against real customer scenarios before deployment.

Engineering Manager Ciaran O’Reilly Ibañez noted the company’s focus on making models “fast and reliable enough for real-time conversations,” highlighting the practical challenges of deploying multimodal AI in enterprise environments.

What This Means

These developments signal a maturation of multimodal AI from research concepts to production-ready systems. The emergence of cost-effective video analysis models like Perceptron’s Mk1 could accelerate enterprise adoption, while Thinking Machines’ interaction models point toward more natural human-AI collaboration patterns.

The focus on structured evaluation frameworks like ARR addresses critical training challenges, potentially improving model reliability and reducing deployment risks. Meanwhile, the growth of specialized data supply chains through companies like Wirestock indicates a professionalizing ecosystem around multimodal AI development.

The convergence of real-time processing, cost efficiency, and enterprise deployment capabilities suggests multimodal AI is transitioning from experimental technology to practical business tools. However, the limited availability of these advanced systems — with most still in preview or research phases — indicates the technology remains in early commercialization stages.

FAQ

What are interaction models and how do they differ from current AI?

Interaction models process multiple input types simultaneously rather than the traditional turn-based approach where users input a query and wait for a response. Thinking Machines’ preview system can handle ongoing text, audio, and video inputs in real-time, creating more natural conversation flows.

How much cheaper is Perceptron’s video analysis compared to major providers?

Perceptron’s Mk1 model costs $0.15 per million input tokens and $1.50 per million output tokens, representing 80-90% cost savings compared to video analysis capabilities from Anthropic’s Claude, OpenAI’s GPT-5, and Google’s Gemini. This pricing could make video AI more accessible to smaller enterprises.

What role does training data quality play in multimodal AI development?

High-quality, diverse training data is crucial for multimodal AI performance. Companies like Wirestock are building specialized supply chains with over 700,000 creators providing annotated images, videos, and 3D content to major AI labs, generating $40 million in annual revenue and highlighting the economic value of quality multimodal datasets.