Multimodal AI Shifts from Chat to Real-Time Interaction

Multimodal AI is moving well beyond text-and-image pairs this week, with a cluster of announcements spanning real-time video conversation, low-cost video reasoning, structured reward modeling, and a $23 million funding round for creative training data. Taken together, the developments mark a period of rapid diversification in how AI systems perceive, process, and respond to the physical world.

Thinking Machines Previews Native Multimodal Interaction

Thinking Machines Lab, the startup co-founded by former OpenAI CTO Mira Murati and former OpenAI researcher John Schulman, announced a research preview of what it calls “interaction models” — a class of AI systems designed to treat interactivity as a core architectural property rather than a software add-on bolted onto a language model.

According to Thinking Machines’ announcement blog, the models are trained to communicate through a camera and microphone simultaneously, processing continuous human input — including pauses, interruptions, and tonal shifts — rather than waiting for a discrete turn to end before generating a response. The company posted demonstration videos showing near-real-time voice and video conversation, though the models are not yet publicly available.

Wired reported that Murati described the system as natively understanding “continuous, messy, human communication,” which she contrasted with existing voice interfaces that transcribe speech and pipe it into a standard chatbot pipeline. The models scored gains on third-party benchmarks and reduced latency compared to harness-based multimodal systems, though Thinking Machines has not published the specific benchmark figures publicly.

The company said it will open a limited research preview in the coming months to collect feedback before a wider release. No pricing has been disclosed.

Murati’s Human-in-the-Loop Philosophy

The interaction model announcement is inseparable from Murati’s stated design philosophy. In an interview with Wired, she said: “At some point we will have super-intelligent machines. But we think that the best way to actually have many possible futures — good futures — is to keep humans in the loop.”

Her approach positions Thinking Machines against the dominant trajectory at OpenAI, Anthropic, and Google, which are building large models that execute complex tasks — including writing full software applications — with minimal human involvement. Murati argues that allowing users to customize frontier models and collaborate with them, rather than simply delegate to them, produces better outcomes.

This is not purely philosophical positioning. The interaction model architecture reflects it: a system that adapts mid-conversation when a user clarifies a point or changes subject is, by design, keeping the human as an active participant rather than a prompt-issuer.

Perceptron Mk1 Prices Video Reasoning at 80-90% Below Rivals

While Thinking Machines is building toward fluid human-AI conversation, a two-year-old startup called Perceptron Inc. is targeting the enterprise video analysis market with a different wedge: cost.

According to VentureBeat’s coverage and Perceptron’s own announcement, the company released Mk1, a proprietary video analysis reasoning model priced at $0.15 per million tokens input and $1.50 per million tokens output via API. The company claims this is 80–90% cheaper than comparable offerings from Anthropic (Claude Sonnet 4.5), OpenAI (GPT-5), and Google (Gemini 3.1 Pro).

Co-founder and CEO Armen Aghajanyan, formerly of Meta FAIR and Microsoft, said the team spent 16 months building a “multi-modal recipe” from scratch. The model is designed to understand cause-and-effect relationships, object dynamics, and physical constraints in video — capabilities that go beyond frame-level image description.

Target use cases include:

Security monitoring of live facility feeds
Marketing video editing — automatically clipping high-engagement segments for social distribution
Quality control — flagging inconsistencies or errors in video content
Behavioral analysis — identifying body language and actions in controlled studies

Perceptron has published benchmark results focused on spatial and grounded video understanding, and a public demo is available. The model’s performance on industry-standard video benchmarks was not independently verified at the time of publication.

Auto-Rubric Reward Modeling Targets Multimodal Alignment

On the research side, a paper published to arXiv (arXiv:2605.08354) introduces Auto-Rubric as Reward (ARR), a framework that addresses a structural problem in how multimodal generative models are trained to match human preferences.

Current reinforcement learning from human feedback (RLHF) approaches compress complex, multi-dimensional human judgments — such as evaluating a generated image for composition, accuracy, and aesthetic quality simultaneously — into a single scalar or pairwise label. The paper argues this collapses nuanced preference structure and creates vulnerabilities to reward hacking, where a model learns to game the metric rather than improve quality.

ARR instead externalizes a vision-language model’s internalized preference knowledge as prompt-specific rubrics before any pairwise comparison occurs. These rubrics decompose holistic intent into independently verifiable quality dimensions, making the reward signal inspectable rather than opaque.

The authors pair ARR with Rubric Policy Optimization (RPO), which distills the structured multi-dimensional evaluation into a binary reward for training stability. On text-to-image generation and image editing benchmarks, ARR-RPO outperformed both pairwise reward models and VLM-as-judge approaches. The paper also reports substantially reduced positional bias compared to scalar reward baselines.

The framework supports zero-shot deployment and few-shot conditioning on minimal supervision — a practical advantage for labs that lack large preference datasets.

Wirestock Raises $23M to Supply Multimodal Training Data

Underpinning all of these model advances is demand for high-quality training data. Wirestock, which pivoted from a stock photography distribution platform to an AI data supplier in 2023, announced a $23 million Series A led by Nava Ventures, with participation from SBVP (co-founded by Sheryl Sandberg), Formula VC, and I2BF Ventures.

According to TechCrunch, Wirestock now supplies datasets of images, videos, design assets, and gaming and 3D content to six of the largest foundation model makers, which the company declined to name. Co-founder and CEO Mikayel Khachatryan said the company has an annual run-rate revenue of $40 million and has paid out $15 million to its contributor base of over 700,000 artists and designers.

The platform shifted from selling existing library content to fulfilling custom data requests — a change Khachatryan said created new income opportunities for creators and accelerated platform growth. The company was transparent about its 2023 pivot and allowed artists to opt out; Khachatryan said “the majority” of the original 100,000+ photographers chose to remain as data contributors.

The Series A capital will fund expansion of the data supply business, including annotation and labeling infrastructure.

What This Means

The multimodal AI stack is maturing at multiple layers simultaneously, and this week’s announcements illustrate where the pressure points are.

At the interface layer, Thinking Machines is making a credible architectural argument that real-time, continuous multimodal interaction requires rethinking model design — not just adding voice and vision as post-hoc capabilities. If interaction models perform as previewed, the gap between AI chat tools and natural conversation will narrow considerably.

At the inference cost layer, Perceptron’s 80–90% price reduction for video reasoning — if it holds up under independent benchmarking — would remove a significant barrier to enterprise video AI adoption. Cost has been the primary reason video understanding remains a niche capability despite obvious demand.

At the training layer, the ARR paper addresses a real and underappreciated problem: scalar reward signals are a poor fit for the multi-dimensional nature of human judgment about generated media. Structured rubric-based rewards are harder to engineer but more robust, and the zero-shot applicability makes them practically relevant beyond large labs.

And at the data layer, Wirestock’s $40M run-rate and $23M raise confirm that the market for licensed, human-generated multimodal training data is real and growing. As synthetic data debates continue, platforms with opt-in creator networks and annotation infrastructure are positioned as a durable supply source.

The common thread is that multimodal AI is no longer primarily a research capability — it is becoming an engineering and product discipline, with cost, latency, alignment reliability, and data provenance all emerging as competitive variables.

FAQ

What are interaction models from Thinking Machines?

Interaction models are AI systems designed to process continuous human input — voice, video, gesture, and tone — in near real-time rather than waiting for a discrete conversational turn to complete. Thinking Machines announced a research preview in May 2025, though the models are not yet publicly available.

How does Perceptron Mk1 compare in price to GPT-5 and Claude Sonnet 4.5?

Perceptron claims Mk1 is priced at $0.15 per million input tokens and $1.50 per million output tokens, which it says is 80–90% below Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini 3.1 Pro for comparable video reasoning tasks. Independent verification of both the price gap and performance parity has not been published at the time of writing.

What is Auto-Rubric as Reward (ARR) in multimodal AI training?

ARR is a reward modeling framework that converts a vision-language model’s internalized preference knowledge into explicit, prompt-specific rubrics before training, replacing opaque scalar reward signals with independently verifiable quality dimensions. The approach is designed to reduce reward hacking and evaluation biases like positional bias, and the authors report it outperforms standard pairwise reward models on text-to-image benchmarks.