Multimodal AI Advances on Four Fronts in May 2025 - featured image
OpenAI

Multimodal AI Advances on Four Fronts in May 2025

Photo by Google DeepMind on Pexels

Synthesized from 5 sources

Five developments in May 2025 illustrate how quickly multimodal AI is moving beyond single-input chatbots: a new reward-modeling framework for vision-language alignment, Thinking Machines Lab’s real-time interaction models, Perceptron’s video reasoning model priced 80–90% below major rivals, and a $23 million Series A for multimodal training-data supplier Wirestock.

Thinking Machines Previews Real-Time Interaction Models

Thinking Machines Lab — the startup founded by former OpenAI CTO Mira Murati and former OpenAI researcher John Schulman — announced a research preview of what it calls “interaction models” this week. According to Thinking Machines’ announcement blog post, these are native multimodal systems that treat interactivity as a first-class architectural property rather than an add-on software layer.

The practical difference matters. Most current voice and video AI systems capture input, transcribe it, then feed the transcript into a language model — a sequential pipeline that introduces latency and loses parallelism. Thinking Machines’ models are designed to process incoming audio and video continuously, allowing the system to begin responding while still receiving the next human input.

Wired reported that the models natively understand pauses, interruptions, and shifts in tone — not just words — enabling them to adapt mid-response when a speaker changes direction. Murati told Wired: “At some point we will have super-intelligent machines. But we think that the best way to actually have many possible futures — good futures — is to keep humans in the loop.”

The models are not yet publicly available. Thinking Machines said it will open a limited research preview in coming months to collect feedback before a wider release.

Perceptron Mk1 Targets Enterprise Video Analysis at Discount Pricing

Two-year-old startup Perceptron Inc. released its flagship video analysis reasoning model, Mk1, this week at API pricing of $0.15 per million tokens input and $1.50 per million tokens output. According to Perceptron’s launch announcement, that positions Mk1 roughly 80–90% cheaper than Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini 3.1 Pro on comparable tasks.

Co-founder and CEO Armen Aghajanyan, formerly of Meta FAIR and Microsoft, said the company spent 16 months developing what it describes as a “multi-modal recipe” built from the ground up to handle physical-world complexity — including cause-and-effect reasoning, object dynamics, and physics-grounded understanding.

Perceptron’s target use cases span enterprise security monitoring, marketing video editing, inconsistency detection in recorded content, and behavioral analysis in controlled studies. The company benchmarked Mk1 against spatial and video-grounded understanding tasks, though independent third-party validation of those results has not yet been published.

A public demo is available for prospective users and enterprise customers.

ARR Framework Tackles Multimodal Alignment with Explicit Rubrics

On the research side, a team published a paper on arXiv introducing Auto-Rubric as Reward (ARR), a framework that reframes reward modeling for multimodal generative systems. According to the paper, standard reinforcement learning from human feedback (RLHF) collapses complex, multi-dimensional human preferences into scalar or pairwise labels — a compression that both loses nuance and creates openings for reward hacking.

ARR takes a different approach: before any pairwise comparison, the framework externalizes a vision-language model’s internalized preference knowledge as explicit, prompt-specific rubrics. Each rubric breaks holistic intent into independently verifiable quality dimensions.

The paper also introduces Rubric Policy Optimization (RPO), which distills ARR’s structured evaluation into a binary reward signal, replacing opaque scalar regression with rubric-conditioned preference decisions. The authors report that ARR-RPO outperforms both pairwise reward models and VLM judges on text-to-image generation and image editing benchmarks, and that the framework supports zero-shot deployment as well as few-shot conditioning on minimal labeled data.

The key claim: the bottleneck in multimodal alignment is not a knowledge deficit in current models but the absence of a factorized interface for externalizing that knowledge.

Wirestock Raises $23M to Supply Multimodal Training Data

TechCrunch reported that Wirestock closed a $23 million Series A led by Nava Ventures, with participation from SBVP (co-founded by Sheryl Sandberg), Formula VC, and I2BF Ventures. The round funds Wirestock’s pivot from a stock photography distribution platform to a multimodal training-data supplier for AI labs.

The company now provides datasets covering images, videos, design assets, and gaming and 3D content to what co-founder and CEO Mikayel Khachatryan described as six of the largest foundation model makers — though he declined to name them. Wirestock’s platform has signed up more than 700,000 artists and designers who complete data collection tasks, and the company currently reports an annual run-rate revenue of $40 million, with $15 million paid out to contributors to date.

Khachatryan told TechCrunch that early deals involved licensing existing library content, but demand shifted toward custom data requests — a change he says created new income opportunities for creators on the platform. Wirestock said it was transparent about its 2023 pivot and allowed artists to opt out of the data supply business.

The round signals continued demand from frontier model developers for high-quality, rights-cleared multimodal datasets, particularly as models are expected to handle video, 3D, and design content alongside text and static images.

What This Means

Taken together, these four developments point to a maturing multimodal stack — one where the bottlenecks are shifting from “can the model handle multiple modalities” to more granular questions about latency, cost, alignment quality, and training data provenance.

Thinking Machines’ interaction models represent the most architecturally ambitious claim: that real-time, continuous multimodal interaction requires rethinking model architecture rather than patching pipelines. If the research preview validates the benchmark gains the company reported, it could pressure larger labs to revisit how they handle streaming input.

Perceptron’s pricing strategy is a direct challenge to the incumbents. An 80–90% cost reduction on video analysis — if benchmark performance holds under independent scrutiny — would make video AI viable for mid-market enterprises that currently find GPT-5 or Gemini 3.1 Pro cost-prohibitive for high-volume video workloads.

The ARR paper addresses a less visible but structurally important problem: how to align multimodal generative models without losing the dimensional richness of human preferences. As text-to-image and video generation models become more capable, alignment methods that can handle compositional, multi-attribute outputs will matter more — and the ARR approach’s zero-shot capability is notable for deployment efficiency.

Wirestock’s raise reflects the upstream constraint: all of these models need training data, and the supply of rights-cleared, labeled multimodal content is finite. A platform paying out $15 million to 700,000 contributors while running $40 million in annual revenue suggests the data supply market is real, not theoretical.

FAQ

What are interaction models from Thinking Machines Lab?

Interaction models are a class of native multimodal AI systems that process continuous audio and video input rather than capturing, transcribing, and then processing it in sequential steps. According to Thinking Machines Lab’s announcement, they are designed to understand pauses, interruptions, and tone shifts in real time, allowing the model to respond while still receiving new input.

How does Perceptron Mk1’s pricing compare to GPT-5 and Claude Sonnet 4.5?

Perceptron priced Mk1 at $0.15 per million input tokens and $1.50 per million output tokens via API. The company states this is 80–90% below comparable pricing from Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini 3.1 Pro, though independent cost-per-task comparisons have not yet been published.

What is Auto-Rubric as Reward (ARR) in multimodal AI alignment?

ARR is a reward-modeling framework described in a May 2025 arXiv paper that converts a vision-language model’s implicit preference knowledge into explicit, prompt-specific rubrics before any pairwise comparison is made. The paper argues this approach reduces evaluation biases and supports more reliable, data-efficient alignment than standard scalar or pairwise RLHF methods.

Sources

Digital Mind News

Digital Mind News is an AI-operated newsroom. Every article here is synthesized from multiple trusted external sources by our automated pipeline, then checked before publication. We disclose our AI authorship openly because transparency is part of the product.