Multimodal AI Accelerates: Five Models Redefine Vision and Video

Five distinct multimodal AI developments landed this week, spanning a 3-billion-parameter open-source model from ByteDance Research, a near-realtime voice-and-video system from Mira Murati’s startup, and a video reasoning model priced 80–90% below Claude Sonnet 4.5 and GPT-5. Taken together, they mark a measurable shift in what vision-language models can do at low cost and low latency.

ByteDance’s Lance Packs Image, Video, and Editing into 3B Parameters

ByteDance Research published Lance on Hugging Face, an open-source unified multimodal model that handles image understanding, image generation, image editing, and video generation within a single framework — all with 3 billion active parameters.

Most multimodal systems split these tasks across separate specialized models or require significantly larger parameter counts to achieve competitive benchmark scores. Lance’s architecture consolidates them, which reduces deployment overhead for developers who need more than one capability.

The model was surfaced by r/singularity and links directly to the Hugging Face repository. ByteDance Research has not yet published a full technical paper with benchmark breakdowns, but the repository describes strong performance across image generation, image editing, and video generation evaluations. The open-source release makes Lance immediately accessible to researchers and developers without API costs or access restrictions.

For the open-source community, a 3B-parameter model that spans generation and understanding is notable because it can run on consumer-grade hardware, lowering the barrier for experimentation with multimodal pipelines.

Thinking Machines Previews Realtime Multimodal Interaction

Thinking Machines — the startup founded by former OpenAI CTO Mira Murati and former OpenAI co-founder John Schulman — announced a research preview of what it calls “interaction models,” described in its announcement blog post as a new class of native multimodal systems that treat interactivity as a core architectural feature rather than a software layer added on top.

According to VentureBeat, the system targets near-realtime voice and video conversation — processing new human inputs while still generating a response to the previous one, rather than waiting for a full turn to complete. The company reported gains on third-party benchmarks and reduced latency compared to turn-based alternatives.

The models are not yet publicly available. Thinking Machines said it will open a limited research preview in the coming months to collect feedback before a wider release. The framing positions the work against the standard prompt-response loop that defines most current AI interfaces, including those handling audio and video.

Thinking Machines is well-funded and has moved quickly since its founding in 2024, making this preview its most concrete technical disclosure to date.

Perceptron Mk1 Offers Video Reasoning at a Fraction of Rival Costs

Two-year-old startup Perceptron Inc. released Mk1, a proprietary video analysis reasoning model, at $0.15 per million tokens input and $1.50 per million tokens output through its API — pricing that VentureBeat reported comes in 80–90% below Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini 3.1 Pro.

Co-founder and CEO Armen Aghajanyan, formerly of Meta FAIR and Microsoft, said the team spent 16 months building a “multi-modal recipe” from scratch to handle the physical world’s complexity — including cause-and-effect relationships, object dynamics, and physics-based reasoning.

Benchmark results focus on grounded spatial and video understanding. A public demo is available for developers and enterprise evaluators. Target use cases include live security monitoring, marketing video clipping, content inconsistency flagging, and behavioral analysis in controlled studies.

The pricing gap relative to frontier models is the headline claim. If Mk1’s benchmark performance holds up under independent evaluation, the cost differential alone makes it worth testing for video-heavy enterprise workflows.

TTE-Flash Cuts Reasoning Costs in Multimodal Embeddings

Researchers published TTE-Flash (arXiv:2605.16638), a method for improving Universal Multimodal Embedding (UME) without the computational overhead that typically comes with Chain-of-Thought reasoning.

The core problem: recent work showed that generating explicit reasoning traces before producing a multimodal embedding significantly improves representation quality. But generating those traces is expensive at inference time, making the approach impractical for production systems.

TTE-Flash replaces explicit CoT traces with latent think tokens — compact representations trained to approximate what an explicit reasoning trace would produce. The model optimizes think tokens using CoT generation loss and embedding tokens using contrastive loss, keeping inference cost constant regardless of reasoning depth.

The paper introduces TTE-Flash-2B, which the authors report outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark. Zero-shot evaluation across 15 video datasets showed scaling behavior as the number of think tokens increases, motivating a pilot study of adaptive think budget allocation based on task difficulty. The latent tokens are also interpretable — both textually and visually — which adds a transparency dimension often absent in embedding research.

Auto-Rubric Targets Reward Hacking in Multimodal RLHF

A separate arXiv paper, Auto-Rubric as Reward (ARR) (arXiv:2605.08354), addresses a persistent problem in aligning multimodal generative models: standard RLHF methods compress human preference into scalar or pairwise labels, which are easy to game and hard to interpret.

ARR reframes reward modeling by having a vision-language model externalize its internalized preference knowledge as explicit, prompt-specific rubrics before any pairwise comparison occurs. Each rubric decomposes holistic quality into independently verifiable dimensions, which the authors argue suppresses evaluation biases — including positional bias — and enables zero-shot deployment with minimal supervision.

To carry these gains into training, the paper introduces Rubric Policy Optimization (RPO), which distills ARR’s structured evaluation into a binary reward signal. On text-to-image generation and image editing benchmarks, ARR-RPO outperformed pairwise reward models and VLM judges. The authors conclude that the bottleneck in multimodal alignment is the absence of a factorized evaluation interface, not a lack of underlying model knowledge — a framing that reorients where alignment research should focus.

What This Means

The five developments this week collectively compress three axes that have constrained multimodal AI: cost, latency, and alignment reliability.

Perceptron’s 80–90% pricing gap and ByteDance’s 3B open-source model both push capable multimodal AI toward wider deployment. If Mk1’s benchmarks survive independent scrutiny, video understanding becomes economically viable for mid-market enterprises that couldn’t justify frontier model pricing. Lance’s parameter efficiency does the same for developers who need generation and editing without cloud API dependency.

Thinking Machines’ interaction model preview addresses a different constraint — the turn-based interface that makes current AI feel mechanical in voice and video contexts. Moving processing and response generation into parallel rather than sequential operation is architecturally significant, though the company has not yet released the models for external evaluation.

TTE-Flash and ARR tackle the research infrastructure layer: how to build multimodal embeddings that reason efficiently, and how to align generative models without reward hacking. Both are incremental but directly applicable — TTE-Flash-2B’s constant inference cost makes reasoning-aware embeddings deployable, and ARR’s rubric framework gives alignment researchers a more auditable alternative to scalar reward models.

The open question across all five is benchmark-to-production transfer. Scores on MMEB-v2 or spatial video evaluations don’t automatically translate to real-world robustness. Independent replication and enterprise pilots will determine which of these advances holds up outside controlled conditions.

FAQ

What is a multimodal AI model?

A multimodal AI model processes and generates more than one type of data — for example, combining text, images, video, and audio within a single system. Unlike models that handle only text or only images, multimodal models can take a video clip as input and produce a text description, or take a text prompt and generate an image.

How does Perceptron Mk1’s pricing compare to GPT-5 and Claude Sonnet 4.5?

According to VentureBeat, Mk1 is priced at $0.15 per million input tokens and $1.50 per million output tokens — approximately 80–90% less than Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini 3.1 Pro. The comparison is based on API pricing at launch and may change as rival models update their pricing tiers.

What makes Thinking Machines’ interaction models different from standard AI chat?

Standard AI interfaces operate in a turn-based loop: the user sends input, the model finishes processing, then responds. According to Thinking Machines’ blog post, interaction models are designed to process new inputs while still generating a response to the previous one, enabling more fluid voice and video conversation. The models are not yet publicly available and remain in a limited research preview phase.

Sources

New open source multimodal model does it all…with only 3b parameters – Reddit Singularity
TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens – arXiv AI
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria – arXiv AI
Thinking Machines shows off preview of near-realtime AI voice and video conversation with new ‘interaction models’ – VentureBeat
Perceptron Mk1 shocks with highly performant video analysis AI model 80-90% cheaper than Anthropic, OpenAI & Google – VentureBeat