Key takeaways
- AI video generators extend image diffusion to the time dimension, denoising a sequence of frames jointly so objects persist coherently across them.
- The current frontier — OpenAI’s Sora, Google’s Veo, Runway Gen-3, Kuaishou’s Kling, and Pika — produces short clips (typically 5-20 seconds) with usable quality for many scenarios.
- Most leading systems use diffusion transformers (DiTs) operating on compressed video latents, a design described in the Peebles and Xie 2022 paper.
- Core open problems: temporal consistency over long durations, physical plausibility, controllable character identity, and the compute cost of each generation.
- Deepfake misuse is already a measurable harm; provenance standards (C2PA) and platform-level detection lag behind generation quality.
What video generation actually is
A text-to-video model takes a prompt — and optionally a seed image, a reference style, or a short driving clip — and returns a sequence of frames. Internally, most current systems treat the full clip as a single high-dimensional tensor (time x height x width x channels), compress it through a 3D autoencoder into a smaller latent, and run diffusion on that latent. A separate decoder maps the final latent back to pixels.

The time dimension is what makes video hard. In a still image the model has to produce something plausible once. In video it has to produce something plausible that stays plausible as a physical world unfolds — the same person has to keep the same face, a ball dropped in frame three has to still obey gravity in frame sixty. For the underlying generative family, see the diffusion models primer.
Diffusion transformers, not U-Nets
Early image diffusion models (Stable Diffusion 1.x, 2.x) used convolutional U-Net backbones. Most 2024-2026 video systems use diffusion transformers: the DiT architecture replaces the U-Net with a transformer operating on tokenized latents. Transformers scale more predictably with compute and handle long sequences better, which matters when a 10-second 720p clip expands to tens of thousands of latent tokens. OpenAI’s technical report on Sora and Stability’s Stable Video Diffusion both point at this direction.
The frontier models
Sora (OpenAI)
Announced in February 2024 and released for ChatGPT Plus and Pro users in December 2024, Sora produces clips up to around a minute, with resolution options up to 1080p. OpenAI describes it as a diffusion transformer operating on spacetime patches. The December 2024 consumer release exposed the tradeoff between advertised demos and daily-use quality — users reported strong hero shots but routine physics and anatomy failures.
Runway Gen-3 Alpha
Runway released Gen-3 Alpha in mid-2024, focused on filmmaker workflows. Its differentiator is control: camera moves, image-to-video with a start frame, video-to-video restyling, and a timeline editor. Clips are typically 5-10 seconds. Runway has partnerships with major studios and positions itself closer to post-production than to consumer novelty.
Veo (Google DeepMind)
Veo, announced at Google I/O 2024 and expanded through 2025, targets 1080p and beyond with clips reported at over a minute. Veo is integrated into Google’s Vertex AI and YouTube Shorts tooling. It benefits from Google’s in-house training infrastructure and the video corpus its ecosystem touches.
Kling (Kuaishou)
Kling, released by Chinese short-video platform Kuaishou in mid-2024, was the first widely available model to produce 2-minute clips and was noted for relatively strong physical motion. Its global availability and English-prompt support arrived shortly after launch. Kling illustrated that Chinese labs were at or near the frontier of video generation, not trailing it.
Pika
Pika Labs targets consumer creators with a web app and Discord-based workflow. Its feature set emphasizes short, remixable clips, lip sync, and “Pikaffects” — prompt-driven visual transformations of an input clip. It competes on iteration speed and price point rather than raw fidelity.
Temporal consistency: the defining challenge
Text-to-image models have to produce a single coherent frame. Text-to-video has to produce dozens to hundreds of frames that are each coherent and consistent with each other. Several failure modes recur across every current system.
Identity drift
A character’s face, hair, or clothing shifts subtly between frames. Features morph over several seconds. This is why most production workflows use short clips (under 10 seconds) and stitch them — the model is asked to stay consistent only over a window short enough to avoid visible drift.
Physics violations
Objects pass through each other, hands grasp nothing, liquid pours upward, a cat walks through a door that has not opened. The model learned surface statistics of motion from training video but did not acquire a physics simulator. The OpenAI Sora technical report acknowledges this explicitly — the model is a general-purpose world simulator “that learns physical rules implicitly, imperfectly.”
Long-range coherence
Beyond roughly 20 seconds, most current systems start losing the thread: scenes re-compose, subjects multiply, lighting drifts. Architectures that explicitly model longer context (hierarchical diffusion, autoregressive chaining of short clips) are active research, not solved products.
What it costs
Video generation is one of the most compute-intensive tasks a consumer can trigger against a hosted API. A single 10-second 1080p generation from a frontier model involves denoising steps across a 3D latent that can reach hundreds of millions of tokens. Reported public pricing puts typical costs in the range of several cents to several dollars per clip, depending on resolution, length, and tier.
Self-hosting is possible for open-weights video models (CogVideoX, Mochi-1, HunyuanVideo, Wan 2.1) but requires high-end GPUs — 24 GB VRAM minimum for low resolutions, and much more for 720p+ output at reasonable speeds. For individuals, cloud inference is usually the only practical option; for production studios, dedicated inference clusters are emerging. The broader economics of hosted generative services are covered in the generative ai primer.
Where it works today
Short-form social and marketing
Product B-roll, mood clips, stylized transitions, animated explainer moments. Lengths of 3-10 seconds, moderate quality tolerance, high volume. AI video is already economical here compared to stock footage or custom shoots.
Previsualization and concept work
Directors, advertisers, and game designers use generated clips to sketch scenes before committing to shoots or full animation. The model does not need to produce final footage — it needs to communicate an intended shot quickly.
Image-to-video animation
A still photo or concept illustration animated into a brief clip. This works better than pure text-to-video because the first frame is already grounded — the model only has to extrapolate motion.
Video-to-video restyling
Runway Gen-3 and several open-source pipelines allow restyling an input clip — preserving motion while changing appearance. This hybrid workflow inherits the real motion from the source footage, sidestepping the physics problem.
Deepfakes and the provenance problem
The same capability that animates a product photo can animate a real person. Deepfake harms are already measurable: non-consensual sexual imagery, political disinformation, and voice-plus-video scams targeting executives and elderly relatives. Leading commercial models block obvious name prompts and identity injections, but open-weights checkpoints can be fine-tuned with a handful of images of a target.
Response mechanisms are partial. C2PA content provenance standards let publishers sign media at source, and watermarking schemes (SynthID, invisible diffusion watermarks) can flag generated output — but both degrade under re-encoding, cropping, or screen-recording. Detection classifiers lag generation by design: every new model shifts the distribution detectors are trained against. Platform policy, rather than pure technical defense, is likely to remain the primary brake. Broader coverage in the ai industry section tracks how major labs and regulators are responding.
Frequently asked questions
Can AI video replace a film crew?
Not for narrative feature work, and not soon. Current models produce clips of limited duration, with identity drift that breaks continuity over a multi-shot scene, and with no shot-to-shot director control beyond prompt engineering. What they can replace or supplement today is stock footage, B-roll, quick previsualization, simple explainer animations, and specific stylized shots. A practical 2026 production workflow blends AI-generated inserts with live-action and traditional VFX rather than substituting for them.
What hardware do I need to run a video model locally?
Open-weights video models such as CogVideoX, Mochi-1, HunyuanVideo, and Wan 2.1 are runnable on consumer GPUs but with meaningful tradeoffs. A 24 GB GPU (RTX 3090, 4090, or similar) can generate low-resolution short clips in minutes per clip. 720p output at tolerable speeds typically requires a 40-80 GB professional card. CPU-only inference is impractical. Most hobbyists run these models via cloud GPU rentals or accept the quality-per-dollar tradeoff of hosted commercial APIs.
How long until generated video is indistinguishable from real footage?
For a curated short clip under tightly controlled prompts, frontier models can already pass casual inspection. For arbitrary scenes with humans, complex physical interaction, and long duration, a confident timeline is not honest. Public demos are cherry-picked; daily-use quality from the same models is visibly lower. The gap between best-case and average-case output is the useful signal. Provenance infrastructure and platform-level detection will likely matter more than a specific “indistinguishable” milestone, because indistinguishability already exists for narrow slices of content.






