Speech Recognition (ASR): How AI Converts Voice to Text

Automatic speech recognition (ASR) converts spoken audio into written text — a long-standing AI goal now widely deployed.
Modern ASR is neural, dominated by end-to-end models that replaced earlier hybrid acoustic-model + language-model pipelines.
OpenAI’s Whisper (2022) democratized high-quality multilingual ASR — trained on 680,000 hours of weakly supervised audio, available as open-source.
Word error rates on clean English now commonly below 5%; accented speech, noisy environments, and low-resource languages still show much higher errors.
Real-time streaming ASR, on-device models, and specialized medical/legal dictation have become major deployment areas.

Why ASR was hard

Speech carries many overlapping sources of variability. The same words sound different depending on accent, speaker, speaking rate, emotion, microphone, room acoustics, and background noise. The same acoustic pattern can come from different words (“ice cream” and “I scream”). Meaning depends heavily on context — “rain/reign”, “write/right”, “flour/flower”.

For decades, ASR used statistical hybrid systems — acoustic models (hidden Markov models, Gaussian mixtures) combined with language models (n-gram statistics) and pronunciation dictionaries. Performance improved steadily but hit a ceiling around early 2010s. The transition to deep learning broke through that ceiling.

The neural ASR transition

Hybrid DNN-HMM (early 2010s)

Deep neural networks replaced Gaussian mixtures in the acoustic-model stage. Error rates dropped meaningfully — the first big deep-learning ASR wins in commercial deployments (Microsoft in 2011-12, Google following).

End-to-end ASR (mid 2010s)

Models like connectionist temporal classification (CTC), listen-attend-spell, and RNN-transducer unified the pipeline — a single neural network mapping raw audio (or spectrograms) directly to text. Simpler than hybrid systems and capable of matching or exceeding them on large datasets.

Transformer ASR (late 2010s-present)

Attention-based transformers (see our transformers primer) took over, first in hybrid roles and then as end-to-end encoders. Conformer (a convolution-augmented transformer) became a dominant ASR architecture around 2020.

Large-scale weakly-supervised models

OpenAI’s Whisper trained a single multilingual model on 680,000 hours of audio scraped from the internet with imperfect transcripts. The result: robust speech recognition, translation, and language identification in a single model. Whisper set a new standard for accessible high-quality ASR and kicked off a wave of similar models (Meta’s MMS, NVIDIA NeMo’s Canary, Distil-Whisper, Parakeet).

Accuracy reality

On clean English speech, modern ASR achieves word error rates (WER) below 5% in many benchmarks — approaching or matching human transcription accuracy. This is a dramatic improvement over the 20%+ WER that was standard a decade ago.

Performance degrades on:

Accented speech. Accents underrepresented in training data see elevated error rates, sometimes substantially.
Low-resource languages. Major European languages have good coverage; many smaller languages much less so.
Domain vocabulary. Medical, legal, and technical terms can error heavily without domain-specific fine-tuning.
Background noise. Street noise, music, multiple speakers, distant microphones all hurt accuracy.
Overlapping speakers. Multi-speaker conversation with interruptions remains one of the hardest regimes.

The Stanford AI Index and similar reports have documented the overall progress. Commercial benchmarks from Google, AWS, Azure, and Speechmatics publish competitive WER numbers, though like all benchmarks they need to be evaluated against your specific use case.

Streaming vs batch ASR

Batch ASR

Process a complete audio file offline. Best accuracy — the model sees the whole utterance. Used for transcription, audio captioning, archival search, and subtitling. Whisper and similar models are typical batch ASR tools.

Streaming ASR

Produce text as audio arrives, with minimal latency. Required for voice assistants, live captioning, real-time conversation. More constrained than batch — the model has to commit to early predictions without seeing future audio. Specialized streaming architectures (RNN-T, streaming Conformer) trade a small accuracy cost for real-time capability.

Deployment dimensions

Cloud ASR

Google Cloud Speech-to-Text, Azure Speech, AWS Transcribe, AssemblyAI, Deepgram, Speechmatics. High accuracy, scale, many languages. Highest cost per hour of audio. Privacy concerns for sensitive domains.

On-device ASR

Apple and Google have shipped capable on-device ASR for years. Whisper-based on-device models (Distil-Whisper, Core ML variants) now run well on laptops and high-end phones. Privacy advantages are real; accuracy is close to cloud for main languages.

Domain-tuned ASR

Medical dictation (Nuance DAX, Abridge), legal dictation, aviation communication, radiology — specialized models fine-tuned on domain data outperform generic ASR substantially for terminology-heavy speech.

Major applications

Voice assistants

Siri, Alexa, Google Assistant, and the newer conversational AI voice modes (OpenAI Advanced Voice Mode, ElevenLabs Conversational, Google Gemini Live) all depend on ASR as the entry point. Real-time performance and multi-lingual handling are the current frontiers.

Medical dictation

AI scribing has become one of the fastest-growing healthcare AI categories, saving clinicians 30-60 minutes per day on documentation. Products from Abridge, Nuance DAX, Ambience, Suki run at scale across US hospitals.

Meeting and call transcription

Fireflies, Otter, Fathom, Zoom AI Companion, Microsoft Teams Copilot. Summarization layered on top of ASR provides post-meeting notes and action items.

Accessibility

Real-time captioning for deaf and hard-of-hearing users. YouTube, TikTok, and live-broadcast services auto-caption. Apple, Google, and Microsoft ship live caption features on devices.

Contact centers

Call transcription for compliance, agent coaching, and analytics. Verint, Observe.AI, CallMiner, Gong are representative vendors.

Ongoing challenges

Fairness across accents and dialects remains an active research area. Speaker diarization (who said what in a multi-speaker conversation) is separate from core ASR and still error-prone. Punctuation, capitalization, and speaker disfluencies (ums, uhs) need post-processing that varies by use case. Code-switching (mixing languages mid-sentence) challenges most models. See our natural language processing primer for the broader text side.

Whisper’s impact

OpenAI’s decision to open-source Whisper had outsized impact on the ASR landscape. Prior to Whisper, high-quality ASR was a commercial product — expensive, opaque, and often restricted by API licensing. Whisper put competitive ASR in the hands of anyone who could download a file, spawning a wave of applications — podcast transcription, video-captioning tools, assistive technologies — that might not have existed otherwise. It also drove commercial ASR prices down industry-wide. For the broader deep-learning context, see our deep learning primer.

Frequently asked questions

Is ASR accurate enough for legal or medical use?
Depending on domain and workflow, increasingly yes — but always paired with human review for high-stakes outputs. Medical dictation platforms show clinicians a draft that they edit before signing. Legal transcription uses AI-assisted workflows where humans verify. Pure autonomous ASR without review is risky in these domains because error rates on technical vocabulary and unusual names are higher than general speech.

Why does ASR struggle with my accent?
Because training data under-represents it. Major providers have expanded accent coverage substantially — Google, Microsoft, and Amazon all now advertise coverage of many English accents and dozens of languages — but non-native English, non-standard dialects, and minority languages still see higher error rates. Fine-tuning a model on your specific accent or dialect, or using providers known for multilingual strength (Speechmatics, Deepgram Nova), can help.

Can ASR handle multiple speakers?
With caveats. Transcription of multi-speaker audio works well when speakers take clear turns. Overlapping speech, cross-talk, and close-mic recordings of different speakers all challenge current systems. Speaker diarization (labelling who said what) is a separate step and typically has higher error rates than the transcription itself. For meetings and interviews, modern tools handle this reasonably; for courtrooms, policy briefings, or debate-style audio, professional transcribers still often outperform AI.