Computer Vision Explained: How AI Understands Images

Computer vision is the branch of AI that lets machines extract meaning from images and video.
Core tasks include classification (what is this?), detection (where are the objects?), segmentation (which pixels belong to what?), and generation (create or modify images).
Modern computer vision is dominated by convolutional neural networks and, increasingly, vision transformers.
The 2012 AlexNet result on ImageNet was a turning point — deep learning went from one approach among many to the dominant paradigm.
Applications span self-driving cars, medical imaging, manufacturing inspection, face recognition, agriculture, and content moderation.

The core problem

To a computer, an image is a grid of numbers — typically three values (red, green, blue) per pixel. Making sense of that grid is surprisingly hard. A cat photographed from one angle and the same cat from another angle share almost no pixel values. Lighting, occlusion, viewpoint, clutter, and scale all change pixels wildly while leaving the content (“a cat”) unchanged.

Early computer vision tried to solve this with hand-engineered features — edge detectors, corner detectors, texture descriptors. These worked for narrow problems but collapsed at scale. The breakthrough came when neural networks learned their own features directly from pixels. For the underlying machinery, see our neural networks primer.

The ImageNet moment

In 2012, a deep convolutional network called AlexNet, trained on GPUs, won the ImageNet Large Scale Visual Recognition Challenge by a 10-percentage-point margin — a huge gap in a competition usually decided by fractions of a percent. That result catalyzed the deep-learning revolution. Within three years, every top computer-vision system was a neural network. For the broader deep-learning story, see our deep learning coverage.

How convolutional neural networks work

Convolutional neural networks (CNNs) are specifically designed for grid-like data. Their key operation — the convolution — slides a small learned filter across the image, producing a feature map that highlights wherever the pattern the filter detects appears. Early layers learn simple features: edges, corners, colour blobs. Middle layers combine these into textures and parts (eyes, wheels, leaves). Deep layers combine parts into concepts (face, car, tree).

Crucially, the same filter is applied everywhere in the image — a property called translation equivariance. A cat detector learns to recognize a cat regardless of where in the frame it appears. This built-in assumption is what makes CNNs dramatically more efficient for vision than fully-connected networks.

From CNNs to vision transformers

Since 2020, vision transformers (ViTs) have challenged CNN dominance. A ViT chops an image into patches, treats each patch like a token, and runs the same transformer architecture used for language. With enough data, ViTs match or beat CNNs on many benchmarks. Hybrid designs that combine convolutional priors with transformer flexibility are now state-of-the-art for many vision tasks.

The main task families

Classification

Given an image, predict one or more labels. “Is this a cat or a dog?” “Which of 1,000 ImageNet categories does this image belong to?” Classification is the simplest and most benchmarked vision task.

Object detection

Given an image, find all objects and draw bounding boxes around them, each with a class label. Detection is what powers autonomous-driving perception, security cameras, and inventory counting. Popular architectures include YOLO, Faster R-CNN, and DETR.

Segmentation

Label every pixel with which object it belongs to. Semantic segmentation groups by class (“all pixels that are ‘road'”); instance segmentation separates individual object instances (“pixel is part of car 1, not car 2”). Medical imaging, video editing, and autonomous perception rely on segmentation. Meta’s Segment Anything Model (SAM), released in 2023, pushed generic segmentation forward dramatically.

Image generation

Given a prompt or condition, create an image. Modern diffusion models like Stable Diffusion, DALL-E, and Midjourney are computer-vision systems in reverse — they start from noise and iteratively refine toward an image. See our diffusion-models explainer for details.

Video understanding

Extending vision to video adds the time dimension. Tasks include action recognition (what is the person doing?), tracking (follow a specific object across frames), and video generation. Video is data-expensive to train on, but models like Sora, Runway Gen-3, and others have made rapid progress.

Where you see computer vision

Computer vision is embedded in products most people use without thinking. Face unlock on your phone is vision. Barcode scanning is vision. Auto-correct on a document photo, auto-framing in video calls, the lane-keep assist in a modern car, the quality-control inspection on a factory line, the radiology report assistant at a hospital, the license-plate reader on a toll road — all vision. For more on industry adoption, see our ai industry coverage.

Known limits and open problems

Modern vision systems are accurate on in-distribution data but fragile under distribution shift. A pedestrian-detection model trained on sunny California footage may fail at night in rain. Adversarial examples — images modified in ways invisible to humans — can fool high-accuracy models. Bias in training data leads to biased performance across demographics. And “understanding” a scene at a human level — reading context, intent, subtle social cues — remains well beyond current vision systems.

Frequently asked questions

Is computer vision the same as image recognition?
Image recognition is one subtask of computer vision — specifically, classifying what is in an image. Computer vision is the broader field, covering detection, segmentation, tracking, generation, 3D reconstruction, optical character recognition, and more. When news articles say “AI can now recognize images better than humans”, they usually mean one specific benchmark like ImageNet, which is a narrow classification task. Human visual understanding is still far richer than any current vision model.

Can computer vision see what I’m pointing at?
A growing category of models combines vision with language to answer questions about images — “what is this object?”, “what is the cat doing?”, “read the label in this photo”. These are called vision-language models (VLMs) or multimodal models. GPT-4V, Claude 3’s vision, Gemini, and open-source alternatives like LLaVA can answer natural-language questions about what a camera captures. Quality is high but not perfect — they can miss subtle details and occasionally hallucinate objects that are not there.

How much data does a computer-vision model need to train?
It depends on the task. A supervised classifier from scratch typically wants tens of thousands of labelled images per class to reach good accuracy. But a common workflow today is to start from a model pre-trained on a massive dataset (ImageNet, LAION, proprietary sets with billions of images) and fine-tune on a small task-specific dataset — sometimes just a few hundred images. This transfer-learning approach is how most practical vision deployments get built.