Adversarial Attacks on AI Models: How Hackers Fool ML

Adversarial examples are inputs carefully modified to fool a machine-learning model while looking normal to humans.
The phenomenon was first documented broadly by Goodfellow et al. in 2014 and has been an active research area ever since.
Attacks span white-box (attacker knows model internals), black-box (attacker only queries the model), and physical (stickers or lighting changes in the real world).
Defenses like adversarial training help but none are bulletproof; robustness remains an open research problem.
The attack surface extends beyond vision to text, audio, and — increasingly — large language models through prompt injection and jailbreaks.

What is an adversarial example?

A 2014 paper by Goodfellow, Shlens, and Szegedy showed that adding a carefully crafted, visually imperceptible perturbation to an image could cause an image classifier to change its prediction completely. A panda picture could be made to classify as a gibbon with near-certainty while still looking exactly like a panda to a human.

This was not a bug in a single model. The same perturbations transferred across models with different architectures and different training sets, suggesting something fundamental about how deep networks represent inputs. For the underlying network machinery, see our neural networks primer.

Why adversarial examples exist

The most accepted explanation is that neural networks learn features that are predictive but not robust. The network picks up patterns that correlate with classes in the training distribution but extend in odd directions off the natural-image manifold. An adversarial perturbation pushes the input just barely in those directions.

Adversarial examples are not unique to deep learning — classical ML models can also be fooled — but deep networks’ high-dimensional, non-smooth decision boundaries make them particularly vulnerable. And because modern computer-vision systems (see our computer vision coverage) use similar architectures, a perturbation that fools one often fools others.

Threat models

White-box attacks

The attacker has full knowledge of the model — architecture, weights, training data, everything. This is the worst case and also the easiest to evaluate defensively. The classical white-box attacks (FGSM, PGD, Carlini-Wagner) are highly effective and widely used as benchmarks.

Black-box attacks

The attacker can only query the model as an oracle — get predictions for chosen inputs but not gradients or internals. Black-box attacks exploit transferability: an attack crafted against a surrogate model often works against the target. Query-efficient black-box attacks (boundary attack, NES, HopSkipJumpAttack) need relatively few queries to succeed.

Physical-world attacks

The attacker modifies physical objects — stickers on stop signs that cause self-driving perception to misclassify as speed limits, patterned glasses that fool face recognition, printed posters that confuse image classifiers. Physical attacks must survive lighting, angle, and camera noise, which constrains them but also shows the threat reaches beyond digital adversaries.

Poisoning and backdoors

Instead of attacking at inference, the attacker corrupts training data or the model itself. A backdoor implants a trigger — a specific pixel pattern, a specific word — that causes the model to misbehave when the trigger is present but behave normally otherwise. Hugging Face model downloads, open-source data contributions, and supply-chain compromises are all real-world vectors.

Attacks on language models

Vision was the original playground, but text followed. Text adversarial attacks substitute synonyms, add invisible characters, or alter phrasing to flip classifier outputs. For large language models, the attacks evolved into a distinct set of techniques:

Jailbreaks — prompts engineered to bypass safety training, like role-play scenarios that trick the model into producing harmful content.
Prompt injection — malicious instructions hidden in data the model processes (covered in our prompt-injection primer).
Data exfiltration — prompts that cause the model to reveal training data, system prompts, or user data from its context.
Model extraction — systematically querying a deployed model to reconstruct a cheaper copy.

These attacks matter because they bypass security even when the model is working as designed — the model is doing what the attacker’s input literally says to do.

Defenses and their limits

Adversarial training

Train the model on adversarial examples in addition to clean ones. Models trained this way are more robust to the attacks they were trained against. They still fail against novel, stronger attacks. Adversarial training is compute-expensive — several times slower than normal training.

Input preprocessing

Smoothing, compression, randomization, or denoising the input before feeding it to the model breaks many attacks. Attackers adapt by optimizing against the combined preprocessing + model pipeline, but the cost of adaptation raises the bar.

Certified robustness

Research methods prove, mathematically, that a model’s prediction will not change under perturbations up to a specific size. These guarantees are real but narrow — certifiable radii are small, and certified models have lower clean accuracy.

Detection and anomaly rejection

Flag inputs that look statistically unusual and refuse to classify them. Attackers adapt by crafting perturbations that look more natural. Detection is a cat-and-mouse game with no clear winner.

Ensemble methods and diversity

Running multiple models with different architectures and requiring agreement raises the attack cost. Still fallible — transferability means one attack often fools many models.

The offensive AI security landscape

The OWASP Machine Learning Security Top 10 catalogs current threats, and MITRE ATLAS documents adversarial tactics and techniques observed against real AI systems. Red-teaming AI systems — paying professional security researchers to break them — has matured into a standard pre-deployment practice at major AI labs. Microsoft, Google, Meta, Anthropic, and OpenAI all maintain AI red teams. Bug bounty programs for model behaviour (beyond just infrastructure vulnerabilities) are increasingly common.

Safety-critical deployments — medical diagnosis, autonomous vehicles, content moderation, financial decisions — need specific evaluation for adversarial robustness, not just benchmark accuracy. Assume the attacker will probe the system and plan accordingly. For more on the broader AI safety picture, see our ai safety coverage.

Frequently asked questions

Can I just retrain my model to fix adversarial vulnerabilities?
Retraining alone rarely solves the problem. Adversarial training helps but only against attack types included in training. A determined attacker will find new perturbations outside your training distribution. The realistic stance is to combine robustness training with monitoring, rate limiting, input sanity checks, and an incident response plan for when attacks succeed.

Are large language models as vulnerable to adversarial attacks as image classifiers?
They are vulnerable in different ways. LLMs are generally robust to character-level perturbations that would fool a classifier, but vulnerable to semantic attacks — jailbreaks, prompt injection, social-engineering-style prompts. The attack surface is broader because LLMs process much more open-ended inputs. Safety training helps, but no production LLM has been fully robustness-proven.

Has an adversarial attack caused real-world harm?
Documented real-world cases are growing. Researchers demonstrated stop-sign attacks that could fool production self-driving perception (in controlled settings). Face-recognition evasion using accessories has been documented. Voice-cloning attacks on voice-biometric systems have drained bank accounts. In 2024, researchers showed targeted prompt-injection attacks successfully manipulating production AI email assistants. The trend is clear — adversarial ML is moving from academic curiosity to real threat model.