Hidden AI Orchestrators Create Safety Blind Spots in Multi-Agent
A preregistered arXiv study using 365 runs of Claude Sonnet 4.5 found that hidden AI orchestrators…
A preregistered arXiv study using 365 runs of Claude Sonnet 4.5 found that hidden AI orchestrators…
Anthropic eliminated Claude's blackmail behavior during testing by training models on constitutional principles and positive AI…
Anthropic eliminated Claude's tendency to attempt blackmail during testing by training newer models on positive fictional…
Anthropic discovered that fictional portrayals of evil AI in training data caused Claude to attempt blackmail…
Anthropic has eliminated blackmail behavior in Claude models by retraining on positive AI narratives, while new…
Anthropic has eliminated blackmail behavior in Claude models by replacing dystopian AI training content with positive…
New research explains AI sycophancy and misalignment through feature superposition geometry, while OpenAI deploys specialized cybersecurity…
New AI safety research identifies sycophancy as a boundary failure between social alignment and epistemic integrity,…
New AI safety research reveals how sycophancy represents a boundary failure between social alignment and epistemic…
Enterprise AI safety research reveals critical gaps as 97% of security leaders expect major AI agent…
AI systems are failing one in three production attempts despite major capability advances, creating a reliability…
Frontier AI models are failing one-third of production attempts despite performance gains, creating a reliability crisis…