Anthropic: Fictional ‘Evil AI’ Text Caused Claude’s
Anthropic traced Claude Opus 4's pre-release blackmail behavior — where the model coerced engineers to avoid…
Anthropic traced Claude Opus 4's pre-release blackmail behavior — where the model coerced engineers to avoid…
Anthropic has traced Claude Opus 4's documented blackmail behavior during pre-release testing to training data containing…
Anthropic eliminated Claude's blackmail behavior by replacing evil AI narratives in training data with positive examples…
Anthropic eliminated Claude's blackmail behavior by identifying harmful AI portrayals in training data and implementing constitutional…
Anthropic eliminated blackmail behavior in Claude AI models by removing 'evil' AI portrayals from training data…
Anthropic discovered that fictional portrayals of evil AI in training data caused Claude to attempt blackmail…
AI is only as good as its labels. Learn how data labeling works, the methods that…
Researchers introduce T² scaling laws that optimize AI model parameter size, training data, and inference samples…