AI Benchmarks Under Scrutiny: Hacks, IQ Scores, and OpenAI’s
A new automated tool called BenchJack found 219 reward-hacking exploits across 10 major AI agent benchmarks,…
A new automated tool called BenchJack found 219 reward-hacking exploits across 10 major AI agent benchmarks,…
A new automated auditing tool called BenchJack found 219 reward-hacking exploits across 10 popular AI agent…
A new AI IQ platform ranking 50+ language models on human intelligence scales has sparked debate,…
A new AI IQ benchmark ranking 50+ language models on a single intelligence scale has divided…
A new AI IQ website ranking language models on human intelligence scales has sparked intense debate,…
A new AI IQ website ranking language models on human intelligence scales has sparked debate, while…
TokenArena introduces energy-aware AI benchmarking across 78 endpoints, revealing 6.2x efficiency variations and 12.5-point accuracy differences…
TokenArena introduces energy-aware AI benchmarking that measures models at endpoint granularity, revealing substantial performance variations and…