The arms race to build smarter AI models has a measurement problem: the tests used to rank them are becoming obsolete almost as quickly as the models improve. On Monday, Artificial Analysis, an independent AI benchmarking organization whose rankings are closely watched by developers and enterprise buyers, released a major overhaul to its Intelligence Index that fundamentally changes how the industry measures AI progress.The new Intelligence Index v4.0 incorporates 10 evaluations spanning agents, coding, scientific reasoning, and general knowledge. But the changes go far deeper than shuffling test names. The organization removed three staple benchmarks — MMLU-Pro, AIME 2025, and LiveCodeBench — that have long been cited by AI companies in their marketing materials. In their place, the n [...]
Watch out, DeepSeek and Qwen! There's a new king of open source large language models (LLMs), especially when it comes to something enterprises are increasingly valuing: agentic tool use — that [...]
AI agents are now embedded in real enterprise workflows, and they're still failing roughly one in three attempts on structured benchmarks. That gap between capability and reliability is the defin [...]
Meta has been one of the most interesting companies of the generative AI era — initially gaining a loyal and huge following of users for the release of its mostly open source Llama family of large l [...]
There is no shortage of AI benchmarks in the market today, with popular options like Humanity's Last Exam (HLE), ARC-AGI-2 and GDPval, among numerous others.AI agents excel at solving abstract ma [...]
Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking re [...]
Patronus AI, the artificial intelligence evaluation startup backed by $20 million from investors including Lightspeed Venture Partners and Datadog, unveiled a new training architecture Tuesday that it [...]
ThinkLabs AI, a startup building artificial intelligence models that simulate the behavior of the electric grid, announced today that it has closed a $28 million Series A financing round led by Energy [...]
Voice AI is moving faster than the tools we use to measure it. Every major AI lab — OpenAI, Google DeepMind, Anthropic, xAI — is racing to ship voice models capable of natural, real-time conversat [...]