The arms race to build smarter AI models has a measurement problem: the tests used to rank them are becoming obsolete almost as quickly as the models improve. On Monday, Artificial Analysis, an independent AI benchmarking organization whose rankings are closely watched by developers and enterprise buyers, released a major overhaul to its Intelligence Index that fundamentally changes how the industry measures AI progress.The new Intelligence Index v4.0 incorporates 10 evaluations spanning agents, coding, scientific reasoning, and general knowledge. But the changes go far deeper than shuffling test names. The organization removed three staple benchmarks — MMLU-Pro, AIME 2025, and LiveCodeBench — that have long been cited by AI companies in their marketing materials. In their place, the n [...]
For decades, the IQ test has been one of the most familiar — and most contested — yardsticks for human intelligence. Now, a startup project called AI IQ is applying the same metaphor to artificial [...]
On Sunday, a team of nine researchers at Sina Weibo — the Chinese social media giant better known for its microblogging platform than for cutting-edge artificial intelligence — quietly posted a 14 [...]
Watch out, DeepSeek and Qwen! There's a new king of open source large language models (LLMs), especially when it comes to something enterprises are increasingly valuing: agentic tool use — that [...]
AI agents are now embedded in real enterprise workflows, and they're still failing roughly one in three attempts on structured benchmarks. That gap between capability and reliability is the defin [...]
Meta has been one of the most interesting companies of the generative AI era — initially gaining a loyal and huge following of users for the release of its mostly open source Llama family of large l [...]
There is no shortage of AI benchmarks in the market today, with popular options like Humanity's Last Exam (HLE), ARC-AGI-2 and GDPval, among numerous others.AI agents excel at solving abstract ma [...]
For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family, Anthropic's Claude [...]
Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking re [...]
Patronus AI, the artificial intelligence evaluation startup backed by $20 million from investors including Lightspeed Venture Partners and Datadog, unveiled a new training architecture Tuesday that it [...]