For more than a decade, conversational AI has promised human-like assistants that can do more than chat. Yet even as large language models (LLMs) like ChatGPT, Gemini, and Claude learn to reason, explain, and code, one critical category of interaction remains largely unsolved — reliably completing tasks for people outside of chat. Even the best AI models score only in the 30th percentile on Terminal-Bench Hard, a third-party benchmark designed to evaluate the performance of AI agents on completing a variety of browser-based tasks, far below the reliability demanded by most enterprises and users. And task-specific benchmarks like TAU-Bench airline, which measures the reliability of AI agents on finding and booking flights on behalf of a user, also don't have much higher pass rates, w [...]
The buzzed-about but still stealthy New York City startup Augmented Intelligence Inc (AUI), which seeks to go beyond the popular "transformer" architecture used by most of today's LLMs [...]
A rogue AI agent at Meta passed every identity check and still exposed sensitive data to unauthorized employees in March. Two weeks later, Mercor, a $10 billion AI startup, confirmed a supply-chain br [...]
New VB Pulse data shows Microsoft and OpenAI leading enterprise agent orchestration, but Anthropic’s first measurable foothold points to a larger fight over who controls the infrastructure where AI [...]
Perplexity, the AI-powered search company valued at $20 billion, announced on Wednesday at its inaugural Ask 2026 developer conference that its multi-model AI agent, Computer, is now available to ente [...]
Microsoft last week took Agent 365, its management platform for AI agents, out of preview and into general availability — a move that signals the software giant believes the governance challenge aro [...]
For the past two years, the technology industry has raced to make AI agents more capable — teaching them to write code, navigate software interfaces, manage files, and orchestrate multi-step workflo [...]