Destination
Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

Based on a new benchmark, Google DeepMind found Gemini 2.0 Flash to be the most factual LLM, with a score of 83.6%. [...]

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat
The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

There's no shortage of generative AI benchmarks designed to measure the performance and accuracy of a given model on completing various helpful enterprise tasks — from coding to instruction fol [...]

Match Score: 170.74

venturebeat
98% of market researchers use AI daily, but 4 in 10 say it makes errors — revealing a major trust problem

Market researchers have embraced artificial intelligence at a staggering pace, with 98% of professionals now incorporating AI tools into their work and 72% using them daily or more frequently, accordi [...]

Match Score: 87.73

Destination
Google DeepMind's Genie 3 can dynamically alter the state of its simulated worlds

At start of December, Google DeepMind released Genie 2. The Genie family of AI systems are what are known as world models. They're capable of generating images as the user — either a human or, [...]

Match Score: 78.85

venturebeat
Frontier models are failing one in three production attempts — and getting harder to audit

AI agents are now embedded in real enterprise workflows, and they're still failing roughly one in three attempts on structured benchmarks. That gap between capability and reliability is the defin [...]

Match Score: 71.77

venturebeat
Meta researchers introduce 'hyperagents' to unlock self-improving AI for non-coding tasks

Creating self-improving AI systems is an important step toward deploying agents in dynamic environments, especially in enterprise production environments, where tasks are not always predictable, nor c [...]

Match Score: 68.36

venturebeat
Lean4: How the theorem prover works and why it's the new competitive edge in AI

Large language models (LLMs) have astounded the world with their capabilities, yet they remain plagued by unpredictability and hallucinations – confidently outputting incorrect information. In high- [...]

Match Score: 67.31

venturebeat
Why observable AI is the missing SRE layer enterprises need for reliable LLMs

As AI systems enter production, reliability and governance can’t depend on wishful thinking. Here’s how observability turns large language models (LLMs) into auditable, trustworthy enterprise syst [...]

Match Score: 65.01

venturebeat
A weekend ‘vibe code’ hack by Andrej Karpathy quietly sketches the missing layer of enterprise AI orchestration

This weekend, Andrej Karpathy, the former director of AI at Tesla and a founding member of OpenAI, decided he wanted to read a book. But he did not want to read it alone. He wanted to read it accompan [...]

Match Score: 63.95

venturebeat
Red teaming LLMs exposes a harsh truth about the AI security arms race

Unrelenting, persistent attacks on frontier models make them fail, with the patterns of failure varying by model and developer. Red teaming shows that it’s not the sophisticated, complex attacks tha [...]

Match Score: 62.31