venturebeat
Nvidia says it can shrink LLM memory 20x without changing model weights

Nvidia researchers have introduced a new technique that dramatically reduces how much memory large language models need to track conversation history — by as much as 20x — without modifying the model itself. The method, called KV Cache Transform Coding (KVTC), applies ideas from media compression formats like JPEG to shrink the key-value cache behind multi-turn AI systems, lowering GPU memory demands and speeding up time-to-first-token by up to 8x.For enterprise AI applications that rely on agents and long contexts, this translates to reduced GPU memory costs, better prompt reuse, and up to an 8x reduction in latency by avoiding the need to recompute dropped KV cache values.Serving large language models at scale requires managing a massive amount of data, especially for multi-turn conv [...]

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat
MeMo's memory model lets teams upgrade their LLM without retraining it — and performance jumps 26%

Enabling LLMs to acquire new knowledge after training remains a major hurdle for enterprise AI — current solutions are either too expensive, too slow, or constrained by context window limits.MeMo, a [...]

Match Score: 130.52

venturebeat
MIT's MeMo lets teams swap in a better LLM without retraining — and performance jumps 26%

Enabling LLMs to acquire new knowledge after training remains a major hurdle for enterprise AI — current solutions are either too expensive, too slow, or constrained by context window limits.MeMo, a [...]

Match Score: 125.48

venturebeat
Nvidia launches enterprise AI agent platform with Adobe, Salesforce, SAP among 17 adopters at GTC 2026

Jensen Huang walked onto the GTC stage Monday wearing his trademark leather jacket and carrying, as it turned out, the blueprints for a new kind of monopoly.The Nvidia CEO unveiled the Agent Toolkit, [...]

Match Score: 124.01

venturebeat
Nvidia introduces Vera Rubin, a seven-chip AI platform with OpenAI, Anthropic and Meta on board

Nvidia on Monday took the wraps off Vera Rubin, a sweeping new computing platform built from seven chips now in full production — and backed by an extraordinary lineup of customers that includes Ant [...]

Match Score: 123.00

venturebeat
A 0.12% parameter add-on gives AI agents the working memory RAG can't

AI agents forget. Every time a coding assistant loses track of a debugging thread, or a data analysis agent re-ingests the same context it already processed, the team pays in latency, token costs, and [...]

Match Score: 109.31

venturebeat
How DeepSeek’s radical architecture is shattering Silicon Valley's token moat

DeepSeek’s announcement over the weekend that it has made its 75% price cut permanent on its flagship V4 Pro model is a disruptive assault on the capital-heavy business models of Silicon Valley’s [...]

Match Score: 98.71

venturebeat
Nvidia's DGX Station is a desktop supercomputer that runs trillion-parameter AI models without the cloud

Nvidia on Monday unveiled a deskside supercomputer powerful enough to run AI models with up to one trillion parameters — roughly the scale of GPT-4 — without touching the cloud. The machine, calle [...]

Match Score: 91.76

venturebeat
Under the hood of AI agents: A technical guide to the next frontier of gen AI

Agents are the trendiest topic in AI today — and with good reason. Taking gen AI out of the protected sandbox of the chat interface and allowing it to act directly on the world represents a leap for [...]

Match Score: 89.94

venturebeat
MiniMax-M3 debuts, eclipsing GPT-5.5 and Gemini 3.1 Pro on key benchmark performance for just 5-10% of the cost

Big news in enterprise AI broke over the weekend as Chinese AI startup MiniMax released its highly anticipated M3 large language model on Sunday evening Eastern time, pairing frontier-tier coding and [...]

Match Score: 87.50