As agentic AI moves from experiments to real production workloads, a quiet but serious infrastructure problem is coming into focus: memory. Not compute. Not models. Memory.Under the hood, today’s GPUs simply don’t have enough space to hold the Key-Value (KV) caches that modern, long-running AI agents depend on to maintain context. The result is a lot of invisible waste — GPUs redoing work they’ve already done, cloud costs climbing, and performance taking a hit. It’s a problem that’s already showing up in production environments, even if most people haven’t named it yet.At a recent stop on the VentureBeat AI Impact Series, WEKA CTO Shimon Ben-David joined VentureBeat CEO Matt Marshall to unpack the industry’s emerging “memory wall,” and why it’s becoming one of the big [...]
Nvidia researchers have introduced a new technique that dramatically reduces how much memory large language models need to track conversation history — by as much as 20x — without modifying the mo [...]
DeepSeek’s announcement over the weekend that it has made its 75% price cut permanent on its flagship V4 Pro model is a disruptive assault on the capital-heavy business models of Silicon Valley’s [...]
For the last 24 months, one narrative justified every over-provisioned data center and bloated IT budget: the GPU scramble. Silicon was the new oil, and H100s traded like contraband. Reserve capacity [...]
AI agents forget. Every time a coding assistant loses track of a debugging thread, or a data analysis agent re-ingests the same context it already processed, the team pays in latency, token costs, and [...]
When an enterprise LLM retrieves a product name, technical specification, or standard contract clause, it's using expensive GPU computation designed for complex reasoning — just to access stati [...]
Enterprise AI applications that handle large documents or long-horizon tasks face a severe memory bottleneck. As the context grows longer, so does the KV cache, the area where the model’s working me [...]
As agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University and TogetherAI has found a [...]
Google senior AI product manager Shubham Saboo has turned one of the thorniest problems in agent design into an open-source engineering exercise: persistent memory.This week, he published an open-sour [...]
The attacker who hit the most financial services organizations over the past 12 months never phished a password. They called an IT support line, convinced an employee to reset their MFA, and registere [...]