How LLM applications learned to remember

Your chatbot forgets the user's name after 8 messages. Your RAG pipeline hallucinates 20% of the time. Your agent re-asks "what language do you prefer?" every single session. These are all the same problem: LLMs have no memory. The industry has spent four years trying to bolt it on, and the approach has changed every year.

Each step fixed one thing and broke another. Here's the engineering story.

2022: Context windows were tiny

GPT-3 davinci: 2,048 tokens. ChatGPT (November 2022): 4,096 tokens. System prompt + conversation history + user query all fighting for the same 4K window. After ~8 turns, the system prompt gets truncated and the bot forgets its own instructions. I spent a week debugging exactly this on a document Q&A app before realizing the model wasn't broken - it was just out of room.

LangChain shipped ConversationSummaryMemory: compress history by asking the LLM to summarize it. You were burning extra API calls just to remember what the user said five minutes ago. Playing telephone with yourself. Researchers found the "lost in the middle" problem on top of that: models forgot information placed in the center of their context.

The takeaway: The context window wasn't going to scale. Memory had to live outside the model.

2023: Vector databases and RAG

The industry converged on RAG. The architecture:

documents -> embedding model (ada-002, 1536d) -> vector DB
                                                    |
user query -> embed query -> nearest neighbor search -> top-k chunks -> prompt -> LLM

In some benchmarks, RAG with GPT-4 beat full-context-stuffing at a fraction of the cost. But production RAG was a different beast.

The chunking problem alone consumed weeks. Fixed-size chunks at 512 tokens split sentences in half. Semantic chunking preserved meaning but needed another model for split decisions. I tuned chunk sizes for a data pipeline and found different sizes worked best for different question types. There was no universal optimum.

The infra overhead was real. Self-hosting Milvus or Qdrant meant distributed systems expertise. Managed services meant per-query fees. A three-person team suddenly needed a fourth just for retrieval.

Even with all this, a 2024 Stanford HAI study ("AI on Trial") benchmarked legal AI tools from LexisNexis and Thomson Reuters and found hallucination rates of 17-33%. Retrieval helped but didn't stop the model from confidently generating wrong answers.

That said, RAG was and still is the right tool for a lot of problems. If you're searching across millions of documents, or your knowledge base updates frequently, nothing else comes close. The pain was real, but so was the value.

The takeaway: RAG solved "what does the model know about this document" but ignored "what does the model remember about me."

2024: Memory extraction replaces document retrieval

The shift: instead of retrieving documents, extract useful facts and throw away the rest.

ChatGPT got memory (April), Claude (August), Gemini (November). These systems extracted atomic facts from conversations: user prefers Python, user is building a data pipeline, user's team uses PostgreSQL. This mirrored how human memory works. You remember takeaways, not transcripts.

Three categories of open-source projects emerged:

Hybrid memory stacks like Mem0 combined vector stores + graph DB + key-value stores into one system. On the LOCOMO benchmark - published by Snap Research, which tests whether a system can recall specific facts from 300+ turn conversations spanning up to 35 sessions - Mem0 reported 66.9% accuracy vs OpenAI Memory's 52.9%, with 91% lower latency and 90% fewer tokens.

Self-managed context systems like Letta (formerly MemGPT) treated the LLM like an OS. The model managed its own context window: what stays in RAM (active context) vs paged to disk (external storage). It used function calls to save and retrieve its own memories.

Temporal knowledge stores like Zep tracked when facts were established and how they changed. User says "I prefer dark mode" in March, "switched to light mode" in October? Zep resolved contradictions by timestamp.

The problems were real too. The "Echoleak" incident showed a prompt hidden in an email could cause an agent to leak private information from prior conversations. Systems hallucinated memories, fabricating facts about users that then influenced future responses.

Also in 2024: Microsoft released GraphRAG, applying graph-based reasoning to document retrieval - a different angle from the entity-graph memory that Mem0 and Zep were building, but pointing in the same direction. Memory compression appeared in research (KVzip: 3-4x compression, Dynamic Memory Compression: 7x throughput improvement). Prompt caching landed at Anthropic (90% cost reduction on repeated prefixes, per their docs) and OpenAI (50% off cached tokens, automatic). If your system prompt is stable, cache it. That's step zero.

The takeaway: Extraction-based memory was cheaper and faster than RAG for personalization. But every memory system introduced a new attack surface.

2025: Just use files

Every memory solution up to this point required infrastructure. Then 2025 happened.

Claude Code introduced CLAUDE.md: a markdown file loaded into the system prompt at session start. Write your project conventions and architecture decisions in it. According to Anthropic's reports, an 80-line CLAUDE.md cut manual corrections 40% on a 50K-line TypeScript codebase. The pattern spread:

.
├── AGENTS.md                          # open standard (Aug 2025)
├── CLAUDE.md                          # Claude Code
├── .cursor/rules/*.mdc                # Cursor (per-project, per-subdirectory)
├── .github/copilot-instructions.md    # GitHub Copilot
└── memory-bank/
    ├── projectbrief.md                # Cline
    ├── activeContext.md
    └── progress.md

AGENTS.md shipped as an open standard from OpenAI, Google, Cursor, Factory, and Sourcegraph. According to the Agentic AI Foundation, adoption reached 60,000+ repos within months. The Linux Foundation took stewardship by December.

Why files won for developer workflows: $0.02/GB/month vs $50-200/GB for vector databases. Git-versionable. Human-readable. When the model goes wrong, you open the file and fix it. Try debugging a vector database that way.

Letta's MemFS scored 74% on LOCOMO, beating specialized memory libraries on that benchmark. Worth noting: Letta's higher LOCOMO score than Mem0's (74% vs 66.9%) measures a different thing - MemFS was benchmarked as a retrieval mechanism (can the filesystem find the right fact?), while Mem0's score measures the full memory pipeline including extraction and consolidation. They're solving adjacent problems. I moved my own project's memory from a Pinecone-backed setup to a CLAUDE.md file and the debugging experience alone was worth the switch.

But I want to be honest about where files fall over, because it's not just an edge case. If you're building anything with multiple concurrent agents writing to the same memory, files corrupt. If your memory is unstructured user-generated content (support tickets, chat logs, meeting notes) rather than curated developer knowledge, grep can't find what you need. If you're retrieving across millions of entities, you need a real database - files aren't even in the conversation. The success of file-based memory is partly a function of the use cases it was designed for: single developers working on single codebases.

The takeaway: For single-user, single-project work, a text file gets you surprisingly far. But the moment you leave that comfort zone, you need more.

2026: Files on top, databases underneath

Most teams building agents in production have landed on a hybrid: the simplicity of file-based interfaces backed by real persistence.

I'll be upfront: I've prototyped with DeepAgents but haven't shipped it to production yet. The frameworks in this section are newer, and my experience with them is shallower than the earlier sections. I'm drawing on what I've tested, what colleagues are reporting, and what the benchmarks show.

LangChain's DeepAgents (March 2026) is the clearest example of the pattern. It ships a virtual filesystem with pluggable backends:

# the agent reads/writes markdown and JSON
# the backend is swappable
backends = {
    "prototype":  "local_disk",
    "staging":    "langgraph_store",     # cross-thread persistence
    "production": "postgresql | mongodb", # multi-user, durable
    "composite":  "disk + langgraph"      # combine multiple
}

The middleware compresses conversation history, offloads large tool outputs, isolates context between subagents, and applies prompt caching. The agent thinks in files. The system decides where they live.

Here's what an end-to-end memory cycle looks like in a system like this:

1. User says: "I switched our DB from Postgres to CockroachDB last week"

2. Extraction:  middleware identifies this as a memory-worthy fact
3. File write:  updates project_context.md →  "database: cockroachdb (changed from postgres, march 2026)"
4. Backend sync: file change persists to langgraph_store / postgresql
5. Old memory:  "database: postgres" marked as superseded, not deleted
6. Next session: agent loads project_context.md from backend into prompt
7. User asks:   "write a migration script"
8. Agent:       generates CockroachDB-compatible SQL, not Postgres

The fact never re-enters the context window as raw conversation history. It lives as a curated, timestamped, version-controlled fact that the agent can retrieve and act on.

Other frameworks are converging on similar ideas, roughly splitting into three categories:

Fact extraction + entity graphs: Hindsight (early 2026) extracts facts from conversations and builds entity graphs with four parallel recall strategies. Good for applications that need to track relationships between people, projects, and decisions.

Autonomous memory management: A-Mem lets agents decide what to store, retrieve, update, and summarize without explicit instructions. The agent manages its own memory lifecycle.

Graph-native memory: Mem0 added directed labeled graphs in early 2026. Zep's Graphiti engine does temporal knowledge graphs. When entity relationships matter more than text similarity - "who reported to whom during Q3" rather than "find documents about Q3" - graphs are the right structure.

On the compression side, Dynamic Memory Sparsification (Edinburgh + NVIDIA) achieves 8x smaller memory with better benchmark scores on reasoning tasks. That's real savings if you're paying per token at scale.

The unsolved problems are hard ones. Multi-agent shared memory - how to give a team of agents a shared brain without a "noisy commons" where irrelevant information drowns out the useful stuff - is the biggest open question. The MemTrack benchmark (published October 2025, arXiv:2510.01353), which tests memory across complex multi-platform organizational scenarios, shows even GPT-5 only hits 60% correctness. We're not close to solved here.

One thing worth watching: GDPR Article 17 gives users the right to erasure. Deleting facts from a file or vector DB is trivial. Deleting them from model weights requires retraining or experimental unlearning. This regulatory gap will keep pushing production toward external memory architectures.

The takeaway: Write markdown, persist anywhere. The file-based interface is the right abstraction for now. The storage backend is an implementation detail.

So what should you actually use?

I've gone through five generations of memory approaches. Here's my honest take on what to reach for, based on what I've seen work and what I've seen waste months of engineering time.

First question: do you even need memory?

Not every LLM application does. If your agent handles independent, one-shot tasks - summarize this document, translate this text, answer this question from a knowledge base - stateless is fine. Reasoning models like o3 actually perform worse with excessive context; they want shorter, cleaner prompts. Adding memory to a stateless workflow adds complexity, cost, and new failure modes (stale state, memory hallucination, race conditions) for zero benefit.

You need memory when the current interaction depends on a previous one. Personalized assistants, multi-session workflows, agents that learn user preferences, anything where "forgetting" visibly degrades the experience.

Second question: how many users and agents?

This is the decision that matters most. Here's how I'd break it down:

┌─────────────────────────────────┐
│  Do you need cross-session      │
│  persistence?                   │
│                                 │
│  NO  → stateless. stop here.   │
│  YES ↓                         │
├─────────────────────────────────┤
│  Single user, single agent?     │
│                                 │
│  YES → file-based memory        │
│        (CLAUDE.md, AGENTS.md)   │
│  NO  ↓                         │
├─────────────────────────────────┤
│  < 1000 users, single agent     │
│  per user?                      │
│                                 │
│  YES → Mem0 or LangMem          │
│        (managed extraction +    │
│         per-user scoping)       │
│  NO  ↓                         │
├─────────────────────────────────┤
│  Multi-agent, shared context,   │
│  or 10K+ users?                 │
│                                 │
│  → hybrid: file interface +     │
│    DB backend (DeepAgents-style) │
│    + graph layer if you need    │
│    entity relationships         │
└─────────────────────────────────┘

Third question: which framework?

From what I've seen in production and what developers are reporting in practice:

File-based memory (CLAUDE.md, AGENTS.md, .cursor/rules) is where I'd start for any developer tool or single-user workflow. Five minutes to set up, zero cost, version control for free. This is my default and I think it should be yours. The ceiling: multi-user, concurrent writes, or semantic search.

Mem0 is the most common choice I see for B2C products that need to remember user preferences across sessions. Its hybrid store (vector + graph + key-value) handles most patterns out of the box. The tradeoff: graph features need the $249/mo Pro tier, and you're taking an API dependency.

LangMem is worth serious consideration if you're already in LangChain/LangGraph. Free, open-source, plugs directly into LangGraph's storage layer, and the prompt optimization features are unique. The tradeoff: tight framework coupling. If you leave LangChain later, your memory layer doesn't come with you cleanly.

Zep is underrated and purpose-built for temporal reasoning. Customer support where ticket history matters, healthcare where patient context evolves, any domain where "when did this fact change" matters as much as "what is the fact." Zep's Graphiti engine handles this better than the alternatives I've tested. The tradeoff: steeper setup, smaller community.

Letta is the most architecturally interesting option for long-running agents. If your agent runs for hours or days and needs to decide what to keep and what to archive, Letta's OS-inspired memory management is genuinely different. The tradeoff: steepest learning curve of the group, and less mature ecosystem.

Roll your own with Redis + vector search if you already have Redis and your needs are straightforward. You'll build the extraction logic yourself, but you avoid all framework lock-in. I'd only go here if you have a strong opinion about how memory should behave in your specific application.

The mistakes I keep seeing:

Storing everything. The biggest one. Developers dump entire conversation histories into memory and wonder why retrieval quality degrades. Memory should be selective - extract the facts, discard the noise. If you're storing more than you're retrieving, something's wrong.
Not handling contradictions. User says "I use Postgres" in January, "we migrated to CockroachDB" in March. If your memory system stores both as separate facts, the agent gets confused. You need an update-or-supersede strategy, not just append.
Skipping the stateless option. I've watched teams spend months building memory systems for applications that didn't need them. If your users aren't coming back for multi-session interactions, or if each request is independent, save yourself the complexity.
Over-extracting. If you extract every fact from every conversation, your memory store fills with noise ("user said hello," "user asked about the weather"). Over-extraction kills retrieval precision. Under-extraction is better - you can always extract more later.
Ignoring memory in testing. Memory introduces a new class of bugs: stale facts, hallucinated memories, race conditions between concurrent agents. If your test suite doesn't cover memory state, you'll find these bugs in production.

My current stack for a new project:

If I'm starting an LLM application today that needs memory, I start with a CLAUDE.md or AGENTS.md file for project-level knowledge. For user-specific memory, I evaluate Mem0 or LangMem depending on whether I'm in the LangChain ecosystem. If I need temporal reasoning, I look at Zep before building anything custom. I don't add graph memory until I have a concrete query that vector search can't answer.

The principle is always the same: add complexity only when the simpler option is provably failing. Every layer you add is a layer you have to debug, monitor, and maintain.

The pattern

Each step traded one kind of complexity for another:

context stuffing  -> RAG             (fixed context limits, added infra)
RAG               -> fact extraction (cut tokens, added privacy risk)
fact extraction   -> files           (killed infra, limited scale)
files             -> files + DB      (brought scale back, brought complexity back)

What's next: the line between RAG, memory, and agent state is already blurring. A single memory layer that handles semantic memory (facts), episodic memory (past interactions), procedural memory (learned patterns), and graph memory (entity relationships) through one interface. Mem0 and DeepAgents are heading there.

The hard unsolved problem is multi-agent shared memory. Single-agent memory is largely figured out. A team of agents sharing context without stepping on each other? Nobody's cracked that yet. That's the frontier.

I'll probably need to update this in six months. That's the point.

How LLM applications learned to remember

2022: Context windows were tiny

2023: Vector databases and RAG

2024: Memory extraction replaces document retrieval

2025: Just use files

2026: Files on top, databases underneath

So what should you actually use?

The pattern

Comments (0)

Leave a comment

2022: Context windows were tiny

2023: Vector databases and RAG

2024: Memory extraction replaces document retrieval

2025: Just use files

2026: Files on top, databases underneath

So what should you actually use?

The pattern

Comments (0)

Leave a comment

Related Posts

LLM-on-Spark: Four Patterns That Actually Scale

The hard part of AI engineering isn't the AI

Your AI agent can't use your software. Here's how that's changing.