Agent Memory — Short-term and Long-term

Episode 3 18 min

Introduction: The Goldfish with a 3-Second Memory

There is an old joke that a goldfish has a 3-second memory. Now imagine an Agent with the same memory. Every time you talk to it, everything starts from scratch! It does not remember your name, your preferences, or even what you told it 5 minutes ago.

The reality is that LLMs are inherently memoryless. Every time you call the API, it starts from zero. The “memory” you see in ChatGPT or Claude is built by the systems surrounding the LLM — not by the LLM itself.

In this episode, we will explore how Agent memory systems work and how to build good memory for an Agent.

Why Does Memory Matter?

Without memory:

  • The user has to introduce themselves every time
  • The Agent cannot learn from its previous mistakes
  • Every conversation is independent and unrelated
  • User experience becomes terrible

With memory:

  • The Agent knows who you are and what you like
  • It uses past experiences
  • Conversations become interconnected
  • Response quality improves over time

Types of Memory

Like the human brain, Agent memory comes in different types. Let us examine each one:

1. Short-term Memory (Working Memory)

This is the current conversation history. When you are chatting with the Agent, previous messages are placed in the context window.

class ShortTermMemory:
    def __init__(self, max_messages: int = 50):
        self.messages = []
        self.max_messages = max_messages

    def add(self, role: str, content: str):
        self.messages.append({
            "role": role,
            "content": content
        })
        # If limit exceeded, remove old messages
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]

    def get_context(self) -> list:
        return self.messages.copy()

    def clear(self):
        self.messages = []
Important limitation: Every LLM has a limited context window. For example, GPT-4o has about 128K tokens and Claude about 200K tokens. When conversations get long, you cannot fit everything into the context.

2. Long-term Memory

Information that persists between sessions. Things like the user’s name, preferences, and order history. This information is typically stored in a database.

import json
from datetime import datetime

class LongTermMemory:
    def __init__(self, storage_path: str = "memory.json"):
        self.storage_path = storage_path
        self.data = self._load()

    def _load(self) -> dict:
        try:
            with open(self.storage_path, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {"facts": [], "preferences": {}, "history": []}

    def _save(self):
        with open(self.storage_path, 'w') as f:
            json.dump(self.data, f, ensure_ascii=False, indent=2)

    def store_fact(self, fact: str, source: str = "conversation"):
        self.data["facts"].append({
            "fact": fact,
            "source": source,
            "timestamp": datetime.now().isoformat()
        })
        self._save()

    def store_preference(self, key: str, value: str):
        self.data["preferences"][key] = value
        self._save()

    def recall_facts(self, query: str = None) -> list:
        if query is None:
            return self.data["facts"]
        # Simple search
        return [f for f in self.data["facts"]
                if query.lower() in f["fact"].lower()]

    def get_preference(self, key: str) -> str:
        return self.data["preferences"].get(key)

3. Episodic Memory

Episodic memory is like personal memories. The Agent remembers what events happened and learns from them.

Example: “Last time the user asked about Python, I gave a simple example and they liked it. So this time I should also give a simple example.”

class EpisodicMemory:
    def __init__(self):
        self.episodes = []

    def record_episode(self, situation: str, action: str,
                       outcome: str, feedback: str = None):
        self.episodes.append({
            "situation": situation,
            "action": action,
            "outcome": outcome,
            "feedback": feedback,
            "timestamp": datetime.now().isoformat()
        })

    def recall_similar(self, current_situation: str) -> list:
        """Find similar past experiences"""
        relevant = []
        for ep in self.episodes:
            # In practice, use similarity search here
            if self._is_similar(current_situation, ep["situation"]):
                relevant.append(ep)
        return relevant

    def _is_similar(self, a: str, b: str) -> bool:
        """Simple comparison — in practice use embeddings"""
        common_words = set(a.lower().split()) & set(b.lower().split())
        return len(common_words) > 3

4. Semantic Memory

Semantic memory includes general, structured knowledge. Like a personal encyclopedia. The difference from episodic memory is that it is not tied to a specific time or place.

Example: “Python is a high-level programming language” (semantic memory) versus “I taught the user Python yesterday” (episodic memory).

Vector Database: The Heart of Long-term Memory

For effective long-term memory implementation, Vector Databases are used. The idea is simple:

  1. Convert text into an embedding (numerical vector)
  2. Store the vector in the database
  3. When you want to find something, embed your query too and find the most similar vectors
from openai import OpenAI
import numpy as np

client = OpenAI()

class VectorMemory:
    def __init__(self):
        self.memories = []  # List of (text, embedding)

    def _get_embedding(self, text: str) -> list:
        """Convert text to vector"""
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

    def store(self, text: str, metadata: dict = None):
        """Store a memory"""
        embedding = self._get_embedding(text)
        self.memories.append({
            "text": text,
            "embedding": embedding,
            "metadata": metadata or {}
        })

    def recall(self, query: str, top_k: int = 5) -> list:
        """Recall the most relevant memories"""
        query_embedding = self._get_embedding(query)

        # Calculate cosine similarity
        scored = []
        for mem in self.memories:
            similarity = self._cosine_similarity(
                query_embedding, mem["embedding"]
            )
            scored.append((similarity, mem["text"]))

        # Sort by similarity
        scored.sort(reverse=True)
        return [text for _, text in scored[:top_k]]

    def _cosine_similarity(self, a: list, b: list) -> float:
        a = np.array(a)
        b = np.array(b)
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
In practice, instead of manual implementation, tools like ChromaDB, Pinecone, Weaviate, or Qdrant are used. They are more optimized and have more features.

Example with ChromaDB

import chromadb

# Create database
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="agent_memory")

# Store memories
collection.add(
    documents=[
        "The user's name is Ali and they are a Python developer",
        "The user prefers explanations with practical examples",
        "Last time they asked about FastAPI",
        "The user is not a beginner, they are intermediate level",
    ],
    ids=["fact1", "fact2", "fact3", "fact4"]
)

# Recall
results = collection.query(
    query_texts=["What level is the user?"],
    n_results=2
)
print(results["documents"])
# [['The user is not a beginner, they are intermediate level',
#   'The user prefers explanations with practical examples']]

Conversation Summarization

When a conversation gets long and exceeds the context window, a good solution is summarization. Instead of deleting old messages, you summarize them:

class ConversationSummarizer:
    def __init__(self, llm_client, max_messages: int = 20):
        self.client = llm_client
        self.max_messages = max_messages
        self.messages = []
        self.summary = ""

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

        if len(self.messages) > self.max_messages:
            self._summarize_old_messages()

    def _summarize_old_messages(self):
        """Summarize old messages"""
        # Summarize the first half
        to_summarize = self.messages[:len(self.messages)//2]
        to_keep = self.messages[len(self.messages)//2:]

        summary_prompt = f"""Previous conversation summary:
{self.summary}

New messages to summarize:
{self._format_messages(to_summarize)}

Please provide a comprehensive summary of the entire conversation so far.
Preserve important points, decisions, and key information."""

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": summary_prompt}]
        )

        self.summary = response.choices[0].message.content
        self.messages = to_keep

    def get_context(self) -> list:
        """Get context for LLM"""
        context = []
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Previous conversation summary:\n{self.summary}"
            })
        context.extend(self.messages)
        return context

    def _format_messages(self, messages: list) -> str:
        return "\n".join(
            f"{m['role']}: {m['content']}" for m in messages
        )

Practical Memory Patterns

Pattern 1: Three-Layer Memory

A common architecture for Agent memory:

class ThreeLayerMemory:
    def __init__(self):
        # Layer 1: Current context (working memory)
        self.working_memory = ShortTermMemory(max_messages=20)

        # Layer 2: Conversation summaries (mid-term memory)
        self.conversation_summary = ""

        # Layer 3: Persistent knowledge (long-term memory)
        self.long_term = VectorMemory()

    def build_context(self, user_message: str) -> list:
        """Build complete context for LLM"""

        # Retrieve the most relevant info from long-term memory
        relevant_memories = self.long_term.recall(
            user_message, top_k=3
        )

        context = []

        # Previous conversation summary
        if self.conversation_summary:
            context.append({
                "role": "system",
                "content": f"Summary of previous interactions:\n"
                          f"{self.conversation_summary}"
            })

        # Relevant info from long-term memory
        if relevant_memories:
            memories_text = "\n".join(
                f"- {m}" for m in relevant_memories
            )
            context.append({
                "role": "system",
                "content": f"Relevant information:\n{memories_text}"
            })

        # Current conversation
        context.extend(self.working_memory.get_context())

        return context

Pattern 2: Automatic Information Extraction

The Agent can automatically extract important information from conversations and store it:

def extract_and_store(self, conversation: list):
    """Automatically extract important info from conversation"""

    extraction_prompt = """Extract important information from the conversation below.
Only write factual, important information, not opinions.
Format: one piece of information per line.

Conversation:
{conversation}

Extracted information:"""

    response = self.client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": extraction_prompt.format(
                conversation=self._format_messages(conversation)
            )
        }]
    )

    facts = response.choices[0].message.content.strip().split("\n")

    for fact in facts:
        fact = fact.strip("- ").strip()
        if fact:
            self.long_term.store(fact)

Problems and Challenges

Problem 1: Contradictory Memories

A user might say contradictory things in different conversations. For example, they might say “I am a Java developer” and later “I only know Python.” You need a system that prioritizes newer information.

Problem 2: Privacy

Long-term memory means storing user data. This has privacy implications. You must let users view and delete their memory.

Problem 3: Scalability

As memory grows, search becomes slower. Vector databases help here, but you need the right architecture.

Problem 4: Noise

Not everything in a conversation is important. The Agent must be able to distinguish what is worth storing.

Combining Memory with an Agent

Let us see how a complete Agent with memory works:

class MemoryAgent:
    def __init__(self):
        self.memory = ThreeLayerMemory()
        self.client = OpenAI()

    def chat(self, user_message: str) -> str:
        # Store user message
        self.memory.working_memory.add("user", user_message)

        # Build context with memory
        context = self.memory.build_context(user_message)

        # Send to LLM
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system",
                 "content": "You are a smart assistant with memory."},
                *context,
            ],
        )

        reply = response.choices[0].message.content

        # Store response
        self.memory.working_memory.add("assistant", reply)

        # Extract important info for long-term memory
        self._maybe_extract_info(user_message, reply)

        return reply

    def _maybe_extract_info(self, user_msg: str, assistant_msg: str):
        """Detect and store important information"""
        # Every 5 messages, extract important info
        msgs = self.memory.working_memory.messages
        if len(msgs) % 10 == 0:
            self.memory.extract_and_store(msgs[-10:])

# Usage
agent = MemoryAgent()
agent.chat("Hi, my name is Alex")
agent.chat("I am a frontend developer")
# ... a few days later ...
agent.chat("Hello")
# Agent: "Hey Alex! How are you? Got more frontend questions?"

Summary

  • Short-term memory: Current conversation history — simple but limited by context window
  • Long-term memory: Stored in Vector DB — persistent and searchable
  • Episodic memory: Memories and experiences — for learning from the past
  • Semantic memory: Structured knowledge — like a personal encyclopedia
  • Summarization: The solution for limited context windows

The next episode is about Planning — when the Agent thinks before it acts and plans ahead. One of the most fascinating parts of building Agents!