AI Architecture8 min readBy Ravi Shankar

Quick Answer

How AI agent memory systems work — covering in-context memory, vector-based long-term memory, episodic memory, and implementation patterns for production agents.

AI Agent Memory: Short-Term, Long-Term, and Episodic

Memory is what separates a one-shot AI interaction from a persistent AI agent. Without memory, every conversation starts from zero — the agent has no knowledge of past interactions, learned preferences, or accumulated context. With well-designed memory systems, agents become progressively more useful over time.


The Four Types of AI Agent Memory

1. In-Context Memory (Working Memory)

The most immediate form of memory: the content of the current context window.

What it holds: The current conversation, retrieved documents, tool results, and the agent's reasoning steps within the current session.

Limitations: Context windows are finite (typically 8K-1M tokens). Long sessions exceed limits. Nothing persists across sessions by default.

Implementation: Simply managed by including prior turns in the message history sent to the LLM.


2. External Storage (Episodic Memory)

Information stored outside the model and retrieved when relevant. This is how agents remember across sessions.

Types:

  • Conversation history: Past interactions with a specific user or about a specific topic
  • Fact records: Specific facts learned ("User prefers formal tone", "Company X is a customer")
  • Action history: What actions the agent has taken in the past

Implementation: Store summaries or full transcripts in a database. Retrieve relevant past interactions at the start of new sessions.


3. Vector Memory (Semantic Memory)

Embedding-based storage that enables semantic retrieval — finding relevant memories based on meaning, not just keyword matching.

How it works:

  1. Memory items are embedded as vectors (using models like text-embedding-3-large)
  2. Stored in a vector database (Pinecone, Qdrant, Weaviate, pgvector)
  3. At query time, the current context is embedded and nearest neighbors retrieved
  4. Relevant memories are inserted into the context window

Use cases: Knowledge bases, document retrieval, past conversation retrieval where exact keyword matching would miss semantically relevant content.


4. Procedural Memory

Stored knowledge about how to do things — not facts but skills and processes.

In AI agents: This manifests as:

  • System prompt instructions
  • Few-shot examples demonstrating desired behavior
  • Learned workflows stored as structured procedures

Implementation: Store successful workflow patterns and surface them when the agent encounters similar situations.


Memory Architecture Patterns

Pattern 1: Session-Scoped Memory

Memory exists only for the duration of a session. When the session ends, memory is cleared.

Appropriate for: Customer service interactions, single-use task completion, privacy-sensitive applications where persistent memory creates risk.

Implementation: In-context only. Conversation history accumulates during the session.


Pattern 2: User-Scoped Memory

Memory persists for a specific user across sessions.

Appropriate for: Personal assistants, customer-facing agents where continuity improves experience.

Implementation:

class UserMemoryManager:
    def __init__(self, user_id: str, storage):
        self.user_id = user_id
        self.storage = storage

    def store_fact(self, fact: str, metadata: dict):
        self.storage.upsert({
            "user_id": self.user_id,
            "fact": fact,
            "embedding": embed(fact),
            "metadata": metadata,
            "timestamp": datetime.now()
        })

    def retrieve_relevant(self, query: str, top_k: int = 5) -> list:
        query_embedding = embed(query)
        return self.storage.query(
            user_id=self.user_id,
            query_embedding=query_embedding,
            top_k=top_k
        )

Pattern 3: Task-Scoped Memory

Memory is scoped to a specific task or project that may span multiple sessions.

Appropriate for: Project management agents, document drafting agents, research agents working on long-form tasks.

Implementation: A task-specific memory store is created when a task begins and archived when it completes.


Memory Management Challenges

The Context Window Bottleneck

Retrieving 50 relevant memories into a context window that's already half-full from conversation history creates problems. Solutions:

Summarization: Compress older conversation history into summaries that retain key facts with fewer tokens.

Memory tiering: Keep the most recent interactions in full detail; summarize older interactions; archive very old interactions.

Selective retrieval: Don't retrieve everything that might be relevant — retrieve only the most relevant based on the current query.


Memory Conflicts

Over time, stored facts may become outdated or conflict:

  • "User is at Company X" → user changes jobs → "User is at Company Y"
  • Earlier conversation says budget is $100K; later conversation says $150K

Handling conflicts:

  • Always retrieve with timestamps and surface recency to the LLM
  • Design memory storage to prefer newer information for factual records
  • For structured facts, implement upsert semantics rather than append

Privacy and Compliance

Persistent memory creates GDPR and privacy obligations:

  • Users must be able to request deletion of their memory
  • Memory data is personal data requiring appropriate protection
  • Retention periods should be defined and enforced

Implement memory with a delete_user_memory(user_id) function from the start.


Practical Implementation with LangChain

from langchain.memory import ConversationSummaryBufferMemory
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

# In-context memory with automatic summarization
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=2000,  # Summarize when exceeds this
    return_messages=True
)

# Long-term vector memory
embeddings = OpenAIEmbeddings()
vectorstore = Qdrant.from_texts(
    texts=[],
    embedding=embeddings,
    collection_name="user_memories"
)

def build_context_with_memory(user_id: str, current_query: str) -> str:
    # Retrieve recent conversation
    recent_context = memory.load_memory_variables({})

    # Retrieve relevant long-term memories
    relevant_memories = vectorstore.similarity_search(
        query=current_query,
        k=5,
        filter={"user_id": user_id}
    )

    # Combine into context
    memory_context = "\n".join([m.page_content for m in relevant_memories])

    return f"Relevant history:\n{memory_context}\n\nRecent conversation:\n{recent_context}"

Conclusion

Memory transforms AI agents from stateless question-answering systems into persistent assistants that learn and improve with use. The right memory architecture depends on use case, privacy requirements, and performance constraints — but some form of memory beyond the context window is essential for production agents that create lasting value.


Related Reading

Ready to deploy autonomous AI agents?

Our engineers are available to discuss your specific requirements.

Book a Consultation