Why Three Tiers?
Without memory management, long conversations become expensive: Result: 25x more efficient conversations, same context quality.The Three Tiers
Tier 1: Working Memory (Recent Context)
Storage: Redis (in-memory cache) Lifetime: Hours to days Scope: Last N messages (configurable, default 20) Access: Every inference What it contains:- Most recent user messages
- Agent responses
- Tool call results
- Current context and state
- Included in every LLM prompt
- If latency > 100ms, user perceives slowness
- Redis provides sub-millisecond access
- In-process cache provides fallback
Tier 2: Summary Memory (Compressed History)
Storage: Object storage + Redis cache Lifetime: Days to months Scope: Summarized sessions and key insights Access: As needed (cached for 1 hour) What happens: When working memory reaches capacity (20 messages), old messages are summarized:Tier 3: Archive (Long-term Storage)
Storage: Object storage (immutable) Lifetime: Forever (compliance) Scope: All history beyond summaries Access: Rarely (audit, compliance) What it contains:- Message journal (all original messages)
- Summaries older than 1 month
- Complete conversation transcript
- Legal hold data (if applicable)
- Compliance: Audit trail of all agent activity
- Legal discovery: Retrieve conversations from specific date range
- Debugging: Understand what happened in past conversations
- ML training: Fine-tune models on real conversations (with consent)
Memory Flow Over Time
Semantic Search in Memory
Memory summaries are embedded as vectors for semantic search: How it works:- Summary is created
- LLM extracts key topics
- Topics are embedded as vectors
- Vectors stored with summary
- Semantic queries find relevant summaries
Configuration
Fine-tune memory behavior per agent:| Setting | Cheap | Expensive |
|---|---|---|
working_memory_size | 5 messages | 100 messages |
enable_summarization | false | true |
summarization_threshold | 10 (frequent) | 100 (rare) |
enable_semantic_search | false | true |
Example: Long Conversation
Memory Best Practices
Enable Summarization
For conversations longer than 1 hour. Reduces cost 10x.
Set Appropriate TTL
Working memory 1 day for customer support, 1 hour for real-time agents.
Use Semantic Search
Enable for agents that need historical context. Helps find relevant info.
Archive for Compliance
Always keep archives for regulatory requirements. 7 years typical.
Memory Costs
For current pricing, see Pricing.Next: Learn about Knowledge and RAG for semantic search.