Skip to main content
Noorle agents use a three-tier memory system that balances efficiency, context retention, and cost. This design prevents token waste and ensures agents remember important information across long conversations.

Why Three Tiers?

Without memory management, long conversations become expensive: Result: 25x more efficient conversations, same context quality.

The Three Tiers

Tier 1: Working Memory (Recent Context)

Storage: Redis (in-memory cache) Lifetime: Hours to days Scope: Last N messages (configurable, default 20) Access: Every inference What it contains:
  • Most recent user messages
  • Agent responses
  • Tool call results
  • Current context and state
Example:
Working Memory (last 5 messages)
├─ User: "Search for AI trends"
├─ Agent: "I'll search the web..."
├─ Tool Call: web_search(query="AI trends")
├─ Tool Result: ["Article 1: ...", "Article 2: ..."]
└─ Agent: "Based on the search, here are trends..."
Why fast access matters:
  • Included in every LLM prompt
  • If latency > 100ms, user perceives slowness
  • Redis provides sub-millisecond access
  • In-process cache provides fallback
Configuration:
{
  "memory_config": {
    "working_memory_size": 20,        // messages to keep
    "working_memory_ttl": 86400       // seconds (1 day)
  }
}

Tier 2: Summary Memory (Compressed History)

Storage: Object storage + Redis cache Lifetime: Days to months Scope: Summarized sessions and key insights Access: As needed (cached for 1 hour) What happens: When working memory reaches capacity (20 messages), old messages are summarized:
Situation:
  Working memory is full (20 messages)
  New message arrives
  Need to make room

Process:
  1. Compress oldest 10 messages into summary
     "User asked about AI trends. Agent researched
      and identified 3 key developments: 1) Multimodal
      models improving, 2) Cost decreasing, 3) Enterprise
      adoption accelerating."

  2. Store summary in object storage with metadata
     {
       "id": "summary-123",
       "time_range": "2024-03-01 to 2024-03-05",
       "summary": "...",
       "key_facts": ["...", "..."],
       "embedding": [0.12, 0.34, ...]
     }

  3. Cache summary in Redis

  4. Discard original messages from working memory

  5. Working memory now has space for new message
Token efficiency:
10 original messages = 5,000 tokens
1 LLM-generated summary = 500 tokens

Compression ratio: 10x reduction
Retrieval: Agent can access summaries when needed:

Tier 3: Archive (Long-term Storage)

Storage: Object storage (immutable) Lifetime: Forever (compliance) Scope: All history beyond summaries Access: Rarely (audit, compliance) What it contains:
  • Message journal (all original messages)
  • Summaries older than 1 month
  • Complete conversation transcript
  • Legal hold data (if applicable)
Never deleted (except by explicit request). Used for:
  • Compliance: Audit trail of all agent activity
  • Legal discovery: Retrieve conversations from specific date range
  • Debugging: Understand what happened in past conversations
  • ML training: Fine-tune models on real conversations (with consent)
Example access:
# Retrieve all messages from agent during March 2024
GET /api/agents/{agent_id}/archive?start=2024-03-01&end=2024-03-31

Response:
{
  "messages": [
    {
      "timestamp": "2024-03-05T10:30:00Z",
      "role": "user",
      "content": "What are market trends?"
    },
    {
      "timestamp": "2024-03-05T10:31:00Z",
      "role": "assistant",
      "content": "..."
    }
  ]
}

Memory Flow Over Time

Semantic Search in Memory

Memory summaries are embedded as vectors for semantic search: How it works:
  1. Summary is created
  2. LLM extracts key topics
  3. Topics are embedded as vectors
  4. Vectors stored with summary
  5. Semantic queries find relevant summaries

Configuration

Fine-tune memory behavior per agent:
{
  "memory_config": {
    "working_memory_size": 20,          // messages
    "working_memory_ttl": 86400,        // 1 day (seconds)
    "enable_summarization": true,
    "summarization_threshold": 20,      // summarize when this many messages
    "summary_window_size": 10,          // compress 10 messages at a time
    "enable_semantic_search": true,     // enable vector search
    "archive_retention_days": 2555      // keep archive for 7 years
  }
}
Trade-offs:
SettingCheapExpensive
working_memory_size5 messages100 messages
enable_summarizationfalsetrue
summarization_threshold10 (frequent)100 (rare)
enable_semantic_searchfalsetrue
Increase for more context, decrease for cost savings.

Example: Long Conversation

User conversation over 1 month: 1000+ messages

Day 1-5:  Working memory captures 20 messages
          Discussions about AI trends, pricing, implementation

Day 6-10: Messages M1-M10 summarized
          Summaries stored, original messages removed
          New messages M21-M30 added to working memory

Day 20:   Users asks: "What was the conclusion about pricing?"
          Agent searches summaries (not working memory)
          Finds: "User concerned about costs. Recommended
                  tiered pricing model."
          Agent answers: "Based on our discussion, we
                        concluded that tiered pricing
                        balances cost and features."

Month 2:  All summaries and archives available
          but working memory only contains recent chat
          Token usage: minimal, cost: low

Year 2:   Audit request: "Show all conversations from 2024"
          Retrieve from archive: 365 days of history
          Complete transcript with all messages
          Used for compliance verification

Memory Best Practices

Enable Summarization

For conversations longer than 1 hour. Reduces cost 10x.

Set Appropriate TTL

Working memory 1 day for customer support, 1 hour for real-time agents.

Use Semantic Search

Enable for agents that need historical context. Helps find relevant info.

Archive for Compliance

Always keep archives for regulatory requirements. 7 years typical.

Memory Costs

For current pricing, see Pricing.
Next: Learn about Knowledge and RAG for semantic search.