Architecture: Auto Context Management for LangGraph Agents

Date: 2025-11-05

The Solution

The BaseLangGraphAgent now includes automatic message trimming that:

Counts tokens before each request using the tiktoken library
Detects overflow when tokens exceed the configured threshold
Trims old messages while preserving:
- System messages (always kept)
- Recent N messages (configurable)
Updates checkpointer to remove trimmed messages

This happens transparently - no code changes required in agent implementations!

Configuration

Provider-Specific Defaults

The system automatically sets appropriate context limits based on your LLM provider:

Provider	Context Window	Default Limit	Safety Margin
`azure-openai`	128K	100K	28K (22%) for tools + response
`openai`	128K-200K	100K	28K+ (22%+) for tools + response
`aws-bedrock` (Claude)	200K	150K	50K (25%) for tools + response
`anthropic-claude`	200K	150K	50K (25%) for tools + response
`google-gemini`	1M-2M	800K	200K+ (20%+) for tools + response
`gcp-vertexai`	Varies	150K	Conservative default

How it works: The agent detects your LLM_PROVIDER environment variable and automatically sets an appropriate MAX_CONTEXT_TOKENS limit with safety margin for tools and response generation.

Environment Variables

Control the behavior via environment variables:

Variable	Default	Description
`LLM_PROVIDER`	`azure-openai`	LLM provider (determines default context limit)
`ENABLE_AUTO_COMPRESSION`	`true`	Enable/disable auto-trimming
`MAX_CONTEXT_TOKENS`	provider-specific	Maximum tokens before trimming (auto-set per provider)
`MIN_MESSAGES_TO_KEEP`	`10`	Minimum recent messages to always keep

Example Configurations

Default (Auto-Detected from Provider)

# docker-compose.yml
services:
  agent-github-p2p:
    environment:
      - LLM_PROVIDER=azure-openai  # Automatically uses 100K token limit
      - ENABLE_AUTO_COMPRESSION=true
      # MAX_CONTEXT_TOKENS not set - uses provider default (100K)

Using AWS Bedrock (Claude)

services:
  agent-github-p2p:
    environment:
      - LLM_PROVIDER=aws-bedrock  # Automatically uses 150K token limit
      - ENABLE_AUTO_COMPRESSION=true
      # MAX_CONTEXT_TOKENS not set - uses provider default (150K)

Using Google Gemini (Large Context)

services:
  agent-github-p2p:
    environment:
      - LLM_PROVIDER=google-gemini  # Automatically uses 800K token limit
      - ENABLE_AUTO_COMPRESSION=true
      # MAX_CONTEXT_TOKENS not set - uses provider default (800K)

Custom Override (Any Provider)

services:
  agent-github-p2p:
    environment:
      - LLM_PROVIDER=azure-openai
      - ENABLE_AUTO_COMPRESSION=true
      - MAX_CONTEXT_TOKENS=80000  # Override default: more aggressive trimming
      - MIN_MESSAGES_TO_KEEP=5    # Keep fewer messages


## How It Works

### 1. Token Counting

The system uses the `tiktoken` library to accurately count tokens:

```python
def _count_message_tokens(self, message: Any) -> int:
    """Count tokens in a message including content and tool calls."""
    content = str(message.content)

    # Add tokens for tool calls
    if hasattr(message, "tool_calls"):
        for tool_call in message.tool_calls:
            content += str(tool_call)

    return len(self.tokenizer.encode(content))

2. Message Trimming

When tokens exceed MAX_CONTEXT_TOKENS:

async def _trim_messages_if_needed(self, config: RunnableConfig) -> None:
    """Trim old messages from checkpointer if context too large."""

    # Get current state
    state = await self.graph.aget_state(config)
    messages = state.values["messages"]

    # Count tokens
    total_tokens = self._count_total_tokens(messages)

    if total_tokens > self.max_context_tokens:
        # Separate system messages (keep) from conversation (trim)
        system_messages = [m for m in messages if isinstance(m, SystemMessage)]
        conversation_messages = [m for m in messages if not isinstance(m, SystemMessage)]

        # Keep recent N messages
        messages_to_keep = conversation_messages[-self.min_messages_to_keep:]
        messages_to_remove = conversation_messages[:-self.min_messages_to_keep]

        # Remove old messages from checkpointer
        remove_commands = [RemoveMessage(id=msg.id) for msg in messages_to_remove]
        await self.graph.aupdate_state(config, {"messages": remove_commands})

3. Integration

Trimming happens automatically in the stream() method:

async def stream(self, query: str, sessionId: str, trace_id: str = None):
    config = self.tracing.create_config(sessionId)

    # Ensure graph is initialized
    await self._ensure_graph_initialized(config)

    # Auto-trim old messages to prevent context overflow
    await self._trim_messages_if_needed(config)  # ← Automatic!

    # Continue streaming...
    async for state in self.graph.astream(inputs, config):
        yield state

Logging

Initialization

At agent startup, you'll see provider-specific configuration:

INFO: Context management initialized for provider=azure-openai: max_tokens=100000, min_messages=10, auto_compression=true
INFO: Context management initialized for provider=aws-bedrock: max_tokens=150000, min_messages=15, auto_compression=true
INFO: Context management initialized for provider=google-gemini: max_tokens=800000, min_messages=20, auto_compression=true

Trimming Activity

When trimming occurs:

WARNING: github: Context too large (186014 tokens > 100000). Trimming old messages...
INFO: github: ✂️ Trimmed 150 messages (86014 tokens). Kept 10 messages (100000 tokens)

Normal Operation

Debug logging shows checks even when no trimming is needed:

DEBUG: github: Context size OK (45230 tokens)

Disabling Auto-Compression

If you need to disable auto-compression (e.g., for testing):

export ENABLE_AUTO_COMPRESSION=false

Or in Docker Compose:

environment:
  - ENABLE_AUTO_COMPRESSION=false

Warning: Disabling compression may cause context overflow errors on long conversations!

Architecture

Component Flow

User Query
    ↓
BaseLangGraphAgent.stream()
    ↓
1. _ensure_graph_initialized() ← Setup MCP + graph
    ↓
2. _trim_messages_if_needed() ← Auto-compression ✂️
    ├─ aget_state() ← Load current messages
    ├─ _count_total_tokens() ← Check size
    ├─ RemoveMessage() ← Delete old messages
    └─ aupdate_state() ← Update checkpointer
    ↓
3. graph.astream() ← Stream with trimmed context
    ↓
Response Stream

What Gets Trimmed

ALWAYS KEPT:

System messages (agent instructions)
Recent N messages (MIN_MESSAGES_TO_KEEP)

TRIMMED:

Old user queries
Old agent responses
Old tool calls/results
Messages beyond the recent window

Trimming Strategy

The algorithm works as follows:

Check threshold: Count all tokens in checkpointer
If over limit: Separate system messages from conversation
Keep recent N: Preserve the last MIN_MESSAGES_TO_KEEP messages
Remove old: Delete everything else
Aggressive trimming: If still over limit, trim more (keeping at least 2 messages)

Recommendations

Azure OpenAI (GPT-4o)

Development:

export LLM_PROVIDER=azure-openai
export MAX_CONTEXT_TOKENS=80000   # More aggressive for testing
export MIN_MESSAGES_TO_KEEP=5

Production:

export LLM_PROVIDER=azure-openai
# Use default (100K) - no MAX_CONTEXT_TOKENS override needed
export MIN_MESSAGES_TO_KEEP=10

AWS Bedrock / Anthropic Claude

Production (Recommended):

export LLM_PROVIDER=aws-bedrock
# Use default (150K) - leverages Claude's larger context
export MIN_MESSAGES_TO_KEEP=15  # Keep more history with larger window

High-Traffic:

export LLM_PROVIDER=aws-bedrock
export MAX_CONTEXT_TOKENS=120000  # More aggressive to save costs
export MIN_MESSAGES_TO_KEEP=10

Google Gemini

Production (Leverage Large Context):

export LLM_PROVIDER=google-gemini
# Use default (800K) - Gemini excels with large context
export MIN_MESSAGES_TO_KEEP=20  # Keep extensive history

Cost-Optimized:

export LLM_PROVIDER=google-gemini
export MAX_CONTEXT_TOKENS=400000  # Still 2x larger than GPT-4
export MIN_MESSAGES_TO_KEEP=15

General Guidelines

Use Case	MAX_CONTEXT_TOKENS	MIN_MESSAGES_TO_KEEP	Notes
Development/Testing	50-60% of provider limit	5-8	Aggressive trimming for faster iteration
Production	70-80% of provider limit	10-15	Balanced approach
Long Conversations	80-90% of provider limit	15-20	Preserve more context
Cost-Sensitive	60-70% of provider limit	8-10	More frequent trimming to reduce tokens

Troubleshooting

Issue: Context still overflowing

Cause: Tool definitions consuming too many tokens

Solution:

Reduce MAX_CONTEXT_TOKENS to trigger earlier trimming
Reduce MIN_MESSAGES_TO_KEEP to trim more aggressively
Review tool schemas - simplify descriptions/parameters

Issue: Agent "forgetting" context

Cause: MIN_MESSAGES_TO_KEEP too low

Solution: Increase MIN_MESSAGES_TO_KEEP to preserve more history

Issue: Frequent trimming

Cause: MAX_CONTEXT_TOKENS set too low

Solution: Increase MAX_CONTEXT_TOKENS (but stay below model limit)

Future Enhancements

Potential improvements:

Smart summarization: Instead of deleting old messages, summarize them
Message importance: Keep important messages (e.g., containing decisions)
Tool result compression: Compress large tool outputs
Per-agent tuning: Different limits for different agent types
Metrics: Track trimming frequency and token usage

Example: Before/After

Before (Context Overflow)

Messages: 200
Total tokens: 186,014
Status: ❌ Error - context_length_exceeded

After (Auto-Compression)

Messages: 10 (kept) + 190 (trimmed)
Total tokens: 98,450
Status: ✅ Success - auto-compressed
Trimmed: 87,564 tokens

Migration

No migration needed! All agents using BaseLangGraphAgent automatically get this feature.

Agents already deployed will start auto-compressing on their next request.

Spec: spec.md

The Solution​

Configuration​

Provider-Specific Defaults​

Environment Variables​

Example Configurations​

Default (Auto-Detected from Provider)​

Using AWS Bedrock (Claude)​

Using Google Gemini (Large Context)​

Custom Override (Any Provider)​

2. Message Trimming​

3. Integration​

Logging​

Initialization​

Trimming Activity​

Normal Operation​

Disabling Auto-Compression​

Architecture​

Component Flow​

What Gets Trimmed​

Trimming Strategy​

Recommendations​

Azure OpenAI (GPT-4o)​

AWS Bedrock / Anthropic Claude​

Google Gemini​

General Guidelines​

Troubleshooting​

Issue: Context still overflowing​

Issue: Agent "forgetting" context​

Issue: Frequent trimming​

Future Enhancements​

Example: Before/After​

Before (Context Overflow)​

After (Auto-Compression)​

Migration​

Related​