Context Management and Error Resilience Architecture

Status: 🟢 In-use
Category: Architecture
Date: December 13, 2025

Overview

Implemented comprehensive context management and error recovery mechanisms to prevent agent crashes, context window overflow, and A2A stream failures. This four-layer architecture ensures agents remain responsive and helpful even when encountering errors or resource constraints.

Problem Statement

Prior to this change, agents experienced multiple critical issues:

1. Context Window Overflow

Symptom: ValidationException: Input is too long for requested model
Impact: Agent crashes, conversation lost, supervisor stops responding
Root Cause: No pre-flight checking before sending messages to LLM
Frequency: Common with large tool outputs (e.g., list_pull_requests returning 50+ PRs)

2. Orphaned Tool Calls

Symptom: Found AIMessages with tool_calls that do not have a corresponding ToolMessage
Impact: LangGraph validation error, conversation breaks
Root Cause: Tool calls made but ToolMessage not returned (interrupted, failed, or timeout)
Frequency: Moderate, especially with RAG agent calls

3. A2A Queue Closure Spam

Symptom: "Queue is closed. Event will not be enqueued." × 35 messages
Impact: Log noise, unclear what's happening, difficult debugging
Root Cause: No tracking of queue state, logs every failed enqueue attempt

4. Loss of Conversation Context

Symptom: Context trimming deletes messages without preserving information
Impact: Agent forgets recent context, asks repeated questions
Root Cause: Simple message deletion instead of intelligent summarization

Solution Architecture

Layer 1: Pre-flight Context Check (BaseLangGraphAgent)

Location: ai_platform_engineering/utils/a2a_common/base_langgraph_agent.py

Functionality:

async def _preflight_context_check(config, query):
    # Estimate tokens: system + history + query + tool schemas
    total_estimated = system_tokens + history_tokens + query_tokens + tool_schema_tokens
    
    # Trigger at 80% of max (leave room for response)
    if total_estimated > (max_context_tokens * 0.8):
        # Use LangMem to summarize old messages
        summary = await summarize_messages(old_messages)
        # Replace old messages with summary SystemMessage
        # ✅ Context preserved, tokens reduced

Benefits:

✅ Proactive prevention (before LLM call)
✅ Preserves context via LangMem summarization
✅ Configurable threshold (default 80%)
✅ Falls back to deletion if LangMem unavailable

Layer 2: Supervisor Exception Recovery (Platform Engineer)

Location: ai_platform_engineering/multi_agents/platform_engineer/protocol_bindings/a2a/agent.py

Functionality:

except ValueError as ve:
    if "tool_calls that do not have a corresponding ToolMessage" in str(ve):
        # Add synthetic ToolMessages for orphaned calls
        synthetic_msgs = [ToolMessage(content="Tool interrupted", tool_call_id=id)]
        await graph.aupdate_state(config, {"messages": synthetic_msgs})
        # ✅ Conversation recovered, LangGraph happy
    
    elif "Input is too long" in str(ve):
        # Summarize conversation with LangMem
        summary = await summarize_messages(all_messages)
        await graph.aupdate_state(config, {"messages": [SystemMessage(summary)]})
        # ✅ Context preserved, overflow resolved

Benefits:

✅ Graceful recovery from validation errors
✅ Supervisor stays responsive
✅ Context preserved via summarization
✅ Clear error messages to users

Layer 3: A2A Queue Lifecycle Management (SupervisorExecutor)

Location: ai_platform_engineering/multi_agents/platform_engineer/protocol_bindings/a2a/agent_executor.py

Functionality:

async def _safe_enqueue_event(event_queue, event):
    try:
        await event_queue.enqueue_event(event)
        # Reset closure flag if queue reopens
        if self._queue_closed_logged:
            logger.info("Queue reopened")
            self._queue_closed_logged = False
    except Exception as e:
        if "Queue is closed" in str(e):
            # Log ONCE when first closed
            if not self._queue_closed_logged:
                logger.warning("Event queue closed")
                self._queue_closed_logged = True
            # Then SILENT (no spam)

Benefits:

✅ Eliminates log spam (35+ → 1 message)
✅ Detects queue reopening
✅ Cleaner logs for debugging

Layer 4: Tool Call Tracking (Supervisor)

Location: ai_platform_engineering/multi_agents/platform_engineer/protocol_bindings/a2a/agent.py

Functionality:

# Track pending tool calls
pending_tool_calls = {}  # {tool_call_id: tool_name}

# When AIMessage with tool_call:
pending_tool_calls[tool_call_id] = tool_name

# When ToolMessage received:
pending_tool_calls.pop(tool_call_id)  # Mark as resolved

# On error, check for orphans:
if pending_tool_calls:
    # Add synthetic ToolMessages

Benefits:

✅ Ensures every tool call gets ToolMessage
✅ Prevents LangGraph validation errors
✅ Enables error recovery

LangMem Integration

What is LangMem?

LangMem is a LangChain library for intelligent conversation memory management. Instead of deleting messages, it:

Extracts key information
Summarizes conversations
Preserves context in compressed form

Dependencies Added

# ai_platform_engineering/utils/pyproject.toml
dependencies = [
    "langmem>=0.0.30",
    ...
]

# ai_platform_engineering/multi_agents/pyproject.toml
dependencies = [
    "langmem>=0.0.30",
    ...
]

Usage Pattern

from langmem import summarize_messages

# Instead of: messages = messages[-10:]  # Delete old, lose context
# Do this:
summary = await summarize_messages(old_messages)
messages = [SystemMessage(summary)] + recent_messages  # Preserve context

Configuration

Environment Variables

# Tool output truncation (safety net)
MAX_TOOL_OUTPUT_SIZE=10000  # Default: 10KB (safe for smaller models)
MAX_TOOL_OUTPUT_SIZE=50000  # For larger context models

# Context management
# (Uses provider-specific limits from context_config.py)
# AWS Bedrock Claude: 100K tokens
# OpenAI GPT-4: 128K tokens
# Azure OpenAI: Configurable

Auto-Configuration

The system automatically:

Detects LLM provider from LLM_PROVIDER env var
Sets appropriate context limits via get_context_limit_for_provider()
Triggers pre-flight check at 80% of limit
Summarizes to 50% of limit when triggered

Testing Strategy

Test Scenarios

Context Overflow with Large Tool Output
- Query: "show all PRs in ai-platform-engineering"
- Expected: Pre-flight check triggers, messages summarized, request succeeds
Orphaned Tool Call Recovery
- Scenario: RAG tool called but fails to return
- Expected: Synthetic ToolMessage added, conversation continues
Queue Closure Handling
- Scenario: Client disconnects mid-stream
- Expected: Single "Queue closed" log, subsequent events dropped silently

Manual Testing

# Test context overflow
docker logs caipe-supervisor | grep "Pre-flight check detected"

# Test LangMem summarization
docker logs caipe-supervisor | grep "Summarizing.*messages with LangMem"

# Verify queue closure (should see 1 message, not 35+)
docker logs caipe-supervisor | grep "Queue is closed" | wc -l

# Test orphaned tool call recovery
docker logs caipe-supervisor | grep "synthetic ToolMessages"

Migration Path

For Existing Deployments

Update dependencies: uv sync in utils and multi_agents directories
No configuration changes required - works with existing settings
Gradual rollout: LangMem gracefully falls back if import fails

For New Agents

All new agents using BaseLangGraphAgent automatically get:

✅ Pre-flight context checking
✅ LangMem summarization
✅ Tool output truncation
✅ Error handling

Performance Impact

Overhead

Pre-flight check: ~10ms (token counting only when approaching limit)
LangMem summarization: ~2-5s (calls LLM once to summarize)
Tool output truncation: <1ms (string operations)
Queue tracking: <1ms (boolean flag check)

Benefits

Prevents crashes: No more "Input is too long" errors
Preserves context: Users don't lose conversation history
Reduces retries: Fewer failed requests = better UX
Cleaner logs: 97% reduction in "Queue is closed" spam

Alternatives Considered

1. Simple Message Deletion (Current Before This Change)

❌ Loses context
❌ Reactive (after error)
✅ Fast
Decision: Keep as fallback when LangMem unavailable

2. Fixed-Size Sliding Window

❌ Loses context
❌ Doesn't adapt to message sizes
✅ Predictable
Decision: Rejected, LangMem is better

3. Increase Context Limits

❌ Not all models support larger contexts
❌ Higher costs
❌ Slower responses
Decision: Rejected, manage efficiently instead

Future Enhancements

Short-term (Next Sprint)

Background Memory Manager: Use LangMem's background processing to continuously extract key facts
User Profiles: Store user preferences and context across sessions
Semantic Search: Query conversation summaries for relevant past context

Long-term (Q1 2026)

Multi-session Memory: Persist summaries in LangGraph Store
Smart Summarization Triggers: Summarize based on topic shifts, not just token count
Memory Tools: Let agents explicitly manage their own memories

MCP Tool Error Handling (commit 46f42d35): Prevents tool failures from closing A2A streams
Tool Output Truncation (commit 25682e66): Safety net for oversized tool outputs
gh CLI Integration (commit 30eb7fb7): Adds GitHub Actions debugging capabilities

References

Author

Sri Aradhyula <sraradhy@cisco.com>

Reviewers

Approval

Code reviewed
Tested in development
Tested in staging
Ready for production

Overview​

Problem Statement​

1. Context Window Overflow​

2. Orphaned Tool Calls​

3. A2A Queue Closure Spam​

4. Loss of Conversation Context​

Solution Architecture​

Layer 1: Pre-flight Context Check (BaseLangGraphAgent)​

Layer 2: Supervisor Exception Recovery (Platform Engineer)​

Layer 3: A2A Queue Lifecycle Management (SupervisorExecutor)​

Layer 4: Tool Call Tracking (Supervisor)​

LangMem Integration​

What is LangMem?​

Dependencies Added​

Usage Pattern​

Configuration​

Environment Variables​

Auto-Configuration​

Testing Strategy​

Test Scenarios​

Manual Testing​

Migration Path​

For Existing Deployments​

For New Agents​

Performance Impact​

Overhead​

Benefits​

Alternatives Considered​

1. Simple Message Deletion (Current Before This Change)​

2. Fixed-Size Sliding Window​

3. Increase Context Limits​

Future Enhancements​

Short-term (Next Sprint)​

Long-term (Q1 2026)​

Related Changes​

References​

Author​

Reviewers​

Approval​

Overview

Problem Statement

1. Context Window Overflow

2. Orphaned Tool Calls

3. A2A Queue Closure Spam

4. Loss of Conversation Context

Solution Architecture

Layer 1: Pre-flight Context Check (BaseLangGraphAgent)

Layer 2: Supervisor Exception Recovery (Platform Engineer)

Layer 3: A2A Queue Lifecycle Management (SupervisorExecutor)

Layer 4: Tool Call Tracking (Supervisor)

LangMem Integration

What is LangMem?

Dependencies Added

Usage Pattern

Configuration

Environment Variables

Auto-Configuration

Testing Strategy

Test Scenarios

Manual Testing

Migration Path

For Existing Deployments

For New Agents

Performance Impact

Overhead

Benefits

Alternatives Considered

1. Simple Message Deletion (Current Before This Change)

2. Fixed-Size Sliding Window

3. Increase Context Limits

Future Enhancements

Short-term (Next Sprint)

Long-term (Q1 2026)

Related Changes

References

Author

Reviewers

Approval