Context Management and Error Resilience Architecture

Status: 🟢 In-use
Category: Architecture
Date: December 13, 2025

Overview

Implemented comprehensive context management and error recovery mechanisms to prevent agent crashes, context window overflow, and A2A stream failures. This four-layer architecture ensures agents remain responsive and helpful even when encountering errors or resource constraints.

Motivation

Prior to this change, agents experienced multiple critical issues:

1. Context Window Overflow

Symptom: ValidationException: Input is too long for requested model
Impact: Agent crashes, conversation lost, supervisor stops responding
Root Cause: No pre-flight checking before sending messages to LLM
Frequency: Common with large tool outputs (e.g., list_pull_requests returning 50+ PRs)

2. Orphaned Tool Calls

Symptom: Found AIMessages with tool_calls that do not have a corresponding ToolMessage
Impact: LangGraph validation error, conversation breaks
Root Cause: Tool calls made but ToolMessage not returned (interrupted, failed, or timeout)
Frequency: Moderate, especially with RAG agent calls

3. A2A Queue Closure Spam

Symptom: "Queue is closed. Event will not be enqueued." × 35 messages
Impact: Log noise, unclear what's happening, difficult debugging
Root Cause: No tracking of queue state, logs every failed enqueue attempt

4. Loss of Conversation Context

Symptom: Context trimming deletes messages without preserving information
Impact: Agent forgets recent context, asks repeated questions
Root Cause: Simple message deletion instead of intelligent summarization

Testing Strategy

Test Scenarios

Context Overflow with Large Tool Output
- Query: "show all PRs in ai-platform-engineering"
- Expected: Pre-flight check triggers, messages summarized, request succeeds
Orphaned Tool Call Recovery
- Scenario: RAG tool called but fails to return
- Expected: Synthetic ToolMessage added, conversation continues
Queue Closure Handling
- Scenario: Client disconnects mid-stream
- Expected: Single "Queue closed" log, subsequent events dropped silently

Manual Testing

# Test context overflow
docker logs caipe-supervisor | grep "Pre-flight check detected"

# Test LangMem summarization
docker logs caipe-supervisor | grep "Summarizing.*messages with LangMem"

# Verify queue closure (should see 1 message, not 35+)
docker logs caipe-supervisor | grep "Queue is closed" | wc -l

# Test orphaned tool call recovery
docker logs caipe-supervisor | grep "synthetic ToolMessages"

MCP Tool Error Handling (commit 46f42d35): Prevents tool failures from closing A2A streams
Tool Output Truncation (commit 25682e66): Safety net for oversized tool outputs
gh CLI Integration (commit 30eb7fb7): Adds GitHub Actions debugging capabilities

Overview​

Motivation​

1. Context Window Overflow​

2. Orphaned Tool Calls​

3. A2A Queue Closure Spam​

4. Loss of Conversation Context​

Testing Strategy​

Test Scenarios​

Manual Testing​

Related Changes​

Related​