Architecture: ADR: ArgoCD Agent OOM Analysis & Resolution

Date: 2025-11-05

Root Cause Analysis

Investigation Results

Memory Behavior
- Native Process: Peaked at 630-692 MB when processing all 819 apps
- Docker Container: Hit OOM and was killed (exit code 137) with 2GB limit
- Memory spike occurs during large response generation, not data fetching
Actual Root Cause: GPT-4o Output Token Limit
- GPT-4o Max Output Tokens: ~16,384 tokens (16K)
- Required for 819 apps: ~82,000 tokens (819 apps × ~100 tokens each in markdown table)
- Result: LLM attempts to generate response, hits output limit, stream disconnects
Why Docker OOM Occurs
- Agent loads all 819 apps into memory (255KB JSON → 630MB in Python objects)
- LLM tries to generate massive response
- Memory accumulates as LLM processes but cannot output
- Docker's stricter memory accounting triggers OOM before graceful failure

Evidence

✅ Small queries (10 apps): Work perfectly, full streaming, ~245 MB memory
❌ Large queries (819 apps): Stream disconnects after tool completion, before data output
✅ Native agent survives with 630 MB peak
❌ Docker kills at 2GB (insufficient for overhead + peak)

Solution Implemented

1. System Prompt Update

Added intelligent pagination rules to ArgoCD agent:

"**CRITICAL - Response Size Limits**: When listing applications, you MUST paginate responses due to output token limits:",
"  - If the tool returns >50 applications, show ONLY a summary with key statistics",
"  - Then show the FIRST 20 applications in a table format",
"  - Inform the user they can ask for 'next 20' or filter by project/namespace",
"  - NEVER attempt to list all 819 applications in a single response",

2. Docker Memory Limit Increase

Updated docker-compose.dev.yaml:

mem_limit: 4g
mem_reservation: 2g

This provides headroom for:

630 MB peak application data
LLM processing overhead
Docker container overhead
Python garbage collection delays

Azure OpenAI + LangChain Considerations

Known Issues:

Timeouts with inputs >15K tokens
Performance degradation with large streaming responses
Memory consumption spikes during large response generation

Recommendations:

Use latest API versions for better streaming
Implement load balancing/fallbacks
Monitor and adjust max_tokens parameter
Implement proper error handling for timeouts

Files Modified

ai_platform_engineering/agents/argocd/agent_argocd/protocol_bindings/a2a_server/agent.py
- Added pagination guidelines to system prompt
docker-compose.dev.yaml
- Increased agent-argocd-p2p memory limit to 4GB
ai_platform_engineering/utils/a2a_common/base_langgraph_agent.py
- Added (but disabled) chunking infrastructure for future use

Conclusion

The issue was NOT a traditional OOM from memory leaks, but rather:

LLM hitting output token limits when trying to generate massive responses
Memory accumulating during failed response generation
Docker's stricter limits catching this before graceful failure

The fix is primarily prompt engineering to enforce pagination, with increased Docker memory as a safety buffer.

Spec: spec.md

Root Cause Analysis​

Investigation Results​

Evidence​

Solution Implemented​

1. System Prompt Update​

2. Docker Memory Limit Increase​

Azure OpenAI + LangChain Considerations​

Known Issues:​

Recommendations:​

Files Modified​

Conclusion​

Related​