Orphaned Tool Call Repair for Bedrock Multi-Turn Conversations

Status: Implemented Category: Bug Fix / Resilience Date: February 24, 2026 PRs: #842 (supervisor fixes), #31 (OTel fix)

Overview

Fixes that improve supervisor resilience during multi-turn conversations with sub-agent delegations when using AWS Bedrock as the LLM provider. Addresses orphaned tool calls that permanently break conversations and a response_format incompatibility with Bedrock's Converse API.

Problem Statement

1. Orphaned Tool Calls Break Multi-Turn Conversations

Symptom: After 2-3 turns involving sub-agent delegation, users see:

✅ I've recovered from an interrupted tool call. Let me continue processing your request...
❌ Recovery retry failed. Please ask your question again.

Root Cause: When a sub-agent call (e.g., AWS_Agent, GitHub_Agent) times out or the client disconnects mid-stream, LangGraph records an AIMessage with tool_calls but no corresponding ToolMessage. On the next turn, Bedrock's Converse API rejects the conversation with:

ValidationException: Expected toolResult blocks at messages.0.content
for the following Ids: tooluse_y6Ma8ihoB4Lqbmm4bumT7p

Impact: Conversation becomes permanently broken for that context. Users must start a new session.

Frequency: Common in multi-turn conversations with sub-agent delegations, especially when responses are large (ArgoCD listing 800+ apps, GitHub listing many PRs).

2. Bedrock `response_format` Causes Prefill ValidationException

Symptom: Sub-agents using aws-bedrock provider fail with:

ValidationException: This model does not support assistant message prefill.
The conversation must end with a user message.

Root Cause: LangGraph's create_react_agent with response_format appends a hidden AIMessage prefill. Bedrock's Converse API does not support assistant message prefill, causing every structured response attempt to fail.

Impact: Sub-agents fall back to error handling, producing ResponseFormat orphaned tool calls that cascade into the supervisor.

Solution Architecture

Fix 1: Enhanced Orphaned Tool Call Repair

Location: ai_platform_engineering/multi_agents/platform_engineer/protocol_bindings/a2a/agent.py

The existing _repair_orphaned_tool_calls was enhanced to detect tool call IDs across all Bedrock-specific message formats:

def _extract_tool_call_ids(msg: BaseMessage) -> set:
    """Extract tool call IDs from all possible locations in an AIMessage.

    Bedrock stores tool_use IDs in three places:
    1. msg.tool_calls[*]["id"]           - standard LangChain format
    2. msg.additional_kwargs["tool_use"] - Bedrock additional_kwargs
    3. msg.content[*] blocks with        - Bedrock content block format
       "type": "tool_use" and "id" key
    """

Pre-fallback repair: Before entering fallback streaming mode, the supervisor now attempts orphan repair:

⚠️ Supervisor: Found 1 orphaned tool calls. IDs: ['tooluse_y6Ma...']
🔧 Will remove AIMessage with orphaned tool_call
✅ Supervisor: Removed 1 AIMessage(s) with orphaned tool calls

Force-repair: For persistent Bedrock errors, extracts tool_use IDs directly from the error message via regex and removes matching AIMessages from state.

Fix 2: Bedrock response_format Bypass

Location: ai_platform_engineering/utils/a2a_common/base_langgraph_agent.py, ai_platform_engineering/multi_agents/platform_engineer/deep_agent.py

When LLM_PROVIDER=aws-bedrock, the response_format parameter is omitted from create_react_agent and the format instructions are embedded directly in the system prompt instead. This prevents the prefill ValidationException at its source.

Fix 3: Safe Summarization Boundary

Location: ai_platform_engineering/utils/a2a_common/langmem_utils.py

_find_safe_summarization_boundary was enhanced to prevent splitting tool_use / toolResult pairs during context compression. If a ToolMessage in the "keep" zone references a tool_call in the "summarize" zone, the boundary shifts to include the corresponding AIMessage.

Fix 4: OpenTelemetry Context Detach Noise Suppression

Location: cnoe-agent-utils/cnoe_agent_utils/tracing/decorators.py (separate repo) PR: cnoe-agent-utils#31

Added _quiet_span_exit() helper that temporarily raises the opentelemetry.context logger level to CRITICAL during span exit, preventing noisy ValueError: <Token var=<ContextVar...> was created in a different Context errors from polluting logs.

Reproduction and Verification

Multi-Turn Reproduction Test

The orphaned tool call issue is reproduced by sending 5+ turns to the supervisor using the same contextId, with queries that trigger sub-agent delegations:

CONTEXT_ID=$(python3 -c 'import uuid;print(uuid.uuid4())')

# Turn 1: GitHub sub-agent delegation
curl -sN -X POST http://localhost:8000 \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{"jsonrpc":"2.0","id":"t1","method":"message/stream","params":{"message":{"role":"user","parts":[{"kind":"text","text":"List 2 recent open PRs for cnoe-io/ai-platform-engineering"}],"messageId":"m1","contextId":"'$CONTEXT_ID'"}}}'

# Turn 2: ArgoCD sub-agent (same context, builds history)
# Turn 3: Cross-reference (triggers summarization pressure)
# Turn 4: Context window check
# Turn 5: Another delegation to push context further

An automated integration test is available at integration/test_orphan_repair_multiturn.py:

PYTHONPATH=. uv run python integration/test_orphan_repair_multiturn.py
PYTHONPATH=. uv run python integration/test_orphan_repair_multiturn.py --turns 3

Verified Results (Feb 24, 2026)

Run 1: 5-turn test

Turn	Query	Events	Status	Text
1	List 2 PRs (GitHub)	104	completed	2,926 chars
2	ArgoCD apps in caipe-preview	430	completed	6,541 chars
3	Summarize PRs + ArgoCD	1,178	completed	27,068 chars
4	Context window usage	120	completed	2,223 chars
5	Failing ArgoCD apps	118	completed	2,041 chars

Orphan repair activated during Run 1 (confirmed conversation continued after repair):

⚠️ Supervisor: Found 1 orphaned tool calls. IDs: ['tooluse_y6Ma8ihoB4Lqbmm4bumT7p'], Names: ['AWS_Agent']
🔧 Will remove AIMessage with orphaned tool_call: msg_id=lc_run--019c919e...
✅ Supervisor: Removed 1 AIMessage(s) with orphaned tool calls. Earlier conversation history preserved.

Run 2: 10-turn stress test (GitHub + ArgoCD + Jira)

Turn	Query	Events	Time	Status	Text
1	List 5 recent open PRs (GitHub)	126	20.4s	PASS	3,889 chars
2	ArgoCD apps in caipe-preview	188	21.3s	PASS	5,988 chars
3	5 most recent Jira tickets	24	5.3s	TIMEOUT	598 chars
4	Cross-reference PRs, ArgoCD, Jira	517	19.4s	PASS	7,725 chars
5	All open PRs across 2 repos	890	45.2s	PASS	42,259 chars
6	Combined status report	1,307	59.4s	PASS	25,814 chars
7	Failing/degraded ArgoCD apps (all namespaces)	406	76.9s	PASS	24,042 chars
8	Jira sprint tickets	1,758	122.7s	PASS	60,116 chars
9	Context window usage	323	20.3s	PASS	6,529 chars
10	Top 3 action items (cross-reference)	1,428	54.3s	PASS	20,887 chars

10-turn summary: 9 completed, 0 failed, 1 timeout (Jira cold-start), 0 recovery failures, 0 fallback triggers.

No orphan repair was needed in Run 2, confirming that the upstream prevention fixes (Bedrock response_format bypass, safe summarization boundary) are effective at eliminating the root causes.

Error counts across both runs: 0 Recovery retry failed, 0 fallback.

Unit Tests

50 unit tests in tests/test_supervisor_streaming_json_and_orphaned_tools.py:

Test Class	Tests	Coverage
`TestExtractToolCallIds`	5	Standard, additional_kwargs, content blocks, dedup
`TestExtractToolCallIdsEdgeCases`	7	camelCase, toolUseId variant, single dict, mixed, malformed
`TestRepairOrphanedToolCalls`	4	No orphans, orphan in tool_calls/kwargs/content
`TestRepairOrphanedToolCallsEdgeCases`	6	None state, empty messages, multiple orphans, partial
`TestSafeummarizationBoundary`	4	Standard, kwargs, content block pairs, complete pairs
`TestSummarizationBoundaryEdgeCases`	6	Min keep, equal, no tools, multiple pending, cross-ref
`TestForceRepairRegex`	6	Bedrock format, LangGraph format, multiple IDs, hyphens
`TestPreflightContextCheckNullQuery`	4	None query, empty string, normal query
`TestPreflightContextCheckEdgeCases`	6	None state/values, no messages, threshold, exception
`TestJsonScopingFix`	2	No local json import, module-level callable

PYTHONPATH=. uv run pytest tests/test_supervisor_streaming_json_and_orphaned_tools.py -v
# 50 passed in 3.68s

Files Changed

File	Change
`ai_platform_engineering/multi_agents/platform_engineer/protocol_bindings/a2a/agent.py`	Enhanced orphan repair, pre-fallback repair, force-repair
`ai_platform_engineering/utils/a2a_common/langmem_utils.py`	`_extract_tool_call_ids`, safe summarization boundary, `query=None` support
`ai_platform_engineering/utils/a2a_common/base_langgraph_agent.py`	Bedrock response_format bypass, corporate CA bundle support for MCP HTTP transport
`ai_platform_engineering/multi_agents/platform_engineer/deep_agent.py`	Bedrock response_format bypass for supervisor graph
`tests/test_supervisor_streaming_json_and_orphaned_tools.py`	50 unit tests
`integration/test_orphan_repair_multiturn.py`	10-turn multi-turn integration test (GitHub, ArgoCD, Jira)

Decision Rationale

Why repair at the supervisor level?

The orphaned tool call problem is inherent to LangGraph's checkpoint system with Bedrock. When a stream is cancelled, the checkpoint records the AIMessage with tool_calls but the ToolMessage response is never written. Repairing at the supervisor level (before the next LLM call) is the only place where we can access the checkpoint state and fix it before Bedrock rejects it.

Why embed response_format in system prompt for Bedrock?

Bedrock's Converse API fundamentally does not support assistant message prefill. LangGraph's create_react_agent uses prefill internally when response_format is set. Rather than patching LangGraph, we bypass the issue by embedding the format instructions in the system prompt -- achieving the same structured output behavior without triggering the prefill.

Why extract tool_call IDs from three locations?

Bedrock's Converse API stores tool_use information inconsistently across LangChain message formats. During normal operation, IDs appear in tool_calls. After checkpoint recovery, they may only exist in additional_kwargs or content blocks. Checking all three locations ensures no orphaned tool call is missed regardless of how the message was serialized.

Overview​

Problem Statement​

1. Orphaned Tool Calls Break Multi-Turn Conversations​

2. Bedrock response_format Causes Prefill ValidationException​

Solution Architecture​

Fix 1: Enhanced Orphaned Tool Call Repair​

Fix 2: Bedrock response_format Bypass​

Fix 3: Safe Summarization Boundary​

Fix 4: OpenTelemetry Context Detach Noise Suppression​

Reproduction and Verification​

Multi-Turn Reproduction Test​

Verified Results (Feb 24, 2026)​

Run 1: 5-turn test​

Run 2: 10-turn stress test (GitHub + ArgoCD + Jira)​

Unit Tests​

Files Changed​

Decision Rationale​

Why repair at the supervisor level?​

Why embed response_format in system prompt for Bedrock?​

Why extract tool_call IDs from three locations?​