Streaming Architecture Knowledge Base: v0.2.41 vs v0.3.0
Date: 2026-04-13 to 2026-04-14 Authors: Sri Aradhyula, Claude (Opus 4.6) Purpose: Preserve institutional knowledge from the streaming regression investigation and remediation between CAIPE v0.2.41 ("golden") and v0.3.0.
See also: Streaming Comparison for the side-by-side event timeline and raw artifact dumps.
Table of Contentsβ
- Executive Summary
- Architecture: Two Response Modes
- Root Cause Analysis: Why 0.3.0 Felt Worse
- Measured Streaming Timelines
- What We Fixed (0.3.0 Streaming Restoration)
- Streaming Pipeline: Component-by-Component
- Metadata Flags: is_narration vs is_final_answer
- StreamBuffer: Flush Strategy Deep Dive
- Middleware Toggle System
- Client Handling: Slack vs UI
- Comparison Data and Tooling
- Lessons Learned
- Decision Log
- File Reference
1. Executive Summaryβ
CAIPE v0.2.41 used marker mode ([FINAL ANSWER]) for streaming β the LLM wrote a marker token, and everything after it was streamed token-by-token to clients. v0.3.0 switched to ResponseFormat structured mode (a separate LLM call producing JSON with PlatformEngineerResponse schema). This caused a UX regression: users saw 15-18 seconds of dead silence followed by a burst of text, instead of the gradual word-by-word build-up they experienced in 0.2.41.
Resolution: We kept both response modes but defaulted back to marker mode (USE_STRUCTURED_RESPONSE=false). We also fixed several streaming bugs in the marker path that had accumulated, added is_narration metadata propagation for proper client handling, fixed a StreamBuffer regression, and added middleware toggles to reduce latency.
Key metrics (simple query "what can you do?"):
| Metric | v0.2.41 | v0.3.0 (structured) | v0.3.0 (fixed marker) |
|---|---|---|---|
| Time to first visible content | 2.0s | 17.8s | ~2.0s (restored) |
| Streaming duration (gradual) | 10.5s | 0.4s (burst) | ~10s (restored) |
| Total response time | 12.5s | 18.2s | ~12s (restored) |
2. Architecture: Two Response Modesβ
Marker Mode (USE_STRUCTURED_RESPONSE=false) β Defaultβ
User query
β LangGraph supervisor (single LLM call)
β LLM writes thinking text: "Let me search..."
β LLM calls tools (RAG search, fetch_document, etc.)
β LLM writes: "[FINAL ANSWER]\n# Here's the answer..."
β marker gate opens
β Everything after marker streams token-by-token to client
β Pre-marker text is tagged is_narration (typing status)
Key code: agent.py lines 1290-1351 (marker gate logic)
The _pre_marker_buffer accumulates all LLM output. A tail holdback of _MARKER_MAX_LEN (16 chars) prevents partial marker text like [FINAL from leaking. When [FINAL ANSWER] (or [FINAL_ANSWER]) is found:
- Everything before the marker is discarded (already yielded as
is_narration) - Everything after is yielded with
is_final_answer: True _strip_post_marker_newlinesstrips leading\n\rfrom the first post-marker chunk
Structured Response Mode (USE_STRUCTURED_RESPONSE=true)β
User query
β LangGraph supervisor (first LLM call β planning/tools)
β LLM calls tools, writes thinking text (not streamed)
β generate_structured_response node (second LLM call)
β LLM writes JSON: {"is_task_complete": true, "content": "..."}
β Incremental JSON parser extracts content field
β Word-boundary buffer (_rf_word_buffer) aligns on _BOUNDARY_CHARS
β Parsed tokens yield with is_final_answer: True
Key code: agent.py lines 1161-1186 (ResponseFormat capture), word-boundary logic for incremental JSON parsing
Why structured mode exists: It provides machine-readable fields (is_task_complete, require_user_input, input_fields) needed for HITL (Human-in-the-Loop) forms in the web UI. Marker mode cannot provide these structured fields.
3. Root Cause Analysis: Why 0.3.0 Felt Worseβ
Primary cause: Two-phase LLM architectureβ
The generate_structured_response node is a separate LLM call after the supervisor finishes. It must write the full JSON schema preamble ({"is_task_complete": true, "require_user_input": false, "content": ") before any content character appears. This adds ~15s of dead time.
Secondary causesβ
-
No narration in structured mode. The first LLM call (supervisor) goes directly to tool calls without writing visible text. In marker mode, the same LLM writes thinking text that becomes
is_narrationevents. -
Network buffering collapses streaming. Even with incremental JSON parsing, tokens arrive in HTTP-chunk-aligned bursts, not one-by-one like direct LLM streaming. The effective streaming window shrinks from ~10s to ~0.4s.
-
RAG adds 60-100s of tool time before the final answer. During this time, only typing status indicators are visible.
-
StreamBuffer regression. 0.3.0 changed the flush trigger from
\n(line) to\n\n(paragraph), which meant tokens accumulated longer before flushing, making the burst effect worse. -
_last_flushinitialization bug.StreamBuffer.__init__set_last_flush = time.monotonic(), so the first interval flush calculation included LLM think time (~15s). By the time tokens arrived, the interval had long expired and everything flushed at once.
4. Measured Streaming Timelinesβ
Simple Query: "what can you do?"β
v0.2.41 (marker mode, 12.5s total):
0.0s HTTP POST message/stream β connection opened
2.0s streaming_result: "I'm an" (6 chars) β StreamBuffer opens, first content visible
2.0-12.3s ~216 token chunks, gradual streaming β user sees text building word-by-word
12.5s final_result (2098 chars) β stream stops, complete
v0.3.0 (structured mode, 18.2s total):
0.0s HTTP POST β connection opened
0.0-17.8s SILENCE (LLM building JSON preamble) β typing indicator only
17.8s "Composing answer..." notification β typing status update
17.8s streaming_result: "\n# " (3 chars) β stream opens
17.8-18.2s ~290 chunks in 0.4s BURST β entire answer appears at once
18.2s final_result β done
RAG Query: "what is caipe?" (0.3.0 only, ~115s)β
0.0s HTTP POST
17.9s tool_notification_start: "Calling [RAG] Search..."
17.9s streaming_result: "Workflow: Calling [RAG]..." (is_narration)
19.1-27.6s Tool calls: search, list_datasources, fetch_document
63.3-65.0s More fetch_document calls
65.0-104.9s Narration: "Synthesize findings..." (is_narration β typing status)
114.6s "Composing answer..." notification
114.6s streaming_result: "# What is CAIPE?" (is_final_answer) β stream opens
114.6-114.9s ~300 chunks in 0.3s burst
114.9s final_result β done
A2A Artifact Comparison: Same Query ("show caipe setup options")β
| Metric | v0.3.0 | v0.2.41 |
|---|---|---|
| Total artifacts | 18 | 4 |
| Unique artifact types | 5 | 4 |
| Tool notifications (start/end) | 7/7 | 1/1 |
| streaming_result count | 1 | 1 |
| Total streaming parts | 288 | 268 |
| Total streaming text | 3225 chars | 11583 chars |
has is_final_answer | True | False |
has is_narration | False | False |
has execution_plan | True | False |
| Final artifact name | final_result | partial_result |
Key finding: 0.2.41 leaked LLM narration ("Excellent! I now have comprehensive information...") directly into the streaming_result. This is visible to users. In 0.3.0, pre-answer narration is properly suppressed (or tagged is_narration).
5. What We Fixed (0.3.0 Streaming Restoration)β
Fix 1: Marker gate buffer flush at tool-call boundariesβ
File: agent.py lines 1126-1141
Problem: When the LLM wrote narration text ("I'll search the knowledge base...") followed by a tool call, the narration got trapped in _pre_marker_buffer because the tool-call continue statement skipped the marker gate logic. The buffer wouldn't flush until the next content token, which might not arrive for 30-60s during RAG tool execution.
Fix: Flush _pre_marker_buffer at every tool-call boundary, before the per-tool continue guards. The flushed content is tagged is_narration: True.
# agent.py line 1132
if not USE_STRUCTURED_RESPONSE and _pre_marker_buffer and not _final_answer_seen:
yield {
"content": _pre_marker_buffer,
"is_narration": True,
}
_pre_marker_buffer = ""
Fix 2: Narration extraction from tool-call AIMessageChunksβ
File: agent.py lines 1202-1230
Problem: The LLM often writes narration text in the same message as a tool call (e.g., "I'll search the knowledge base" + search() tool_call). The old code had a USE_STRUCTURED_RESPONSE guard that prevented narration extraction in marker mode. This meant narration was lost when using marker mode.
Fix: Removed the USE_STRUCTURED_RESPONSE guard. Narration from tool-call chunks now always yields with is_narration: True, regardless of response mode.
# agent.py line 1212 β no mode guard
if hasattr(message, "content") and message.content:
_narration = message.content
# ... extract text, filter out tool_use blocks ...
if isinstance(_narration, str) and _narration.strip():
yield {"content": _narration, "is_narration": True}
Fix 3: is_narration metadata propagation through A2Aβ
File: agent_executor.py lines 840-842
Problem: agent.py yields is_narration: True on events, but agent_executor.py (which wraps events into A2A artifacts) wasn't copying this flag to artifact metadata. Clients couldn't distinguish narration from final answer content.
Fix: 3-line addition that propagates is_narration to artifact metadata.
# agent_executor.py line 840
if event.get('is_narration'):
artifact.metadata = artifact.metadata or {}
artifact.metadata['is_narration'] = True
Fix 4: Slack bot β narration as typing status, not stream contentβ
File: ai.py lines 370-376
Problem: Without is_narration handling, narration text would be streamed to the user as if it were the answer, or open the stream prematurely.
Fix: When is_narration is detected, show a static typing status message instead of streaming the text.
# ai.py line 374
if artifact_meta.get("is_narration"):
_set_typing_status("is responding...")
continue # skip streaming this to the user
Fix 5: StreamBuffer reverted to 0.2.41 line-flush logicβ
File: ai.py lines 33-102
Problem: 0.3.0 changed the flush trigger from \n to \n\n (paragraph boundaries). This was a regression β tokens accumulated much longer before flushing, making streaming feel like a dump rather than gradual build-up.
Fix: Reverted to \n (newline boundary) flush. Verified by diffing against git show 0.2.41:ai_platform_engineering/integrations/slack_bot/utils/ai.py β the restored code is identical to production 0.2.41.
Also fixed: _last_flush initialization. Changed from time.monotonic() (set at init, stale from LLM think time) to None (set on first append() call). This prevents the interval safety net from triggering a premature mega-flush.
# ai.py line 49
self._last_flush = None # set on first append, not init
# ai.py line 61-62 (in append())
if self._last_flush is None:
self._last_flush = now # start interval from first token
Fix 6: Thread-safe debug counterβ
File: ai.py (module level)
Problem: _token_seq used global _token_seq; _token_seq += 1 which is not thread-safe under concurrent Slack event threads. Python's GIL doesn't protect += 1 (it's LOAD + ADD + STORE, which can interleave).
Fix: Replaced with itertools.count(1) and next(_token_seq), which is atomic at the Python level.
# Old (thread-unsafe):
_token_seq = 0
# ... inside function:
global _token_seq
_token_seq += 1
# New (thread-safe):
_token_seq = __import__("itertools").count(1)
# ... inside function:
seq = next(_token_seq)
Fix 7: Middleware toggles for latency reductionβ
File: deep_agent.py lines 113-121
Problem: Optional middleware (DeterministicTaskMiddleware, SelfServiceMiddleware, PolicyMiddleware, SkillsMiddleware, FileArgMiddleware) added latency for every request. No way to disable them without code changes.
Fix: Added ENABLE_MIDDLEWARE master switch + 5 individual toggles. All default to true for backwards compatibility.
ENABLE_MIDDLEWARE = os.getenv("ENABLE_MIDDLEWARE", "true").lower() == "true"
ENABLE_DETERMINISTIC_MIDDLEWARE = ENABLE_MIDDLEWARE and os.getenv("ENABLE_DETERMINISTIC_MIDDLEWARE", "true").lower() == "true"
# ... etc for SELF_SERVICE, POLICY, SKILLS, FILE_ARG
6. Streaming Pipeline: Component-by-Componentβ
Full pipeline (marker mode, A2A β Slack):β
LangGraph supervisor LLM
β
βΌ (AIMessageChunk with content or tool_calls)
agent.py stream() generator
β
ββ Tool call detected?
β ββ Flush _pre_marker_buffer (is_narration: True)
β ββ Extract narration from message.content (is_narration: True)
β ββ Yield tool_notification_start event
β
ββ Content token (no tool call)?
β ββ _pre_marker_buffer += content
β ββ Check buffer for [FINAL ANSWER] or [FINAL_ANSWER]
β β ββ Found β _final_answer_seen = True, yield post-marker with is_final_answer: True
β β ββ Not found β yield safe prefix (buffer - _MARKER_MAX_LEN) with is_narration: True
β ββ Post-marker tokens β yield with is_final_answer: True
β
ββ ToolMessage (tool result)?
ββ Yield tool_notification_end event
β
βΌ (yield dicts with content, is_narration, is_final_answer, tool_call)
agent_executor.py _handle_streaming_chunk()
β
ββ Wraps yield dict into A2A Artifact
ββ Copies is_final_answer β artifact.metadata.is_final_answer
ββ Copies is_narration β artifact.metadata.is_narration
ββ Copies plan_step_id β artifact.metadata.plan_step_id
ββ Sends artifact via SSE event_queue
β
βΌ (A2A SSE events: data: {result: {artifact: {...}}})
Slack bot stream_a2a_response()
β
ββ STREAMING_RESULT:
β ββ is_narration? β _set_typing_status("is responding..."), skip
β ββ is_final_answer? β latch streaming_final_answer = True
β ββ No stream_ts yet? β _set_typing_status("is responding..."), skip
β ββ Stream open? β StreamBuffer.append(text)
β ββ \n found? β flush up to last \n
β ββ 1s interval? β flush all
β ββ else β keep buffering
β
ββ FINAL_RESULT:
β ββ streaming_final_answer? β SKIP (already streamed)
β ββ Not streamed? β StreamBuffer.append + flush
β
ββ TOOL_NOTIFICATION_START:
β ββ _start_stream_if_needed(), show tool indicator
β
ββ EXECUTION_PLAN_STATUS_UPDATE:
ββ Update plan_steps dict, track current_step_id
Full pipeline (marker mode, A2A β Web UI):β
A2A SSE events
β
βΌ
ChatPanel.tsx (lines 605-609)
β
ββ is_final_answer? β appendMessage(text) β main chat bubble
ββ else β pushThinking(text) β collapsible "thinking" section
β
βΌ
timeline-manager.ts pushThinking()
β AgentTimeline.tsx renders with variant="thinking"
β Collapsible machinery section (expand to see narration)
7. Metadata Flags: is_narration vs is_final_answerβ
Flag semanticsβ
| Flag | Set when | Meaning | Slack behavior | UI behavior |
|---|---|---|---|---|
is_narration: True | Pre-marker thinking text, buffer flush at tool boundary | LLM is thinking/planning, not answering | _set_typing_status("is responding...") | pushThinking() β collapsible section |
is_final_answer: True | Post-[FINAL ANSWER] marker content | This is the real answer | Latch streaming_final_answer, open stream, StreamBuffer.append() | appendMessage() β chat bubble |
| Neither flag | Tool notifications, plan updates | System events | Handled by specific event type handlers | Handled by specific event type handlers |
Mutual exclusivityβ
Pre-marker content is always is_narration: True, never is_final_answer: True.
Post-marker content is always is_final_answer: True, never is_narration: True.
These paths are mutually exclusive in agent.py β the _final_answer_seen flag gates the transition.
Edge case: both flagsβ
If a coding error or future change somehow sets both flags, clients should treat is_final_answer as dominant (the content is the real answer, even if tagged as narration). The current code makes this impossible, but the principle is documented for safety.
8. StreamBuffer: Flush Strategy Deep Diveβ
Flush triggers (in priority order)β
-
Newline boundary (
\nin buffer): Flush up to and including the last\n, keep the remainder. This prevents split markdown (e.g.,**boldin one flush andtext**in the next). -
Interval safety net (1.0s since last flush): If no newline has appeared for 1 second, flush everything. Prevents content from stalling.
Why \n and not \n\nβ
The 0.3.0 codebase briefly used \n\n (paragraph boundary) flushes. This was a regression:
- Markdown tables use
\nbetween rows. With\n\nflush, an entire table accumulates before any row is visible. - Bullet lists use single
\nbetween items. With\n\n, the entire list buffers. - The 0.2.41 production code used
\nand ran without issues for months. We verified by diffing:git show 0.2.41:ai_platform_engineering/integrations/slack_bot/utils/ai.py.
Slack rate limit analysisβ
Concern: more frequent flushing from \n vs \n\n could hit Slack rate limits on chat_appendStream.
Findings:
- 0.2.41 used
\nflush in production for months without rate limit issues chat_appendStreamis a low-tier endpoint (not the rate-limitedchat.postMessage)- Typical answer has 20-40
\ncharacters β 20-40 flushes over 10-12s β ~3 flushes/second - Slack's streaming API is designed for this exact use case
_last_flush = None fixβ
The bug: StreamBuffer.__init__ set _last_flush = time.monotonic(). If the LLM took 15 seconds to think before producing tokens, the first append() call would see elapsed = 15.0s which is >= flush_interval (1.0s), triggering an immediate interval flush of whatever partial content was in the buffer. This could flush a single token like "#" before enough context accumulated for markdown to render correctly.
The fix: Set _last_flush = None in __init__. On the first append(), if _last_flush is None, set it to now. This starts the interval clock from the first real token, not from object creation.
9. Middleware Toggle Systemβ
Architectureβ
# deep_agent.py lines 116-121
ENABLE_MIDDLEWARE = os.getenv("ENABLE_MIDDLEWARE", "true").lower() == "true" # master switch
ENABLE_DETERMINISTIC_MIDDLEWARE = ENABLE_MIDDLEWARE and os.getenv("...", "true").lower() == "true"
ENABLE_SELF_SERVICE_MIDDLEWARE = ENABLE_MIDDLEWARE and os.getenv("...", "true").lower() == "true"
ENABLE_POLICY_MIDDLEWARE = ENABLE_MIDDLEWARE and os.getenv("...", "true").lower() == "true"
ENABLE_SKILLS_MIDDLEWARE = ENABLE_MIDDLEWARE and os.getenv("...", "true").lower() == "true"
ENABLE_FILE_ARG_MIDDLEWARE = ENABLE_MIDDLEWARE and os.getenv("...", "true").lower() == "true"
Behaviorβ
| Env var | Effect |
|---|---|
ENABLE_MIDDLEWARE=false | Disables ALL 5 optional middleware. Only ModelRetryMiddleware remains. |
ENABLE_MIDDLEWARE=true (default) + individual=false | Disables only that specific middleware |
ENABLE_MIDDLEWARE=false + individual=true | Individual stays disabled (master switch wins) |
Why this existsβ
Middleware (especially SelfServiceMiddleware and DeterministicTaskMiddleware) adds latency by injecting extra tool calls (write_todos, invoke_self_service_task). For simple queries, this overhead is wasted. The toggle lets deployments disable middleware globally for faster responses, or selectively disable specific middleware.
10. Client Handling: Slack vs UIβ
Slack bot (ai.py)β
is_narration=True β _set_typing_status("is responding...") β NOT streamed
is_final_answer=True β latch streaming_final_answer β StreamBuffer.append()
Neither β depends on context (plan step, tool notification, etc.)
Static typing message ("is responding...") rather than dynamic narration text. Reason: dynamic text ("I'll search the knowledge base...") changes too fast and creates flickering in the typing indicator.
Web UI (ChatPanel.tsx)β
is_final_answer=True β appendMessage(text) β main chat bubble (visible)
Everything else β pushThinking(text) β collapsible "thinking" section
The UI doesn't check is_narration explicitly β it checks is_final_answer and routes everything else to thinking. This means narration naturally goes to the collapsible section without special handling.
11. Comparison Data and Toolingβ
Scripts created for this investigationβ
| Script | Purpose |
|---|---|
scripts/capture_a2a_events.py | Captures all A2A SSE events from a supervisor into JSON. Supports both SSE and JSON-RPC response modes. |
scripts/trace_a2a_streaming.py | Traces streaming timeline with timing, chunk sizes, and summary stats. Uses raw http.client for minimal dependencies. |
docker-compose.compare-041.yaml | Runs 0.2.41 Slack bot pointing to a 0.2.41 supervisor for side-by-side comparison in real Slack channels. |
Comparison reportsβ
| File | Content |
|---|---|
docs/streaming-comparison-030-vs-041.md | Side-by-side artifact dump: 18 artifacts (0.3.0) vs 4 artifacts (0.2.41) for "show caipe setup options" |
docs/docs/STREAMING_COMPARISON_041_vs_030.md | Detailed event-flow timelines, token streaming analysis, Slack bot processing pipeline, root cause analysis |
How to reproduce the comparisonβ
# Terminal 1: Run 0.2.41 supervisor on port 8041
docker compose -f docker-compose.compare-041.yaml up -d
# Terminal 2: Capture 0.2.41 events
python scripts/capture_a2a_events.py http://localhost:8041/ "what can you do?" /tmp/events-041.json
# Terminal 3: Capture 0.3.0 events
python scripts/capture_a2a_events.py http://localhost:8000/ "what can you do?" /tmp/events-030.json
# Terminal 4: Trace timeline
python scripts/trace_a2a_streaming.py 8041 "what can you do?"
python scripts/trace_a2a_streaming.py 8000 "what can you do?"
12. Lessons Learnedβ
1. Streaming UX is not about throughput, it's about time-to-first-tokenβ
0.3.0's structured mode delivered the same total content as 0.2.41 marker mode, but 15.8 seconds later with no gradual progression. Users perceive waiting 18 seconds then seeing everything at once as much worse than waiting 2 seconds then watching text build up over 10 seconds β even though the total wait is similar.
2. Two-phase LLM architectures kill streamingβ
Any architecture where a second LLM call generates the final response (e.g., generate_structured_response) introduces a dead gap between the first call ending and the second call's content appearing. The JSON preamble in structured responses makes this worse.
3. Separate response format concerns from streaming concernsβ
The decision to use PlatformEngineerResponse schema (for is_task_complete, require_user_input, HITL forms) is orthogonal to how tokens are streamed. A hybrid approach β marker-mode streaming with post-hoc structured wrapping β gives both the streaming UX and the structured metadata.
4. Test against production, not against mocksβ
The StreamBuffer \n\n regression was only caught by comparing against the actual 0.2.41 code (git show 0.2.41:...). Mock-based tests wouldn't have caught this because the mock itself encoded the wrong assumption.
5. Metadata flags need end-to-end testingβ
is_narration was set correctly in agent.py, but wasn't propagated through agent_executor.py to A2A artifacts. Unit testing each component in isolation wouldn't catch this β you need integration tests that verify the full pipeline from agent yield to client consumption.
6. The _last_flush initialization bug is a class of problemβ
Any time you initialize a timer/counter at object creation but the object isn't used until much later, you risk stale calculations. The pattern _last_X = None; if _last_X is None: _last_X = now is the standard fix.
7. Thread safety in Python requires explicit attentionβ
Even with the GIL, += 1 is not atomic (it's LOAD_FAST + BINARY_ADD + STORE_FAST, which can interleave between threads at bytecode boundaries). Use itertools.count() for lock-free incrementing or threading.Lock for general mutual exclusion.
8. Middleware latency adds upβ
Five middleware layers, each potentially adding tool calls (write_todos, invoke_self_service_task, etc.), can add seconds to every request. Toggles let deployments trade features for speed.
13. Decision Logβ
| Date | Decision | Reasoning |
|---|---|---|
| 2026-04-13 | Default to marker mode (USE_STRUCTURED_RESPONSE=false) | Structured mode's streaming UX is fundamentally limited by two-phase architecture |
| 2026-04-13 | Keep structured mode code in place | Web UI HITL features need PlatformEngineerResponse fields; can be re-enabled per-deployment |
| 2026-04-13 | Use static "is responding..." for narration typing status | Dynamic narration text causes flickering; static message is calmer UX |
| 2026-04-13 | Revert StreamBuffer to \n flush (matching 0.2.41) | \n\n was a regression; 0.2.41 production proved \n is safe for Slack rate limits |
| 2026-04-14 | Add middleware toggles as env vars | Lets deployments reduce latency without code changes; defaults preserve backwards compatibility |
| 2026-04-14 | Tag pre-marker content as is_narration not is_final_answer | Narration is "thinking" text, not the answer; clients need to distinguish for proper UX |
14. File Referenceβ
Modified files (in this PR)β
| File | What changed |
|---|---|
protocol_bindings/a2a/agent.py | Marker gate buffer flush, narration extraction, stale comment fix |
protocol_bindings/a2a/agent_executor.py | is_narration metadata propagation (3 lines) |
integrations/slack_bot/utils/ai.py | Narration handler, StreamBuffer revert, thread-safe counter, debug logging |
multi_agents/platform_engineer/deep_agent.py | Middleware toggle env vars |
Test filesβ
| File | Tests | Coverage |
|---|---|---|
tests/test_marker_gate_buffer_flush.py | 45 | Marker gate logic, buffer flush, tail holdback, ResponseFormat word buffer, JSON extraction, narration extraction |
tests/test_narration_metadata_propagation.py | 12 | is_narration/is_final_answer metadata in A2A artifacts |
tests/test_slack_narration_typing_status.py | 37 | Narrationβtyping, StreamBuffer, final answer latch, safety filters |
tests/test_middleware_toggles.py | 23 | Master switch, individual toggles, case sensitivity, RAG toggle |
tests/test_streaming_e2e.py | 12 | Full A2A SSE streaming (integration, requires running supervisor) |
Comparison artifactsβ
| File | Purpose |
|---|---|
docs/docs/specs/008-platform-engineer-streaming-architecture/streaming-comparison.md | Side-by-side artifact timeline, token streaming analysis, and raw comparison data |
docs/docs/specs/008-platform-engineer-streaming-architecture/streaming-architecture-knowledge-base.md | This document |
scripts/capture_a2a_events.py | A2A event capture tool |
scripts/trace_a2a_streaming.py | Streaming timeline tracer |
docker-compose.compare-041.yaml | Side-by-side 0.2.41 comparison stack |