Skip to main content

Slack Streaming Conformance Benchmark

Spec: 099-slack-streaming-conformance
Benchmark Definition: tests/STREAMING_CONFORMANCE.md
Runner: tests/simulate_slack_stream.py --suite --report

Overview​

The Slack Streaming Conformance suite validates that every query type — simple chat, off-topic, and RAG-heavy — delivers the full answer via live word-by-word streaming without artifact ID resets, swallowed responses, or premature graph termination.

The suite runs 4 scenarios with 22 checks against a live supervisor, exercising the complete SSE → event parser → StreamBuffer → Slack API pipeline.


Scenarios​

#ScenarioQueryWhat It Validates
1simple-chat"tell me a joke"No-tool query streams via appendStream; stream opens without tool notifications
2off-topic"how is the weather in San Francisco?"Out-of-scope response still delivered; not swallowed
3rag-simple"what is agntcy"RAG tools fire; answer streams word-by-word in multiple chunks; no duplicates
4rag-complex"explain how agntcy agents communicate with each other"Multi-search RAG with plan steps; per-tool cap isolation; substantial content

Conformance Checks (22 total)​

simple-chat (6 checks)​

CheckDescription
content_deliveredResponse must have content (>20 chars)
stream_openedStream must be opened (startStream called)
live_streamedAnswer delivered via appendStream (not just stopStream)
no_duplicateNo duplicate content in both appendStream and stopStream
final_answer_latchedstreaming_final_answer must be True
no_toolsNo tools should fire for a casual chat query

off-topic (4 checks)​

CheckDescription
content_deliveredResponse must have content (>20 chars)
stream_openedStream must be opened
live_streamedAnswer delivered via appendStream
final_answer_latchedstreaming_final_answer must be True

rag-simple (7 checks)​

CheckDescription
content_deliveredResponse must have substantial content (>200 chars)
stream_openedStream must be opened
live_streamedAnswer delivered via appendStream (word-by-word)
tools_usedRAG tools (search/fetch_document) must be called
no_duplicateNo duplicate content in both streams
final_answer_latchedstreaming_final_answer must be True
multi_chunkAnswer should arrive in multiple streaming chunks (>1)

rag-complex (5 checks)​

CheckDescription
content_deliveredResponse must have substantial content (>300 chars)
stream_openedStream must be opened
live_streamedAnswer delivered via appendStream
tools_usedRAG tools must be called
final_answer_latchedstreaming_final_answer must be True

Streaming Invariants​

These are the fundamental rules that must never be violated:

#InvariantDescription
INV-1Every query gets a responseNo query should result in an empty Slack message
INV-2No duplicate contentThe same text must not appear in both appendStream and stopStream.chunks
INV-3streaming_final_answer gates FINAL_RESULTWhen True, the FINAL_RESULT event is skipped; when False, content is delivered
INV-4Stream opens for is_final_answer chunksEven without prior tool notifications
INV-5Per-tool RAG cap isolationCapping one RAG tool must not prevent other uncapped RAG tools from executing
INV-6Live streaming preferredFinal answer via appendStream (word-by-word), not deferred to stopStream.chunks

Per-Query Metrics (template)​

The --report flag generates a detailed markdown report with per-query streaming metrics.

MetricDescription
Total CharsTotal characters delivered to the user (streamed + stopped)
Streamed (append)Characters delivered via appendStream calls (live streaming)
StoppedCharacters delivered via stopStream.chunks (deferred)
Append CallsNumber of appendStream text calls made
Final ChunksNumber of streaming chunks tagged as final answer
ToolsTools called during the query (with call counts)
Deliverylive stream, stopStream only, split, or empty

Pipeline Scope​

The following files are covered by this benchmark:

FileRole
ai_platform_engineering/integrations/slack_bot/utils/ai.pySlack streaming event loop, StreamBuffer, finalization
ai_platform_engineering/multi_agents/platform_engineer/protocol_bindings/a2a/agent.pyA2A binding — yields streaming events from LangGraph
ai_platform_engineering/multi_agents/platform_engineer/protocol_bindings/a2a/agent_executor.pyExecutor — artifact construction, deterministic chunker
ai_platform_engineering/multi_agents/platform_engineer/rag_tools.pyRAG cap wrappers, per-tool cap tracking
ai_platform_engineering/utils/deepagents_custom/middleware.pyDeterministicTaskMiddleware — RAG loop detection
ai_platform_engineering/integrations/slack_bot/utils/event_parser.pySSE event parser (EventType enum)
ai_platform_engineering/integrations/slack_bot/a2a_client.pyA2A SSE client

Running the Suite​

Prerequisites​

  • Supervisor running at http://localhost:8000 (or specify --url)
  • RAG knowledge base loaded (for rag-simple and rag-complex scenarios)

Commands​

# Run the conformance suite (terminal output)
PYTHONPATH=. uv run python tests/simulate_slack_stream.py --suite

# Run with verbose output (shows every tool and appendStream call)
PYTHONPATH=. uv run python tests/simulate_slack_stream.py --suite -v

# Generate a markdown report (auto-timestamped)
PYTHONPATH=. uv run python tests/simulate_slack_stream.py --suite --report

# Generate a report to a specific path
PYTHONPATH=. uv run python tests/simulate_slack_stream.py --suite --report results.md

Report Output​

The --report flag generates a tabulated markdown report with:

  1. Scenario Results — per-scenario pass/fail with duration
  2. Per-Query Streaming Metrics — chars, calls, chunks, tools, delivery method
  3. Conformance Check Details — per-check pass/fail with detail
  4. State Flags — streaming_final_answer, stream_opened, plan_steps per scenario
  5. Event Counts — SSE event type counts per scenario

Reports are saved to tests/reports/ (gitignored) by default.


Enforcement​

A Cursor IDE rule (.cursor/rules/streaming-conformance.mdc) automatically triggers when any of the 7 pipeline files are edited. The rule instructs developers to run the conformance suite before committing.


Adding New Scenarios​

  1. Add a new entry to the SCENARIOS list in tests/simulate_slack_stream.py
  2. Define the query, description, and conformance checks (lambdas)
  3. Update tests/STREAMING_CONFORMANCE.md with the new scenario
  4. Run --suite to verify all checks pass
  5. Update this doc with the new scenario details