Skip to main content
Version: main 🚧

Tasks: MCP Authorization Resilience

Input: Design documents from docs/docs/specs/2026-06-02-mcp-authz-resilience/ Prerequisites: plan.md, spec.md, research.md, data-model.md, contracts/, quickstart.md

Tests: INCLUDED. The spec's success criteria (SC-002/004/005) demand verifiable transient/permanent/denial behavior, so behavioral unit tests are part of the work (TDD: write failing tests before impl).

Organization: By user story (US1 timeout default, US2 retry/reconcile, US3 messaging) so each is independently deliverable.

Format: [ID] [P?] [Story] Description​

  • [P]: Can run in parallel (different files, no dependency on incomplete tasks)
  • [Story]: US1/US2/US3
  • Exact file paths included.

Path Conventions​

  • Chart: charts/ai-platform-engineering/
  • Runtime: ai_platform_engineering/dynamic_agents/src/dynamic_agents/services/
  • Tests: ai_platform_engineering/dynamic_agents/tests/
  • Dev gateway: deploy/agentgateway/
  • Docs: docs/docs/security/rbac/

Phase 1: Setup (Shared Infrastructure)​

  • T001 Ensure dynamic-agents dev env for lint/tests: cd ai_platform_engineering/dynamic_agents && uv venv --python python3.13 .venv && uv sync (per CLAUDE.md worktree/venv rule)

Phase 2: Foundational (Blocking Prerequisites)​

⚠️ CRITICAL: The shared error classifier blocks US2 (retry decisions) and US3 (messaging). US1 does NOT depend on this phase and may proceed in parallel.

  • T002 Add MCPServerLoadOutcome status typing (available|transient|permanent|denied) and a single classify_load_error(error_msg: str, status_code: int | None = None) -> str helper in ai_platform_engineering/dynamic_agents/src/dynamic_agents/services/mcp_client.py, reusing _extract_error_message/_diagnose_endpoint_failure. Conservative bias: ambiguous 403 β‡’ denied. Type hints + docstring + named constants per constitution.

Checkpoint: classifier available β€” US2 and US3 can build on it.


Phase 3: User Story 1 - Default install has working MCP tools (Priority: P1) 🎯 MVP​

Goal: Ship the configurable ext_authz timeout default (10s) so a default install no longer reports healthy/authorized MCP servers as unavailable.

Independent Test: helm template shows timeout: "10s" under extAuthz and honors --set global.agentgateway.extAuth.timeout=5s; dev compose reloads config with 10s.

Implementation for User Story 1​

  • T003 [P] [US1] Add timeout: "10s" (with why-comment about the 200ms race) under global.agentgateway.extAuth in charts/ai-platform-engineering/values.yaml
  • T004 [US1] Render timeout: {{ $extAuth.timeout | default "10s" | quote }} inside the extAuthz block of charts/ai-platform-engineering/templates/agentgateway-static-config.yaml (depends on T003)
  • T005 [P] [US1] Dev-parity: extAuthz.timeout: 10s in deploy/agentgateway/config.yaml and DEFAULT_MCP_ROUTE_POLICIES in deploy/agentgateway/config_bridge.py β€” already applied as the live dev hotfix (cross-ref; verify still present)
  • T006 [P] [US1] Document the new ext_authz timeout knob + default in docs/docs/security/rbac/architecture.md (RBAC living-documentation rule)
  • T007 [P] [US1] Document CRD/Gateway-API path guidance (no policy timeout field; tune via authz-bridge backend requestTimeout) near the extAuth/routingMode comments in charts/ai-platform-engineering/values.yaml and the agentgateway routing doc
  • T008 [US1] Verify: helm template render shows 10s default and honors override (quickstart US1) (depends on T003, T004)

Checkpoint: US1 independently shippable β€” the reported defect's primary root cause is fixed.


Phase 4: User Story 2 - Cold-start slowness self-heals (Priority: P2)​

Goal: Bounded transient-retry so cold-start auth timeouts become available without manual action.

Independent Test: unit tests for retry-then-success / permanent-fail-fast / no-retry-on-success / denial-not-retried.

Tests for User Story 2 (write first, ensure they FAIL) βš οΈβ€‹

  • T009 [P] [US2] Unit test: transient first attempt then success β‡’ server available, attempts>1, not in failed list β€” ai_platform_engineering/dynamic_agents/tests/test_mcp_resilience_retry.py
  • T010 [P] [US2] Unit test: permanent error β‡’ attempts==1 (fail fast), in permanent-failed list β€” same test file
  • T011 [P] [US2] Unit test: success on first attempt β‡’ attempts==1 (zero retries, no added latency) β€” same test file
  • T012 [P] [US2] Unit test: clean policy 403 (denial) β‡’ attempts==1, not retried β€” same test file

Implementation for User Story 2​

  • T013 [US2] Add bounded retry (max_attempts=3, jittered exponential backoff base_backoff_s=0.25) gated by classify_load_error in get_tools_with_resilience (mcp_client.py): retry only transient; permanent/denied return immediately; per-server concurrency preserved (depends on T002)
  • T014 [US2] Expose a per-server status map from get_tools_with_resilience while preserving the existing (all_tools, failed_servers, failed_errors) return for current callers (mcp_client.py) (depends on T013)
  • T015 [US2] Run US2 tests green (depends on T009–T014)

Checkpoint: US2 self-heals transient cold-start failures; US1+US2 both work.


Phase 5: User Story 3 - Honest not-ready vs failed messaging (Priority: P3)​

Goal: Distinct transient ("starting up, will retry") vs permanent ("needs attention") messaging; denials unchanged.

Independent Test: unit tests for classification mapping and message wording per class.

Note: Builds on the status map from US2 (T014) and the classifier (T002).

Tests for User Story 3 (write first, ensure they FAIL) βš οΈβ€‹

  • T016 [P] [US3] Unit test: classify_load_error mapping (timeout/5xx/authz-timeout-403β‡’transient; unknown-host/refused/404β‡’permanent; clean 401/403β‡’denied) β€” ai_platform_engineering/dynamic_agents/tests/test_mcp_load_classification.py
  • T017 [P] [US3] Unit test: transient warning conveys "starting up/retry" (NOT "will not work"); permanent warning keeps "Tools from this server will not work."; denial message unchanged β€” ai_platform_engineering/dynamic_agents/tests/test_agent_runtime_warnings.py

Implementation for User Story 3​

  • T018 [US3] Replace _failed_servers/_failed_servers_error with classification-aware state (_failed_servers_transient, _failed_servers_permanent + messages) populated from the US2 status map in agent_runtime.py (~lines 263, 363–375) (depends on T014)
  • T019 [US3] Emit distinct system-prompt warning lines for transient vs permanent in agent_runtime.py (~lines 587–589) per contract C3 (depends on T018)
  • T020 [US3] Emit distinct streamed on_warning messages for transient vs permanent; keep denial messaging unchanged in agent_runtime.py (~lines 1061–1063) (depends on T018)
  • T021 [P] [US3] Apply the same transient/permanent split to the subagent load path logging in agent_runtime.py (~lines 893–896) (depends on T014)
  • T022 [US3] Run US3 tests green (depends on T016–T021)

Checkpoint: All three stories independently functional.


Phase 6: Polish & Cross-Cutting​

  • T023 [P] Lint: cd ai_platform_engineering/dynamic_agents && uv run ruff check src
  • T024 [P] Full tests: cd ai_platform_engineering/dynamic_agents && PYTHONPATH=src uv run pytest tests -q
  • T025 Run quickstart.md validation (helm render static+override; optional compose smoke: enumerate tools, confirm no healthy server marked unavailable)
  • T026 [P] Note the new get_tools_with_resilience retry/classification behavior in the dynamic-agents README/ARCHITECTURE.md if config/behavior docs exist there

Dependencies & Execution Order​

Phase Dependencies​

  • Setup (P1): none.
  • Foundational (P2 / T002): blocks US2 + US3 (not US1).
  • US1 (P3): independent β€” can run alongside Foundational/US2/US3.
  • US2 (P4): needs T002.
  • US3 (P5): needs T002 + US2's T014 (status map).
  • Polish (P6): after the stories you intend to ship.

Within stories​

  • Tests written and failing before implementation (US2, US3).
  • US2 retry (T013) before status exposure (T014) before US3 consumes it.

Parallel Opportunities​

  • T003, T006, T007 (US1 docs/values, different files) run in parallel; T004 after T003.
  • US2 test tasks T009–T012 in parallel.
  • US3 test tasks T016–T017 in parallel; T021 parallel to T019/T020 (different region/file concern).
  • US1 (config/docs) can be done by one person while another does Foundational+US2 (code).

Implementation Strategy​

MVP first (US1 only)​

  1. T001 setup β†’ 2. T003/T004/T008 (timeout default) + T005 verify hotfix + T006/T007 docs β†’ ship the permanent fix for the reported defect.

Incremental delivery​

  1. US1 (MVP) β†’ render-verify β†’ ship.
    • Foundational (T002) + US2 (retry) β†’ unit-test β†’ ship self-healing.
    • US3 (messaging) β†’ unit-test β†’ ship honest status.

Notes​

  • [P] = different files, no incomplete-task dependency.
  • T005 is pre-completed (live dev hotfix); all other boxes unchecked.
  • Security: retry never flips ext_authz to fail-open; denials never retried/relabeled (FR-004, FR-009).
  • Commit per logical group with Conventional Commits + DCO.