Tasks: MCP Authorization Resilience
Input: Design documents from docs/docs/specs/2026-06-02-mcp-authz-resilience/
Prerequisites: plan.md, spec.md, research.md, data-model.md, contracts/, quickstart.md
Tests: INCLUDED. The spec's success criteria (SC-002/004/005) demand verifiable transient/permanent/denial behavior, so behavioral unit tests are part of the work (TDD: write failing tests before impl).
Organization: By user story (US1 timeout default, US2 retry/reconcile, US3 messaging) so each is independently deliverable.
Format: [ID] [P?] [Story] Descriptionβ
- [P]: Can run in parallel (different files, no dependency on incomplete tasks)
- [Story]: US1/US2/US3
- Exact file paths included.
Path Conventionsβ
- Chart:
charts/ai-platform-engineering/ - Runtime:
ai_platform_engineering/dynamic_agents/src/dynamic_agents/services/ - Tests:
ai_platform_engineering/dynamic_agents/tests/ - Dev gateway:
deploy/agentgateway/ - Docs:
docs/docs/security/rbac/
Phase 1: Setup (Shared Infrastructure)β
- T001 Ensure dynamic-agents dev env for lint/tests:
cd ai_platform_engineering/dynamic_agents && uv venv --python python3.13 .venv && uv sync(per CLAUDE.md worktree/venv rule)
Phase 2: Foundational (Blocking Prerequisites)β
β οΈ CRITICAL: The shared error classifier blocks US2 (retry decisions) and US3 (messaging). US1 does NOT depend on this phase and may proceed in parallel.
- T002 Add
MCPServerLoadOutcomestatus typing (available|transient|permanent|denied) and a singleclassify_load_error(error_msg: str, status_code: int | None = None) -> strhelper inai_platform_engineering/dynamic_agents/src/dynamic_agents/services/mcp_client.py, reusing_extract_error_message/_diagnose_endpoint_failure. Conservative bias: ambiguous 403 βdenied. Type hints + docstring + named constants per constitution.
Checkpoint: classifier available β US2 and US3 can build on it.
Phase 3: User Story 1 - Default install has working MCP tools (Priority: P1) π― MVPβ
Goal: Ship the configurable ext_authz timeout default (10s) so a default install no longer reports healthy/authorized MCP servers as unavailable.
Independent Test: helm template shows timeout: "10s" under extAuthz and honors --set global.agentgateway.extAuth.timeout=5s; dev compose reloads config with 10s.
Implementation for User Story 1β
- T003 [P] [US1] Add
timeout: "10s"(with why-comment about the 200ms race) underglobal.agentgateway.extAuthincharts/ai-platform-engineering/values.yaml - T004 [US1] Render
timeout: {{ $extAuth.timeout | default "10s" | quote }}inside theextAuthzblock ofcharts/ai-platform-engineering/templates/agentgateway-static-config.yaml(depends on T003) - T005 [P] [US1] Dev-parity:
extAuthz.timeout: 10sindeploy/agentgateway/config.yamlandDEFAULT_MCP_ROUTE_POLICIESindeploy/agentgateway/config_bridge.pyβ already applied as the live dev hotfix (cross-ref; verify still present) - T006 [P] [US1] Document the new ext_authz timeout knob + default in
docs/docs/security/rbac/architecture.md(RBAC living-documentation rule) - T007 [P] [US1] Document CRD/Gateway-API path guidance (no policy timeout field; tune via authz-bridge backend
requestTimeout) near theextAuth/routingModecomments incharts/ai-platform-engineering/values.yamland the agentgateway routing doc - T008 [US1] Verify:
helm templaterender shows 10s default and honors override (quickstart US1) (depends on T003, T004)
Checkpoint: US1 independently shippable β the reported defect's primary root cause is fixed.
Phase 4: User Story 2 - Cold-start slowness self-heals (Priority: P2)β
Goal: Bounded transient-retry so cold-start auth timeouts become available without manual action.
Independent Test: unit tests for retry-then-success / permanent-fail-fast / no-retry-on-success / denial-not-retried.
Tests for User Story 2 (write first, ensure they FAIL) β οΈβ
- T009 [P] [US2] Unit test: transient first attempt then success β server available,
attempts>1, not in failed list βai_platform_engineering/dynamic_agents/tests/test_mcp_resilience_retry.py - T010 [P] [US2] Unit test: permanent error β
attempts==1(fail fast), in permanent-failed list β same test file - T011 [P] [US2] Unit test: success on first attempt β
attempts==1(zero retries, no added latency) β same test file - T012 [P] [US2] Unit test: clean policy 403 (denial) β
attempts==1, not retried β same test file
Implementation for User Story 2β
- T013 [US2] Add bounded retry (
max_attempts=3, jittered exponential backoffbase_backoff_s=0.25) gated byclassify_load_erroringet_tools_with_resilience(mcp_client.py): retry onlytransient;permanent/deniedreturn immediately; per-server concurrency preserved (depends on T002) - T014 [US2] Expose a per-server status map from
get_tools_with_resiliencewhile preserving the existing(all_tools, failed_servers, failed_errors)return for current callers (mcp_client.py) (depends on T013) - T015 [US2] Run US2 tests green (depends on T009βT014)
Checkpoint: US2 self-heals transient cold-start failures; US1+US2 both work.
Phase 5: User Story 3 - Honest not-ready vs failed messaging (Priority: P3)β
Goal: Distinct transient ("starting up, will retry") vs permanent ("needs attention") messaging; denials unchanged.
Independent Test: unit tests for classification mapping and message wording per class.
Note: Builds on the status map from US2 (T014) and the classifier (T002).
Tests for User Story 3 (write first, ensure they FAIL) β οΈβ
- T016 [P] [US3] Unit test:
classify_load_errormapping (timeout/5xx/authz-timeout-403βtransient; unknown-host/refused/404βpermanent; clean 401/403βdenied) βai_platform_engineering/dynamic_agents/tests/test_mcp_load_classification.py - T017 [P] [US3] Unit test: transient warning conveys "starting up/retry" (NOT "will not work"); permanent warning keeps "Tools from this server will not work."; denial message unchanged β
ai_platform_engineering/dynamic_agents/tests/test_agent_runtime_warnings.py
Implementation for User Story 3β
- T018 [US3] Replace
_failed_servers/_failed_servers_errorwith classification-aware state (_failed_servers_transient,_failed_servers_permanent+ messages) populated from the US2 status map inagent_runtime.py(~lines 263, 363β375) (depends on T014) - T019 [US3] Emit distinct system-prompt warning lines for transient vs permanent in
agent_runtime.py(~lines 587β589) per contract C3 (depends on T018) - T020 [US3] Emit distinct streamed
on_warningmessages for transient vs permanent; keep denial messaging unchanged inagent_runtime.py(~lines 1061β1063) (depends on T018) - T021 [P] [US3] Apply the same transient/permanent split to the subagent load path logging in
agent_runtime.py(~lines 893β896) (depends on T014) - T022 [US3] Run US3 tests green (depends on T016βT021)
Checkpoint: All three stories independently functional.
Phase 6: Polish & Cross-Cuttingβ
- T023 [P] Lint:
cd ai_platform_engineering/dynamic_agents && uv run ruff check src - T024 [P] Full tests:
cd ai_platform_engineering/dynamic_agents && PYTHONPATH=src uv run pytest tests -q - T025 Run quickstart.md validation (helm render static+override; optional compose smoke: enumerate tools, confirm no healthy server marked unavailable)
- T026 [P] Note the new
get_tools_with_resilienceretry/classification behavior in the dynamic-agents README/ARCHITECTURE.md if config/behavior docs exist there
Dependencies & Execution Orderβ
Phase Dependenciesβ
- Setup (P1): none.
- Foundational (P2 / T002): blocks US2 + US3 (not US1).
- US1 (P3): independent β can run alongside Foundational/US2/US3.
- US2 (P4): needs T002.
- US3 (P5): needs T002 + US2's T014 (status map).
- Polish (P6): after the stories you intend to ship.
Within storiesβ
- Tests written and failing before implementation (US2, US3).
- US2 retry (T013) before status exposure (T014) before US3 consumes it.
Parallel Opportunitiesβ
- T003, T006, T007 (US1 docs/values, different files) run in parallel; T004 after T003.
- US2 test tasks T009βT012 in parallel.
- US3 test tasks T016βT017 in parallel; T021 parallel to T019/T020 (different region/file concern).
- US1 (config/docs) can be done by one person while another does Foundational+US2 (code).
Implementation Strategyβ
MVP first (US1 only)β
- T001 setup β 2. T003/T004/T008 (timeout default) + T005 verify hotfix + T006/T007 docs β ship the permanent fix for the reported defect.
Incremental deliveryβ
- US1 (MVP) β render-verify β ship.
-
- Foundational (T002) + US2 (retry) β unit-test β ship self-healing.
-
- US3 (messaging) β unit-test β ship honest status.
Notesβ
- [P] = different files, no incomplete-task dependency.
- T005 is pre-completed (live dev hotfix); all other boxes unchecked.
- Security: retry never flips ext_authz to fail-open; denials never retried/relabeled (FR-004, FR-009).
- Commit per logical group with Conventional Commits + DCO.