Feature Specification: Comprehensive RBAC Tests + Completion of 098
Feature Branch: prebuild/feat/comprehensive-rbac (existing)
Created: 2026-04-22
Status: Draft β awaiting user review
Input: User description: "We need a super comprehensive rules based unit and e2e tests for RBAC for each area and update how RBAC works diagrams based on the updates we made"
Companion docs:
call-sequences.mdβ code-level sequence diagrams (realfile:functionreferences, before/after migration) for every flow this spec touches.
Why this spec existsβ
Spec 098-enterprise-rbac-slack-ui defined the target end state: every CAIPE surface β UI BFF, supervisor, agents, MCP, RAG, Slack, dynamic agents, A2A β gates on Keycloak Authorization Services with default-deny, OBO token forwarding, and CEL where appropriate.
Today (2026-04-22), the implementation is partial:
| Surface | 098 target | Reality on prebuild/feat/comprehensive-rbac |
|---|---|---|
| UI BFF β admin pages | requireRbacPermission(session, 'admin_ui', 'view') | /admin/users and /admin/users/stats migrated; ~15 other admin/management routes still on legacy requireAdmin(session) |
| Supervisor | JwtUserContextMiddleware + JWKS + OBO + httpx_client_factory | Implemented after PR #1253 + #1145 merges; untested as a unit |
| Agent MCP | All MCP servers honour JWT and gate on requireRbacPermission | Mix of shared key, OAuth2, dual-auth; no Keycloak gate in any MCP server |
| RAG (KB ingest/retrieve) | rag#ingest / rag#retrieve Keycloak gates + Mongo KB ACL (per FR-026, FR-027) | Group-claim checks only; no Keycloak resource defined |
| Slack bot | slack#use scope + per-user OBO via slack_user_id linkage | Channel-allowlist + group claims only |
| Dynamic / custom agents | dynamic_agent#view / #invoke / #manage + JWT validation + per-request token forwarding to MCP (per FR-028, FR-030) | DA backend trusts forgeable X-User-Context header; chat endpoint has no per-agent authz; MCP tool calls carry no user bearer |
| A2A | OBO across agent hops (per FR-018) | forward_jwt_to_mcp flag exists; no integration tests prove the chain |
Without comprehensive tests this gap is invisible to reviewers and easy to regress. Without the missing migrations, the tests have nothing to assert against. This spec closes both at once.
Clarificationsβ
Session 2026-04-22β
- Q: How should the e2e test stack be brought up β a new
docker-compose/docker-compose.e2e.yaml, or reusedocker-compose.dev.yamlwithCOMPOSE_PROFILES? β A: Reusedocker-compose.dev.yamlwith a curatedCOMPOSE_PROFILESselection (no second compose file). Test-only port remaps and env-var overrides go in a tinydocker-compose/docker-compose.e2e.override.yamloverlay only if strictly required (e.g., to avoid host-port collisions with a running dev stack). Themake test-rbactarget setsCOMPOSE_PROFILES="rbac,caipe-ui,caipe-supervisor,caipe-mongodb,dynamic-agents,rag,all-agents,slack-bot"and runsdocker compose -f docker-compose.dev.yaml [-f docker-compose/docker-compose.e2e.override.yaml] up -d --wait.
In scopeβ
- Migrate every authorization decision point to Keycloak Authorization Services, replacing legacy gates (
requireAdmin,canViewAdmin, raw group-claim checks, channel-allowlists,X-User-Contextheader trust). - Define and seed the missing Keycloak resources, scopes, and policies in
deploy/keycloak/realm-config.json. - Add a uniform RBAC middleware to every Python service (supervisor, dynamic agents backend, RAG server, agent MCP servers, Slack bot) that validates the bearer against JWKS and calls Keycloak's PDP with
urn:ietf:params:oauth:grant-type:uma-ticket. - Wire per-request user-token forwarding into every MCP client used by an agent (supervisorβs
httpx_client_factoryexists; the dynamic-agentsMultiServerMCPClientdoes not). - Add a comprehensive automated test matrix:
- Jest unit tests for every BFF route Γ every persona Γ allow/deny permutation.
- pytest unit tests for every Python middleware Γ every persona Γ allow/deny permutation.
- Playwright end-to-end tests against a real Keycloak (docker-compose) covering the canonical user journeys.
- Add audit logging at every new gate using the existing
logAuthzDecision(TS) / equivalent Python helper. - Update the canonical RBAC reference under
docs/docs/security/rbac/(architecture.md,workflows.md,usage.md,file-map.md) so its diagrams, file map, and component sections reflect the post-migration reality. (The original single-file doc atdocs/docs/specs/098-enterprise-rbac-slack-ui/how-rbac-works.mdis now a redirect stub.)
Out of scope (explicitly)β
- Replacing NextAuth with another auth library.
- Replacing MongoDB as the team / KB-ownership store.
- Replacing the existing CEL evaluator implementations (
ui/src/lib/rbac/cel-evaluator.ts,ai_platform_engineering/dynamic_agents/src/dynamic_agents/cel_evaluator.py). - Designing a new Admin UI layout. Admin UI changes are limited to wiring forms to the new Keycloak APIs and documenting the migration under
docs/docs/security/rbac/. - Performance benchmarking. The PDP cache TTL (
RBAC_CACHE_TTL_SECONDS, default 60s) is taken as given. - Multi-realm or multi-Keycloak federation. Single realm only.
Personas (used throughout the user stories below)β
| Persona | Keycloak realm roles | Team membership (Mongo) | Slack link |
|---|---|---|---|
alice_admin | admin | caipe-admins | linked |
bob_chat_user | chat_user, team_member | team-a | linked |
carol_kb_ingestor | chat_user, kb_ingestor (per-KB role: kb_ingestor:team-a-docs) | team-a | linked |
dave_no_role | (none) | (none) | unlinked |
eve_dynamic_agent_user | chat_user, agent_user:my-team-agent | team-a | linked |
frank_service_account | client-credentials, service_account realm role | n/a | n/a |
These personas are defined as kcadm create-script fragments in the test fixture and reused across Jest, pytest, and Playwright.
User Scenarios & Testing (mandatory)β
User Story 1 β Admin UI is fully Keycloak-gated (Priority: P1)β
alice_admin can reach every page under /admin/* and every BFF route under /api/admin/*. Anyone else gets a 403.
Why this priority: This is the most-used RBAC surface, the area the existing test suite exercises best, and the lowest-risk migration (the pattern is already proven by /api/admin/users/stats).
Independent Test: Boot Keycloak + UI, log in as each persona, hit every /api/admin/* route, assert 200 for alice_admin and 403 for everyone else. Uses real Keycloak Authorization Services PDP.
Acceptance Scenarios:
- Given
alice_adminis logged in, When she GETs/api/admin/users, Then she receives 200 and the response is sourced from Keycloak Admin API. - Given
bob_chat_useris logged in, When he GETs/api/admin/users, Then he receives 403 withreason=DENY_NO_CAPABILITY. - Given Keycloak's PDP is unreachable AND
admin_uihas a realm-role fallback configured toadmin, Whenalice_admin(who has theadminrealm role) GETs/api/admin/users, Then she receives 200 via the role-fallback. Configuration lives indeploy/keycloak/realm-config-extras.json. - Given Keycloak's PDP is unreachable AND
admin_uihas a realm-role fallback configured, Whenbob_chat_userGETs/api/admin/users, Then he receives 403 (he lacks the fallback role). - Given Keycloak's PDP is unreachable AND a resource has NO fallback configured (default), When any persona GETs the gated route, Then the response is 503 with
reason=DENY_PDP_UNAVAILABLE(deny-all). - Given any caller, When the gate fires, Then an entry appears in the
authz_decisionsMongo collection with{userId, resource: 'admin_ui', scope: 'view', allowed, reason, timestamp}.
User Story 2 β Supervisor enforces Keycloak before delegating to agents (Priority: P1)β
When the supervisor receives an A2A request, it validates the bearer against Keycloak's JWKS, extracts the user context, and passes the user's OBO token (not the bot's service account) to every downstream agent and MCP call.
Why this priority: The supervisor is the trust boundary for every backend interaction. If it doesn't enforce, every downstream check is moot.
Independent Test: Send an A2A request with (a) a valid bob_chat_user token, (b) an expired token, (c) a token signed by a different issuer, (d) no token. Assert 200, 401, 401, 401 respectively. With (a), assert that the OBO token landed in the downstream MCP Authorization header (verified via a stub MCP server that records headers).
Acceptance Scenarios:
- Given a valid bearer for
bob_chat_user, When the supervisor receives an A2Atasks/sendforargocd_agent.list_apps, Then the request is authorized, an OBO token is minted (urn:ietf:params:oauth:grant-type:token-exchange), and the downstream MCP call seesAuthorization: Bearer <obo_token>whoseact.subclaim is the supervisor service account. - Given an expired bearer, When the supervisor receives any A2A request, Then it responds
401 invalid_tokenand never opens a graph stream. - Given a bearer signed by an issuer other than the configured Keycloak realm, When the supervisor receives any A2A request, Then it responds
401 invalid_tokenand never queries JWKS for it twice in one minute. - Given a chain of two agents (supervisor β agent A β agent B), When
bob_chat_userinvokes the chain, Then every hop sees a token whosesubresolves tobob_chat_user'skeycloak_suband whoseactchain reflects the calling service.
User Story 3 β Every agent MCP server is Keycloak-gated (Priority: P1)β
Every MCP server (argocd, aws, jira, github, pagerduty, splunk, confluence, webex, slack, komodor, aigateway, backstage) accepts only requests with a valid Keycloak-issued bearer and only invokes a tool if the caller has the matching <agent>_mcp#<scope> permission.
Why this priority: MCP is where agent capability ultimately materializes. Today most MCPs trust shared keys or have no auth β that's the largest live attack surface.
Independent Test: For each MCP server, run a parameterized pytest that POSTs tools/list and a representative tools/call with each persona's token. Assert the matrix: chat_user can list+read, team_member can list+read+write within team scope, admin can do everything, dave_no_role gets 401.
Acceptance Scenarios:
- Given
bob_chat_userwithargocd_mcp:read, When he callsargocd.list_apps, Then the MCP returns 200. - Given
bob_chat_userwithoutargocd_mcp:write, When he callsargocd.delete_app, Then the MCP returns 403 fromrequireRbacPermission(...)and never reaches the tool implementation. - Given any MCP server with no Authorization header, When any tool call is made, Then the MCP returns 401.
- Given the legacy
SHARED_KEYenv var is still set, When a request arrives with that key but no bearer, Then the MCP returns 401 and logs a deprecation warning. (Shared-key auth is removed in this spec.) - Given any tool call, When the MCP forwards to the upstream system (ArgoCD API, Jira API, etc.), Then the upstream sees the user's identity in audit logs (header forwarding, where supported).
User Story 4 β RAG enforces hybrid Keycloak + Mongo KB ACL (Priority: P1)β
Per spec 098 FR-026 and FR-027, the RAG server gates /v1/ingest and /v1/query on Keycloak (rag#ingest, rag#retrieve), then filters per-KB based on Mongo TeamKbOwnership and per-KB Keycloak roles (kb_reader:<id>, kb_ingestor:<id>).
Why this priority: KB content frequently contains sensitive operational data (tickets, runbooks, incident postmortems). Today RAG retrieval is gated only on group claims with no Keycloak PDP involvement β a gap explicitly called out in 098.
Independent Test: Seed two KBs (team-a-docs, team-b-docs) with distinct sentinel documents. Run /v1/query as each persona. Assert: alice_admin sees both, carol_kb_ingestor (team-a) sees only team-a-docs, bob_chat_user sees neither in a strict deployment, both in a permissive deployment (depending on rag_default_visibility).
Acceptance Scenarios:
- Given
carol_kb_ingestorwithkb_ingestor:team-a-docs, When she POSTs to/v1/ingestwith a document taggedkb_id=team-a-docs, Then the document is ingested. - Given the same persona, When she POSTs to
/v1/ingestwithkb_id=team-b-docs, Then the RAG server returns 403. - Given
bob_chat_user(member ofteam-ain Mongo), When he POSTs to/v1/query, Then results include only documents whosekb_idis owned byteam-aperTeamKbOwnership. - Given
dave_no_role, When he POSTs to/v1/query, Then he receives 403. - Given any successful query, When the result set is built, Then filtering happens server-side in the RAG service, not at the BFF.
User Story 5 β Slack commands run with the user's identity, not the bot's (Priority: P2)β
When a linked Slack user issues a command, the bot exchanges its app token + the user's slack_user_id linkage for a Keycloak OBO token, calls the supervisor with that token, and the entire downstream chain (agents, MCPs, RAG) sees the user's identity.
Why this priority: Without this, every Slack action looks like the bot to backend audit logs, defeating both attribution and per-user authorization.
Independent Test: Send a slash command from a linked user via the Slack Events test harness, capture the supervisor's incoming Authorization header, decode the JWT, assert sub == bob_chat_user.keycloak_sub and act.sub == slack-bot's service account.
Acceptance Scenarios:
- Given a linked Slack user (
bob_chat_user), When he runs/caipe list argocd apps, Then the supervisor sees a token withsub=bob_chat_user.keycloak_sub. - Given an unlinked Slack user, When he runs any command, Then the bot replies with the linking instructions (FR-025) and does not call the supervisor.
- Given a linked user lacking
argocd_mcp:read, When he runs/caipe list argocd apps, Then the supervisor delegates to ArgoCD MCP, ArgoCD MCP returns 403, the bot surfaces a user-friendly message, and audit logs the denial. - Given a channel mapped to
team-band a user lackingteam_memberforteam-b, When he runs any command, Then the bot denies per FR-031 with a clear message. - Given the bot's own startup, When it registers with the supervisor, Then it uses its service account token (not a user OBO), and the supervisor allows only the narrow scope
slack#register.
User Story 6 β Custom (dynamic) agents are bound to Keycloak (Priority: P1)β
Every dynamic agent becomes a Keycloak resource with view, invoke, manage scopes (per 098 FR-028). The DA backend validates JWT bearers (no more X-User-Context trust). The chat endpoint enforces requireRbacPermission(session, 'dynamic_agent:<agent_id>', 'invoke'). MCP tool calls from a DA runtime carry the user's per-request OBO bearer (not a runtime-cached one).
Why this priority: The earlier audit found this is the single largest gap β anyone authenticated can chat with any custom agent today, MCP tools called by custom agents go out anonymously, and a forged X-User-Context header gives admin-level access if DA is reachable directly. This story closes all five layers in one go.
Independent Test: Three agents seeded β private-eve, team-a-shared, global-public. Each persona attempts view, invoke, manage on each. Result matrix asserted via Playwright (BFF) and pytest (DA backend talked to directly with an injected forged header β proves the header is no longer trusted).
Acceptance Scenarios:
- Given
eve_dynamic_agent_userwithagent_user:my-team-agent, When she POSTs to/api/v1/chat/stream/startwithagent_id=my-team-agent, Then the BFF passes the gate and DA streams a response. - Given
bob_chat_userwithout per-agent role, When he POSTs the same withagent_id=eve-private, Then the BFF returns 403 and never opens a stream to DA. - Given the DA backend is reached directly (bypassing the BFF) with a forged
X-User-Contextheader claimingis_admin=true, When any endpoint is hit, Then the response is 401 because the header is no longer trusted; onlyAuthorization: Bearer <jwt>is honoured. - Given a custom agent's runtime calls an MCP tool, When the MCP receives the request, Then the
Authorizationheader carries a fresh per-request OBO token whosesubis the chatting user, not a stale token from an earlier conversation turn. - Given any DA authorization decision (allow or deny), When it occurs, Then an entry appears in the
authz_decisionsMongo collection.
User Story 7 β Comprehensive automated test matrix exists and runs in CI (Priority: P1)β
There is one pass/fail signal per RBAC area, runnable locally and in CI. Adding a new endpoint without a corresponding test entry causes the suite to fail.
Why this priority: Tests are the only mechanism that prevents the gaps we just closed from re-opening.
Independent Test: make test-rbac runs all three layers (Jest, pytest, Playwright) and exits non-zero if any persona-permutation fails. New routes without entries in tests/rbac-matrix.yaml fail the linter.
Acceptance Scenarios:
- Given the post-migration codebase, When
make test-rbacruns, Then Jest BFF, pytest backend, and Playwright E2E suites all pass with zero skipped RBAC tests. - Given a developer adds a new BFF route under
/api/admin/*without arequireRbacPermissioncall, When they runmake test-rbac, Then the suite fails with a specific message identifying the unprotected route. - Given a developer adds a new MCP tool without a corresponding scope in
deploy/keycloak/realm-config.json, When they runmake test-rbac, Then the suite fails with a specific message identifying the missing scope. - Given the test fixtures, When the Playwright suite starts, Then it brings up Keycloak, Mongo, UI, supervisor, DA, RAG, and at least one agent MCP by running
docker compose -f docker-compose.dev.yaml [-f docker-compose/docker-compose.e2e.override.yaml] up -d --waitwithCOMPOSE_PROFILES="rbac,caipe-ui,caipe-supervisor,caipe-mongodb,dynamic-agents,rag,all-agents,slack-bot"(no separatedocker-compose.e2e.yaml). - Given a green CI run, When the audit logs are inspected, Then every persona-route pair from
tests/rbac-matrix.yamlproduced a correspondingauthz_decisionsentry.
User Story 8 β docs/docs/security/rbac/ is the canonical, accurate reference (Priority: P2)β
After the migration, the RBAC reference under docs/docs/security/rbac/ (index.md, architecture.md, workflows.md, usage.md, file-map.md) accurately describes every component, every gate, every flow, and every file involved. A junior engineer who reads it can locate the code that enforces any given decision in under 5 minutes.
Why this priority: The earlier session-summary already noted the docs are out-of-sync; this is the user-facing manifestation of the migration.
Independent Test: A reviewer reads the four RBAC docs end-to-end and answers a 10-question quiz (e.g., "Where is the dynamic-agent invoke gate enforced?", "Which env var controls the PDP cache TTL?", "What does RESOURCE_ROLE_FALLBACK do when Keycloak is unreachable?"). 9/10 correct = pass.
Acceptance Scenarios:
- Given the post-migration code, When
docs/docs/security/rbac/file-map.mdis checked, Then every authz-relevant file is listed with its current path. - Given the new components (DA backend JWT middleware, RAG hybrid gate, etc.), When
docs/docs/security/rbac/architecture.mdis read, Then each one has a dedicated section with: purpose, env vars, error responses, file paths. - Given the post-migration sequence diagram in
docs/docs/security/rbac/workflows.md, When read end-to-end, Then it shows: browser β BFF (PDP check) β supervisor (JWT validation + OBO mint) β agent (PDP check) β MCP (PDP check) β upstream system, with the audit log written at each gate. - Given the migration changes the meaning of any env var or removes any legacy gate, When the doc is read, Then the change is called out in a "Migrated from 098 partial implementation" callout.
Edge Casesβ
- PDP unavailable β gate behaviour depends on per-resource configuration (see Open Question 1). Default for unconfigured resources is deny-all.
admin_uiis configured to fall back to theadminrealm role today; that configuration is preserved. Per-resource fallback rules live indeploy/keycloak/realm-config-extras.json(a sibling file consumed by both TS and Py middlewares). - Token expiring mid-request β the supervisor's
httpx_client_factoryre-mints OBO if expiry is within 30s of the call. - Slack user link revoked mid-session β next command returns the linking instructions; in-flight A2A streams complete (no mid-stream revocation).
- Keycloak resource missing β
requireRbacPermissionreturns 503 (not 403) with a clear server log line; tests assert this distinction. - Per-user OIDC group claim larger than 16 KB β bearer is rejected at the JWKS validation layer (header size check); tests cover this.
- DA runtime cache holding a stale OBO token β the new per-request
httpx_client_factoryfor DA'sMultiServerMCPClientresolves the bearer per-request viaContextVar; cached runtimes never carry tokens. - Two services calling Keycloak's PDP for the same
(token, resource, scope)simultaneously β both share the samepermissionDecisionCacherow keyed bysha256(token):resource#scope; both succeed or both fail; no PDP storm. - Audit log Mongo write fails β the gate decision still proceeds (don't deny on audit-log failure); a structured warning is emitted with
{decision, error}.
Requirements (mandatory)β
Functional Requirementsβ
- FR-001: Every BFF route under
/api/admin/*,/api/dynamic-agents/*,/api/mcp-servers/*,/api/teams/*,/api/agents/*MUST gate onrequireRbacPermission(session, '<resource>', '<scope>'). The full route β(resource, scope)mapping MUST live in a single source-of-truth file (tests/rbac-matrix.yaml). - FR-002: Every Python service (supervisor, dynamic_agents backend, RAG server, every agent MCP server, slack bot) MUST validate the bearer against Keycloak's JWKS endpoint with caching (TTL β₯ 5 min) and reject expired or wrong-issuer tokens with HTTP 401.
- FR-003: Every Python service MUST expose a
requireRbacPermission(token, resource, scope)helper that calls Keycloak's PDP viaurn:ietf:params:oauth:grant-type:uma-ticketwithresponse_mode=decision, with the same caching semantics as the TS implementation. - FR-004: The
X-User-Contextheader consumed bydynamic_agents/auth/auth.pyMUST be removed in favour ofAuthorization: Bearer <jwt>. - FR-005:
MultiServerMCPClientcalls from the dynamic-agents runtime MUST source theirAuthorizationheader from a per-requestContextVar-backed factory (parallel to the supervisor'shttpx_client_factory), never from a runtime-instance attribute. - FR-006:
deploy/keycloak/realm-config.jsonMUST seed the following resources and scopes; the file MUST be CI-validated against the post-migration code (everyrequireRbacPermissioncall's(resource, scope)MUST exist in the realm config):admin_uiβview,managedynamic_agentβview,invoke,managemcp_serverβread,manageteamβview,manageragβingest,retrieve,manage- One resource per agent MCP (e.g.
argocd_mcp,aws_mcp,jira_mcp,github_mcp,pagerduty_mcp,splunk_mcp,confluence_mcp,webex_mcp,slack_mcp,komodor_mcp,aigateway_mcp,backstage_mcp) each with scopesread,write slackβuse,register
- FR-007: Every gate (TS or Py) MUST emit an audit-log entry to Mongo collection
authz_decisionswith schema{userId, resource, scope, allowed, reason, source, timestamp}. Audit-log write failure MUST NOT cause the gate to fail open. - FR-008: A reusable test fixture MUST stand up a real Keycloak (using
deploy/keycloak/docker-compose.yml), seed the personas listed in this spec, and expose a TypeScript helper (tests/fixtures/keycloak.ts) and a pytest fixture (tests/conftest.py::keycloak) that returns a bearer for any persona by name. - FR-009: A new
make test-rbactarget MUST exist that runs Jest, pytest, and Playwright RBAC suites in sequence and exits non-zero if any sub-suite fails. - FR-010: A new
tests/rbac-matrix.yamlMUST list every (route, resource, scope, persona, expected_status) tuple. A linter (scripts/validate-rbac-matrix.py) MUST verify that every BFF route under the protected prefixes appears in the matrix. - FR-011: Slack bot MUST mint per-command OBO tokens via Keycloak token-exchange (
urn:ietf:params:oauth:grant-type:token-exchange) using the linkedslack_user_id β keycloak_submapping from FR-025 of 098, and use them as theAuthorizationheader for every supervisor call. - FR-012: Every MCP server MUST replace
SharedKeyMiddleware-based auth withJwtUserContextMiddleware+requireRbacPermission. The shared-key path MUST be removed (not deprecated-with-warning) in this PR. - FR-013: The RAG server MUST implement hybrid authorization: Keycloak
rag#ingest/rag#retrieveas the coarse gate, then per-KB filtering via the union ofTeamKbOwnership(Mongo) and per-KB realm roles (kb_reader:<id>,kb_ingestor:<id>). - FR-014: The canonical RBAC reference under
docs/docs/security/rbac/(architecture.md,workflows.md,usage.md,file-map.md) MUST be updated in the same PR to reflect every change: components, env vars, sequence diagrams, file map. The table indocs/docs/security/rbac/file-map.mdMUST be auto-validated byscripts/validate-rbac-doc.pyagainst the actual files referenced from the protected routes. - FR-015: All migrations and tests MUST land on the existing
prebuild/feat/comprehensive-rbacbranch (PR #1257). No new branches.
Key Entitiesβ
authz_decisions(Mongo collection) β append-only audit log of every authorization decision. One document per decision. Indexed on(userId, timestamp)and(resource, scope, timestamp).tests/rbac-matrix.yamlβ single source of truth for which persona may do what. Drives Jest, pytest, and Playwright tests via fixture loaders. Validated by a CI linter.PersonaTokenfixture β TS + Python helpers that mint a real Keycloak access token for a named persona, used by every test.- e2e test stack β brought up by reusing
docker-compose.dev.yamlwithCOMPOSE_PROFILES="rbac,caipe-ui,caipe-supervisor,caipe-mongodb,dynamic-agents,rag,all-agents,slack-bot". An optional thin overlaydocker-compose/docker-compose.e2e.override.yamlis layered in only to remap host ports (e.g., Mongo27017β27018) and inject e2e-only env vars when avoiding collision with a running dev stack.make test-rbacand Playwright drive both. KeycloakResourceCatalogβ generated TypeScript constant (output ofscripts/extract-rbac-resources.py) listing every(resource, scope)referenced in code, used at build time to verify realm-config completeness.
Success Criteria (mandatory)β
Measurable Outcomesβ
- SC-001: Zero BFF routes under
/api/admin/*,/api/dynamic-agents/*,/api/mcp-servers/*,/api/teams/*,/api/agents/*userequireAdmin,canViewAdmin, raw group-claim checks, or any non-Keycloak gate after this PR. Verified byscripts/validate-rbac-matrix.py. - SC-002: Zero Python services validate identity by reading
X-User-Context. Verified byrg "X-User-Context" ai_platform_engineeringreturning only test fixtures and audit-log lines. - SC-003: 100% of BFF routes in
tests/rbac-matrix.yamlhave at least one Jest test asserting allow + at least one asserting deny. - SC-004: 100% of Python services in scope have at least one pytest test asserting JWKS validation, allow, deny, and PDP-unavailable role-fallback.
- SC-005: Playwright E2E suite covers at least 8 canonical user journeys (the 8 user stories above) end-to-end against real Keycloak, runs in under 10 minutes locally.
- SC-006: Adding a new BFF route to a protected prefix without a matrix entry causes
make test-rbacto exit non-zero with a specific, actionable error message. - SC-007: RBAC docs quiz (10 questions auto-generated from
docs/docs/security/rbac/file-map.mdandarchitecture.mdcomponent sections) is answerable in under 5 minutes by a reviewer who has not worked on this PR. - SC-008: Total CI time for
make test-rbacincreases by no more than 4 minutes over the currentmake test+make caipe-ui-testsbaseline.
Open questions (for the user, before plan)β
These do not block writing the spec, but the answers will shape the plan:
- PDP-unavailable behaviour for non-admin scopes β
admin_ui#viewfalls back to theadminrealm role today. Fordynamic_agent#invoke,rag#retrieve,<agent>_mcp#read, etc., should the PDP-unavailable fallback be (a) deny-all, (b) realm-role fallback per resource (e.g.,chat_userforrag#retrieve), or (c) configurable per resource? Recommendation: (c) configurable, default deny-all, with a per-resource override list inrealm-config.jsonextras. - Where do the test fixtures live? β
tests/at repo root vstests/rbac/vs colocate with each component (ui/tests/rbac/,ai_platform_engineering/tests/rbac/)? Recommendation: single repo-roottests/rbac/for the matrix and fixtures, with thin shims that import them from each component's existing test runner. - Realm-config drift detection β should the CI linter be a hard gate (fail PR) or advisory (warn)? Recommendation: hard gate.
- Slack OBO token-exchange enabled by default in dev compose? β exchange requires Keycloak's token-exchange feature, which is gated behind
--features=token-exchange. The dev compose currently doesn't enable it. Recommendation: enable indeploy/keycloak/docker-compose.ymlso dev mirrors prod.