Feature Specification: Channel-Derived Team Binding + Personal DM Experience
Feature Branch: main (working directly on main per maintainer instruction)
Created: 2026-05-24
Status: Draft β awaiting review
Supersedes: docs/docs/specs/2026-05-24-personal-dm-experience/spec.md β its content has been merged into this spec. The standalone DM spec should be archived once this one is approved.
Input: User description (paraphrased from session):
"Remove the Keycloak-stamped
active_teamJWT claim. Derive the user's team at the consumer from the same MongoDB data the producer already reads. Make DMs, Webex 1:1 spaces, and the Web UI use the same team-union authorization shape. Give DM users a real, agentic, permission-aware experience: pick their own default agent, list what they can use, and override per thread. Make the whole stack explainable in six sentences."
One-sentence summaryβ
Stop encoding the user's team in the OBO JWT; derive it at every consumer from MongoDB (the source of truth that already exists); use the same authorization gate for Slack channels, Slack DMs, Webex spaces, Webex 1:1 spaces, and Web UI chat; and use the freed-up DM surface to let users pick their own preferred agent and steer per thread.
Problem Contextβ
There are two interlocking problems and they share a single fix.
Problem 1 β active_team in the JWT is the wrong layer for team bindingβ
Today, the user's team affiliation is encoded twice for Slack/Webex bot interactions:
- In MongoDB β
channel_team_mappings[channel_id] β team_slug(Slack) andspace_team_mappings[space_id] β team_slug(Webex). Source of truth. - In the Keycloak OBO JWT β a
team-<slug>client scope binds anactive_teamprotocol mapper tocaipe-platform; during RFC 8693 token exchange the mapper stampsactive_team=<slug>into the issued token.
The bot reads the Mongo row, then asks Keycloak to round-trip it back via a JWT claim. The RAG server reads the claim, then does a Mongo lookup keyed off team_id to find KB ownership. The actors on the path all have MongoDB access β the JWT round-trip adds no information, only fragility.
The fragility this creates:
- Per-team Keycloak client scopes provisioned at admin time (
ensureTeamClientScope). - An audience-default reconciler (
selectAgentGatewayActiveTeamScope) to keep singularity. KEYCLOAK_RBAC_ACTIVE_TEAM_SLUGenv var so the reconciler knows which team to pin.- A cardinality invariant + heal button to detect/repair drift (
audience.<client>.single_team_default). - A
team-personalscope and__personal__sentinel for DMs that structurally cannot be bound as default on the same audience (the slot is taken), so DMs intermittently fail with "team session unavailable." - A mismatch check in the bot's
obo_exchange.pythat rejects tokens where the returnedactive_teamdiffers from what was requested. - An advisory invariant acknowledging DMs are broken.
- ~10 Keycloak invariant rows in the Admin UI, with tooltips that have to explain RFC 8693 default-scope semantics to operators.
The Web UI surface today doesn't use active_team at all β requireAgentUsePermission only probes user:<sub> can_use agent:<id>. A user with team-mediated access but no direct grants is denied in the Web UI even though they can use the same agents from Slack channels. That's an existing inconsistency this spec resolves.
Problem 2 β DMs are an inert, deployment-default-only surfaceβ
Even if Problem 1 didn't exist, today's DM experience is weak:
- DMs bypass channel ReBAC (no channel-team mapping exists for a DM).
- Every user is routed to a single agent picked by
SLACK_INTEGRATION_DM_AGENT_ID(or the Webex equivalent). A user with access to ten agents still gets the one the operator chose at deploy time. - The user cannot personalize ("the GitHub agent is my default in DMs").
- The user cannot steer ("for this question I want the Splunk agent, not my default").
- The user cannot discover ("what can I even talk to in here?") without leaving the chat.
Why one spec instead of twoβ
The two problems share infrastructure:
- Both need the BFF PDP to accept
user_subject=user:<sub>and resolve "which agents/teams is this user allowed to use" by walking OpenFGA. - Both make DMs go from "structurally broken" (current state) to "first-class agentic surface."
- Both touch the same code paths in the bots (
obo_exchange.py,_rbac_enrich_context, the DM short-circuit).
Shipping them together means the operator and user both get a coherent jump rather than two confusing interim states ("DMs work now but you can't pick an agent" β "now you can pick an agent but you have to set KEYCLOAK_RBAC_ACTIVE_TEAM_SLUG").
Goalβ
Make the team the responsibility of the data layer (MongoDB + OpenFGA), not the identity token. Keycloak issues an OBO token that proves who is acting and for whom β nothing more. Every consumer that needs the team derives it from the request's channel context (or, where no channel exists, from the user's team memberships in OpenFGA). DMs and the Web UI become first-class surfaces with per-user agent selection, in-DM discovery, and per-thread override, all backed by the same authorization gate as channels.
User Scenarios & Testing (mandatory)β
User Story 1 β Slack channel message (Priority: P1)β
A user types a message in a Slack channel mapped to team platform. They are a member of platform. The team has team:platform#member can_use agent:incident-responder. They expect the bot to respond.
Why this priority: This is the dominant Slack flow and must keep working without operator intervention. It is also the easiest scenario to verify regression-free behavior.
Independent Test: Run the existing Slack integration tests against the new code path. With no KEYCLOAK_RBAC_ACTIVE_TEAM_SLUG env var and no team-platform Keycloak scope, the bot must still allow the user to talk to the agent.
Acceptance Scenarios:
- Given a Slack channel mapped to team
platform, a user who is a member ofplatform, and an OpenFGA tupleteam:platform#member can_use agent:incident-responder, When the user posts a message routed to that agent, Then the bot calls the BFF PDP withuser_subject=team:platform#member, the PDP returns allow, the bot forwards an OBO token with noactive_teamclaim, and the agent responds. - Given the same setup but the OpenFGA tuple is missing, When the user posts a message, Then the bot replies with a "your team does not have access to this agent" message and does not forward to Dynamic Agents.
- Given a Slack channel with no Mongo mapping, When any user posts a message, Then the bot replies "this channel isn't assigned to a CAIPE team yet" and does not forward.
User Story 2 β Slack DM: personalized, agentic, permission-aware (Priority: P1)β
A user opens a DM with the bot. They are a member of multiple teams with access to multiple agents. They expect:
- The bot uses their saved DM default agent if they've set one (and they still have permission).
- If they haven't set a preference, the bot falls back to the deployment default DM agent (existing behavior).
- They can list, in-DM, what agents they can use (
/listor equivalent). - They can steer per thread to a specific agent (
/use <agent>or equivalent), with the override scoped to that thread only. - Authorization always re-validates: a saved preference does not bypass
can_usechecks.
Why this priority: DMs are broken today (structural Keycloak limitation) AND are deployment-default-only. This story is the load-bearing fix for both problems simultaneously.
Independent Test: Send the bot a DM. With no Keycloak team-personal scope, no __personal__ sentinel, and the user having set Agent X as their DM default (in Web UI Settings), the next DM message must route to Agent X. Sending /use AgentY then a follow-up message must route the follow-up to AgentY. After the thread ends or expires, the next conversation must revert to Agent X.
Acceptance Scenarios (consolidating Problem 1 + Problem 2):
Authorization (Problem 1 β the gate works without active_team)β
- Given a Slack DM with no entry in
channel_team_mappings, a user who is a member ofteam:platform, andteam:platform#member can_use agent:X, When the user DMs the bot, Then the bot calls the BFF PDP withuser_subject=user:<sub>, the PDP resolves the user's team memberships via OpenFGAlist_objects, findsteam:platform, probesteam:platform#member can_use agent:X(allow), and returns allow. - Given the same setup but the user is removed from
team:platform, When the user DMs the bot, Then the PDP iterates the user's teams (now empty), no direct grant exists, and returns deny. The bot replies that the user has no agents they can use. - Given an admin has additionally granted
user:<sub> can_use agent:Ydirectly (in addition to team grants), When the user DMs the bot to use agent Y, Then the PDP short-circuits on the direct grant path and allows without iterating team union.
Agent selection (Problem 2 β DMs are first-class)β
- Given a user who has not set a DM default, When they DM the bot, Then the bot dispatches to the deployment-configured
dm_agent_id(falling back todefault_agent_id), runs the authorization check from scenarios 1-3 against that agent, and dispatches if allowed. - Given a user who has saved Agent X as their DM default via Web UI Settings, When they DM the bot, Then the bot resolves X as the target agent, runs the authorization check from scenarios 1-3 against X, and dispatches if allowed.
- Given a user whose saved DM default is Agent X but whose
can_usepermission on X has been revoked, When they DM the bot, Then the bot falls through to the deployment default DM agent, emits one ephemeral notice explaining the fallback, and dispatches to the deployment default if allowed. - Given a user who issues
/listin a DM, When the command runs, Then the bot returns (ephemerally) the list of agents that user hascan_useon (direct grants OR any team grants), with human-readable names and short descriptions, paginated if >25. - Given a user who issues
/use AgentZin a DM, When they then send a follow-up message in the same thread, Then the follow-up is dispatched to AgentZ. Subsequent messages in the same thread continue to AgentZ until the user explicitly changes the agent (via/use <other>or/use default) or the bot restarts. - Given a user who issues
/use AgentWand does not havecan_useon AgentW, When the override command runs, Then the bot refuses with a clear message ("you don't have access to AgentW"), the user's existing default is preserved, no override is stored. - Given a user who issues
/helpin a DM, When the command runs, Then the bot returns (ephemerally) help text describing the three commands and their semantics.
Failure modesβ
- Given the user-preference store is temporarily unavailable, When a user DMs the bot, Then the bot falls back to the deployment default DM agent and the DM still succeeds (graceful degradation). One operational log line is emitted.
- Given OpenFGA is temporarily unavailable during the PDP call, When the user DMs the bot, Then the PDP returns
pdp_unavailable, the bot denies with a "try again" message. Same fail-closed behavior as today.
User Story 3 β Webex space and Webex 1:1 with personal DM experience (Priority: P1)β
Webex is structurally symmetric with Slack. Group spaces are channels; 1:1 spaces are DMs. The personal-DM experience (saved default, list, override, help) applies to Webex 1:1 spaces too. Webex has no native slash commands, so commands are issued as plain text (e.g. @bot list, @bot use github).
Why this priority: Shipping Slack and Webex together prevents asymmetric architecture across the two integrations. If we ship only one, operators will perpetually be asked "when is Webex going to work the same way?"
Independent Test: Run the existing Webex integration tests against the new code path. Webex 1:1 interactions for a user with a saved Webex DM default must route to the saved agent and respect overrides exactly like Slack.
Acceptance Scenarios: Identical to Story 2 with Webex transport. Webex spaces (group) follow Story 1's pattern; Webex 1:1 spaces follow Story 2's pattern.
User Story 4 β Web UI chat with team-mediated access (Priority: P1)β
A user logs into the Web UI and starts a chat with an agent they have team-mediated access to (but no direct user grant).
Why this priority: Today the Web UI uses requireAgentUsePermission which only probes user-direct grants. A user with only team-mediated access is denied. This is an existing inconsistency β they can use the same agent from a Slack channel but not from the Web UI. The spec closes that gap.
Independent Test: With only team:platform#member can_use agent:X and no direct user grant, a member of team:platform should be able to start a Web UI chat with agent X. Sending a chat with an agent the user does not have access to (via any path) must still be denied.
Acceptance Scenarios:
- Given a Web UI user who is a member of
team:platformandteam:platform#member can_use agent:X(no direct user grant), When they start a chat with agent X, ThenrequireAgentUsePermissionresolves the user's team memberships, probes the team grants, finds allow, and proceeds to forward the chat. - Given the same user is removed from all teams and has no direct grants, When they try to chat with agent X, Then the gate returns deny.
- Given an admin has granted
user:<sub> can_use agent:Ydirectly, When the user chats with agent Y (regardless of team membership), Then the gate allows via the direct-grant path.
(Web UI does not have a "default DM agent" concept of its own; the Web UI's agent picker is the existing UI surface and is unrelated.)
User Story 5 β Admin sets DM default agent picker for users (Priority: P2)β
A user opens the Web UI Settings panel. They see a "DM agent" section with a picker. The picker lists only agents the user has can_use on. The user picks one and saves; their next DM to the Slack or Webex bot routes to that agent.
Why this priority: Without this surface, users can't personalize DMs at all. Tied to FR-002 / Story 2 scenario 5.
Independent Test: Open Settings, see picker, see only authorized agents, save one, DM the bot, observe the saved agent is used. Clear the preference, DM again, observe deployment default.
Acceptance Scenarios:
- Given a user who has
can_useon 3 agents, When they open the Settings DM-agent picker, Then they see exactly those 3 agents (with names and descriptions). No other agents. - Given a user with 0 accessible agents, When they open the picker, Then they see a helpful empty-state message ("you don't have access to any agents yet β ask an admin to grant your team access").
- Given a user who has set a default, When they look at the picker, Then the current default is visibly highlighted. A "clear preference" affordance reverts to deployment default.
- Given the user changes their default, When they DM the bot in a NEW thread (or after the in-process preference cache TTL expires, default 60s), Then the new default is honored without a bot restart.
User Story 6 β Admin creates a team (Priority: P2)β
An admin creates a new team via the Admin UI. They expect the team to be usable immediately for channel/space mappings and grants. They expect to touch no Keycloak configuration.
Why this priority: Dominant admin flow. Today triggers Keycloak scope provisioning that can fail, drift, and require an env var.
Acceptance Scenarios:
- Given an admin with
admin_ui:admin, When they POST/api/admin/teams { name: "Platform Engineering" }, Then the BFF inserts ateamsrow in Mongo, writesteamobject tuples to OpenFGA, and returns the team. No Keycloak Admin API calls. - Given the team exists, When the admin maps a Slack channel to it via the Admin UI, Then the bot's next channelβteam Mongo lookup returns the new mapping and members of the team can use mapped agents in that channel.
User Story 7 β Operator's Keycloak panel after migration (Priority: P2)β
An operator opens Admin UI β Security & Policy β Keycloak. They expect to see only the rows actually about Keycloak.
Acceptance Scenarios:
- Given a post-migration deployment, When the operator opens the Keycloak panel, Then they see at most these invariant categories: realm-exists, bot-clients-registered, token-exchange-policy-strict-shape, bootstrap-admins-resolved. No
audience.*.single_team_default. Noteam_personal.dm_mode_known_limitation. No "Reconcile active-team scope" button. - Given the operator removes
KEYCLOAK_RBAC_ACTIVE_TEAM_SLUGfrom Helm values, When they redeploy, Then no warnings or invariants fire about that env var.
Edge Casesβ
- User belongs to many teams (50+), all with access to the same agent: PDP returns allow on first match; no double-counting. Performance: see Open Question on
list_objectsperf. - User in zero teams, no direct grants, DMs the bot: PDP returns deny; bot responds "no agents available" with guidance.
- Channel mapping changes mid-conversation: bot looks up channelβteam on every message; mid-conversation re-mapping takes effect immediately.
- Tokens in flight from before the migration (legacy
active_teamclaim present): consumers prefer channel-derived team but accept the claim as fallback during the dual-read window. - RAG server receives a request with neither claim nor channel_id: fail closed.
- DM thread override active, then bot restart: thread overrides do not persist across restarts (FR-016); next message reverts to saved default.
- DM thread override and user wants to revert mid-thread: user issues
/use default(Slack) oruse default(Webex) which clears any active thread override AND clears the saved preference, reverting to the deployment default agent. See FR-033. - DM thread override and user wants a different agent: user issues
/use <other-agent>which replaces the active thread override. - User types
/use github-agentbut the agent name has a typo: bot replies with a friendly correction ("did you mean github?") β does not silently route to default, does not silently fail. - Two users DM the bot simultaneously: each has independent preference / override state; no cross-contamination.
- Web UI user with team-mediated agent access: with this spec, they can chat with team-mediated agents. Without the spec they could not. This is a behavior broadening.
Requirements (mandatory)β
Functional Requirementsβ
Authorization & gating (preserved in shape, simplified in implementation)β
- FR-001: System MUST continue to enforce that a Slack channel mapped to a team only allows users who are members of that team to talk to mapped agents from that channel.
- FR-002: System MUST continue to enforce that a team must have an OpenFGA
team#member can_use agenttuple for its members to use an agent. - FR-003: System MUST continue to enforce that an unmapped Slack channel or Webex space cannot reach Dynamic Agents β bot edge denies before forwarding.
- FR-004: AgentGateway MCP-tool gating per
user can_invoke toolis unchanged.
DM, Web UI, and Webex 1:1 authorization (new shape β direct OR team union)β
- FR-005: For Slack DMs (no
channel_team_mappingsrow), the bot MUST call the BFF PDP withuser_subject=user:<sub>. No__personal__sentinel. - FR-006: For Webex 1:1 spaces (no
space_team_mappingsrow), the bot MUST call the BFF PDP withuser_subject=user:<sub>. - FR-007: For Web UI chat requests,
requireAgentUsePermissionMUST allow if EITHERuser:<sub> can_use agent(direct grant) OR any team the user belongs to hasteam:<slug>#member can_use agent. - FR-008: BFF PDP MUST resolve "team union" by listing user team memberships via OpenFGA
list_objects(user, member, team), then probing eachteam:<slug>#member can_use agent:<id>until one returns allow or all are exhausted. - FR-009: BFF PDP MUST short-circuit on direct grant (
user:<sub> can_use agent:<id>) before iterating team union. - FR-010: When the PDP returns allow, the audit log MUST include a
team_resolution_pathfield:direct_user_grant|team_union:<slug>|channel_grant_and_team|denied.
Token shape and Keycloak surfaceβ
- FR-011: Slack bot, Webex bot, and Web UI MUST NOT request
scope=team-<slug>orscope=team-personalon token exchange. Onlyscope=openidplus audience. - FR-012: Consumers (RAG, Dynamic Agents) MUST NOT require
active_teamclaim. Dual-read window: they MAY read it if present; after Phase 3 the readers and the issuer-side machinery are removed. - FR-013: BFF MUST NOT call
ensureTeamClientScopewhen creating a team in Phase 2 onward. Phase 3 deletes the function. - FR-014:
selectAgentGatewayActiveTeamScope,ensureTeamClientScope, theKEYCLOAK_RBAC_ACTIVE_TEAM_SLUGenv var, thePOST /api/admin/keycloak/active-team-scopeBFF route, and the heal UI button MUST be removed in Phase 3. - FR-015: The
audience.<client>.single_team_defaultinvariant andteam_personal.dm_mode_known_limitationadvisory MUST be removed in Phase 3.
Channel context propagationβ
- FR-016: Bots MUST include the originating
channel_id(andworkspace_id) in the chat request envelope forwarded to Dynamic Agents. Web UI omits both. - FR-017: RAG server MUST derive
team_idfromchannel_idvia the existingchannel_team_mappingscollection when channel context is present. When absent (Web UI, DM, Webex 1:1), RAG MUST evaluate team-union from the user's OpenFGA memberships for KB filtering β matching the gate policy. - FR-018: Absence of a
channel_team_mappingsrow is the DM signal. The string sentinel__personal__MUST be removed.
Per-user DM agent preference (new β replaces personal-DM spec)β
- FR-019: System MUST persist, per user, a "DM default agent" preference (
dm_default_agent_id) that survives bot restarts, Web UI restarts, and user sessions. Stored in MongoDB as part of an existing user-preferences collection (no new collection if one exists; create one nameduser_preferencesif not). - FR-020: Web UI Settings MUST expose a dedicated, discoverable surface for the user to view and update their DM default agent.
- FR-021: The Web UI picker MUST list only agents the user has
can_useon (via direct grants or any team grants). - FR-022: The Web UI MUST show what the deployment default is (the agent used when no preference is set) so "clear preference" is meaningful.
- FR-023: Slack bot and Webex bot MUST, on every DM message, resolve the dispatch agent in this order:
- The user's active thread override (Phase 2), if any and if user has
can_use. - The user's saved DM default (
dm_default_agent_id), if any and if user hascan_use. - The deployment-configured
dm_agent_id, if user hascan_use. - The deployment-configured
default_agent_id, if user hascan_use. - Deny with "no agents available" guidance.
- The user's active thread override (Phase 2), if any and if user has
- FR-024: The bot MUST re-verify
can_useon the resolved agent at dispatch time. A saved preference does not bypass authorization. - FR-025: If a saved preference is bypassed because permission was revoked or the agent no longer exists, the user MUST receive a single ephemeral notice in that DM explaining the fallback. The notice MUST NOT block the request.
- FR-026: The bot's user-preference lookup MUST be cached briefly per user inside the process (default 60s TTL) for latency. Cache MUST honor changes within the TTL.
- FR-027: If the preference-storage backend is unavailable, the bot MUST gracefully fall back to the deployment default. One operational log line per outage window.
In-DM commands (new β from personal-DM spec)β
- FR-028: Slack bot MUST expose
/list(or/caipe-list) returning ephemerally the agents the user hascan_useon. Names, short descriptions, paginated for >25 entries. - FR-029: Slack bot MUST expose
/use <agent>(or/caipe-use <agent>) setting an explicit thread-scoped override. - FR-029a: Slack bot MUST accept the literal token
defaultas the argument to/use(i.e./use defaultor/caipe-use default). This single command MUST (a) clear any active thread override for the current thread, and (b) clear the user's saved DM preference (dm_default_agent_id := nullinuser_preferences). Subsequent messages then route via the standard chain (FR-023), which falls through to the deployment default. The bot MUST acknowledge with an ephemeral confirmation naming the deployment-default agent the user will now be talking to. - FR-030: Slack bot MUST expose
/help(or/caipe-help) showing available commands, including/use default. - FR-031: Webex bot MUST provide equivalent functionality. Since Webex has no native slash commands, the bot MUST accept plain text commands
list,use <agent>,use default,helpdirected to it (after the@botmention or as the entire 1:1 message body), and MAY also support natural-language detection of the same intents. - FR-032: Thread overrides MUST be scoped to the current DM thread, MUST NOT cross threads, MUST NOT affect saved preference (except via the explicit
/use defaultcommand β FR-029a), MUST NOT persist across bot restarts. Thread overrides do NOT have a time-based expiry: an override remains in effect for the lifetime of the bot process (or until the user explicitly changes it via another/use <agent>or/use defaultcommand in the same thread). - FR-033: If a user issues
/use <agent>for an agent they don't havecan_useon, the bot MUST refuse with a clear message, preserve existing default, store no override./use defaultMUST always succeed (no permission check β clearing your own preference is always allowed). - FR-034: All command output MUST be ephemeral where the platform supports it (Slack:
response_type=ephemeral; Webex: direct reply to the issuer only). - FR-035: Commands MUST be rate-limited per user (default: 5 commands per 30 seconds).
- FR-036:
/listMUST resolve permissions at command time, not from a stale cache. - FR-037: User-visible bot messages (notices, refusals, help, fallback explanations) MUST use the existing
user_messages.pytemplating infrastructure, not ad-hoc literal strings.
Migration safetyβ
- FR-038: Phase 1 (additive, dual-read) MUST land without breaking any existing Slack/Webex/Web UI flow.
- FR-039: Phase 2 (flip default) MUST be deployable independently of Phase 3.
- FR-040: Phase 3 (demolition) MUST be reversible up until Keycloak
team-<slug>scopes are deleted. Scope deletion is the point-of-no-return.
Audit & observabilityβ
- FR-041: Every bot DM dispatch decision MUST log: user_id, resolved agent_id, source (
saved_preference|thread_override|deployment_dm_default|deployment_default|denied), team_resolution_path (from FR-010), and authorization outcome. - FR-042: Dynamic Agents MUST drop the
active_teamlog field once Phase 2 is deployed.
Key Entitiesβ
Existing (no schema change)β
- Channel/Space mapping:
channel_team_mappings,space_team_mappingsβ source of truth for roomβteam. - Team membership: OpenFGA
team:<slug>#membertuples. - Agent grants: OpenFGA tuples
<subject> can_use agent:<id>.
Newβ
- User Preference:
user_preferences(or extension of existing collection if one exists). Key fields:user_id(Keycloak sub),dm_default_agent_id(nullable),updated_at. Lifecycle: created on first save, updated on user change, cleared by user via "clear preference" action. Deleted automatically if user is purged (out of scope for this spec). - DM Thread Override: In-process map keyed by
(workspace_id, channel_id, user_id, thread_ts)for Slack and(person_id, room_id)for Webex 1:1, valued byagent_id. Bounded scope; never persisted; no time-based expiry β entry lives until the user explicitly changes the agent (/use <other>or/use default) or until the bot process restarts. - PDP request envelope (changed):
/access-checkacceptsuser_subjectas eitherteam:<slug>#member(mapped channel/space) oruser:<sub>(DM/1:1/Web UI). PDP behavior depends on subject type. - Chat request envelope (new optional field):
channel_idandworkspace_idpresent for bot-originated requests; absent for Web UI.
Success Criteria (mandatory)β
Authorization simplification (Problem 1)β
- SC-001: Slack DMs work end-to-end without
KEYCLOAK_RBAC_ACTIVE_TEAM_SLUGset and without anyteam-*scope bound as default oncaipe-platform. Verified bytests/rbac/end_to_end/test_slack_dm.sh(new). - SC-002: Webex 1:1 spaces work end-to-end under the same conditions. Verified by
tests/rbac/end_to_end/test_webex_1to1.sh(new). - SC-003: Web UI chat works for a user with team-mediated access only. Verified by new Jest integration test.
- SC-004: After Phase 3, adding a new team makes zero Keycloak Admin API calls.
- SC-005: After Phase 3, Keycloak panel has at most 4 invariant rows.
- SC-006: After Phase 3,
KEYCLOAK_RBAC_ACTIVE_TEAM_SLUGremoved from Helm chart anddocker-compose.dev.yaml. - SC-007: After Phase 3,
rg "active_team" ai_platform_engineering uireturns no production matches. - SC-008: All three surfaces use the same PDP code path with
user_subjectdistinguishing the cases.
Personal DM experience (Problem 2)β
- SC-009: A user can set a DM default agent in Web UI Settings and see the result in a follow-up DM within 60 seconds, no bot restart.
- SC-010: 100% of bot DM dispatches honor the user's saved preference when authorized.
- SC-011: 100% of dispatches that would have used a no-longer-authorized preference fall back gracefully with exactly one ephemeral notice per fallback occurrence (per thread, not per message).
- SC-012:
/listreturns results within 2 seconds at p95 for users with up to 50 accessible agents. - SC-013: After a thread override, 100% of subsequent messages in the same thread route to the chosen agent until the user explicitly changes the agent (
/use <other>or/use default) or the bot restarts. - SC-014: 0 cases of cross-user state leakage in DM preference, override, or accessible-agents lookup.
- SC-015: When the preference-storage backend is unavailable, DM messages still succeed (via deployment default) at β₯99% over the outage window.
- SC-016:
/helpand/listp95 latency under 1 second in steady state.
Combinedβ
- SC-017: End-to-end RBAC test suite runtime stays within 10% of pre-spec baseline.
- SC-018: RBAC docs in
docs/docs/security/rbac/have theactive_team,team-<slug>,team-personalnarrative removed. The "How RBAC works" page is one page shorter and the architecture diagram shows the simplified 5-box model. - SC-019: A new operator can read
docs/docs/security/rbac/index.mdand understand the entire authorization model in under 10 minutes (subjectively measured; we will validate with one external reader).
Phasing and Independent Deliverabilityβ
Phase 1 β Additive dual-read (foundation)β
Goal: Both code paths coexist. Nothing breaks. No user-visible change.
- BFF PDP
requireAgentUsePermissionadds ateam_unionfallback when no direct grant exists. - BFF PDP
/access-checkalready acceptsuser_subject=user:<sub>; extend it to do team-union if no direct grant. - Bots include
channel_id+workspace_idin the chat envelope forwarded to Dynamic Agents (additive field; safely ignored). - RAG server: if
active_teamclaim is missing ANDchannel_idis present in request context, deriveteam_idfromchannel_team_mappingsand proceed. - Add
user_preferencesMongoDB collection + read API used by the bot (not yet wired into dispatch). - Add Web UI Settings panel for DM default agent (writes preference; not yet honored by bot β that's Phase 2 dispatch).
FRs: FR-001 through FR-010, FR-016 through FR-021, FR-038.
Phase 2 β Flip default + DM personalization liveβ
Goal: New code paths are used by default. Old paths still work (for in-flight tokens). Personal DM experience is live for users.
- Bots stop requesting
scope=team-<slug>/scope=team-personalon token exchange. - Bots stop the
active_teammismatch check inobo_exchange.py. - Bots stop using
PERSONAL_ACTIVE_TEAMsentinel; DMs useuser_subject=user:<sub>and call the PDP. - Bots honor the saved DM default agent preference (FR-023 dispatch chain).
- Slack slash commands
/list,/use,/helpgo live (manifest updated). - Webex equivalent commands go live.
- Web UI
requireAgentUsePermissionhonors team-union grant (today restricted to direct only). - Keycloak still has
team-*scopes provisioned but no longer issuesactive_teamclaims (because nothing requests the scopes). - Dynamic Agents drop the
active_teamlog line.
FRs: FR-011, FR-022 through FR-037, FR-039, FR-041, FR-042.
Phase 3 β Demolitionβ
Goal: Delete the now-unused Keycloak surface and the supporting code.
- BFF: delete
ensureTeamClientScope,selectAgentGatewayActiveTeamScope, thePOST /api/admin/keycloak/active-team-scoperoute, the cardinality invariant, the team-personal DM advisory, the heal UI button. - Bots: delete the
active_teamparameter, the_apply_active_teamhelper, the mismatch check (already inert from Phase 2). - RAG: delete the
active_teamclaim extractor. - Keycloak: delete
team-<slug>andteam-personalclient scopes via a one-time cleanup migration (or operator opt-in script β see Open Question). - Helm chart: remove
KEYCLOAK_RBAC_ACTIVE_TEAM_SLUGvalue definition. docker-compose.dev.yaml: remove the env var passthrough.- Documentation: rewrite
docs/docs/security/rbac/*to drop theactive_teamnarrative.
FRs: FR-012, FR-013, FR-014, FR-015, FR-040.
Independent value per phaseβ
- Phase 1 alone is shippable. Adds the data plumbing; no behavior change.
- Phase 1 + 2 is shippable and delivers full user-visible value: DMs work, users can personalize, Web UI parity. Keycloak still has extra scopes but they're inert.
- Phase 3 is the cleanup. Critical for the explainability goal (SC-019) but not for functional correctness.
Non-Goalsβ
- Not changing the OpenFGA authorization model. Same tuples, same probes. We change the producer of the team subject, not the schema.
- Not changing AgentGatewayβMCP authorization. Per-user-per-tool, orthogonal.
- Not changing admin UI for team/channel/space CRUD. Same UX. BFF stops calling Keycloak on team create, but UI doesn't notice.
- Not adding context-scoped grants ("Agent X usable in channels but not DMs"). Current OpenFGA model supports it via different resource refs; out of scope here.
- Not multi-tenant bot instances per realm. Assumes one bot per realm.
- Not retrofitting other token-exchange consumers (e.g. dynamic-agents OBO chains for MCP). Those continue to use OBO without
active_teamfor their own reasons. - Not changing the channelβteam and user-membership pre-checks at the bot edge. They remain β fast-deny paths.
- Not introducing a supervisor-style LLM auto-router as the DM default agent. A user can pick any agent they have access to; introducing a "platform supervisor" agent is a separate design decision.
- Not per-user agent permissions management. Spec consumes existing permission system; granting/revoking is out of scope.
- Not cross-platform thread continuity. A Slack DM thread and a Webex DM thread are independent.
Assumptionsβ
- The authoritative OpenFGA system can answer
list_objects(user, member, team)andlist_objects(user, can_use, agent)efficiently for users with β€50 memberships / agent grants. - A user-preferences MongoDB collection exists or can be created without conflicting with other in-flight specs.
- Slack workspace operators are willing to update the Slack app manifest to register new slash commands as part of deploying Phase 2.
- Slack DM detection (
is_dm_channel) and Webex 1:1 detection (space.type == "direct") are reliable signals already used by the bots. - The bots can call the BFF authenticated via OBO context (already true).
Dependenciesβ
- Existing per-user OBO token-exchange flow.
- Existing OpenFGA-backed permission system with
list_objectssupport. - Existing user-settings storage in the BFF (or willingness to create
user_preferencescollection if absent). - Existing identity-link prerequisite: bot must resolve chatting user to a CAIPE Keycloak identity before reading preferences or checking permissions.
- Phase 3 demolition depends on Phase 2 having shipped and been observed running for at least one release window (so any in-flight tokens with the legacy claim have expired naturally).
Open Questionsβ
-
OpenFGA
list_objectsperformance. DM/Web UI path runslist_objects(user, member, team)for each PDP call. For users in many teams, this is N+1 OpenFGA calls. Plan: ship Phase 1+2 without optimization, with an in-process per-user 60s cache. Monitor latency. If it becomes a problem add a server-side OpenFGA relation (agent#can_be_used_by_any_team_of_user) in a future spec. -
active_teamlog field deprecation window. Spec FR-042 drops it in Phase 2. If anyone uses the log field for downstream observability (Splunk dashboards), they need a heads-up before Phase 2 ships. Plan: announce in the Phase 1 release notes; deprecate the field in Phase 2. -
Keycloak
team-<slug>scope cleanup mechanism (Phase 3). Options:- (a) BFF-initiated migration script that uses Keycloak Admin API to bulk-delete
team-*scopes once Phase 2 has been live for one release. - (b) Operator-facing button in Admin UI that lists current
team-*scopes and deletes them on click. - (c) Leave scopes in place forever; they're inert and harmless. Plan: default to (a) β automated migration as part of the Phase 3 release. (c) is the rollback path if something breaks.
- (a) BFF-initiated migration script that uses Keycloak Admin API to bulk-delete
-
Webex command syntax. Slack has slash commands. Webex doesn't. Plan: text commands
list,use <agent>,helpdirected at the bot (after@botmention or as full message body in 1:1). Optional follow-up: natural-language detection for "what agents can I use?" etc., but not a hard requirement for shipping. -
Web UI DM default agent picker placement. New section in existing Settings panel, or new top-level "DM" subpage? Plan: new section in existing Settings panel (one less navigation surface).
-
Per-user preference storage location. MongoDB collection name
user_preferencesis a natural fit. Check whether an existing collection serves user-scoped settings; if so, add the field there. Plan: scan for existinguser_settings/user_preferencescollections at plan time and reuse if found.
What Stays The Sameβ
For reviewers to know what they do NOT need to re-validate:
- Keycloak as identity authority. Still issues OBO. Signs
sub,act.sub,aud,exp. We remove one custom claim, not the identity model. - OpenFGA tuples and authorization model. All existing tuples remain.
- MongoDB schemas (other than new
user_preferencesfield/collection). - Bot edge denies for unmapped channels/spaces. Keep working as fast-fail paths.
- The two distinct gates (HumanβAgent at bot/BFF; AgentβTool at AgentGateway). Both stay.
- Admin UI flows for team/channel/space CRUD. No UX changes for admins managing data.
- Deployment-level
default_agent_idanddm_agent_idconfig. Still the fallback when user has no preference.
Glossary (for the spec β to be migrated into docs/docs/security/rbac/)β
- PEP (Policy Enforcement Point): Where a deny happens. Slack/Webex bot for humanβagent; AgentGateway for agentβtool.
- PDP (Policy Decision Point): Where the allow/deny decision is made. BFF for humanβagent; OpenFGA bridge for agentβtool.
- Team union: The set of agents a user can use by virtue of being a member of any team that has been granted access.
- DM personal mode: A direct message between a user and the bot. Detected by absence of a
channel_team_mappings(orspace_team_mappings) row for the channel/space, not by a JWT claim or sentinel string. - Saved preference vs thread override: The user's persistent default for DMs (stored in
user_preferences) vs a per-thread choice that lives in the bot's process memory until the user explicitly changes it or the bot restarts. Neither has a time-based expiry.