Skip to main content

RBAC Architecture

Component-by-component reference. Each section describes what it owns, what it does NOT own, and the env vars / config files / extension points you'd touch to change its behavior.

Read the index first if you want the big-picture mental model and the JWT primer. Read Workflows for the request-flow sequence diagrams that tie all of this together.


Helm Runtime Packaging​

The 0.5.0 umbrella chart can own the RBAC runtime stack for demo and managed environments:

  • tags.keycloak=true enables the Keycloak subchart, realm import, and IdP/token-exchange init hooks. The imported realm follows keycloak.realm.name by rewriting the packaged realm JSON's realm name, Keycloak default-role name, and realm-role container ids at render time.
  • The Keycloak subchart packages the caipe login theme by default and mounts it as a ConfigMap under /opt/keycloak/themes/caipe. Deployments can customize branding with keycloak.theme.brandName, keycloak.theme.colors.*, or full keycloak.theme.files.* overrides; keycloak.theme.existingConfigMap remains available for externally managed theme ConfigMaps.
  • openfga.enabled=true enables the OpenFGA service and the CAIPE authorization model loader hook. The loader can still write explicit emergency tuples through openfga.init.seedTuples, but production RBAC installs should bootstrap human admins through the Web UI BFF email reconciler so operators do not have to hardcode Keycloak UUIDs in Helm values.
  • openfgaAuthzBridge.enabled=true enables the gRPC ext_authz bridge that validates the bearer JWT again, extracts the verified sub, and translates AgentGateway checks into OpenFGA checks.
  • agentgateway.enabled=true enables the standalone AgentGateway proxy chart. global.agentgateway.enabled=true is still the Gateway API route-resource path for clusters using the AgentGateway controller model.
  • The local and Helm standalone AgentGateway provider MCP targets preserve the caller's Keycloak bearer for listener JWT validation and OpenFGA mcp_gateway:list authorization, then inject provider tokens such as GITHUB_PERSONAL_ACCESS_TOKEN and GITLAB_PERSONAL_ACCESS_TOKEN as backend auth only on the upstream MCP hop. Helm installs should mount those values from Secrets or ExternalSecrets through agentgateway.extraEnv/agentgateway.extraEnvFrom; this keeps provider PATs out of browser/session traffic while satisfying upstream MCP servers' Authorization requirements.

Production installs must still supply ExternalSecrets and persistent datastore settings; the chart defaults are conservative and disabled by default.


Component 1: Keycloak β€” HR & The Front Desk​

Badge analogy: HR issues ID badges. The front desk verifies them on entry. Every other door in the building trusts the badge's chip β€” they don't call HR each time. When a contractor arrives via a partner agency (Duo SSO), the front desk checks with the agency once, creates an internal record, and issues a standard building badge. From that point on, the contractor uses the same badge as everyone else.

Technically: Keycloak acts as an OIDC Authorization Server and IdP broker. It proxies login to Duo SSO via an OIDC client, mirrors external group claims into identity attributes for sync, and issues its own signed JWT β€” so downstream services only ever need to trust one issuer. CAIPE authorization decisions are no longer encoded as Keycloak realm roles.

Realm Roles (configured realm, default caipe)​

RoleDefault?Purpose
default-roles-<realm>YesKeycloak composite default role.
offline_accessYesKeycloak protocol role for refresh/offline.
uma_authorizationBuilt-inKeycloak protocol role; not CAIPE authz.

There are no CAIPE business/resource realm roles. A CAIPE admin is represented as user:<sub> admin organization:<org_key> in OpenFGA, optionally via team:<slug>#admin admin organization:<org_key>. BOOTSTRAP_ADMIN_EMAILS is only a break-glass fallback until those durable organization tuples exist.

Resource-scoped roles (legacy)​

Legacy role names such as chat_user, admin, admin_user, team_member:*, kb_reader:*, agent_user:*, and tool_user:* are cleanup targets only. New installs do not create them, and new authorization code must not check them.

Relationships are created and assigned by:

  • init-idp.sh (runs in the keycloak-init job) is the first-run bootstrap escape hatch. It uses direct Keycloak admin credentials before the Web UI backend is healthy, which avoids a bootstrap cycle where BFF startup needs Keycloak config that only the BFF can create. It should keep only baseline app-realm prerequisites, IdP broker login bootstrap, optional demo personas (KEYCLOAK_SEED_DEMO_USERS=true), and operational master-realm settings such as admin-console frontendUrl. It also ensures offline_access is present on the configured realm's default-roles-<realm> composite and enables Keycloak's realm-level users-management-permissions feature with bootstrap admin credentials so the later BFF migration does not need broad manage-realm privilege. init-token-exchange.sh uses the same bootstrap-admin path to grant both Slack and Webex bot service accounts the realm-management impersonation role before the lower-privilege BFF reconciliation runs.
  • The Web UI backend runs a startup Keycloak RBAC reconciliation migration (keycloak_rbac_mapping_reconciliation_v1) in TypeScript. MongoDB teams remain the source of truth; the migration repairs bot OBO token-exchange permissions for the CAIPE_PLATFORM_AUDIENCE target client, assigns bot service-account impersonation roles, pins the AFFIRMATIVE decision strategy on every scope-permission with bot client policies attached, resolves BOOTSTRAP_ADMIN_EMAILS to Keycloak user ids, creates passwordless verified placeholders for bootstrap emails that have not logged in yet, writes durable OpenFGA super-admin tuples, and records status in migration_manifest, schema_migrations, and data_schema_versions. When the BFF token cannot enable users-management-permissions itself, it falls back to reading the already-enabled permission created by the init hook and continues with policy repair. (Phase 3 of spec 2026-05-24-derive-team-from-channel removed the per-team and personal client-scope branches, the orphan-scope deletion step, and the audience-default selection step β€” team identity now flows through channel_team_mappings, not Keycloak.)
  • Slack/Webex bot onboarding can still repair OBO prerequisites on-demand, but the BFF startup migration is the canonical environment-wide reconciliation path after bootstrap. Its last run, counts, warnings, and errors are exposed through Admin β†’ Security & Policy β†’ Keycloak via GET /api/admin/keycloak/migration-health, plus the persistent header migration status indicator. The same endpoint also performs a read-only Keycloak inspection for the tile details modal, returning actual realm values such as the OBO token-exchange permission strategy, attached OBO policies, and bot service-account impersonation roles. When the migration is behind or failed, the Keycloak tab's Reconcile now button invokes the same typed migration apply path for keycloak_rbac_mapping_reconciliation_v1 and refreshes the persisted health result. Every Keycloak scope-permission that ends up with bot-specific client policies attached β€” the caipe-platform target-audience token-exchange perm, each bot client's own token-exchange perm (caipe-slack-bot, caipe-webex-bot), and the realm-level users.impersonate perm β€” must use AFFIRMATIVE decision strategy. With Keycloak's default UNANIMOUS strategy, adding the second bot's per-client policy makes the first bot's OBO fail with Client not allowed to exchange / Client not allowed to impersonate because the other bot's clients=[...] policy votes DENY for it. The kc_attach_policy_to_scope_permission helper in init-idp.sh and the matching attach_policy_to_scope_permission helper in init-token-exchange.sh both force AFFIRMATIVE on every attach so this regression cannot reappear when a new bot client is onboarded. The same invariants β€” plus a defense-in-depth "every attached policy is type=client with a non-empty client_ids allow-list" check β€” are evaluated server-side by ui/src/lib/rbac/keycloak-invariants.ts#evaluateKeycloakInvariants, exposed through GET /api/admin/keycloak/migration-health as keycloak_invariants.items, and rendered as a named pass/fail/unknown list in the Admin β†’ Security & Policy β†’ Keycloak tile. The evaluator is a pure function over the existing read-only inspector output, so the same checks gate every realm regardless of whether it was bootstrapped by init-idp.sh or by an operator using the Keycloak Admin Console. The inspector hydrates each type=client policy by calling /authz/resource-server/policy/client/<id> and resolves the returned UUIDs to operator-meaningful clientId strings via a single batched /clients round-trip per probe β€” this is necessary because Keycloak's associatedPolicies summary endpoint returns config: {} on client-type policies, so the allow-list is invisible to a naive inspector. The hydration step also lets the panel surface the policy's resolved client_ids (e.g. clients=[caipe-slack-bot]) inline whenever a policy is flagged, so admins don't have to leave the panel to identify the right policy in the Keycloak Admin Console.
  • Production caipe-ui, caipe-platform (supervisor), and Slack/Webex bot OBO client secrets are Keeper-backed Kubernetes Secrets/ExternalSecrets rather than values embedded in rendered ConfigMaps. keycloak.uiClient.secretRef or keycloak.uiClient.externalSecret feeds KEYCLOAK_UI_CLIENT_SECRET to the Keycloak init/reconcile hook, which updates the existing caipe-ui client through the Admin API so NextAuth's OIDC_CLIENT_SECRET stays aligned across upgrades and rotations. keycloak.platformClient.secretRef / keycloak.platformClient.externalSecret feeds KEYCLOAK_PLATFORM_CLIENT_SECRET the same way to replace the dev placeholder shipped in realm-config.json for the caipe-platform confidential client (consumed by the supervisor's client_credentials flow and the on-behalf-of / token-exchange target audience). Bot OBO secrets use the same single-source-of-truth pattern through keycloak.tokenExchange.externalSecret and keycloak.webexTokenExchange.externalSecret. Setting keycloak.strictClientSecrets: true adds a runtime guard at the end of init-idp.sh (covering caipe-ui + caipe-platform) and init-token-exchange.sh (covering caipe-slack-bot + caipe-webex-bot) that issues a client_credentials token request for each known dev placeholder secret and fails the Helm install if Keycloak still accepts any of them β€” preventing "operator forgot to set the secretRef" silent regressions. See secrets-bootstrap β†’ Production hardening for the recommended adoption order.
  • The Admin UI Team Resources panel (Admin β†’ Teams β†’ selected team β†’ Resources tab, spec 104 Story 4) β€” checking an agent or tool box calls PUT /api/admin/teams/[id]/resources, which:
    1. Writes base relationship intent to OpenFGA before Mongo persistence: team:<slug>#member user agent:<id>, team:<slug>#admin manager agent:<id>, and team:<slug>#member caller tool:<prefix|*>.
    2. Resolves current team members to Keycloak sub values and writes OpenFGA user:<sub> member team:<slug> membership tuples when possible.
    3. Persists the selection on the team document in Mongo (team.resources = { agents, agent_admins, tools, tool_wildcard }). The Resources tab covers Use+Manage per agent and per-MCP-server tool grants plus a single "All tools" wildcard checkbox. Mongo persistence happens after OpenFGA reconciliation so a PDP outage doesn't leave Mongo ahead of the enforcement store.
  • The Admin UI Team Slack Channels panel (Admin β†’ Teams β†’ <team> β†’ Slack Channels tab, spec 098 US9) β€” bind Slack channels to a team so the bot resolves the channel's effective team via channel_team_mappings. Slack runtime agent access is configured separately in the OpenFGA ReBAC Slack Channels panel, where admins grant a channel access to selected Dynamic Agents. PUT /api/admin/teams/[id]/slack-channels is an idempotent full-replace: it deactivates this team's previous mappings that aren't in the new payload (only when team_id still matches β€” never touches another team's rows), upserts the active set, and denormalises a thin slack_channels array onto the team document for the team-card chip count. The UI offers a live users.conversations discovery picker (server-side SLACK_BOT_TOKEN only; lists only channels where the bot is already a member; the in-process cache TTL is admin-configurable via the Discovery cache popover next to the Find Bot-Member Slack Channels button on Admin β†’ Integrations β†’ Slack, default 60 minutes, range 0–1440, 0 disables caching; the same popover exposes a Refresh from Slack now button that drops the snapshot for ad-hoc bot-membership changes) plus a manual ID entry fallback for when the bot isn't in the channel yet.
  • The Admin UI Team Webex Spaces panel (Admin β†’ Teams β†’ <team> β†’ Webex Spaces tab, spec 2026-05-18 Webex RBAC parity) β€” binds Webex spaces to a team through webex_space_team_mappings. Runtime agent access is configured separately in the OpenFGA ReBAC Webex Spaces panel. PUT /api/admin/teams/[id]/webex-spaces is an idempotent full-replace, preserves mappings owned by other teams, and denormalises webex_spaces onto the team document for display.
  • Identity group sync β€” upstream Okta/AD group ids map to external_group:<provider>/<group_id> and then to CAIPE teams, for example external_group:okta/00g... member team:platform. Application code consumes the resulting team relationships; it does not check upstream group strings directly.

BOOTSTRAP_ADMIN_EMAILS is an explicit break-glass/initial-admin list and the source for durable email-based bootstrap seeding. The Web UI BFF resolves each email to a Keycloak sub during keycloak_rbac_mapping_reconciliation_v1; existing SSO users are left untouched, while missing users get a passwordless verified Keycloak placeholder that the IdP broker can auto-link on first login. For each resolved subject, the BFF writes the default member baseline tuples, caller on mcp_gateway:list for AgentGateway's coarse MCP ext_authz gate, admin on organization:<org_key>, manager on system_config:platform_settings, manager on mcp_server:agentgateway, and manager tuples for the built-in admin surfaces, including baseline surfaces such as admin_surface:teams and admin_surface:credentials. Keep the list small, audit it in Admin β†’ Security & Policy β†’ Keycloak, and replace it with team/group-mediated admin relationships when steady-state Identity Group Sync is configured.

Local no-SSO development uses a dedicated dev auth provider rather than route-local bypass checks. When SSO_ENABLED=false, ALLOW_DEV_ADMIN_WHEN_SSO_DISABLED=true, and CAIPE_UNSAFE_RBAC_BYPASS=true outside production, ui/src/lib/auth/dev-auth-provider.ts supplies the stable anonymous@local / anonymous-local-dev admin principal to API middleware, admin tab gates, RAG proxy calls, and admin-surface checks. This keeps local development on the same auth-context contract as real OIDC sessions while making the insecure mode visible through logs and the UI No Auth indicator.

When Authorization Relationships Are Created​

Keycloak realm roles are not created for CAIPE permissions. New deployments keep Keycloak focused on identity and login:

  • Organization access is user:<sub> member|admin|auditor organization:<org_key> or team-mediated variants. The release migration organization_membership_backfill_v1 writes direct member organization:<org_key> tuples for existing Mongo users with a stable Keycloak sub, restoring baseline supervisor:invoke/RAG query access after the OpenFGA cutover.
  • Login bootstrap access is repaired on each successful CAIPE login. If the user passes OIDC_REQUIRED_GROUP, the Web UI BFF reads the Mongo-backed default OpenFGA grant profile bundle from openfga_baseline_profiles (falling back to the built-in defaults) and writes the selected member profile tuples such as user:<sub> member organization:<org_key>, user:<sub> reader system_config:platform_settings, user:<sub> owner user_profile:<sub>, user:<sub> caller mcp_gateway:list, and selected read-only admin_surface grants. The mcp_gateway:list tuple is required before AgentGateway proxies any MCP probe or tool-call traffic. If the user also matches OIDC_REQUIRED_ADMIN_GROUP or BOOTSTRAP_ADMIN_EMAILS, login bootstrap adds the selected admin profile tuple set, including admin organization:<org_key>, manager system_config:platform_settings, manager mcp_server:agentgateway, and selected admin_surface manager grants for both baseline surfaces (for example teams, credentials, and skills) and privileged surfaces (for example openfga and migrations). Stored built-in profiles are normalized with newly required default grants so existing environments pick up added baseline admin-surface permissions after upgrade. Admins can update the global Org Member / Org Admin default grant profiles, create custom profiles, and assign member/admin profile overrides to teams in Admin β†’ Security & Policy β†’ OpenFGA β†’ Default FGA Grants. These profiles are templates that materialize concrete OpenFGA tuples during login or all-user reconciliation. The same workspace includes OpenFGA Store: Catalog & Live Relationships, a read-only catalog of resource types, action checks, discovered resources, and paginated live OpenFGA tuples so operators can audit the full authorization store beyond the default login templates. Tuple Inspector filter inputs are apply-only; complete tuple identifiers are sent to OpenFGA as exact read filters, while partial text stays a post-read contains filter for ad-hoc inspection. A team override replaces the global profile for matching team users for that role; if several teams provide overrides, their selected profile grants are unioned. The result is materialized as direct user OpenFGA tuples during login or all-user reconciliation so self-profile grants and existing can_* checks remain deterministic. This is an OpenFGA reconciliation step, not a runtime realm-role fallback; users who fail the OIDC admission group are never bootstrapped.
  • Team membership is user:<sub> member|admin team:<slug>.
  • Resource access is team-mediated where possible, for example team:<slug>#member user agent:<id> or team:<slug>#member reader knowledge_base:<id>.
  • Runtime checks use derived can_* permissions from those base relationships.

Rule of thumb: Keycloak owns identity and JWT claims; OpenFGA owns who is related to which organization, team, or resource.

The user-facing Connections & Secrets surface is hidden unless credential features are enabled and the signed-in Keycloak subject has can_use_credentials organization:<org_key> in OpenFGA (granted by organization member or admin). Specific secret metadata, use, share, manage, and audit operations are still governed by secret_ref:<id> relationships. The Admin β†’ Settings β†’ Credentials tab is stricter: it is also feature-flagged and requires organization-admin access (can_manage organization:<org_key>), not only the read-only admin_surface:credentials baseline grant.

The Web UI backend now uses shared object-level OpenFGA checks for UI-owned resource surfaces whenever the authorization model has a concrete resource type. list and discover map to can_discover, runtime/content access maps to can_read or can_use, mutations map to can_write, sharing maps to can_share, and platform configuration maps to can_manage on system_config:<key>. Dynamic Agent create requires a stable Keycloak sub; private agents write user:<creator_sub> owner agent:<id>, and team-owned agents require OpenFGA team:<slug>#can_use before creation (Mongo team membership is not a fallback). Creation writes durable relationships before MongoDB persistence: user:<creator_sub> owner agent:<id>, organization:<org>#admin manager agent:<id>, team:<slug>#member user agent:<id>, team:<slug>#admin manager agent:<id>, and the agent-to-tool caller tuples. The Agent editor's "Share with Teams" multi-select extends the same two-tuple pair (team:<slug>#member user agent:<id> plus team:<slug>#admin manager agent:<id>) to every additional shared team; POST /api/dynamic-agents and PUT /api/dynamic-agents resolve each entry against the teams collection (legacy Mongo _id is accepted for backward compat but normalized to the canonical slug before persistence and OpenFGA writes), drop the owner-team duplicate, and feed both nextSharedTeamSlugs and previousSharedTeamSlugs into reconcileAgentRelationships so unchecking a team in the editor genuinely emits delete tuples instead of leaving a dangling grant. The agent_shared_team_grants_backfill_v1 migration replays this normalisation against every existing agent so the multi-select that pre-dated the 2026-05-27 fix retroactively writes the missing canonical tuples. Dynamic Agent update/delete paths check the concrete agent:<id> object before MongoDB writes or tuple reconciliation. Chat agent pickers (/api/dynamic-agents/available) and subagent pickers (/api/dynamic-agents/available-subagents) load enabled candidates and filter through agent#can_use; conversation creation also checks agent#can_use before storing a selected agent. LLM model list and edit routes use llm_model#can_read/#can_write/#can_delete; config-driven system models get organization:<org>#member reader llm_model:<id> and organization:<org>#admin manager llm_model:<id> tuples during seed and remain immutable. Skill config reads no longer prefilter by MongoDB visibility, owner_id, shared_with_teams, or legacy realm roles; they load candidates and let skill#can_discover/skill#can_read decide. Task Builder reads follow the same pattern with task#can_discover/task#can_read. Workflow configs are mapped to the existing OpenFGA task namespace until the authorization model grows a first-class workflow type. Dynamic Agent built-in tool metadata at GET /api/dynamic-agents/builtin-tools is not OpenFGA-gated: it returns a static catalog of supported built-in tool types (web_search, file_io, etc.), is read by every authenticated user who can open the Create Agent wizard, and per-tool authorization happens at MCP invocation time. The route requires only an authenticated session and forwards the caller's bearer token to dynamic-agents (which enforces DA_REQUIRE_BEARER). Earlier revisions gated this on tool:dynamic-agents-builtin#can_discover, but no seed/migration path ever wrote that tuple so every caller (including admins) was denied with 403; that pseudo-resource is now retired.

The Admin β†’ Security & Policy β†’ OpenFGA policy graph is a visibility surface for these same base relationships. Team-scoped graph queries include both team:<slug>#member and team:<slug>#admin usersets, so management grants such as team:<slug>#admin manager agent:<id> and team:<slug>#admin manager admin_surface:<surface> appear alongside member grants. The default graph remains a clean team/resource workspace: team and userset nodes are always visible, and resource nodes are shown when selected from the live catalog. Operators can switch graph layers to inspect stored OpenFGA tuples, read-only Slack/Webex routing metadata, subject-scoped effective can_* access paths, or authorization-model topology derived from the universal resource/action model. These layers are user-facing alternatives, not one combined overlay. Effective access is intentionally user-centered and requires a selected user before rendering broad inherited access. Model topology shows resource-type anchors first; selecting catalog resources expands only the matching type's relation and permission stacks, not concrete live resource cards. The UI resource palette and connection defaults read from the live catalog, so newly introduced resource types such as secret_ref, policy, audit_log, or llm_model appear without adding another graph-specific resource list.

Conversations use a hybrid ownership model to avoid creating high-cardinality owner tuples for every private chat. Private ownership is implicit from MongoDB (owner_subject for normalized records, legacy owner_id email fallback for old records). Explicit OpenFGA relationships remain the enforcement store for cross-boundary sharing and admin surfaces. The Web UI backend now fetches non-deleted conversation candidates without MongoDB team-sharing prefilters, then applies the same implicit-or-explicit conversation check on chat list/detail routes, Dynamic Agent v1 stream/invoke/resume/cancel proxy routes, and conversation metadata updates. This lets Slack OBO requests write their own thread conversations and bookkeeping metadata without requiring explicit owner tuples while still allowing OpenFGA-only conversation grants to appear in the UI. The Admin β†’ System β†’ Migrations tab seeds a DB-managed migration_manifest from the runtime bundle, shows the active runtime migration release beside per-collection data_schema_versions, hides completed migrations by default, and runs the release migration handlers, including conversation_owner_identity_v1 for owner_subject/owner_identity_version=2, organization_membership_backfill_v1 for direct baseline organization membership, universal team-resource OpenFGA backfill, Dynamic Agent tool tuple reconciliation, Dynamic Agent organization-admin inheritance backfill, Dynamic Agent shared-team grants backfill (agent_shared_team_grants_backfill_v1, writes the missing team:<slug>#member can_use agent:<id> tuples for every existing agent's shared_with_teams), Slack channel and Webex space ReBAC grant backfills, messaging team mapping reconciliation, RBAC index creation, and Webex messaging ReBAC index creation. Migration runs are recorded in schema_migrations; blocking required migrations and the migration status API are admin-only surfaces.

Conversation secondary views and mutations now use the same model: shared, search, and trash routes fetch candidates and filter through the implicit-or-explicit OpenFGA helper; pin, archive, restore, and share actions require the concrete conversation relationship instead of raw owner_id equality. Skill nested routes and import overwrite paths also load candidates by id and require skill#read, skill#write, or skill#admin as appropriate; legacy skill visibility fields remain metadata only. Workflow run list/start/poll/update/delete/resume/cancel operations authorize against the parent workflow config through the temporary task namespace mapping. MCP server list/probe/update/delete and team RAG tool list/read/write/delete use concrete mcp_server and tool OpenFGA resource checks without a legacy session role bypass; MCP server create requires a stable Keycloak sub, writes mcp_server owner/team tuples before Mongo persistence, and delete removes associated OpenFGA tuples before deleting the Mongo row. Credential management adds admin_surface:credentials for connector administration and global secret metadata management, plus concrete secret_ref authorization for user metadata, use, share, manage, and audit decisions. The user-facing page separates My Secrets and My Connections, while the Admin Credentials tab owns OAuth provider configuration and all-user secret metadata actions. Browser API routes may create or rotate secret material, but raw credential retrieval is restricted to bearer-authenticated service callers using the credential-service audience.

Knowledge Base UI routes are enforced at the Web UI backend before proxying to the RAG server. caipe-ui authenticates the browser session, applies the coarse rag route gate, requires admin_surface:rag_datasources#can_manage for the Data Sources admin surface, checks concrete knowledge_base:<id> operations for Knowledge Base pages and sharing, filters datasource list responses by data_source#can_read, constrains search/MCP invocations to the caller's readable datasource IDs, and then forwards the Keycloak bearer token to RAG. RAG validates the token signature, issuer, audience, and expiry against Keycloak, then repeats OpenFGA checks for direct API/MCP requests using the caller's Keycloak sub. Human Keycloak realm roles and per-KB realm roles do not grant RAG access; OpenFGA tuples such as team:<slug>#member reader knowledge_base:<id> and team:<slug>#member reader data_source:<id> are the source of truth. Settings β†’ Knowledge Bases / RAG Team Access can grant either team access to the Data Sources admin surface, read/ingest/admin access to Knowledge Bases, or component-level datasource read/ingest/admin access. Team owners/admins may manage KB grants for their own team without platform-admin access.

The Teams dialog Knowledge Bases tab reads team_kb_ownership through /api/admin/teams/[id]/kb-assignments. During the migration window, if no ownership row exists it treats legacy teams.resources.knowledge_bases entries as read-level assignments so older team resource grants still render instead of appearing empty.

Org-admin super-grant on KB / Search / Data Sources / Graph / MCP Tools (PR 1, 2026-05-27). Any caller that holds user:<sub> can_manage organization:<org_key> in OpenFGA is always allowed on every Knowledge Base sidebar surface. The Web UI backend implements this with an explicit bypassForOrgAdmin: true option passed to requireResourcePermission / filterResourcesByPermission for knowledge_base:<id> reads (per-KB gate, datasource list filter, readable-datasource enumerator) and a matching org-admin short-circuit in constrainSearchBody so admins are not subject to filter injection. This is policy: once you are org admin, you cannot be excluded from one specific KB while staying org admin. To restore pure per-resource checks (no super-grant), set RAG_ADMIN_BYPASS_DISABLED=true. Non-admins continue to need explicit per-KB / per-team tuples. The release migration admin_surface_rag_datasources_admin_grant_v1 backfills user:<sub> manager admin_surface:rag_datasources for every previously-bootstrapped org admin so the rag + admin short-circuit in api-middleware.ts is fail-safe and not solely inheritance-dependent.

Graph tab gate + info banner + per-KB ontology filtering follow-up (PR 5, 2026-05-27). The Graph tab at /knowledge-bases/graph now consults useKbTabGates (the PR 2 hook). Non-admins with zero readable KBs see the NoKbAccessEmpty empty state. When the tab is rendered the new GraphInfoBanner reminds the user β€” including org admins under PR 1's super-grant β€” that the ontology graph is currently global: it is stored in Neo4j keyed only by _datasource_id and is not filtered per KB. Per-KB filtering needs new RAG-server work (a kb_ids filter on the /v1/graphrag/* endpoints plus an OpenFGA-driven membership probe in the BFF) and is tracked by docs/docs/specs/2026-05-27-per-kb-ontology-graph-filtering/spec.md.

data_source and mcp_tool OpenFGA types + reconcilers + BFF list filter (PR 4, 2026-05-27). deploy/openfga/model.fga and the Helm-packaged JSON authorization model include two RAG resource types; local Docker Compose mounts the same chart JSON model used by Helm so there is only one JSON artifact to keep current:

type data_source     # datasource component inside the Knowledge Base feature,
# with per-datasource read and ingest/write grants
type mcp_tool # RAG custom MCP tools (PUT /v1/mcp/custom-tools/<id>),
# distinct from the existing tool:<id> used by AgentGateway

Both expose manager: [user, service_account, team#admin, organization#admin] so org admins are an explicit edge on the model β€” not just a runtime bypass. buildDataSourceRelationshipTupleDiff and buildMcpToolRelationshipTupleDiff (in ui/src/lib/rbac/openfga-owned-resources.ts) emit the same shared-teams diff that PR 3 introduced for knowledge_base. mcp_tool additionally emits the user relation on member tuples so team members get can_call (mirrors how mcp_server invokers are modelled).

The BFF (ui/src/app/api/rag/[...path]/route.ts) now writes mcp_tool:<tool_id> tuples on a successful PUT /v1/mcp/custom-tools/<tool_id> (sourcing the owner team slug from the request body) and filters the GET /v1/mcp/custom-tools response by mcp_tool:<id>#can_read. Org admins bypass via the PR 1 super-grant; non-admins only see tools they have a tuple on.

Two strictly-additive backfill migrations live in ui/src/lib/rbac/migrations/registry.ts:

  • data_source_grants_backfill_v1 mirrors every existing knowledge_base:<id> tuple as a parallel data_source:<id> tuple, so admins who could read a KB on day zero can still read its data source on day one. No deletes.
  • mcp_tool_grants_backfill_v1 walks Mongo team_rag_tools and writes the canonical team:<slug>#member reader mcp_tool:<id> + team:<slug>#member user mcp_tool:<id> + team:<slug>#admin manager mcp_tool:<id> tuples. Tools without a team owner fall through to the organization#admin β†’ manager edge.

Per-KB Share-with-Teams panel + reconciler (PR 3, 2026-05-27). KB admins (anyone with knowledge_base:<id>#can_manage) and org admins can share a Knowledge Base with additional teams from the new /knowledge-bases/sharing/[id] page (KbSharingPanel + TeamMultiPicker). The page calls PUT /api/rag/kbs/[id]/sharing, which reconciles the team list through reconcileKnowledgeBaseRelationships. The reconciler diffs nextSharedTeamSlugs vs previousSharedTeamSlugs and emits explicit deletes for removed teams (mirrors how reconcileAgentRelationships reconciles shared agent teams), so unchecking a team revokes the team:<slug>#member reader, team:<slug>#member ingestor, and team:<slug>#admin manager tuples in a single OpenFGA write. The release migration knowledge_base_shared_team_grants_backfill_v1 walks the legacy team_kb_ownership Mongo collection and writes the canonical team:<slug>#member reader knowledge_base:<id> + team:<slug>#member ingestor knowledge_base:<id> + team:<slug>#admin manager knowledge_base:<id> tuples for every (team, kb) row so existing readers/managers retain access once the per-resource gates ship.

Knowledge sidebar tab gates and empty states (PR 2, 2026-05-27). The Knowledge Base sidebar (KnowledgeSidebar) now consults GET /api/rbac/kb-tab-gates and renders any tab the user cannot see as a disabled-with-tooltip control. Org admins (per the PR 1 super-grant) get every tab true with kb_count=-1 and no empty-state banner. Non-admins get a tab visibility map driven by the count of knowledge_base:<id> objects on which they have can_read (resolved by listing /v1/datasources and filtering via filterResourcesByPermission with bypassForOrgAdmin: false). When has_any_kb=false the sidebar shows a "you don't have access to any knowledge bases yet" banner and the NoKbAccessEmpty component replaces the page-level body for Search / Data Sources / Graph / MCP Tools. The same RAG_ADMIN_BYPASS_DISABLED kill switch disables the org-admin short-circuit on this route, forcing every caller through the per-resource path. The hook fails closed: until the BFF responds every tab is hidden so the UI never exposes a control the BFF would 403.

Slack and Webex bot channel/space team resolution uses Mongo mappings (channel_team_mappings, webex_space_team_mappings) to find the owning CAIPE team. Membership prechecks are OpenFGA-first: the bot checks user:<sub> member team:<slug> and only falls back to legacy teams.members when the PDP is not configured or unavailable. A negative OpenFGA decision denies the bot interaction before OBO so users get the friendly "not a member" response. (Phase 3 of spec 2026-05-24-derive-team-from-channel removed the per-team OBO scope mint — the bot now mints a team-agnostic OBO token and the channel→team mapping is the sole source of team identity downstream.)

RAG accepts both browser user tokens and ingestor client-credentials tokens from Keycloak. For local Docker Compose, OIDC_DISCOVERY_URL and INGESTOR_OIDC_DISCOVERY_URL may be either the realm base URL (http://keycloak:7080/realms/caipe) or the full .well-known/openid-configuration URL; the server normalizes both forms before fetching metadata. Keycloak service-account tokens use preferred_username=service-account-<client>, so RAG treats that token shape as machine-to-machine and assigns RBAC_CLIENT_CREDENTIALS_ROLE; human tokens are identity-only and use OpenFGA for authorization.

User-facing Role Cleanup​

The Admin UI intentionally separates team/resource authorization from raw Keycloak plumbing:

  • Keycloak system roles (default-roles-caipe, offline_access, uma_authorization) are hidden from the table and role filter because they are OIDC/UMA plumbing, not product permissions.
  • Teams are the human-facing source for membership and most resource grants.
  • Legacy resource roles (agent_user:*, agent_admin:*, tool_user:*, kb_reader:*, task_user:*, skill_user:*) are stale compatibility data only; cleanup scripts can remove them from local/dev realms.

GET /api/admin/users exposes raw Keycloak protocol roles for platform-admin diagnostics. Non-admin callers must hold admin_surface:users#can_read and then receive a self-scoped response containing only their own Keycloak user row. GET /api/admin/users/[id] checks user_profile:<id>#can_read, which is granted by owner user_profile:<sub> for self reads and by organization:<org>#admin for admins. The baseline Users tab can show "my access" without leaking other users; mutation controls remain admin-only. Product authorization should be read through teams and OpenFGA relationships. Local/dev realms can remove stale legacy CAIPE roles with scripts/cleanup-local-keycloak-legacy-roles.py.

Do not delete Keycloak system roles as part of cleanup. They may be required by Keycloak or OIDC flows even though CAIPE hides them from the main admin UX.

External IdP Brokering (Duo SSO, Okta, or any OIDC provider)​

Badge analogy: The partner agency desk. Whether it's Duo SSO, Okta, or any other corporate identity provider, they all speak the same language (OIDC). Keycloak is the single translator β€” it talks to whichever agency is configured and converts their badges into standard building badges. The rest of the building never needs to know which agency originally issued the contractor's credentials.

Keycloak acts as a relying party to the upstream IdP (OIDC). From the user's perspective it's invisible β€” they see only the upstream IdP login page. From a security perspective:

Browser ──OIDC auth code flow──▢ Keycloak
β”‚
──OIDC auth code──▢ Upstream IdP (Duo SSO / Okta / any OIDC)
β”‚
◀── id_token β”€β”€β”€β”€β”€β”€β”€β”˜ (external claims: email, name, groups)
β”‚
Preserves external group claims for team sync
Issues new Keycloak JWT with identity claims
β”‚
Browser ◀── Keycloak JWT β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Supported upstream IdPs β€” the init-idp.sh script configures any OIDC provider generically via OIDC discovery (/.well-known/openid-configuration):

ProviderIDP_ALIAS (in realm)IDP_ISSUER exampleNotes
Duo SSOduo-ssohttps://sso-xxx.sso.duosecurity.com/oidc/xxxUses firstname/lastname (non-standard); extra IdP mappers handle both given_name and firstname
Okta (OIDC)okta-oidchttps://your-org.okta.com or https://your-org.okta.com/oauth2/defaultStandard OIDC claims; groups come from Okta's groups claim (requires Okta app config)
Okta (SAML)okta-samlβ€”SAML 2.0; configured as a SAML IdP in Keycloak; attribute mappers needed for groups
Microsoft Entra ID (OIDC)entra-oidchttps://login.microsoftonline.com/{tenant-id}/v2.0Standard OIDC; groups claim requires Entra app manifest groupMembershipClaims config
Microsoft Entra ID (SAML)entra-samlβ€”SAML 2.0; common in enterprise M365 environments
Generic OIDCany aliasany OIDC-compliant issuer URLWorks as long as the provider exposes /.well-known/openid-configuration

To wire up a new IdP, set these env vars and run init-idp.sh (or restart the init-idp container β€” it is idempotent):

IDP_ALIAS=okta                                 # short alias, used in kc_idp_hint
IDP_DISPLAY_NAME="Okta SSO" # shown on Keycloak login page (if visible)
IDP_ISSUER=https://your-org.okta.com # OIDC issuer URL
IDP_CLIENT_ID=<okta-app-client-id>
IDP_CLIENT_SECRET=<okta-app-client-secret>
IDP_ACCESS_GROUP=caipe-users # Okta group β†’ chat_user role (optional)
IDP_ADMIN_GROUP=caipe-admins # Okta group β†’ admin role (optional)
KEYCLOAK_ADMIN_FRONTEND_URL=http://localhost:18080 # optional private master-realm admin URL
KEYCLOAK_FORCE_IDP_REDIRECT=true # disable local app-realm login fallback
OIDC_IDP_HINT=okta # auto-redirect browser to this IdP alias

**OIDC_IDP_HINT** (set in ui/.env.local) is passed to Keycloak as kc_idp_hint on every auth request. It skips the Keycloak login page entirely and redirects straight to the named IdP. Set it to the same value as IDP_ALIAS.

**KEYCLOAK_FORCE_IDP_REDIRECT=true** makes the app realm configured-IdP only: init-idp.sh sets the browser flow's Identity Provider Redirector defaultProvider to IDP_ALIAS, marks that redirector as required, and disables the local username/password form. This prevents CAIPE users from seeing the Keycloak login screen even if a client omits kc_idp_hint. Keep the master realm admin console on its private URL for operational access.

If the upstream OIDC app requires PKCE on the Keycloak broker flow, enable keycloak.idp.pkce.enabled=true in Helm. The chart passes IDP_PKCE_ENABLED=true and IDP_PKCE_METHOD=S256 to init-idp.sh, which adds pkceEnabled=true and pkceMethod=S256 to the Keycloak OIDC identity-provider config. Leave it disabled when the upstream IdP does not require broker-side PKCE.

**KEYCLOAK_ADMIN_FRONTEND_URL** is optional and only affects the master realm admin console. Use it when public ingress intentionally exposes only /realms/caipe and /resources; the caipe realm issuer and Duo broker redirect remain on the public Keycloak hostname.

In production, the browser-facing issuer is Keycloak, not the upstream IdP. For the Grid RBAC environment the UI uses:

OIDC_ISSUER=https://idp.caipe.example.com/realms/caipe
OIDC_CLIENT_ID=caipe-ui
OIDC_IDP_HINT=duo-sso
NEXTAUTH_URL=https://caipe.example.com

Duo credentials stay on the Keycloak IdP broker only. The Duo application's redirect URI points to Keycloak's broker endpoint (https://idp.caipe.example.com/realms/caipe/broker/duo-sso/endpoint), while the Keycloak caipe-ui client allows NextAuth's callback (https://caipe.example.com/api/auth/callback/oidc). Keycloak must be started with a public hostname such as KC_HOSTNAME=https://idp.caipe.example.com and KC_PROXY_HEADERS=xforwarded so discovery metadata and JWT iss match the public issuer. A host-specific Docker Compose overlay (kept outside this repo) sets those Keycloak values alongside the UI/RAG/Dynamic Agents OIDC_ISSUER overrides; otherwise browser login links can point back at the local dev default (http://localhost:7080).

Claim mapping chain: The IdP sends email, given_name/firstname, family_name/lastname, and groups claims. Keycloak IdP mappers write identity attributes to the local user record. Group claims are input to the identity-group-to-team sync path, which writes OpenFGA team relationships; they are not translated into CAIPE realm roles.

The login sequence diagram (one-time login + the silent first-broker-login flow) lives in Workflows β€Ί Login.

Keycloak Auth Reconciliation Job​

Keycloak browser-flow and identity-provider settings are persisted inside Keycloak's database, not in Kubernetes objects. Upgrades can recreate pods and chart resources without automatically reasserting the Identity Provider Redirector, local-login disablement, first-broker-login flow, or required-action settings. The durable design is:

  • Keep an idempotent keycloak-auth-reconcile Job.
  • Make it chart-owned, not a Grid-only extraDeploy override.
  • Run it as an early ArgoCD/Helm sync hook on install and upgrade.
  • Use BeforeHookCreation,HookSucceeded cleanup.
  • Remove any temporary Grid-specific reconcile job once the chart contains the same behavior.
  • Reassert realm token/session lifetimes on upgrade: access tokens remain short-lived at 1 hour, SSO idle timeout is 8 hours, and the absolute SSO max lifespan is 24 hours unless overridden through the Keycloak chart values.

A CronJob is intentionally avoided. Periodic reconciliation would hide ownership drift and repeatedly exercise Keycloak admin credentials when nothing changed. The desired model is one job pod per install/upgrade event, with idempotent Admin API calls that restore the browser-flow and IdP invariants for every downstream install.

User Profile & Custom Attributes​

Keycloak 26+ enforces a user profile schema. Custom attributes are silently dropped unless declared or unmanagedAttributePolicy=ADMIN_EDIT is set on the user profile API. The Helm realm import JSON must not include unmanagedAttributePolicy as a top-level realm field because Keycloak 26.3 rejects that RealmRepresentation property during import. init-idp.sh patches both supported user-profile settings after the server starts:

  • Adds slack_user_id to the user profile schema with admin-only view/edit permissions
  • Sets unmanagedAttributePolicy=ADMIN_EDIT so other Admin API attribute writes succeed
  • Makes firstName and lastName optional, disables Keycloak's VERIFY_PROFILE required-action provider, and clears any assigned VERIFY_PROFILE actions from existing users so enterprise SSO users are never stopped at Keycloak's "Update Account Information" form

The Keycloak container exposes login/API traffic on 8080 and management health on 9000; Helm readiness/liveness probes target the management port.

Account Linking (Slack)​

Three onboarding paths, evaluated in order:

  • Auto-bootstrap (default, SLACK_FORCE_LINK=false) β€” bot looks up the Slack user's email, finds an existing Keycloak user, writes slack_user_id silently. Zero user action required.
  • Just-In-Time user creation (default ON, SLACK_JIT_CREATE_USER=true, spec 103) β€” when no existing Keycloak user matches, the bot creates a federated-only shell user via POST /admin/realms/{realm}/users using the same caipe-platform admin credential. Optional domain allowlist via SLACK_JIT_ALLOWED_EMAIL_DOMAINS. 409 races are resolved by re-querying.
  • Explicit link (SLACK_FORCE_LINK=true, or fallback when JIT is off / not allowed / fails) β€” bot sends an HMAC-signed link prompt; user clicks β†’ SSO login β†’ slack_user_id written via Admin API.

The full sequence (including HMAC URL shape, TTL enforcement, JIT request body, error kinds, and post-link OBO flow) is in Workflows β€Ί Slack identity linking.

Account Linking (Webex)​

Webex uses the same Keycloak identity boundary as Slack but stores the Webex person identifier in webex_user_id. The Webex link callback lives in the Web UI backend at /api/auth/webex-link and uses single-use, 10-minute nonces in webex_link_nonces; HMAC links are converted into nonce-backed completion URLs before the user reaches the OIDC session. The callback rejects attempts to bind one Webex person ID to multiple Keycloak users.

For group spaces, the default Webex bootstrap path keeps signed linking URLs out of the shared room. The bot posts only a generic thread notice in the group, then sends the requesting person a 1:1 Adaptive Card with the SSO linking URL. If the 1:1 send fails, the group fallback still avoids posting the signed URL publicly. Slack-style implicit/profile linking is treated as a user-choice path, not the default: it should only be enabled when Webex org and verified-email trust checks can prove the Webex profile maps unambiguously to one Keycloak user.

After linking, the Webex bot exchanges its service-account token for a user OBO token with the selected active team scope. The Webex bot clients are caipe-webex-bot and caipe-webex-bot-admin; the caipe-ui client receives the webex-bot-admin-audience mapper so runtime admin calls can use client-credentials tokens. The full runtime sequence is in Workflows β€Ί Webex space ReBAC.


Component 2: CAIPE UI β€” The Reception Desk​

Badge analogy: The reception desk at each department entrance. When you badge in, it reads your chip (JWT), checks your clearance level for this department, and either waves you through or says "sorry, you don't have access here." It doesn't phone HR β€” the badge chip already carries everything needed to make the decision.

Technically: Next.js App Router with NextAuth (Auth.js v5) for OIDC session management. Every API route handler runs requireRbacPermission() which validates the server-side session and enforces role requirements before proxying to backend services.

Authentication Flow​

1. Browser visits http://localhost:3000
2. NextAuth detects no session β†’ 302 to Keycloak (OIDC auth code flow)
3. Keycloak β†’ Duo SSO (kc_idp_hint=duo-sso auto-redirects, user never sees KC)
4. Duo SSO login β†’ auth code returned to Keycloak
5. Keycloak issues JWT β†’ NextAuth exchanges code for tokens
6. NextAuth stores small session metadata in the encrypted httpOnly cookie
7. Large OAuth tokens (access, refresh, ID token) stay in the UI server's in-process token cache and are rehydrated server-side

Security note: The session cookie is httpOnly, Secure, SameSite=Lax, and encrypted with NEXTAUTH_SECRET. Large OAuth tokens are kept out of the browser cookie to avoid oversized request headers when Keycloak emits RBAC scopes, groups, or relationship-derived claims. If the UI process restarts and the in-process token cache is lost while a browser still has a valid slim session cookie, the session is marked AccessTokenMissing and the token-expiry guard sends the user back through login instead of allowing tokenless backend proxy calls. For multi-replica deployments, use sticky sessions or replace the in-process token cache with a shared store.

Server-Side Authorization (api-middleware.ts)​

// Every protected API route:
const { user, session } = await getAuthFromBearerOrSession(request);
await requireRbacPermission(session, "rag", "kb.query");

Two authorization paths:

  1. Primary PDP: requireRbacPermission() calls Keycloak Authorization Services with the caller's bearer/session access token and the requested resource#scope.
  2. Role-based fallback: hasRoleFallback() checks realm_access.roles from the session JWT when the PDP is unavailable or not configured.
  3. Bootstrap admin path: isBootstrapAdmin(email) still provides a temporary break-glass fallback from BOOTSTRAP_ADMIN_EMAILS, but the same email list is also reconciled by the BFF into durable OpenFGA tuples. Prefer the durable tuple state shown in Admin β†’ Security & Policy β†’ Keycloak, and remove the email fallback once group/team-admin relationships are configured.

Routes that have not yet been rewritten inline no longer remain session-only: the deprecated withAuth() compatibility wrapper now uses getAuthFromBearerOrSession(), resolves the route family to a least-privilege RBAC policy, and calls requireRbacPermission() before invoking the handler. The old generic supervisor umbrella is now split for basic user surfaces: profile and identity-link routes use self_profile#read/write, user search uses user_directory#read, chat/A2A/model discovery uses chat_supervisor#invoke, settings use user_settings#read/write, feedback/NPS uses feedback#submit, session files use user_files#read/write, AI assist uses ai_assist#invoke, credentials use credential_vault#use, and platform settings reads use system_config#read. Unmatched compatibility routes fall back to admin_ui#view for GET and admin_ui#manage for writes instead of a generic baseline-use capability. These user-surface capabilities map to organization-level OpenFGA relations (can_read_self, can_manage_self, can_search_directory, can_chat, can_submit_feedback, can_use_files, can_use_ai_assist, can_use_credentials) that derive from existing organization membership/admin relationships so upgrades preserve current access automatically.

Credential APIs additionally keep concrete secret_ref checks for payload and metadata operations. credential_vault#use only opens the credential surface; it does not authorize retrieving or using a specific secret. Slack and Webex runtime access-check APIs likewise require slack_channel:<workspace>--<channel>#can_read or webex_space:<workspace>--<space>#can_read before they evaluate the requested channel/space grant and target user grant, preventing those endpoints from becoming permission oracles for messaging resources the caller cannot inspect. Platform org admins use the standard resource-authz admin bypass because they already hold global organization:<org_key>#can_manage.

For a route-by-route breakdown of which BFF /api/* endpoints use resource-scoped PDP, which still rely on the legacy withAuth wrapper, and which have a user.role === 'admin' bypass, see the PDP Coverage Audit. The audit also documents how to read audit_event_id rows and how to add explicit route capabilities.

Dynamic Agent Execution Gate​

Dynamic Agent execution is a data-plane ReBAC decision, not a Keycloak UMA management-plane decision. The Web UI backend chat proxy routes authenticate the caller, extract the stable session or bearer-token subject, and check OpenFGA before proxying execution to Dynamic Agents:

user:<sub> can_use agent:<agent_id>

For compatibility with existing team data that was originally keyed by email, the Web UI backend and Dynamic Agents runtime check the stable subject first and then fallback to user:<email> can_use agent:<agent_id> when the token carries an email claim. New relationship writers should prefer Keycloak sub values. The UI auth middleware also persists the verified Keycloak subject into MongoDB users.keycloak_sub and users.metadata.keycloak_sub during session or bearer authentication. This gives migrations and admin tooling a durable email-to-sub mapping without depending on transient session cookies.

For browser sessions, the Web UI backend forwards the Keycloak access token to Dynamic Agents when it is present so the runtime can bind current_user_token and pass the same bearer to AgentGateway-backed MCP calls. If the slim NextAuth cookie survives a UI restart but the server-side token cache is gone, Dynamic Agents proxy routes still forward the signed-in X-User-Context fallback instead of blocking configuration reads, AI review, or agent save flows. Token-backed AgentGateway tool calls may still require the user to sign in again before they can be probed or invoked.

POST /api/v1/chat/stream/start, POST /api/v1/chat/invoke, POST /api/v1/chat/stream/resume, and POST /api/v1/chat/stream/cancel fail closed before any backend call unless the caller can use the selected agent and can write the target conversation through implicit ownership or an explicit OpenFGA relationship. The older plain SSE proxy at POST /api/chat/stream also forwards the authenticated session access token to the supervisor backend and applies the same implicit-or-explicit conversation write check before proxying.

The Web UI backend emits a unified RBAC Audit event for every OpenFGA agent-use decision, and the Dynamic Agents runtime persists the same structured openfga_rebac event to MongoDB audit_events for direct bearer-token calls. Both use pdp=openfga; the Web UI backend stores the checked tuple in a resource reference shaped like:

user:<sub> can_use agent:<agent_id>

This gives operators a single RBAC Audit view for runtime OpenFGA allows, denies, and PDP-unavailable failures alongside admin ReBAC graph/check actions. The Admin UI's RBAC Audit type filter uses All as a literal unfiltered view over MongoDB audit_events; selecting a specific type narrows the result to auth, openfga_rebac, tool_action, or agent_delegation. The AgentGateway openfga-authz-bridge also writes each external ext_authz decision into the same audit_events collection with source=openfga_authz_bridge, so gateway-level OpenFGA allow/deny/error decisions appear without a trace backend. MongoDB is the durable audit record and the Admin UI reads it directly.

Personal DM Experience β€” Phase 2 (spec 2026-05-24)​

Slack DMs and Webex 1:1 spaces dispatch through a personal chain. (The legacy active_team JWT claim has been removed; see Phase 3 demolition notes above and the deprecated Spec 104 section below.) The BFF owns three new routes, the Web UI broadens its agent-use check to honor team-union grants, and both bots intercept text/slash commands before route resolution.

SurfaceEndpointPurpose
Bot β†’ BFFPOST /api/user/check_agent_accessPure PDP probe for the DM dispatch chain. Wraps evaluateAgentAccess(subject, agent_id) (direct grant β†’ team-union fallback) and returns {allowed, reason, path, matched_team_slug}. No team scope needed on the token.
Bot β†’ BFFGET /api/user/accessible-agentsPagination-friendly list of agents the calling user can can_use. Drives /caipe-list (Slack) and list (Webex).
Bot β†’ BFFGET/PUT /api/user/preferencesPer-user saved dm_default_agent_id. PUT {"dm_default_agent_id": null} clears the preference (FR-029a, invoked by /caipe-use default).
Web UIrequireAgentUsePermissionNew ALLOW_TEAM_UNION audit reason code. When direct user→agent grants miss, the helper probes the caller's team slugs (listUserTeamSlugs) and accepts team:<slug>#member can_use agent:<id>. This aligns the Web UI with the bots, which already honored team-mediated grants.

The bots' DM dispatch chain is:

  1. Thread/space override (dm_thread_overrides.OverrideStore β€” LRU capped at 1000 entries, no TTL, cleared on bot restart or explicit /caipe-use default).
  2. Saved preference (user_preferences.dm_default_agent_id via the BFF).
  3. Deployment dm_agent_id (SLACK_INTEGRATION_DM_AGENT_ID / WEBEX_INTEGRATION_DM_AGENT_ID).
  4. Deployment default_agent_id (fallback).

Every candidate is re-checked via POST /api/user/check_agent_access before being returned. A stale override that fails the PDP is auto-cleared with a user-visible notice. A stale saved preference emits a notice but is NOT auto-cleared (the user may be temporarily off-team). Deployment defaults fall through silently on deny β€” org defaults failing is an ops issue, not something to spam users about. PDP unavailability returns a clean "try again later" response.

Slack registers /caipe-help, /caipe-list, and /caipe-use Bolt commands (see docs/integrations/slack-manifest.md). Webex parses plain-text help / list / use <agent> / use default via text_commands.parse_command_text and intercepts them in handle_webex_message BEFORE route resolution so an unmapped 1:1 space still gets a useful response. Both surfaces are rate-limited per user (default 5 commands per 30s; SLACK_COMMAND_RATE_LIMIT / WEBEX_COMMAND_RATE_LIMIT) and reply ephemerally (Slack response_type=ephemeral; Webex DMs the issuer in group spaces, replies inline in 1:1).

Credential Exchange Authorization​

Connections & Secrets OAuth tokens are never returned to the browser. Browser users can start or relink OAuth provider connections and can run a server-side profile check, but POST /api/credentials/connections/[connection_id]/profile refreshes the token inside the BFF and returns only redacted provider profile metadata or, for Atlassian, redacted accessible-resource metadata when /me returns 403. The same response includes a redacted diagnostics checklist for the Connections page modal so users can see which validation step passed, failed, or needs follow-up without receiving token material. The Connections page also calls POST /api/credentials/connections/[connection_id]/refresh automatically for the signed-in user's expired or expiring connected providers; that endpoint persists the refreshed token server-side and returns only non-secret refresh metadata.

Raw token exchange is reserved for service callers. POST /api/credentials/exchange rejects browser-origin/session requests, verifies the service bearer JWT through the OIDC JWKS path, requires the credential-service audience header, and can resolve credentials in two ways:

  • provider_connection_id: refreshes that specific connection, returning an access token only when the JWT subject owns it or has delegated use permission.
  • provider: lists provider connections owned by the JWT sub, selects that user's connected provider record, refreshes it, and returns only that user's provider access token.

When a caller asks for a specific connection that is not owned by the JWT subject, the route only returns an access token when the subject has:

user:<service-sub> can_use secret_ref:provider_connection:<connection_id>

This keeps Dynamic Agents and MCP runtimes on a narrow service-to-service path while preserving OpenFGA as the PDP for delegated provider-token use. Dynamic Agents uses this path behind USE_IMPERSONATION_TOKENS=true; Jira MCP receives the exchanged Atlassian token on X-CAIPE-Provider-Token, leaving the normal Authorization header reserved for Keycloak MCP authentication.

GET /api/credentials/inject/atlassian is available as a BFF-side injector contract for future AgentGateway integrations, but AgentGateway v0.12 does not support backend-level HTTP extAuthz response-header injection. The active Jira path therefore keeps injection in the runtime connector: Dynamic Agents calls credential exchange with the user's Keycloak JWT and passes the resulting Atlassian token to Jira MCP on X-CAIPE-Provider-Token.

OpenFGA Relationship Backfill​

Existing MongoDB team/resource assignments can be reconciled into OpenFGA with scripts/backfill-universal-rebac.ts. The backfill is a production migration, not a demo seed: it reads teams, team_membership_sources, users, platform_config, and dynamic_agents, then writes idempotent OpenFGA tuples plus Mongo provenance in team_membership_sources and rebac_relationships. It records first-run status in rbac_migrations using the stable migration id openfga_relationship_backfill_v1.

For team membership subjects, the backfill prefers users.keycloak_sub, then users.metadata.keycloak_sub, and only falls back to legacy subject fields. If none of those mappings exist, it may use the member email for compatibility; operators should run the migration after users have logged in at least once so the stable subject mapping is available.

The migration materializes team grants such as:

user:<sub> member team:<slug>
user:<sub> admin team:<slug>
team:<slug>#member user agent:<agent_id>
team:<slug>#admin manager agent:<agent_id>
team:<slug>#member caller tool:<tool_prefix>
team:<slug>#member reader knowledge_base:<kb_id>
team:<slug>#member user skill:<skill_id>
team:<slug>#member user task:<task_id>

Skill Hub imports use the same skill:<id> resource model as locally-authored skills. Hub skills are projected into stable catalog ids hub-<hub_id>-<hub_skill_id>, so team grants write team:<slug>#member user skill:hub-<hub_id>-<hub_skill_id>. The skills catalog filters non-admin list responses with can_read skill:<id> and content-bearing runtime responses with can_use skill:<id>; admins keep full catalog visibility. The Skill Hubs admin card can list hub metadata for callers with admin_surface:skills#can_read, but this is operational catalog metadata only. Which hub skills a user can read or run remains enforced through OpenFGA skill:<id> relationships after the hub has been crawled. Locally-created team-visible skills and bulk .zip imports now reconcile selected teams into OpenFGA skill#user relationships as part of save/import. Skill Hubs also persist shared_with_teams; every force-refresh grants those teams access to all refreshed hub skill ids, and the skill_hub_team_grants_backfill_v1 migration does the same for hub skills that were already crawled before the hub-level team policy existed. GitHub Skill Hub crawl/import uses the hub's validated credentials_ref when configured, otherwise falls back to the server-side GITHUB_TOKEN environment variable on caipe-ui. In dev compose, caipe-ui receives GITHUB_TOKEN from .env or the shell, with GITHUB_PERSONAL_ACCESS_TOKEN as a local fallback.

To preserve the default chat path after Dynamic Agent PDP enforcement, the OpenFGA model allows a typed wildcard subject on agent.user, and the migration writes this tuple when a dynamic default agent is configured:

user:* user agent:<default_agent_id>

Default-agent resolution matches the Admin Settings feature: persisted platform_config.default_agent_id first, then DEFAULT_AGENT_ID, then the supervisor fallback. Supervisor fallback is not a Dynamic Agent and does not produce a default-agent OpenFGA tuple. The backfill is still the bulk repair path for existing environments, but the Web UI also reconciles this typed-wildcard grant when an admin saves a default Dynamic Agent, when an admitted user logs in, and before the chat-available Dynamic Agent picker filters candidates through OpenFGA. The picker now also repairs the same typed-wildcard grant for every enabled Dynamic Agent with visibility: "global" before filtering. That keeps the runtime picker OpenFGA-only without requiring an admin to manually provision default-agent or global-agent tuples.

Default agent is public by design​

Selecting an agent in Admin β†’ Settings β†’ Default Agent writes the user:* user agent:<id> tuple shown above. Every signed-in user (Web UI and Slack/Webex DMs) is then allowed to can_use that agent, regardless of their team memberships. To keep that contract visible and reversible:

  • The Admin Settings picker shows a persistent banner explaining the consequence and a confirmation modal on save. PATCH /api/admin/platform-config rejects requests with 400 / PUBLIC_ACCESS_NOT_ACKNOWLEDGED unless acknowledge_public_access: true is included alongside a non-null default_agent_id. Clearing the default (null) does not require the ack β€” it only revokes the existing wildcard.
  • Each platform-default change emits a structured audit line ([AUDIT] platform_default_agent_changed) with actor, previous, next, and at so log shippers can build an audit trail without a new collection.
  • PUT /api/dynamic-agents rejects demoting visibility: global β†’ team on the current platform default with 409 / AGENT_IS_PLATFORM_DEFAULT, and DELETE /api/dynamic-agents rejects deleting it with the same code. Both paths surface a plain-English message pointing the admin back to Admin β†’ Settings to change the platform default first. The per-agent edit page mirrors this by disabling the visibility selector with an inline note when an agent is the current platform default.
  • The single source of truth for the invariant is ui/src/lib/rbac/platform-default.ts (isPlatformDefaultAgent(id)), which reads platform_config.default_agent_id with the DEFAULT_AGENT_ID env var as a fallback.

Per-agent MCP tool restrictions are reconciled separately with scripts/backfill-agent-tool-openfga.ts. That migration reads each dynamic agent's allowed_tools map and reconciles tuples shaped as:

agent:<agent_id> caller tool:<server_id>/<tool_name>
agent:<agent_id> caller tool:<server_id>/*

Run it after enabling signed agent context so existing agents have the same AgentGateway/OpenFGA enforcement as newly-created or edited agents. Apply mode also removes stale agent-tool tuples that no longer match allowed_tools.

Schema-versioned migration agent_org_admin_inheritance_v1 backfills the organization-admin inheritance tuple for existing Dynamic Agents:

organization:<org>#admin manager agent:<agent_id>

This grants organization admins can_manage through the OpenFGA model without guessing owner teams for legacy agents. New agents get this tuple during create.

Self-service resource creation is PDP-backed. A signed-in user can create a private Dynamic Agent, MCP server, or RAG data source and receives a direct owner tuple (user:<sub> owner <resource>:<id>), which derives read/use/write and manage permissions in OpenFGA. Config-driven and AgentGateway-synced MCP servers seed organization:<org>#member read/use/invoke tuples and organization:<org>#admin manager tuples, so admitted users can discover and use system MCP servers while config-driven records remain immutable through the Web UI mutation APIs. For team-scoped resources, the Web UI backend first checks user:<sub> can_use team:<slug> before creation, then writes team-scoped tuples so team members can use/read the resource and team:<slug>#admin can manage it. MongoDB stores resource metadata such as owner_team_slug, but OpenFGA remains the authorization source of truth.

Token Refresh​

NextAuth holds the refresh token and silently refreshes the access token before it expires. The bundled Keycloak realm keeps access tokens at 1 hour, sets SSO idle timeout to 8 hours, and uses a 24-hour absolute SSO max lifespan. As long as the user keeps using the app and Keycloak accepts the refresh token, the UI asks Keycloak for a new access token instead of expiring the browser session based on local access-token staleness. If Keycloak rejects refresh (invalid_grant), the realm session is revoked, or Keycloak is unavailable, the user is redirected to login. The access token in the session is always the current live token β€” it's what gets forwarded to backend services.

Identity Group Sync Hybrid Source Model​

Identity Group Sync deliberately has two upstream sources:

  1. OIDC memberOf / groups claims on login β€” Keycloak imports the upstream IdP groups claim into the idp_groups user attribute and emits it to the caipe-ui client as a multivalued groups claim in ID token/userinfo responses. Login-claim reconciliation is enabled by default; set IDENTITY_SYNC_LOGIN_CLAIMS_ENABLED=false only when a deployment needs to disable it. auth-config.ts extracts the signed-in user's group claims and runs a best-effort reconciliation for only that user. This is additive and fast: it refreshes the user's managed team_membership_sources and OpenFGA user:<sub> member team:<slug> tuples without storing the full group list in the session cookie. Login is not failed if reconciliation cannot run.
  2. Direct Okta directory API for admin dry-runs β€” /api/admin/identity-group-sync/dry-run can fetch full group inventory from Okta using server-side IdP credentials when fetch_from_provider=true and provider_id is an Okta provider. This path is the authoritative source for scheduled/admin sync because it can see users who are not actively logging in, detect removals, produce drift findings, and surface users that still need identity linking before tuples can be written.

The claim path is not a replacement for direct directory querying. It improves freshness for the current user while the directory connector remains responsible for complete inventory and removals. Admins can also use GET /api/admin/identity-group-sync/claim-suggestions from the Identity Group Sync tab to read the current admin's server-side cached login claim groups, convert them through the same OIDC claim mapper, run existing rules, and review suggested teams for unmatched groups before creating anything. The endpoint intentionally does not call the OIDC userinfo endpoint on demand; if the in-process session claim cache is empty after a UI restart, the admin signs out and back in to refresh the cached claim groups. The UI lets admins filter large AD group sets, select one or more detected groups, and apply a reviewed teams_to_create plan to create those CAIPE teams without granting memberships or deleting anything.

Reviewed admin apply flows can materialize missing teams from teams_to_create when a reviewed rule has auto_create_team=true. Login-time reconciliation is intentionally narrower: it reconciles existing teams only, and never creates teams or grants access to missing teams. Later syncs may remove managed membership sources and matching OpenFGA user:<sub> member/admin team:<slug> tuples when a user's IdP claim or group membership disappears, but Identity Group Sync never deletes teams it previously created. Dry-runs include safety warnings for disruptive removals such as admin membership loss, large removal batches, and teams that would be left without active managed identity-sync memberships. Apply requests that include acknowledged removal risks require an explicit acknowledge_removal_risks=true review flag before the Web UI backend removes access. These warnings are also the operator signal to inspect orphaned or abandoned resource grants on now-empty teams.

Identity Group Sync admin APIs use the shared getAuthFromBearerOrSession path before requireRbacPermission, so browser sessions and validated first-party bearer tokens both reach the same OpenFGA organization checks. Keycloak identity and user administration APIs follow the same pattern: list/detail/stats require organization can_audit, while self-scoped identity detail reads use user_profile:<id>#can_read. Profile updates, team membership edits, and relationship writes require organization can_manage. Admin observability APIs for skill statistics and checkpoint persistence statistics require organization can_audit before reaching MongoDB-backed metrics; the Prometheus instant/batch proxy requires admin_surface:metrics#can_read so baseline Metrics & Health viewers can load charts. Skill Hub list metadata requires admin_surface:skills#can_read, while hub registration, refresh, update, and deletion remain admin_ui#admin operations. This keeps Playwright persona tests and future service-triggered sync previews aligned with the Web UI backend authorization path.

Manual team management is also provenance-aware. Teams created through /api/admin/teams are stamped with source=manual, status=active, and creator/updater metadata. Manual membership edits create or remove non-managed team_membership_sources rows (source_type=manual, managed=false) so automated Okta/AD/OIDC sync can prune only managed sources. The Team Details members tab reads /api/admin/identity-group-sync/teams/[teamId]/membership-sources, reconstructs the visible member list from active source rows, and displays each member's manual/synced/stale/pending source labels; the embedded teams.members[] array is legacy fallback only. Team-level admins (members with role=owner or role=admin) can fully manage teams they own β€” rename and description edits (PATCH /api/admin/teams/[id]), team deletion (DELETE /api/admin/teams/[id]), realm role assignments (PUT /api/admin/teams/[id]/roles), agent/tool resource grants (PUT /api/admin/teams/[id]/resources), member add/remove (POST/DELETE /api/admin/teams/[id]/members), and OpenFGA reconciliation (POST /api/admin/teams/[id]/openfga/reconcile). All six routes share a single requireTeamMembershipManagementPermission(session, actorEmail, team) guard in ui/src/lib/rbac/team-admin-guards.ts that first tries requireRbacPermission(session, "admin_ui", "admin") for the platform-admin bypass and falls back to isScopedTeamAdmin(actorEmail, team) for the team-scoped path. Unrelated team edits remain denied unless the caller is a platform admin (issue #1509).

OpenFGA ReBAC Admin UI​

Admins can create and visualize OpenFGA policy/resource relationships at Admin β†’ Security & Policy β†’ OpenFGA ReBAC.

The older user-facing Policy tab has been removed. It edited CEL tab-visibility and legacy policy surfaces that are no longer part of the operational model. Admin tab visibility is now a deterministic Web UI backend gate (/api/rbac/admin-tab-gates) based on session role plus feature flags; resource authorization is modeled in OpenFGA relationships.

The Admin UI also includes a read-only effective-permissions simulator. Platform admins can add simulate_type=user&simulate_id=<keycloak_sub> or simulate_type=team&simulate_id=<slug>&simulate_relation=member|admin to the Admin URL through the View As Effective Permissions control. The browser stays authenticated as the real admin; the Web UI backend simply evaluates tab gates as the simulated OpenFGA subject (user:<sub> or team:<slug>#admin). Simulation is not Keycloak impersonation, never mints a token for the target principal, and disables mutation-oriented integration panels while previewing.

The UI is intentionally Web UI backend first:

  1. The browser loads a safe catalog from /api/admin/openfga/catalog (teams, dynamic agents, MCP tool prefixes, known KB IDs, universal resources, and OpenFGA status).
  2. The Access Manager combines relationship authoring and effective-access checks in one catalog-driven form. It searches/selects subjects, resources, and actions; previews the derived check relation such as team:platform#member can_use agent:incident-agent; and applies admin grant/revoke mutations through the staged ReBAC change-set API.
  3. The Policy Graph calls /api/admin/rebac/graph and renders tuple usersets as typed nodes and edges so relationships across the universal resource catalog are visible without reading raw tuple rows. Admins can switch between a single-team scope and an all-relationships system scope, open a full-screen graph workspace, search/select catalog resources in the palette, drag resources onto the canvas, connect valid nodes to stage grants, select existing edges to stage revokes, and save the reviewed tuple diff through /api/admin/openfga/tuples.
  4. The OpenFGA Tuples tab is the default sub-tab. It calls /api/admin/openfga/tuples for capped, filtered reads and admin-only deletes, and can be deep-linked with openfgaTab=tuples.

The OpenFGA ReBAC sub-tabs are URL-addressable with openfgaTab=<tab> so admins can share links to specific views. Supported values are tuples, graph, and access; old builder and explorer links open Access Manager, while legacy rag, slack, and webex links canonicalize to Settings or Integrations.

Raw OpenFGA HTTP endpoints stay on the Docker/private service network. The browser never talks to OpenFGA directly, and the Web UI backend only accepts writable tuple shapes that match the CAIPE base model (user:<sub> member team:<slug>, team:<slug>#member user/manager agent:<id>, team:<slug>#member caller tool:<prefix>, and KB base relations). Materialized can_* relations are derived by the OpenFGA model for checks and are rejected on tuple writes.

The universal ReBAC catalog lives behind /api/admin/rebac/catalog. It returns the complete protected resource vocabulary, per-type action map, and discovered resource instances from teams, users, dynamic agents, AgentGateway's mcp_gateway:list gate, MCP servers/tools, KB ownership, Slack mappings, Webex mappings, conversations, and built-in admin/system resources. /api/admin/rebac/enforcement-status reports transition state for every resource type (not_gated, role_gated, rebac_shadowed, rebac_enforced, or deprecated) by merging defaults with rebac_enforcement_status overrides. The older OpenFGA admin endpoints use the same session-or-bearer authentication path, and /api/admin/openfga/catalog now embeds these universal resources while preserving its legacy agents, tools, and knowledge_bases picker shape.

Policy authoring is staged through policy_change_sets instead of direct browser-to-tuple writes. The Web UI backend creates a draft change set, validates every requested grant/revocation against the universal action vocabulary, delegated-scope guardrails, circular-grant checks, and last-admin risk, then applies the validated diff to OpenFGA and records provenance in rebac_relationships. The OpenFGA admin tab uses this create/validate/apply sequence for Access Manager edits, graph edits, and tuple revocations so administrators see the staged diff before the write is committed.

Graph and access explanation APIs read OpenFGA tuples and join them with rebac_relationships provenance. /api/admin/rebac/graph supports all-relationship views and scoped filters for team, subject, resource, and Slack channel, returning source metadata with each edge. /api/admin/rebac/check runs the same universal relationship check and explains allow outcomes with the recorded source path or deny outcomes with the missing OpenFGA prerequisite. Access Manager is catalog-driven: operators can search/select team, user, Slack channel, Webex space, external group, or service-account subjects and check any catalog resource type/action, including AgentGateway mcp_gateway:list and tool can_call paths. Admins can remediate denied results by creating the selected relationship, or revoke allowed results, through the same staged change-set validation/apply path used by the graph editor. The legacy /api/admin/openfga/graph endpoint delegates to the universal graph service so older UI code gets the same source-aware graph.

Slack channel ReBAC is managed through /api/admin/slack/channels and the per-channel resources/routes/access-check routes under /api/admin/slack/channels/[workspaceId]/[channelId]. The [workspaceId] value is the configured workspace alias from SLACK_WORKSPACE_ALIAS (for example, CAIPE), not Slack's opaque team_id. Channel management is team-owned: assigning a channel to a team writes team:<slug>#member user slack_channel:<workspace>--<channel> and team:<slug>#admin manager slack_channel:<workspace>--<channel>, and per-channel resource/route mutations check can_manage on that Slack channel instead of requiring global Admin UI permission. The top-level Slack channel list is resource-scoped: a non-admin caller sees only channels where OpenFGA grants can_read or can_manage, with can_manage returned for the UI. Admin tab gates also open the Integrations β†’ Slack tab when the caller can manage at least one concrete Slack channel. The admin UI exposes the currently enforced Slack runtime path: channel-agent associations write base OpenFGA tuples such as slack_channel:CAIPE--C0123456789 user agent:<id>; runtime checks ask for derived can_use.

Team-cascade sharing model (intentional). The channel-dispatch access-check at /api/integrations/slack/channels/[workspaceId]/[channelId]/access-check sends user_subject = "team:<slug>#member" (the channel's mapped team) rather than user:<sub>. This is the documented policy: any agent associated with a channel that is mapped to a team is callable in that channel by every member of that team, including members who were never granted the agent directly via user:<sub> can_use agent:<id>. The DM-dispatch chain (POST /api/user/check_agent_access) is user-scoped and is not subject to this cascade. The Slack and Webex ReBAC admin panels surface this trade-off both in the top-of-card "Sharing model" callout and in a per-channel heads-up under the agent-association form. See Workflows β†’ Sharing model: assigning a channel to a team transitively shares its agents for the full rationale.

OpenFGA is the source of truth for whether a Slack channel may invoke a Dynamic Agent. slack_channel_agent_routes is retained only for dependent dispatch metadata such as listen mode and priority, and a metadata row is valid only while the matching OpenFGA tuple exists. The Slack bot resolves candidate agents from OpenFGA first, joins optional Mongo route metadata for ordering/listen filters, and never lets a stale Mongo route keep a deleted OpenFGA association alive. Deleting a channel-agent association removes both the OpenFGA tuple and the saved route metadata row. Route misses fail closed; user-visible Slack notices are reserved for explicit invocations, while ambient plain channel messages stay silent even when route diagnostics are recorded. The Admin Slack Channels panel exposes runtime diagnostics for the selected channel so operators can see OpenFGA read failures, stale Mongo metadata, missing tuples, listen-mode mismatches, and the latest Slack runtime audit error without checking container logs. Fix buttons in diagnostics repair common drift by removing stale route metadata when its OpenFGA tuple is gone, or by switching a tuple-backed route to listen to both mentions and plain messages.

Slack bot deployments now default to SLACK_AGENT_ROUTES_MODE=db_prefer, so OpenFGA-backed UI-managed routes are preferred when present and static Slack bot config remains the fallback; config remains available for static-only environments and db_only is available for canaries that should ignore static route bindings. At runtime, the Slack bot maps any incoming Slack team_id to SLACK_WORKSPACE_ALIAS, resolves the channel's team from channel_team_mappings, mints the user's team-scoped OBO token, selects an OpenFGA-backed channel agent, and authorizes the selected agent before dispatch. The request is denied unless both the channel association and the user's team/resource relationship allow the selected agent.

For hands-off channel onboarding, operators may set SLACK_AUTO_ASSIGN_UNMAPPED_CHANNELS=true with SLACK_DEFAULT_TEAM_SLUG and SLACK_DEFAULT_AGENT_ID. When a group-channel message arrives and no active channel_team_mappings row exists, the Slack bot writes the configured channel-team mapping, writes slack_channel:<workspace_alias>--<channel_id> user agent:<default_agent_id> to OpenFGA, and stores a slack_channel_agent_routes metadata row with listen: all. The feature is disabled by default in Helm and fails closed if MongoDB, OpenFGA, the default team, or either required env var is missing; existing active channel mappings are never overwritten.

For migrations, the Slack Channels panel includes Slack Channel Association Default backed by GET/POST /api/admin/slack/channels/defaults. The UI shows the currently configured default team and Dynamic Agent from SLACK_DEFAULT_TEAM_SLUG and SLACK_DEFAULT_AGENT_ID. Admins may apply those defaults to all managed channels, or use bot-member discovery to select individual channels and override the team and Dynamic Agent per selected row. The Web UI backend writes the selected channel-team mappings, ensures slack_channel:<workspace_alias>--<channel_id> user agent:<id>, ensures team:<slug>#member user agent:<id> for each selected team/agent pair, ensures the inbound team:<slug>#member user slack_channel:<workspace>--<channel> and team:<slug>#admin manager slack_channel:<workspace>--<channel> visibility tuples (so the channel actually shows as Setup completed in the listing — /api/admin/slack/channels filters each row by can_read and silently drops channels with no inbound team→channel tuples), and optionally creates matching bootstrap routes in slack_channel_agent_routes. Those bootstrap routes are stamped with source_type: "bootstrap" and users.listen: "all" so the bot responds to both @mentions and plain channel messages by default — admins who want quieter behaviour can narrow individual routes to mention or message from the Step-2a route picker. The same listen: "all" default applies to route rows the Web UI lazily materialises from an OpenFGA tuple that has no Mongo metadata yet (the "ghost route" path in /api/admin/slack/channels/{workspaceId}/{channelId}/routes); the equivalent Webex spaces endpoint mirrors this default. This is intentionally an explicit bulk write rather than an OpenFGA wildcard/default subject, so every relationship appears in the tuple store and Policy Graph. The shared helpers slackChannelTeamVisibilityRelationships and webexSpaceTeamVisibilityRelationships are used by the onboarding writers and the messaging_team_visibility_v1 migration so admin-PUT, onboarding-defaults, and the backfill path all converge on identical tuple shapes.

The Slack Channels panel also includes Slack Bot Runtime Sync for the running bot process. Browser requests still terminate at the Web UI backend: caipe-ui checks the signed-in user's admin_ui#admin permission, obtains a Keycloak client-credentials token for the Slack bot admin audience, and calls the Slack bot's internal admin API. The Slack bot verifies that token with Keycloak JWKS before returning route-cache status, clearing its in-memory route cache, or upserting static YAML channel-agent routes into slack_channel_agent_routes and OpenFGA. Local no-SSO development can opt into an explicit dev-token path with SLACK_BOT_ADMIN_DEV_AUTH_ENABLED=true on the Web UI and SLACK_ADMIN_DEV_AUTH_ENABLED=true on the bot, with matching dev token values; this bypasses Keycloak only for the internal Slack bot admin API and must not be enabled in shared environments. The sync operation is intentionally upsert-only: it creates missing records and updates matching channel/agent metadata, but it does not delete existing UI-managed associations that are absent from static config.

Webex space ReBAC follows the same team-ownership shape with Webex-specific types and storage: webex_space:<workspace_alias>--<space_id> user agent:<id> is the OpenFGA source of truth, while webex_space_agent_routes stores dependent dispatch metadata such as listen mode, priority, and enabled state. Team-space assignment writes team:<slug>#member user webex_space:<workspace>--<space> and team:<slug>#admin manager webex_space:<workspace>--<space>, and per-space grant/route/diagnostic APIs check the derived Webex space permissions. The top-level Webex space list is also resource-scoped, and the Integrations β†’ Webex tab appears for non-admin users who can manage at least one concrete webex_space. The Webex bot never trusts workspace identifiers from incoming Webex events; policy namespace selection comes from WEBEX_WORKSPACE_ALIAS or WEBEX_WORKSPACE_ID. Route reads use server-side OpenFGA tuple filters for the selected webex_space subject and fail closed on PDP outages.

Threaded Webex replies are anchored with Webex parentId. After an allow decision and before Dynamic Agent dispatch, the bot may fetch bounded prior thread context from the Webex Messages API: the root message plus recent replies filtered by the same parentId and capped by WEBEX_THREAD_CONTEXT_MAX_MESSAGES / WEBEX_THREAD_CONTEXT_MAX_CHARS. The context is sent only to the already selected and authorized Dynamic Agent under the user's OBO token; fetch failures do not weaken authorization and fall back to sending only the current message. Bot replies include the selected agent_id and tell users to continue in the same Webex thread. Whether the bot processes follow-up posts still depends on route listen mode: mention, message, or all.

The Webex Spaces panel includes diagnostics and runtime sync through /api/admin/webex/* BFF routes. The Web UI backend obtains a caipe-webex-bot-admin audience token, calls the internal Webex bot admin API, and the bot verifies that token with Keycloak JWKS. Runtime sync is upsert-only: it creates or updates configured webex_space_agent_routes rows and corresponding OpenFGA tuples, but it does not delete UI-managed associations absent from static config. Diagnostics compares tuple-backed agents with Mongo route metadata and offers one-click repairs for zero-agent spaces, stale metadata, and listen-mode mismatches; the zero-agent repair creates a default/selected agent association with listen: all through the same route API used by manual association saves.

For opt-in onboarding, WEBEX_AUTO_ASSIGN_UNMAPPED_SPACES=true with WEBEX_DEFAULT_TEAM_SLUG and WEBEX_DEFAULT_AGENT_ID creates an explicit space-team mapping, route metadata row, and OpenFGA tuple for a previously unmapped space. The feature is disabled by default, writes MongoDB before OpenFGA to avoid orphan grants, rolls back on failure, and never overwrites an existing active space mapping. The onboarding writer (webex-space-onboarding.ts) also emits the inbound team:<slug>#member user webex_space:<workspace>--<space> and team:<slug>#admin manager webex_space:<workspace>--<space> visibility tuples so the space surfaces in /api/admin/webex/spaces (which filters each row by can_read). Previously-onboarded spaces are backfilled by the same messaging_team_visibility_v1 migration that handles Slack channels β€” both surfaces share the helper builders so admin-PUT, onboarding-defaults, and the backfill emit identical tuple shapes.

Future PDP consolidation note: OpenFGA should remain the source of truth for all relationship decisions, but the OpenFGA auth bridge should not be treated as the universal application PDP until it exposes a stable, domain-neutral JSON authorization API in addition to its Envoy ext_authz adapter. Until then, keep the bridge focused on network enforcement for AgentGateway/MCP traffic and keep Slack using /api/admin/slack/channels/[workspaceId]/[channelId]/access-check for domain-aware dispatch checks. The later consolidation path is to extract shared OpenFGA decision helpers and audit/result shapes first, then optionally let Slack, Web UI backend routes, and the bridge call a common PDP service rather than duplicating tuple logic.

Legacy Keycloak realm roles may still appear in old local data, but they are not an authorization source. /api/rbac/enforcement-comparison remains available only as an engineer-facing migration aid for comparing stale role-shaped data with ReBAC decisions for a selected subject/action/resource.

Key Environment Variables​

VariablePurposeSecurity note
OPENFGA_RECONCILE_ENABLEDEnables Team Resources β†’ OpenFGA tuple reconciliation in the Web UI backendDefaults to false so non-RBAC local UI runs do not require OpenFGA; enable only when the OpenFGA profile is healthy.
OPENFGA_HTTPDocker-internal OpenFGA HTTP API URL used by the Web UI backend tuple writer and Slack bot route resolverKeep this on the private service network; do not point browser clients at OpenFGA.
OPENFGA_STORE_NAME / OPENFGA_STORE_IDSelects the OpenFGA store for tuple writesPrefer OPENFGA_STORE_ID in locked-down deployments to avoid discovery ambiguity.
BOOTSTRAP_ADMIN_EMAILS / RBAC_BOOTSTRAP_ADMIN_EMAILSComma-separated initial admin emails consumed by the Web UI BFF bootstrap reconciler; RBAC_BOOTSTRAP_ADMIN_EMAILS overrides the legacy fallback env var when setKeep the list short. The BFF resolves emails to Keycloak sub values and writes durable OpenFGA tuples; do not hardcode user UUID tuples in Helm values for normal admin bootstrap.
OPENFGA_SEED_TUPLESJSON list of exact OpenFGA tuple keys consumed by the OpenFGA init hook after the authorization model is loadedChart-generated from openfga.init.seedTuples; reserve for non-user emergency tuples or recovery. Human bootstrap admins should use BOOTSTRAP_ADMIN_EMAILS so Keycloak UUIDs are resolved automatically.
AGENT_GATEWAY_ADMIN_URLOptional Web UI backend URL for AgentGateway admin config discovery; defaults to http://agentgateway:15000/configKeep the AgentGateway admin port on the private service network. The browser calls only the Web UI backend discovery/sync APIs, which require mcp_server:agentgateway#can_discover for discovery and mcp_server:agentgateway#can_manage for sync.
AGENT_GATEWAY_URLAgentGateway data-plane base URL used when onboarding discovered MCP targets; defaults to http://agentgateway:4000 and the UI backend appends /mcp when neededAgentGateway-discovered MCP server records should route through this URL so JWT/authz enforcement remains on the gateway path. The backend target URL from AgentGateway config is stored only as operator metadata.
AGENTGATEWAY_CONFIG_BRIDGE_POLL_SECONDSDocker Compose local-dev poll interval for the AgentGateway config bridge that renders standalone MCP routes from MongoDB mcp_servers rowsLocal-only control plane helper. It writes only the shared generated AgentGateway config volume; Kubernetes uses native AgentgatewayBackend and HTTPRoute resources instead.
CAIPE_AGENT_CONTEXT_HMAC_SECRETShared secret used by Dynamic Agents and the OpenFGA authz bridge to sign/verify agent_id context for per-agent MCP tool enforcementStore only in runtime secrets. When unset, AgentGateway still enforces the coarse user mcp_gateway:list gate, but the bridge cannot enforce derived agent:<id> can_call tool:<server>/<tool> decisions.
CAIPE_CREDENTIALS_ENABLED / CREDENTIAL_STORE_BACKENDEnables the Connections & Secrets surface and selects the MongoDB envelope credential backendDefaults disabled. Browsers can create or rotate credential values, but raw retrieval is limited to server-to-server callers.
CREDENTIAL_KEY_PROVIDER / CREDENTIAL_KMS_CMK_ID / CREDENTIAL_KMS_REGIONSelects the credential data-key wrapper. Local development uses local-cmk; production should use aws-kms with a real CMK.local-cmk and legacy dev-local fail closed in production. Do not put real CMK secrets in ConfigMaps; production KMS access must come from runtime identity and least-privilege key policy.
CREDENTIAL_BOOTSTRAP_OAUTH_CONNECTORS / GITHUB_* / CONFLUENCE_* / WEBEX_* / PAGERDUTY_* / GITLAB_*Lets the caipe-ui TypeScript startup bootstrap idempotently seed global GitHub, Atlassian/Confluence, Webex, PagerDuty, and GitLab OAuth connector records from environment variablesDocker Compose reads these from .env; Kubernetes must source them through ESO/ExternalSecret. Provider client secrets must never be placed in ConfigMaps or logs and are immediately written through MongoDB envelope encryption.
CREDENTIAL_SERVICE_AUDIENCE / CREDENTIAL_API_URLAudience and service URL used by Dynamic Agents and other internal services when retrieving secret refs or exchanging provider connectionsMust match the issued service/OBO token audience. Browser-origin, session-only, and wrong-audience retrieval/exchange requests are denied before credential lookup.
MONGODB_URI / MONGODB_DATABASEEnables Python OpenFGA audit writers, including Dynamic Agents and openfga-authz-bridge, to persist durable openfga_rebac rows into audit_eventsStore MONGODB_URI in runtime secrets for Helm/production; dev compose uses the local MongoDB service.
SLACK_AGENT_ROUTES_MODESlack bot route source: db_prefer (default; prefer OpenFGA-backed UI-managed channel-agent routes, fall back to static config), config, or db_onlydb_prefer and db_only require OpenFGA access; MongoDB is used only to enrich tuple-backed routes with listen/priority metadata. Use config only for static-only environments that should ignore UI-managed channel routes.
SLACK_INTEGRATION_SILENCE_ENVInitial setup switch that makes the Slack bot ignore inbound payloads before handlers can send user-visible Slack responsesUse only during bootstrap or broken-route setup windows. Admin/runtime diagnostics remain the place to inspect OpenFGA route health while end-user channel noise is suppressed.
SLACK_WORKSPACE_ALIASCanonical Slack workspace namespace used by the Web UI backend, Slack bot, Mongo route/grant rows, and OpenFGA slack_channel:<alias>--<channel_id> subjectsConfigure per deployment (for example, CAIPE or Splunk). The Slack bot maps incoming Slack team_id values to this alias before route and ReBAC lookups.
SLACK_BOT_TOKENWeb UI backend Slack Web API token used only for admin channel discovery in the Slack Channel Setup flowSource from Vault/ExternalSecret, normally the same bot token used by slack-bot. Never place the value in ConfigMaps or logs.
DISCOVERY_CACHE_TTL_MINUTESBootstrap default for the in-process cache TTL on /api/admin/slack/available-channels and /api/admin/webex/available-spaces; defaults to 60 and is overridden at runtime by platform_config.discovery_cache_ttl_minutesAdmins set the live value via the Discovery cache popover next to the connector discovery button on Admin β†’ Integrations β†’ Slack and Admin β†’ Integrations β†’ Webex (range 0–1440; 0 disables caching). The env var only sets the bootstrap value when no DB override exists. The same popover exposes a per-provider Refresh from Slack/Webex now button that drops the snapshot immediately for ad-hoc bot-membership changes.
SLACK_AGENT_ROUTES_ENABLEDLegacy rollout alias; when true and SLACK_AGENT_ROUTES_MODE is unset, behaves as SLACK_AGENT_ROUTES_MODE=db_preferPrefer SLACK_AGENT_ROUTES_MODE for new deployments so the fallback behavior is explicit.
SLACK_AGENT_ROUTES_TTL_SECONDSSlack bot in-process cache TTL for OpenFGA-backed channel agent routes; defaults to 60Short TTLs make UI route changes visible faster at the cost of more OpenFGA reads and Mongo metadata joins.
CAIPE_PLATFORM_AUDIENCEAudience requested by Slack/Webex OBO exchanges for bot β†’ CAIPE UI BFF access checks; defaults to caipe-platformKeep this aligned with the Keycloak client accepted by the Web UI backend. Do not use agentgateway for bot pre-dispatch access checks because the next hop is the BFF.
WEBEX_THREAD_CONTEXT_ENABLEDEnables Webex bot thread-context fetch before Dynamic Agent dispatch; defaults to trueReads only messages visible to the bot in the same Webex thread and sends bounded context to the authorized agent under the user's OBO path. Set to false where message-history minimization is required.
WEBEX_THREAD_CONTEXT_MAX_MESSAGESCaps prior Webex thread replies fetched with the Webex Messages API; defaults to 10Keep this low to limit prompt size and data exposure.
WEBEX_THREAD_CONTEXT_MAX_CHARSCaps formatted Webex thread context sent to Dynamic Agents; defaults to 4000Prevents unbounded prompt growth and avoids sending entire long conversations to downstream agents.
TENANT_ID / AUDIT_SUBJECT_SALTControls tenant scoping and privacy-preserving subject hashing for Python OpenFGA audit eventsKeep the salt stable per environment so subject hashes remain correlatable without storing raw tokens.
AUTHZ_TRACING_ENABLEDEnables optional Web UI backend OpenFGA/ReBAC OTLP span exportDefaults off in dev compose. Trace spans are observational only; do not put raw tokens, request bodies, or PII in span attributes.
OTEL_EXPORTER_OTLP_TRACES_ENDPOINTOptional OTLP HTTP endpoint for Web UI backend authz spansLeave unset unless an external collector is explicitly configured. RBAC Audit uses MongoDB audit_events and does not need a trace backend.
KEYCLOAK_ADMIN_CLIENT_IDConfidential Keycloak client used by Web UI backend admin APIs for Keycloak Admin REST calls such as user listing, role assignment, client inspection, and Keycloak RBAC OBO permission repairUse a service-account client with only the required realm-management roles: user roles (view-users, query-users, manage-users), client roles (query-clients, view-clients, manage-clients), and authorization roles (view-authorization, manage-authorization). Production should not rely on the dev admin-cli password-grant fallback.
KEYCLOAK_ADMIN_CLIENT_SECRETMatching client secret for KEYCLOAK_ADMIN_CLIENT_IDStore in Vault/ExternalSecret/Kubernetes Secret only; never commit the secret value.
KEYCLOAK_ACCESS_TOKEN_LIFESPANKeycloak init/reconcile job override for realm access-token lifetime; chart default is 3600 secondsKeep access tokens short and rely on refresh tokens for active sessions.
KEYCLOAK_SSO_SESSION_IDLE_TIMEOUTKeycloak init/reconcile job override for realm SSO idle timeout; chart default is 28800 secondsThis is the user-facing idle logout window. Increasing it should be a deliberate security decision.
KEYCLOAK_SSO_SESSION_MAX_LIFESPANKeycloak init/reconcile job override for absolute realm SSO max lifespan; chart default is 86400 secondsMust be longer than the idle timeout if active users should keep refreshing throughout the workday.
OIDC_ACCEPTED_AUDIENCESAdditional bearer JWT audiences accepted by the Web UI backendThe dev compose stack defaults this to caipe-platform so RBAC persona tokens minted by the Keycloak resource-server client can exercise Web UI backend routes; production deployments should set the narrow audience list they actually issue.
IDENTITY_SYNC_LOGIN_CLAIMS_ENABLEDControls best-effort login-time reconciliation from OIDC group claimsDefaults on; set to false to disable. Login remains best-effort and must not depend on directory sync health.
IDENTITY_SYNC_OIDC_CLAIM_PROVIDER_IDProvider id used to select mapping rules for claim-derived syncDefaults to oidc-claims; keep separate from direct Okta providers so provenance stays clear.
IDENTITY_SYNC_OKTA_ORG_URL / IDENTITY_SYNC_OKTA_API_TOKENServer-side Okta Management API connector for full inventory dry-runsStore the token in runtime secrets only; never expose it to the browser or commit it.

The deploy/keycloak/init-idp.sh bootstrap keeps the IdP group importer on per-mapper syncMode=FORCE, so the idp_groups attribute is refreshed on login without resetting unrelated user attributes such as Slack links. The same idempotent init job may seed identity-only test personas before e2e runs. The caipe-ui mapper intentionally leaves access.token.claim=false to avoid sending large group arrays through every downstream bearer-token path.


Component 3: Supervisor A2A Server β€” The Dispatcher​

Badge analogy: The dispatcher at the internal mail room. When you drop off a work order, they scan your badge, note your name and clearance on the paperwork, and attach a photo-copy of your badge to every sub-order sent to other departments. Downstream departments never need to ask who initiated the original request β€” it's stapled to everything.

Technically: A Starlette/FastAPI application running the LangGraph multi-agent supervisor. It has a layered middleware stack. The JWT is validated once at the outer layer, then decoded and stored in a per-request contextvar by JwtUserContextMiddleware so all downstream code can read user identity without re-parsing the header.

Middleware Stack (outermost β†’ innermost)​

CORSMiddleware
β”‚
PrometheusMetricsMiddleware (metrics, skips /health)
β”‚
OAuth2Middleware / SharedKeyMiddleware (validates JWT signature + expiry)
β”‚
JwtUserContextMiddleware (decodes claims β†’ stores in contextvar)
β”‚
A2A request handler + LangGraph agent

JwtUserContextMiddleware is intentionally read-only. It does not re-validate the token β€” that's already done by the auth middleware above it. It decodes the JWT payload without verification, fetches the OIDC userinfo endpoint (cached 10 min) for authoritative email/name/groups, and stores the result in a ContextVar:

# Set once per request by JwtUserContextMiddleware
_jwt_user_context_var: ContextVar[JwtUserContext | None]

# Read anywhere in the same request (agent executor, tools, sub-calls)
ctx = get_jwt_user_context()
# ctx.email, ctx.name, ctx.groups, ctx.token

JWT Forwarding to MCP Tools​

When FORWARD_JWT_TO_MCP=true, the supervisor forwards the original, unmodified bearer token from the incoming request to AgentGateway. This means:

  • The token that reaches AgentGateway has sub = the real user (or OBO token with act.sub = bot)
  • AgentGateway can evaluate the user's actual roles, not the supervisor's service account
  • MCP servers that do their own JWT validation (e.g. RAG) see the real user identity
User JWT  β†’  Supervisor  β†’  (same JWT)  β†’  AgentGateway  β†’  MCP Server

Security implication: The supervisor must not modify or strip the bearer token before forwarding. If it substituted its own service account token, the entire per-user authorization chain would collapse.

Key Environment Variables​

VariablePurposeSecurity note
A2A_AUTH_OAUTH2=trueEnable JWT signature validationOff in dev; mandatory in prod
A2A_AUTH_SHARED_KEYShared-key auth alternativeUse only for service-to-service; not for user-facing flows
ENABLE_USER_INFO_TOOL=trueExtract identity from JWT (vs. "by user: email" prefix)The JWT is the authoritative source; prefer this over message prefix
FORWARD_JWT_TO_MCP=trueForward incoming JWT to MCP toolsRequired for per-user enforcement at AgentGateway
ISSUER / OIDC_ISSUEROIDC issuer for userinfo endpoint discoveryMust match iss claim in tokens

Component 4: AgentGateway β€” The Security Checkpoint​

Badge analogy: The armed security checkpoint at the entrance to the server room. Everyone must badge in β€” no exceptions, no tailgating. The checkpoint verifies the badge locally, then calls the central relationship desk (OpenFGA) to ask whether this person is allowed through.

Technically: AgentGateway is the single Policy Enforcement Point (PEP) for all MCP tool calls. It proxies HTTP/SSE requests to registered MCP backend servers, validates the Keycloak JWT, and calls OpenFGA through extAuthz for the PDP decision before allowing each request through. MCP servers still mount a shared custom middleware package for authentication defense-in-depth (JWT/shared-key validation, token passthrough context, and an optional local-dev localhost bypass). For embedded/local MCP servers that do not sit behind AgentGateway, the same package can also perform an optional Keycloak PDP scope check (for example mcp_jira#invoke) so they still have a real authz gate.

Request Flow​

Supervisor POST /rag/v1/query
Authorization: Bearer <JWT>
β”‚
β–Ό
AgentGateway
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Extract JWT from Authorization header β”‚
β”‚ 2. Validate signature against JWKS β”‚
β”‚ 3. ext_authz β†’ OpenFGA Check β”‚
β”‚ 4a. OpenFGA DENY β†’ 403 Forbidden β”‚
β”‚ 4b. OpenFGA ALLOW β†’ proxy to MCP server β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ ALLOW
β–Ό
RAG MCP Server
(receives same JWT for its own validation)

Authorization Model​

AgentGateway uses jwtAuth for authentication and extAuthz for authorization. The openfga-authz-bridge adapts Envoy's gRPC authorization check into an OpenFGA Check, so gateway authorization is maintained through ReBAC tuples rather than CEL policy authoring. For observability and compliance, the bridge also writes a best-effort openfga_rebac document to MongoDB audit_events for every terminal authorization result: missing subject, OpenFGA allow, OpenFGA deny, and OpenFGA unavailable. These writes never affect the allow/deny response returned to AgentGateway.

Data-Plane Ingress​

The Helm chart can expose AgentGateway's MCP data path with agentgateway.ingress.enabled=true. That ingress always routes to the service HTTP port (service.port, default 4000). The admin listener (service.adminPort, default 15000) is not exposed by the ingress and should remain reachable only from inside the cluster.

Admin UI MCP Discovery and Migration​

The Web UI backend owns AgentGateway MCP discovery and sync through /api/mcp-servers/agentgateway/discover and /api/mcp-servers/agentgateway/sync. Both routes check the singleton mcp_server:agentgateway resource directly: can_discover for discovery and can_manage for sync/onboarding. Bootstrap environments should seed that singleton grant explicitly instead of relying on a session-role bypass.

Sync is intentionally one-click for migrations: the backend reads the private AgentGateway admin config, imports every discovered target with status new as a config-driven source: "agentgateway" MCP server, leaves already-managed targets unchanged, and never overwrites conflicting legacy MCP servers. Conflicts are returned as migration warnings with the legacy endpoint and AgentGateway target endpoint so operators can remove or rename the old row manually.

Why This Is the Right Architecture for a PEP​

  • Decoupled policy from business logic: MCP servers implement domain logic, not authz. Changing a policy means editing config.yaml, not redeploying an MCP server.
  • Consistent enforcement: Every tool β€” RAG, GitHub, ArgoCD, Slack β€” goes through the same gateway with the same JWT. No tool can be accidentally left unenforced.
  • Externalized relationship decisions: OpenFGA gives us a remote PDP for relationship checks without putting that logic inside each MCP server.
  • Token passthrough: AgentGateway forwards the JWT to the MCP backend unchanged. The backend can do its own secondary validation (e.g. tenant isolation).

Local / Embedded MCP Exception Path​

Most production MCP traffic should still go through AgentGateway. The repository also ships a shared custom MCP middleware for the exception cases:

  • Local dev β€” when an engineer runs a FastMCP server directly on localhost for mcp dev, MCP_TRUSTED_LOCALHOST=true can bypass auth for the real loopback peer only.
  • Embedded MCPs β€” when an MCP lives inside another Python service and therefore cannot be registered as a standalone AgentGateway backend, the same package validates the bearer token locally and can optionally call Keycloak's PDP for a per-MCP scope decision.

That package lives under ai_platform_engineering/agents/common/mcp-auth/ and is intentionally authn-focused by default. In the normal standalone path, AgentGateway remains the source of truth for RBAC.


AgentGateway + OIDC + Keycloak β€” The Integrated Picture​

Badge analogy: Duo SSO is the national ID office β€” it issues the underlying identity. Keycloak is HR β€” it takes that national ID, prints a CAIPE-branded employee badge with your roles stamped on it, and publishes a public fingerprint scanner (JWKS) in the lobby so anyone can verify a badge is really HR-issued. AgentGateway is the armed checkpoint at the server room door. The checkpoint verifies the badge locally, then calls the OpenFGA authorization desk through ext_authz before opening the door.

Technically: Keycloak, OpenFGA, and AgentGateway cooperate to put a verified, relationship-checked, role-carrying JWT in front of every MCP request. AG itself is the Policy Enforcement Point (PEP) β€” it doesn't authenticate users, it doesn't store roles, and it never talks to Duo. It verifies that the JWT in the request was signed by Keycloak (using a cached copy of Keycloak's JWKS), then calls OpenFGA through extAuthz for the authorization decision.

LayerRoleWhat it ownsWhat it does NOT own
Upstream IdP (e.g. Duo SSO, Okta, Azure AD)Identity providerUser authentication (password, MFA, device trust), email ownershipApplication roles, per-tool access rules
KeycloakOIDC AS + IdP brokerRealm roles (chat_user, admin), JWT issuance, JWKS publication, OBO token exchange (RFC 8693)Tool-level decisions, user password (delegated to Duo)
OpenFGARemote PDPRelationship decisions such as user:<sub> can_call mcp_gateway:list and team resource tuples (team:<slug>#member can_use agent:<id>)JWT validation, token minting, proxying traffic
AgentGateway (PEP)Policy Enforcement PointjwtAuth, extAuthz, local JWT verification against cached JWKSIdentity store, role store, token minting, CEL policy storage

Keycloak brokers the upstream IdP β€” Duo SSO doesn't issue the JWT that AG sees. Duo authenticates the user, returns an OIDC authorization code to Keycloak, and Keycloak then mints the CAIPE JWT whose identity claims feed the OpenFGA extAuthz check. From AG's perspective, Keycloak is the only issuer it trusts (iss = http://localhost:7080/realms/caipe); the existence of Duo is invisible to AG. This is the standard OIDC/OAuth 2.0 resource-server pattern applied to an MCP-aware proxy.

Identity Provenance: Duo SSO β†’ Keycloak β†’ JWT β†’ AG β†’ MCP​

Read this as the badge's lifecycle:

  1. Duo SSO authenticates the human. It doesn't know about CAIPE roles. It only proves "this really is alice@example.com with working MFA" and hands an OIDC authorization code to Keycloak. Duo's issuer (IDP_ISSUER) is configured in Keycloak as IDP_ALIAS=duo-sso; this is the only direct contact between CAIPE and Duo.
  2. Keycloak brokers and rebrands the identity. It validates the Duo code, runs its IdP mappers (e.g. firstname β†’ given_name to handle Duo's non-standard claim), and signs a fresh JWT with its own RS256 key. Product authorization is evaluated later through OpenFGA organization, team, and resource relationships. This is the only token CAIPE services ever see. Duo's identity token is discarded at the Keycloak boundary.
  3. Every CAIPE caller holds the same JWT. The Slack Bot additionally does an RFC 8693 token-exchange to produce an OBO (On-Behalf-Of) JWT that pins sub=alice and act.sub=caipe-slack-bot β€” but it's still a Keycloak-signed JWT with iss = http://localhost:7080/realms/caipe. From AG's perspective there's no difference between a UI JWT and an OBO JWT; both pass jwtAuth as long as they're signed by a key in AG's JWKS cache.
  4. AG verifies locally, calls OpenFGA, forwards unchanged. The JWT reaches the MCP server with Alice's identity intact, so MCP-level defense-in-depth checks (e.g. the RAG server's per-tenant document ACLs) see the real user β€” not the supervisor's service account and not the Slack bot.

The practical consequence: to switch CAIPE from Duo SSO to Okta or Azure AD you don't touch AgentGateway at all. You change IDP_ISSUER, IDP_CLIENT_ID, IDP_CLIENT_SECRET, IDP_ALIAS, and maybe a mapper in Keycloak, and every component downstream continues to trust Keycloak-issued JWTs. This is the whole point of making Keycloak the IdP broker instead of having each service integrate directly with the upstream IdP.

How AG Is Wired to Keycloak and OpenFGA (at boot and at steady state)​

Four independent channels feed the AG decision:

#ChannelDirectionPurposeCadence
1JWKSAG β†’ KeycloakFetch public keys to verify JWT signaturesOn startup; on unknown kid; on Cache-Control TTL expiry
2Token issuanceClient β†’ Keycloak β†’ ClientUsers/bots obtain JWTs to present to AG; AG never mints tokensOn login / OBO exchange
3Relationship decisionAG β†’ openfga-authz-bridge β†’ OpenFGARemote PDP check before MCP proxyingEvery MCP request

There is no direct API call from AG to Keycloak per request. JWKS fetching is a pure cache-refresh operation, not a live auth check.

The Exact jwtAuth Contract (from config.yaml)​

binds:
- port: 4000
listeners:
- protocol: HTTP
policies:
jwtAuth:
mode: strict # reject request if no valid JWT present
issuer: https://caipe.example.com/realms/caipe
audiences: [caipe-platform, agentgateway]
jwks:
url: http://keycloak:7080/realms/caipe/protocol/openid-connect/certs
routes:
- policies:
extAuthz:
host: openfga-authz-bridge:9100
failureMode:
denyWithStatus: 403
protocol:
grpc:
metadata:
caipe.auth: '{"sub": jwt.sub}'

What mode: strict means in practice:

  • **iss must equal issuer** β€” tokens from any other realm or IdP are rejected with 401.
  • **aud must contain at least one of audiences** β€” protects against token substitution where a token was issued to a different service client.
  • **exp, nbf, iat enforced** β€” expired or not-yet-valid tokens rejected.
  • Signature verified against JWKS β€” kid in the JWT header must match a cached key.
  • Unknown kid triggers one forced JWKS refresh β€” handles Keycloak key rotation without manual intervention.

Only after jwtAuth passes does AG call extAuthz. AG sends an Envoy CheckRequest over gRPC with caipe.auth.sub metadata derived from jwt.sub; the OpenFGA bridge maps that subject to user:<sub> and calls OpenFGA Check. The route-level bridge checks the coarse mcp_gateway:list object for MCP browse/list/init traffic, while signed Dynamic Agent tools/call requests additionally check the agent/tool relationships. If jwtAuth fails, the request never reaches policy evaluation; if OpenFGA/bridge is unavailable or denies, AG returns 403 because failureMode.denyWithStatus=403.

OpenFGA ReBAC Model​

The dev PDP model keeps the coarse AgentGateway gate and adds admin-configured team relationships:

TypeRelationTuple written by
mcp_gateway:listcan_call: [user]openfga-init seed / manual bootstrap for the current AGW coarse browse/list gate
team:<slug>member: [user]Team Resources save, using Keycloak sub values resolved from team member emails
agent:<agent_id>base user, manager; derived can_use, can_manageTeam Resources agent Use / Manage checkboxes write base relations
tool:<server>_* and tool:*base caller; derived can_callTeam Resources MCP-server prefix checkboxes and the All Tools wildcard write base relations
knowledge_base:<id>base reader, ingestor, manager; derived can_read, can_ingest, can_adminTeam Knowledge Base assignments and Settings β†’ Knowledge Bases write team:<slug>#member reader/ingestor for read and ingest, and team:<slug>#admin manager for admin, before persisting Mongo assignment metadata. KB pages, sharing, and KB-scoped routes check these relationships.
data_source:<id>base reader, ingestor, manager; derived can_read, can_ingest, can_manageDatasource component grants are reconciled alongside Knowledge Base grants when a KB-backed datasource is created or shared. Datasource lists, search filters, and ingest/reload operations check these relationships so read and write can differ per datasource.
skill:<id>base reader, user, writer, manager; derived can_read, can_use, can_write, can_manageTeam Resources skill selection writes user relationships for local and Skill Hub catalog ids; /api/skills filters by can_read/can_use.
conversation:<id>base owner, reader, writer, sharer, manager; derived can_read, can_write, can_share, can_deleteChat list/read/write/share and Dynamic Agent stream/invoke/resume/cancel paths check implicit Mongo ownership first, then explicit OpenFGA conversation access.
mcp_server:agentgatewaybase reader, writer, manager; derived can_discover, can_read, can_manageAgentGateway discovery uses can_discover; selected-server sync/onboarding uses can_manage.
system_config:platform_settingsbase reader, manager; derived can_read, can_managePlatform config GET/PATCH checks the concrete system config object in addition to admin session gates.

The Web UI backend tuple writer is idempotent: it checks tuples before writes/deletes to avoid duplicate-write failures and to tolerate missing tuples during removals. It intentionally rejects writable can_* tuples; callers must write base relationships and let OpenFGA derive the can_* permissions.

Team membership semantic: On the team type, member is now defined as [user, external_group#member] or admin β€” i.e. anyone with the admin relation on a team automatically satisfies team#member checks (and, by extension, team#member userset references such as the team:<slug>#member can_use agent:<id> Slack/Webex resource paths). This means an admin no longer needs a separate member tuple to use the team's agents, and bots can ask check(user, "member", team:<slug>) as a single question. admin continues to be a directly-written relation; only member gains the derived branch. Callers that legacy-listed both team#member and team#admin as subject sets still work but are now redundant.

AgentGateway Policy Model​

AgentGateway no longer maintains a Mongo-backed CEL policy surface for MCP authorization. The checked-in deploy/agentgateway/config.yaml is intentionally static: it authenticates with jwtAuth, delegates authorization to the OpenFGA bridge through extAuthz, and then proxies to the configured MCP targets.

The Admin UI's former "AG MCP Policies" tab, /api/rbac/ag-policies, /api/rbac/ag-sync-status, ag_mcp_policies, ag_mcp_backends, and ag_sync_state are retired. Relationship changes should be modeled as OpenFGA tuples through the ReBAC admin surfaces instead of editing AgentGateway CEL.

The Web UI backend's former CEL overlay is also retired: CEL_RBAC_EXPRESSIONS, /api/rbac/admin-tab-policies, editable admin_tab_policies, and the browser CEL editor are no longer part of the UI authorization path. Keep custom authorization logic in OpenFGA tuples and audited ReBAC change sets.

Operational Guarantees​

GuaranteeMechanism
AG restart does not invalidate user sessionsUser JWTs are self-contained; AG just re-fetches JWKS on startup
Keycloak key rotation is zero-downtimeUnknown kid triggers one forced JWKS refresh; cached keys remain valid until exp
Policy update is zero-downtimeOpenFGA tuple writes are independent of AG process restarts; AG keeps using extAuthz
Admin UI edit audit trailReBAC relationship/policy surfaces write openfga_rebac audit events through the Web UI backend
MongoDB outage doesn't take AG downAG uses static config plus OpenFGA; it does not depend on Mongo-rendered CEL rules
Keycloak outage doesn't take AG down for already-issued tokensJWKS is cached; new logins fail at Keycloak, not at AG

The end-to-end per-request sequence diagram (and the demo walkthrough that proves all three outcomes β€” 200, 403, 401) lives in Workflows β€Ί Per-request authorization. Use that to demo the system live.


Component 5: Dynamic Agents β€” The Workshop Floor​

Badge analogy: A workshop where employees build and operate their own machines. The workshop checks your badge at the door (JWT validation on every request). Once inside, each machine has its own access tag β€” some are personal (Private), some are shared with your team (Team), some anyone can use (Global). Your badge level determines which machines you can touch. When a machine makes a tool call, it presents your badge β€” not its own β€” so the security checkpoint still sees you, not the machine.

Technically: A FastAPI service where every route handler uses get_current_user() as a FastAPI Depends(). Unlike the supervisor (which uses a middleware contextvar), the dynamic agents service validates the JWT on every request at the route level, giving precise control per endpoint.

JWT Validation Chain​

# FastAPI dependency injection β€” runs before every protected handler
user: UserContext = Depends(get_current_user)

Inside get_current_user():

1. Extract Bearer token from Authorization header
2. Fetch JWKS from Keycloak (cached in-process)
3. Validate:
- Signature (RS256 against JWKS public key)
- expiry (exp)
- issuer (iss == OIDC_ISSUER)
- audience (aud == OIDC_CLIENT_ID, if set)
4. Call OIDC userinfo endpoint (cached 10 min by token hash)
β†’ authoritative email, name, groups (OIDC tokens often omit these)
5. Extract realm_access.roles from JWT claims
(Keycloak puts roles here; also checked in userinfo)
6. Evaluate the configured required-access group (if set) β€” 403 if missing
7. Preserve group claims as identity context only; product admin is decided by OpenFGA organization relationships
8. Return UserContext { email, name, groups, access_token, obo_jwt }

Agent-Level Authorization (OpenFGA Execution Gate)​

After the bearer token is validated by JwtAuthMiddleware, Dynamic Agents decodes the already-validated JWT payload only to extract sub and repeats the same OpenFGA check used by the Web UI backend:

user:<sub> can_use agent:<agent_id>

The runtime check runs before agent lookup, MCP server lookup, runtime cache creation, non-streaming invocation, or stream resume work. This second layer is required because the runtime must not trust the Web UI backend as the only enforcement point. Denials return 403 / pdp_denied; OpenFGA outages return 503 / pdp_unavailable; missing or malformed bearer context returns a structured 401.

The older visibility-rule and CEL authorization paths are no longer the authoritative execution gate for start, invoke, and resume. Downstream tool authorization continues to be enforced by AgentGateway and OpenFGA.

Token Forwarding to MCP Tools​

The UserContext.obo_jwt (set from X-OBO-JWT header) or UserContext.access_token is forwarded as the Authorization: Bearer header on all MCP tool calls made by the agent runtime. This gives the same per-user enforcement at AgentGateway as the supervisor path provides.

Dynamic Agents also forwards the validated per-request bearer when probing MCP servers for tool manifests. The MCP client connection config carries an explicit Authorization header in addition to the HTTP client factory hook, because AgentGateway denies tokenless probe traffic before any upstream MCP server can return tools.

Only MCP server IDs listed in AGENT_GATEWAY_MCP_SERVER_IDS are rewritten to AGENT_GATEWAY_URL/mcp/<server_id>. The special value all applies only to gateway-managed rows (source: agentgateway, agentgateway_discovered: true, or an endpoint already rooted at AGENT_GATEWAY_URL); manual/direct MCP rows keep their stored endpoint so runtime-added tools do not get sent to missing AgentGateway routes. Docker Compose defaults to all because agentgateway-config-bridge reconciles enabled gateway-managed mcp_servers rows into the standalone AgentGateway config. The Helm path uses AgentGateway's native Kubernetes resources: global.agentgateway.knowledgeBaseTarget and global.agentgateway.extraMcpTargets render AgentgatewayBackend and HTTPRoute objects instead of running the Mongo polling bridge in-cluster.

For runtime tools/call requests, Dynamic Agents can also attach a signed X-CAIPE-Agent-Context header containing the calling agent_id. The OpenFGA bridge verifies this header with CAIPE_AGENT_CONTEXT_HMAC_SECRET, then checks both relationships before allowing the call:

user:<sub> can_use agent:<agent_id>
agent:<agent_id> can_call tool:<server_id>/<tool_name>

The Web UI backend reconciles the second tuple family from each agent's allowed_tools whenever an agent is created, updated, or deleted. Empty per-server tool lists are represented as tool:<server_id>/* so the runtime allowlist and the enforcement graph use the same wildcard semantics.

Key Environment Variables​

VariableDefaultSecurity note
AUTH_ENABLEDfalseMust be true in production. false returns a hardcoded dev@localhost admin β€” never deploy with false.
OIDC_ISSUERβ€”Validated against iss claim; tokens from other issuers are rejected
OIDC_CLIENT_IDβ€”Identifies the Web UI client used by browser-facing flows. Dynamic Agents audience validation uses KEYCLOAK_AUDIENCE / OIDC_AUDIENCE.
KEYCLOAK_URL / KEYCLOAK_REALMβ€”Cluster-internal Keycloak base URL and realm used to fetch JWKS. Required when OIDC_ISSUER is a public hostname that is not reachable through the pod's localhost.
KEYCLOAK_AUDIENCE / OIDC_AUDIENCEcaipe-platform,agentgatewayComma-separated audiences accepted for Dynamic Agents bearer validation. Include caipe-ui when browser session tokens carry that audience.
OIDC_REQUIRED_GROUPβ€”Optional deployment-specific Web UI admission gate; users missing this upstream group are denied before product authorization runs
OIDC_REQUIRED_ADMIN_GROUPβ€”Deprecated for CAIPE product admin. Map enterprise admin groups to CAIPE teams through Identity Group Sync, then grant OpenFGA admin on organization:<org>.
DA_REQUIRE_BEARERfalseSet to true to require validated bearer identity for runtime OpenFGA enforcement
OPENFGA_HTTPβ€” (http://openfga:8080 in Docker Compose dev)OpenFGA API base URL used for runtime can_use checks
OPENFGA_STORE_IDβ€”Optional explicit OpenFGA store id; takes precedence over store-name discovery
OPENFGA_STORE_NAMEcaipe-openfgaStore name used when discovering the OpenFGA store id; Docker Compose dev wires this into Dynamic Agents alongside the Web UI backend
AGENT_GATEWAY_MCP_SERVER_IDSallComma-separated MCP server IDs that Dynamic Agents should reach through AGENT_GATEWAY_URL; all only includes gateway-managed rows, while manual/direct MCP servers keep their stored endpoints.
CAIPE_AGENT_CONTEXT_HMAC_SECRETβ€”Optional shared secret for signing Dynamic Agents β†’ AgentGateway agent_id context used by the OpenFGA bridge for per-agent MCP tool enforcement. Use a secret manager; do not commit values.
SLACK_BOT_ADMIN_URLhttp://ai-platform-engineering-slack-bot:3001Web UI backend URL for the Slack bot internal admin API used for runtime route status, cache reload, and static-config sync. Keep cluster-internal.
OIDC_CLIENT_ID / OIDC_CLIENT_SECRETcaipe-ui / β€”Web UI backend Keycloak confidential client credentials. The same caipe-ui client is used for browser OIDC login and server-side client-credentials calls to the Slack bot admin API. Store the secret in a secret manager; do not place it in ConfigMaps.
SLACK_ADMIN_API_ENABLEDfalseEnables the Slack bot's internal admin API. It must remain internal-only and require JWKS-verified Bearer tokens.
SLACK_BOT_ADMIN_DEV_AUTH_ENABLED / SLACK_BOT_ADMIN_DEV_TOKENfalse / β€”Web UI local-dev escape hatch for Slack bot admin API calls when Keycloak is intentionally not running. Sends the configured dev bearer token instead of minting a Keycloak client-credentials token.
SLACK_ADMIN_DEV_AUTH_ENABLED / SLACK_ADMIN_DEV_TOKENfalse / β€”Slack bot side of the same local-dev escape hatch. The bot accepts the dev bearer token only when explicitly enabled. Never enable in shared, staging, or production environments.
SLACK_ADMIN_JWKS_URLβ€”Optional Docker/cluster-internal JWKS URL for Slack bot token verification when the public issuer is not directly reachable from the bot container.
SLACK_ADMIN_JWT_AUDIENCEcaipe-slack-bot-adminExpected audience for Web UI backend service tokens calling Slack bot admin endpoints.

Service-to-Service Authentication (Slack bot β†’ caipe-ui)​

The Slack bot calls caipe-ui's API as a machine client, not as a logged-in user. It uses the OAuth2 client_credentials grant against the caipe realm:

Env varPurpose
SLACK_INTEGRATION_ENABLE_AUTH=trueEnables Bearer-token path in app.py
SLACK_INTEGRATION_AUTH_TOKEN_URL${KEYCLOAK_URL}/realms/caipe/protocol/openid-connect/token
SLACK_INTEGRATION_AUTH_CLIENT_IDcaipe-slack-bot (pre-created in realm-config.json)
SLACK_INTEGRATION_AUTH_CLIENT_SECRETFetched from Keycloak β€” see "Provisioning service-client secrets" below
OAUTH2_CLIENT_SECRETHelm fallback env var for the same caipe-slack-bot client secret, normally sourced from the keycloak-bot Secret
KEYCLOAK_BOT_CLIENT_SECRETSame secret again for the Slack OBO helper (utils/obo_exchange.py)

Token shape (fields that matter):

  • iss β€” ${KEYCLOAK_URL}/realms/caipe
  • aud β€” [caipe-ui, caipe-platform] β€” both audiences are needed. caipe-platform is added by Keycloak's default audience resolution; caipe-ui comes from an oidc-audience-mapper protocol mapper (aud-caipe-ui) on the caipe-slack-bot client. caipe-ui's JWT validator rejects tokens whose audience doesn't include OIDC_CLIENT_ID (i.e. caipe-ui), so this mapper is required.
  • azp β€” caipe-slack-bot
  • sub β€” service account UUID (stable)
  • preferred_username β€” service-account-caipe-slack-bot
  • scope β€” groups email profile org roles

The mapper is created automatically by deploy/keycloak/init-idp.sh (idempotent).

This token represents the bot, not the user. User identity is carried separately by the OBO flow in utils/obo_exchange.py (RFC 8693 token exchange), which produces a second token with act.sub=caipe-slack-bot and the real user's sub/email.

Provisioning service-client secrets in production​

In dev, secrets are embedded in deploy/keycloak/realm-config.json. In production, operators should treat them as rotating credentials:

Option A β€” manual (Keycloak Admin UI):

  1. Log into Keycloak Admin Console β†’ caipe realm β†’ Clients β†’ caipe-slack-bot β†’ Credentials tab.
  2. Copy the Secret value (or click Regenerate Secret for rotation).
  3. Store it in your secret manager (Vault, AWS SSM, K8s Secret) as SLACK_INTEGRATION_AUTH_CLIENT_SECRET.
  4. Redeploy / restart the Slack bot pod so it picks up the new secret.

Option B β€” scripted (deploy/keycloak/export-client-secrets.sh):

The script fetches secrets via the Keycloak Admin API and emits them in one of three formats:

# shell (source into current session)
eval "$(KC_URL=https://keycloak.example.com ./export-client-secrets.sh)"

# dotenv (append to a .env file)
KC_URL=https://keycloak.example.com FORMAT=dotenv \
./export-client-secrets.sh >> slack-bot.env

# kubernetes Secret (pipe to kubectl)
KC_URL=https://keycloak.example.com FORMAT=k8s \
K8S_NAMESPACE=caipe K8S_SECRET_NAME=caipe-service-secrets \
./export-client-secrets.sh | kubectl apply -f -

The Helm chart can wire this up as a post-install Job so fresh installs get the Secret populated without operator intervention. Rotation is the same call β€” the Secret is overwritten in place.

Slack bot β†’ Keycloak Admin REST API (identity lookup)​

Separate from the OBO flow above. The Slack bot also calls Keycloak's Admin REST API to find a Keycloak user by slack_user_id attribute (and to read/write team_id). This is the call that fires when someone @mentions the bot for the first time. It uses client_credentials and a different Keycloak client than the OBO flow.

Env varPurpose
KEYCLOAK_SLACK_BOT_ADMIN_CLIENT_IDConfidential Keycloak client for slack-bot's Admin API calls (lookup + JIT create). Default caipe-platform β€” that client's service account is granted view-users + query-users + manage-users on realm-management for user lookup/create, plus the client/authz roles needed by the Web UI BFF Keycloak RBAC migration.
KEYCLOAK_SLACK_BOT_ADMIN_CLIENT_SECRETMatching client_secret. In dev, defaults to caipe-platform-dev-secret.
KEYCLOAK_URL, KEYCLOAK_REALMSame values as everywhere else.
SLACK_RBAC_ENABLEDEnables Slack-side identity lookup, team/channel resolution, OBO exchange, and channel ReBAC checks before the bot forwards a request.
SLACK_JIT_CREATE_USER (spec 103)true (default) auto-creates a federated-only Keycloak shell user on first DM when no Keycloak user with the Slack email exists. false falls through to the HMAC link URL so onboarding requires the web UI. Reuses KEYCLOAK_SLACK_BOT_ADMIN_* β€” no new secret. See plan R-8 for the single-credential trade-off.
SLACK_JIT_ALLOWED_EMAIL_DOMAINS (spec 103)Optional comma-separated allowlist (e.g. corp.com,acme.io). Empty = any domain. Recommended for prod when the federated IdP can return non-corporate emails.

In Helm and GitOps installs, charts/ai-platform-engineering/charts/slack-bot/templates/deployment.yaml wires OAUTH2_CLIENT_SECRET and KEYCLOAK_BOT_CLIENT_SECRET from the Keycloak bot Secret, while the Slack tokens and KEYCLOAK_SLACK_BOT_ADMIN_CLIENT_SECRET can come from an ExternalSecret such as Vault path projects/caipe/rbac/slackbot.

Why KEYCLOAK_SLACK_BOT_ADMIN_* and not just KEYCLOAK_ADMIN_* or KEYCLOAK_BOT_ADMIN_*? Two reasons:

  1. No collision with the Web UI backend. Pre-098 the slack-bot read the same KEYCLOAK_ADMIN_* env names as the Web UI backend. Both services share docker-compose.dev.yaml env interpolation, so a single KEYCLOAK_ADMIN_CLIENT_ID=admin-cli line in .env (intended for the UI's password-grant fallback) silently overrode the slack-bot's client_credentials path, producing HTTP 401 "Public client not allowed to retrieve service account" on every Slack mention.
  2. Room for future surfaces. The surface-specific prefix (KEYCLOAK_<surface>_BOT_ADMIN_*) means future bot integrations like KEYCLOAK_WEBEX_BOT_ADMIN_* or KEYCLOAK_TEAMS_BOT_ADMIN_* can each have their own dedicated namespace without yet another rename.

Required client config in Keycloak (any client you point this at):

  • publicClient: false
  • serviceAccountsEnabled: true
  • clientAuthenticatorType: client-secret
  • Service-account user has these realm-management client roles for Slack identity lookup/JIT: view-users, query-users, and manage-users. In the default caipe-platform wiring it also has query-clients, view-clients, manage-clients, view-authorization, and manage-authorization so the Web UI BFF can repair Keycloak OBO mappings.

The realm seeder already provisions caipe-platform with all of those, so the default values "just work" in dev.

Spec 104 β€” active_team JWT claim (REMOVED by Phase 3 of spec 2026-05-24-derive-team-from-channel)​

Status: removed. The active_team JWT claim mechanism described here has been demolished. Team identity is now derived from the channel_team_mappings collection at request time (BFF + AgentGateway PDP). Bots no longer mint team-<slug> client scopes, the OBO audience client no longer has any team-* default scope, and Keycloak no longer participates in team-identity negotiation.

See spec 2026-05-24-derive-team-from-channel for the full demolition rationale. The active_team mechanism never shipped to production, so no realm has legacy team-* scopes to clean up β€” Phase 3 is a pure code/Helm/UI deletion.

Components touched (post-demolition)​

  1. Keycloak β€” no per-team client scopes, no active_team mapper, no team-personal DM-marker scope. Only the team-agnostic OBO permission wiring (token-exchange decision strategy, bot service-account impersonation roles, realm-wide users.impersonate scope-permission) remains in scope of the reconciliation migration.
  2. Web UI backend (caipe-ui) β€” POST /api/admin/teams writes a Mongo team row + OpenFGA tuples only. DELETE /api/admin/teams/[id] removes those rows. Slack / Webex channel onboarding writes channel_team_mappings entries (no Keycloak touch).
  3. Slack / Webex bots β€” obo_exchange.impersonate_user() no longer requests a team-* scope and no longer verifies an active_team claim. Channel β†’ team resolution lives entirely in channel_team_resolver, which reads from channel_team_mappings.
  4. Dynamic agents β€” request-bound auth context is the user OBO JWT only. No active_team claim is read or written.
  5. AgentGateway PDP / RAG server — both consume the user JWT plus the channel→team mapping. RAG's UserContext.active_team field is gone; _kb_cel_context now exposes user.teams as a list of teams the user belongs to (OpenFGA-sourced), not the single channel team.

Failure modes (post-demolition)​

  • Group channel without a team mapping β†’ bot replies "this channel isn't assigned to a CAIPE team yet"; nothing reaches AGW.
  • User not in the mapped team β†’ bot replies "you aren't a member of <team>".
  • DM with no dm_agent_id preference and no realm default β†’ bot replies with the default_agent_id selection UI.
  • DA receives a request without a user JWT β†’ middleware logs WARNING, MCP call goes out without Authorization, AGW 401s.