RBAC Architecture
Component-by-component reference. Each section describes what it owns, what it does NOT own, and the env vars / config files / extension points you'd touch to change its behavior.
Read the index first if you want the big-picture mental model and the JWT primer. Read Workflows for the request-flow sequence diagrams that tie all of this together.
Helm Runtime Packagingβ
The 0.5.0 umbrella chart can own the RBAC runtime stack for demo and managed environments:
tags.keycloak=trueenables the Keycloak subchart, realm import, and IdP/token-exchange init hooks. The imported realm followskeycloak.realm.nameby rewriting the packaged realm JSON's realm name, Keycloak default-role name, and realm-role container ids at render time.- The Keycloak subchart packages the
caipelogin theme by default and mounts it as a ConfigMap under/opt/keycloak/themes/caipe. Deployments can customize branding withkeycloak.theme.brandName,keycloak.theme.colors.*, or fullkeycloak.theme.files.*overrides;keycloak.theme.existingConfigMapremains available for externally managed theme ConfigMaps. openfga.enabled=trueenables the OpenFGA service and the CAIPE authorization model loader hook. The loader can still write explicit emergency tuples throughopenfga.init.seedTuples, but production RBAC installs should bootstrap human admins through the Web UI BFF email reconciler so operators do not have to hardcode Keycloak UUIDs in Helm values.openfgaAuthzBridge.enabled=trueenables the gRPCext_authzbridge that validates the bearer JWT again, extracts the verifiedsub, and translates AgentGateway checks into OpenFGA checks.agentgateway.enabled=trueenables the standalone AgentGateway proxy chart.global.agentgateway.enabled=trueis still the Gateway API route-resource path for clusters using the AgentGateway controller model.- The local and Helm standalone AgentGateway provider MCP targets preserve the caller's Keycloak bearer for listener JWT validation and OpenFGA
mcp_gateway:listauthorization, then inject provider tokens such asGITHUB_PERSONAL_ACCESS_TOKENandGITLAB_PERSONAL_ACCESS_TOKENas backend auth only on the upstream MCP hop. Helm installs should mount those values from Secrets or ExternalSecrets throughagentgateway.extraEnv/agentgateway.extraEnvFrom; this keeps provider PATs out of browser/session traffic while satisfying upstream MCP servers'Authorizationrequirements.
Production installs must still supply ExternalSecrets and persistent datastore settings; the chart defaults are conservative and disabled by default.
Persistent RBAC datastores. By default the Keycloak subchart uses an embedded H2 database and OpenFGA uses an in-memory store β both lose all identity, realm, and authorization state on pod restart, which makes RBAC unusable for anything beyond a throwaway demo. To persist RBAC state, point both at PostgreSQL:
- Keycloak reads non-secret connection settings from
keycloak.env(KC_DB=postgres,KC_DB_URL,KC_DB_USERNAME) and the DB password from a Secret viakeycloak.db.passwordSecret.name/.key(rendered as asecretKeyRefonKC_DB_PASSWORD, so the password never lands in values or release data). WithKC_DBset, the per-podpersistencePVC is unnecessary. - OpenFGA uses
openfga.datastore.engine=postgreswithopenfga.datastore.uriSecretRef.name/.keysupplying thepostgres://β¦connection string to both the deployment and themigrateJob.
setup-caipe.sh wires this automatically: it deploys a single shared bitnami/postgresql instance (caipe-postgres) with one role+database per consumer (keycloak, openfga, optional litellm), persists generated passwords in the caipe-postgres-credentials Secret, and emits the consumer-facing caipe-keycloak-db / caipe-openfga-db Secrets. This is the default for RBAC installs; --no-shared-postgres falls back to the ephemeral H2/in-memory stores.
Component 1: Keycloak β HR & The Front Deskβ
Badge analogy: HR issues ID badges. The front desk verifies them on entry. Every other door in the building trusts the badge's chip β they don't call HR each time. When a contractor arrives via a partner agency (Duo SSO), the front desk checks with the agency once, creates an internal record, and issues a standard building badge. From that point on, the contractor uses the same badge as everyone else.
Technically: Keycloak acts as an OIDC Authorization Server and IdP broker. It proxies login to Duo SSO via an OIDC client, mirrors external group claims into identity attributes for sync, and issues its own signed JWT β so downstream services only ever need to trust one issuer. CAIPE authorization decisions are no longer encoded as Keycloak realm roles.
Realm Roles (configured realm, default caipe)β
| Role | Default? | Purpose |
|---|---|---|
default-roles-<realm> | Yes | Keycloak composite default role. |
offline_access | Yes | Keycloak protocol role for refresh/offline. |
uma_authorization | Built-in | Keycloak protocol role; not CAIPE authz. |
There are no CAIPE business/resource realm roles. A CAIPE admin is represented as user:<sub> admin organization:<org_key> in OpenFGA, optionally via team:<slug>#admin admin organization:<org_key>. BOOTSTRAP_ADMIN_EMAILS is only a break-glass fallback until those durable organization tuples exist.
Resource-scoped roles (legacy)β
Legacy role names such as chat_user, admin, admin_user, team_member:*, kb_reader:*, agent_user:*, and tool_user:* are cleanup targets only. New installs do not create them, and new authorization code must not check them.
Relationships are created and assigned by:
init-idp.sh(runs in thekeycloak-initjob) is the first-run bootstrap escape hatch. It uses direct Keycloak admin credentials before the Web UI backend is healthy, which avoids a bootstrap cycle where BFF startup needs Keycloak config that only the BFF can create. It should keep only baseline app-realm prerequisites, IdP broker login bootstrap, optional demo personas (KEYCLOAK_SEED_DEMO_USERS=true), and operational master-realm settings such as admin-consolefrontendUrl. It also ensuresoffline_accessis present on the configured realm'sdefault-roles-<realm>composite and enables Keycloak's realm-levelusers-management-permissionsfeature with bootstrap admin credentials so the later BFF migration does not need broadmanage-realmprivilege.init-token-exchange.shuses the same bootstrap-admin path to grant both Slack and Webex bot service accounts therealm-managementimpersonationrole before the lower-privilege BFF reconciliation runs. Becauseinit-token-exchange.shruns in the always-oninit-token-exchangejob (gated ontokenExchange.enabled, default true) rather than theauth-reconcilejob (gated onidp.enabled, default false), it also reconciles the OBO target β it enables management permissions on theCAIPE_PLATFORM_AUDIENCEclient (caipe-platform), attaches both bot client policies (caipe-slack-bot-token-exchange-policy,caipe-webex-bot-token-exchange-policy) to that client'stoken-exchangescope-permission, and pins theAFFIRMATIVEdecision strategy. This makes a fresh in-chart / local-Keycloak install (no upstream IdP) pass the bot OBO health invariants without depending on the IdP-gatedauth-reconcilepath; the equivalent logic ininit-idp.shremains for upstream-IdP installs and both paths are idempotent.- The Web UI backend runs a startup Keycloak RBAC reconciliation migration (
keycloak_rbac_mapping_reconciliation_v1) in TypeScript. MongoDBteamsremain the source of truth; the migration repairs bot OBO token-exchange permissions for theCAIPE_PLATFORM_AUDIENCEtarget client, assigns bot service-account impersonation roles, pins theAFFIRMATIVEdecision strategy on every scope-permission with bot client policies attached, resolvesBOOTSTRAP_ADMIN_EMAILSto Keycloak user ids, creates passwordless verified placeholders for bootstrap emails that have not logged in yet, writes durable OpenFGA super-admin tuples, and records status inmigration_manifest,schema_migrations, anddata_schema_versions. When the BFF token cannot enableusers-management-permissionsitself, it falls back to reading the already-enabled permission created by the init hook and continues with policy repair. (Phase 3 of spec 2026-05-24-derive-team-from-channel removed the per-team and personal client-scope branches, the orphan-scope deletion step, and the audience-default selection step β team identity now flows throughchannel_team_mappings, not Keycloak.) - Slack/Webex bot onboarding can still repair OBO prerequisites on-demand, but the BFF startup migration is the canonical environment-wide reconciliation path after bootstrap. Its last run, counts, warnings, and errors are exposed through Admin β Security & Policy β Keycloak via
GET /api/admin/keycloak/migration-health, plus the persistent header migration status indicator. The same endpoint also performs a read-only Keycloak inspection for the tile details modal, returning actual realm values such as the OBO token-exchange permission strategy, attached OBO policies, and bot service-account impersonation roles. When the migration is behind or failed, the Keycloak tab's Reconcile now button invokes the same typed migration apply path forkeycloak_rbac_mapping_reconciliation_v1and refreshes the persisted health result. Every Keycloak scope-permission that ends up with bot-specific client policies attached β thecaipe-platformtarget-audiencetoken-exchangeperm, each bot client's owntoken-exchangeperm (caipe-slack-bot,caipe-webex-bot), and the realm-levelusers.impersonateperm β must useAFFIRMATIVEdecision strategy. With Keycloak's defaultUNANIMOUSstrategy, adding the second bot's per-client policy makes the first bot's OBO fail withClient not allowed to exchange/Client not allowed to impersonatebecause the other bot'sclients=[...]policy votes DENY for it. Thekc_attach_policy_to_scope_permissionhelper ininit-idp.shand the matchingattach_policy_to_scope_permissionhelper ininit-token-exchange.shboth forceAFFIRMATIVEon every attach so this regression cannot reappear when a new bot client is onboarded. The same invariants β plus a defense-in-depth "every attached policy istype=clientwith a non-emptyclient_idsallow-list" check β are evaluated server-side byui/src/lib/rbac/keycloak-invariants.ts#evaluateKeycloakInvariants, exposed throughGET /api/admin/keycloak/migration-healthaskeycloak_invariants.items, and rendered as a named pass/fail/unknown list in the Admin β Security & Policy β Keycloak tile. The evaluator is a pure function over the existing read-only inspector output, so the same checks gate every realm regardless of whether it was bootstrapped byinit-idp.shor by an operator using the Keycloak Admin Console. The inspector hydrates eachtype=clientpolicy by calling/authz/resource-server/policy/client/<id>and resolves the returned UUIDs to operator-meaningfulclientIdstrings via a single batched/clientsround-trip per probe β this is necessary because Keycloak'sassociatedPoliciessummary endpoint returnsconfig: {}on client-type policies, so the allow-list is invisible to a naive inspector. The hydration step also lets the panel surface the policy's resolvedclient_ids(e.g.clients=[caipe-slack-bot]) inline whenever a policy is flagged, so admins don't have to leave the panel to identify the right policy in the Keycloak Admin Console. - Production
caipe-ui,caipe-platform(supervisor), and Slack/Webex bot OBO client secrets are Keeper-backed Kubernetes Secrets/ExternalSecrets rather than values embedded in rendered ConfigMaps.keycloak.uiClient.secretReforkeycloak.uiClient.externalSecretfeedsKEYCLOAK_UI_CLIENT_SECRETto the Keycloak init/reconcile hook, which updates the existingcaipe-uiclient through the Admin API so NextAuth'sOIDC_CLIENT_SECRETstays aligned across upgrades and rotations.keycloak.platformClient.secretRef/keycloak.platformClient.externalSecretfeedsKEYCLOAK_PLATFORM_CLIENT_SECRETthe same way to replace the dev placeholder shipped inrealm-config.jsonfor thecaipe-platformconfidential client (consumed by the supervisor'sclient_credentialsflow and the on-behalf-of / token-exchange target audience). Bot OBO secrets use the same single-source-of-truth pattern throughkeycloak.tokenExchange.externalSecretandkeycloak.webexTokenExchange.externalSecret. Settingkeycloak.strictClientSecrets: trueadds a runtime guard at the end ofinit-idp.sh(coveringcaipe-ui+caipe-platform) andinit-token-exchange.sh(coveringcaipe-slack-bot+caipe-webex-bot) that issues aclient_credentialstoken request for each known dev placeholder secret and fails the Helm install if Keycloak still accepts any of them β preventing "operator forgot to set the secretRef" silent regressions. See secrets-bootstrap β Production hardening for the recommended adoption order. - The Admin UI Team Resources panel (
Admin β Teams β selected team β Resourcestab, spec 104 Story 4) β checking an agent or tool box callsPUT /api/admin/teams/[id]/resources, which:- Writes base relationship intent to OpenFGA before Mongo persistence:
team:<slug>#member user agent:<id>,team:<slug>#admin manager agent:<id>, andteam:<slug>#member caller tool:<prefix|*>. - Resolves current team members to Keycloak
subvalues and writes OpenFGAuser:<sub> member team:<slug>membership tuples when possible. - Persists the selection on the team document in Mongo (
team.resources = { agents, agent_admins, tools, tool_wildcard }). The Resources tab covers Use+Manage per agent and per-MCP-server tool grants plus a single "All tools" wildcard checkbox. Mongo persistence happens after OpenFGA reconciliation so a PDP outage doesn't leave Mongo ahead of the enforcement store.
- Writes base relationship intent to OpenFGA before Mongo persistence:
- The Admin UI Team Slack Channels panel (
Admin β Teams β <team> β Slack Channelstab, spec 098 US9) β bind Slack channels to a team so the bot resolves the channel's effective team viachannel_team_mappings. Slack runtime agent access is configured separately in the OpenFGA ReBAC Slack Channels panel, where admins grant a channel access to selected Dynamic Agents.PUT /api/admin/teams/[id]/slack-channelsis an idempotent full-replace: it deactivates this team's previous mappings that aren't in the new payload (only whenteam_idstill matches β never touches another team's rows), upserts the active set, and denormalises a thinslack_channelsarray onto the team document for the team-card chip count. The UI offers a liveusers.conversationsdiscovery picker (server-sideSLACK_BOT_TOKENonly; lists only channels where the bot is already a member; the in-process cache TTL is admin-configurable via the Discovery cache popover next to the Find Bot-Member Slack Channels button onAdmin β Integrations β Slack, default 60 minutes, range0β1440,0disables caching; the same popover exposes a Refresh from Slack now button that drops the snapshot for ad-hoc bot-membership changes) plus a manual ID entry fallback for when the bot isn't in the channel yet. - The Admin UI Team Webex Spaces panel (
Admin β Teams β <team> β Webex Spacestab, spec 2026-05-18 Webex RBAC parity) β binds Webex spaces to a team throughwebex_space_team_mappings. Runtime agent access is configured separately in the OpenFGA ReBAC Webex Spaces panel.PUT /api/admin/teams/[id]/webex-spacesis an idempotent full-replace, preserves mappings owned by other teams, and denormaliseswebex_spacesonto the team document for display. - Identity group sync β upstream Okta/AD group ids map to
external_group:<provider>/<group_id>and then to CAIPE teams, for exampleexternal_group:okta/00g... member team:platform. Application code consumes the resulting team relationships; it does not check upstream group strings directly.
BOOTSTRAP_ADMIN_EMAILS is an explicit break-glass/initial-admin list and the source for durable email-based bootstrap seeding. The Web UI BFF resolves each email to a Keycloak sub during keycloak_rbac_mapping_reconciliation_v1; existing SSO users are left untouched, while missing users get a passwordless verified Keycloak placeholder that the IdP broker can auto-link on first login. For each resolved subject, the BFF writes the default member baseline tuples, caller on mcp_gateway:list for AgentGateway's coarse MCP ext_authz gate, admin on organization:<org_key>, manager on system_config:platform_settings, manager on mcp_server:agentgateway, and manager tuples for the built-in admin surfaces, including baseline surfaces such as admin_surface:teams and admin_surface:credentials. Keep the list small, audit it in Admin β Security & Policy β Keycloak, and replace it with team/group-mediated admin relationships when steady-state Identity Group Sync is configured.
Local no-SSO development uses a dedicated dev auth provider rather than route-local bypass checks. When SSO_ENABLED=false, ALLOW_DEV_ADMIN_WHEN_SSO_DISABLED=true, and CAIPE_UNSAFE_RBAC_BYPASS=true outside production, ui/src/lib/auth/dev-auth-provider.ts supplies the stable anonymous@local / anonymous-local-dev admin principal to API middleware, admin tab gates, RAG proxy calls, and admin-surface checks. This keeps local development on the same auth-context contract as real OIDC sessions while making the insecure mode visible through logs and the UI No Auth indicator.
When Authorization Relationships Are Createdβ
Keycloak realm roles are not created for CAIPE permissions. New deployments keep Keycloak focused on identity and login:
- Organization access is
user:<sub> member|admin|auditor organization:<org_key>or team-mediated variants. The release migrationorganization_membership_backfill_v1writes directmember organization:<org_key>tuples for existing Mongo users with a stable Keycloaksub, restoring baselinesupervisor:invoke/RAGqueryaccess after the OpenFGA cutover. - Login bootstrap access is repaired on each successful CAIPE login. If the user passes
OIDC_REQUIRED_GROUP, the Web UI BFF reads the Mongo-backed default OpenFGA grant profile bundle fromopenfga_baseline_profiles(falling back to the built-in defaults) and writes the selected member profile tuples such asuser:<sub> member organization:<org_key>,user:<sub> reader system_config:platform_settings,user:<sub> owner user_profile:<sub>,user:<sub> caller mcp_gateway:list, and selected read-onlyadmin_surfacegrants. Themcp_gateway:listtuple is required before AgentGateway proxies any MCP probe or tool-call traffic. If the user also matchesOIDC_REQUIRED_ADMIN_GROUPorBOOTSTRAP_ADMIN_EMAILS, login bootstrap adds the selected admin profile tuple set, includingadmin organization:<org_key>,manager system_config:platform_settings,manager mcp_server:agentgateway, and selectedadmin_surfacemanager grants for both baseline surfaces (for exampleteams,credentials, andskills) and privileged surfaces (for exampleopenfgaandmigrations). Stored built-in profiles are normalized with newly required default grants so existing environments pick up added baseline admin-surface permissions after upgrade. Admins can update the global Org Member / Org Admin default grant profiles, create custom profiles, and assign member/admin profile overrides to teams in Admin β Security & Policy β OpenFGA β Default FGA Grants. These profiles are templates that materialize concrete OpenFGA tuples during login or all-user reconciliation. The same workspace includes OpenFGA Store: Catalog & Live Relationships, a read-only catalog of resource types, action checks, discovered resources, and paginated live OpenFGA tuples so operators can audit the full authorization store beyond the default login templates. Tuple Inspector filter inputs are apply-only; complete tuple identifiers are sent to OpenFGA as exact read filters, while partial text stays a post-read contains filter for ad-hoc inspection. A team override replaces the global profile for matching team users for that role; if several teams provide overrides, their selected profile grants are unioned. The result is materialized as direct user OpenFGA tuples during login or all-user reconciliation so self-profile grants and existingcan_*checks remain deterministic. This is an OpenFGA reconciliation step, not a runtime realm-role fallback; users who fail the OIDC admission group are never bootstrapped. - Team membership is
user:<sub> member|admin team:<slug>. - Resource access is team-mediated where possible, for example
team:<slug>#member user agent:<id>orteam:<slug>#member reader knowledge_base:<id>. - Runtime checks use derived
can_*permissions from those base relationships.
Rule of thumb: Keycloak owns identity and JWT claims; OpenFGA owns who is related to which organization, team, or resource.
The user-facing Connections & Secrets surface is hidden unless credential
features are enabled and the signed-in Keycloak subject has
can_use_credentials organization:<org_key> in OpenFGA (granted by organization
member or admin). Specific secret metadata, use, share, manage, and audit
operations are still governed by secret_ref:<id> relationships. The Admin β
Settings β Credentials tab is stricter: it is also
feature-flagged and requires organization-admin access
(can_manage organization:<org_key>), not only the read-only
admin_surface:credentials baseline grant.
The Web UI backend now uses shared object-level OpenFGA checks for UI-owned resource surfaces whenever the authorization model has a concrete resource type. list and discover map to can_discover, runtime/content access maps to can_read or can_use, mutations map to can_write, sharing maps to can_share, and platform configuration maps to can_manage on system_config:<key>. Dynamic Agent create requires a stable Keycloak sub; private agents write user:<creator_sub> owner agent:<id>, and team-owned agents require OpenFGA team:<slug>#can_use before creation (Mongo team membership is not a fallback). Creation writes durable relationships before MongoDB persistence: user:<creator_sub> owner agent:<id>, organization:<org>#admin manager agent:<id>, team:<slug>#member user agent:<id>, team:<slug>#admin manager agent:<id>, and the agent-to-tool caller tuples. The Agent editor's "Share with Teams" multi-select extends the same two-tuple pair (team:<slug>#member user agent:<id> plus team:<slug>#admin manager agent:<id>) to every additional shared team; POST /api/dynamic-agents and PUT /api/dynamic-agents resolve each entry against the teams collection (legacy Mongo _id is accepted for backward compat but normalized to the canonical slug before persistence and OpenFGA writes), drop the owner-team duplicate, and feed both nextSharedTeamSlugs and previousSharedTeamSlugs into reconcileAgentRelationships so unchecking a team in the editor genuinely emits delete tuples instead of leaving a dangling grant. The agent_shared_team_grants_backfill_v1 migration replays this normalisation against every existing agent so the multi-select that pre-dated the 2026-05-27 fix retroactively writes the missing canonical tuples. Dynamic Agent update/delete paths check the concrete agent:<id> object before MongoDB writes or tuple reconciliation. Chat agent pickers (/api/dynamic-agents/available) and subagent pickers (/api/dynamic-agents/available-subagents) load enabled candidates and filter through agent#can_use; conversation creation also checks agent#can_use before storing a selected agent. LLM model list and edit routes use llm_model#can_read/#can_write/#can_delete; config-driven system models get organization:<org>#member reader llm_model:<id> and organization:<org>#admin manager llm_model:<id> tuples during seed and remain immutable. Skill config reads no longer prefilter by MongoDB visibility, owner_id, shared_with_teams, or legacy realm roles; they load candidates and let skill#can_discover/skill#can_read decide. Task Builder reads follow the same pattern with task#can_discover/task#can_read. Workflow configs are mapped to the existing OpenFGA task namespace until the authorization model grows a first-class workflow type. Dynamic Agent built-in tool metadata at GET /api/dynamic-agents/builtin-tools is not OpenFGA-gated: it returns a static catalog of supported built-in tool types (web_search, file_io, etc.), is read by every authenticated user who can open the Create Agent wizard, and per-tool authorization happens at MCP invocation time. The route requires only an authenticated session and forwards the caller's bearer token to dynamic-agents (which enforces DA_REQUIRE_BEARER). Earlier revisions gated this on tool:dynamic-agents-builtin#can_discover, but no seed/migration path ever wrote that tuple so every caller (including admins) was denied with 403; that pseudo-resource is now retired.
The Admin β Security & Policy β OpenFGA policy graph is a visibility surface for these same base relationships. Team-scoped graph queries include both team:<slug>#member and team:<slug>#admin usersets, so management grants such as team:<slug>#admin manager agent:<id> and team:<slug>#admin manager admin_surface:<surface> appear alongside member grants. The default graph remains a clean team/resource workspace: team and userset nodes are always visible, and resource nodes are shown when selected from the live catalog. Operators can switch graph layers to inspect stored OpenFGA tuples, read-only Slack/Webex routing metadata, subject-scoped effective can_* access paths, or authorization-model topology derived from the universal resource/action model. These layers are user-facing alternatives, not one combined overlay. Effective access is intentionally user-centered and requires a selected user before rendering broad inherited access. Model topology shows resource-type anchors first; selecting catalog resources expands only the matching type's relation and permission stacks, not concrete live resource cards. The UI resource palette and connection defaults read from the live catalog, so newly introduced resource types such as secret_ref, policy, audit_log, or llm_model appear without adding another graph-specific resource list.
Conversations use a hybrid ownership model to avoid creating high-cardinality owner tuples for every private chat. Private ownership is implicit from MongoDB (owner_subject for normalized records, legacy owner_id email fallback for old records). Explicit OpenFGA relationships remain the enforcement store for cross-boundary sharing and admin surfaces. The Web UI backend now fetches non-deleted conversation candidates without MongoDB team-sharing prefilters, then applies the same implicit-or-explicit conversation check on chat list/detail routes, Dynamic Agent v1 stream/invoke/resume/cancel proxy routes, and conversation metadata updates. This lets Slack OBO requests write their own thread conversations and bookkeeping metadata without requiring explicit owner tuples while still allowing OpenFGA-only conversation grants to appear in the UI. The Admin β System β Migrations tab seeds a DB-managed migration_manifest from the runtime bundle, shows the active runtime migration release beside per-collection data_schema_versions, hides completed migrations by default, and runs the release migration handlers, including conversation_owner_identity_v1 for owner_subject/owner_identity_version=2, organization_membership_backfill_v1 for direct baseline organization membership, universal team-resource OpenFGA backfill, Dynamic Agent tool tuple reconciliation, Dynamic Agent organization-admin inheritance backfill, Dynamic Agent shared-team grants backfill (agent_shared_team_grants_backfill_v1, writes the missing team:<slug>#member can_use agent:<id> tuples for every existing agent's shared_with_teams), Slack channel and Webex space ReBAC grant backfills, messaging team mapping reconciliation, RBAC index creation, and Webex messaging ReBAC index creation. Migration runs are recorded in schema_migrations; blocking required migrations and the migration status API are admin-only surfaces.
Conversation secondary views and mutations now use the same model: shared, search, and trash routes fetch candidates and filter through the implicit-or-explicit OpenFGA helper; pin, archive, restore, and share actions require the concrete conversation relationship instead of raw owner_id equality. Skill nested routes and import overwrite paths also load candidates by id and require skill#read, skill#write, or skill#admin as appropriate; legacy skill visibility fields remain metadata only. Workflow run list/start/poll/update/delete/resume/cancel operations authorize against the parent workflow config through the temporary task namespace mapping. MCP server list/probe/update/delete and team RAG tool list/read/write/delete use concrete mcp_server and tool OpenFGA resource checks without a legacy session role bypass; MCP server create requires a stable Keycloak sub, writes mcp_server owner/team tuples before Mongo persistence, and delete removes associated OpenFGA tuples before deleting the Mongo row. Credential management adds admin_surface:credentials for connector administration and global secret metadata management, plus concrete secret_ref authorization for user metadata, use, share, manage, and audit decisions. The user-facing page separates My Secrets and My Connections, while the Admin Credentials tab owns OAuth provider configuration and all-user secret metadata actions. Browser API routes may create or rotate secret material, but raw credential retrieval is restricted to bearer-authenticated service callers using the credential-service audience.
Knowledge Base UI routes are enforced at the Web UI backend before proxying to the RAG server. caipe-ui authenticates the browser session, applies the coarse rag route gate, requires admin_surface:rag_datasources#can_manage for the Data Sources admin surface, checks concrete knowledge_base:<id> operations for Knowledge Base pages and sharing, filters datasource list responses by data_source#can_read, constrains search/MCP invocations to the caller's readable datasource IDs, and then forwards the Keycloak bearer token to RAG. RAG validates the token signature, issuer, audience, and expiry against Keycloak, then repeats OpenFGA checks for direct API/MCP requests using the caller's Keycloak sub. Human Keycloak realm roles and per-KB realm roles do not grant RAG access; OpenFGA tuples such as team:<slug>#member reader knowledge_base:<id> and team:<slug>#member reader data_source:<id> are the source of truth. Settings β Knowledge Bases / RAG Team Access can grant either team access to the Data Sources admin surface, read/ingest/admin access to Knowledge Bases, or component-level datasource read/ingest/admin access. Team owners/admins may manage KB grants for their own team without platform-admin access.
The Teams dialog Knowledge Bases tab reads team_kb_ownership through /api/admin/teams/[id]/kb-assignments. During the migration window, if no ownership row exists it treats legacy teams.resources.knowledge_bases entries as read-level assignments so older team resource grants still render instead of appearing empty.
Org-admin super-grant on KB / Search / Data Sources / Graph / MCP Tools (PR 1, 2026-05-27). Any caller that holds user:<sub> can_manage organization:<org_key> in OpenFGA is always allowed on every Knowledge Base sidebar surface. The Web UI backend implements this with an explicit bypassForOrgAdmin: true option passed to requireResourcePermission / filterResourcesByPermission for knowledge_base:<id> reads (per-KB gate, datasource list filter, readable-datasource enumerator) and a matching org-admin short-circuit in constrainSearchBody so admins are not subject to filter injection. This is policy: once you are org admin, you cannot be excluded from one specific KB while staying org admin. To restore pure per-resource checks (no super-grant), set RAG_ADMIN_BYPASS_DISABLED=true. Non-admins continue to need explicit per-KB / per-team tuples. The release migration admin_surface_rag_datasources_admin_grant_v1 backfills user:<sub> manager admin_surface:rag_datasources for every previously-bootstrapped org admin so the rag + admin short-circuit in api-middleware.ts is fail-safe and not solely inheritance-dependent.
Slack admin-surface backfill (issue #1513). The Slack Channels admin panel (/api/admin/slack/channels) is gated by requireAdminSurfaceManage(session, "slack"), which checks admin_surface:slack#can_manage. slack is in PRIVILEGED_ADMIN_SURFACES, so the login bootstrap (reconcileLoginOpenFgaAccess) writes user:<sub> manager admin_surface:slack for admins. To cover org admins bootstrapped before that seed who have not re-logged-in, the release migration admin_surface_slack_admin_grant_v1 (schema area admin_surfaces, v2 β v3) walks OpenFGA for existing user:<sub> admin organization:<key> admins and writes the matching admin-surface manager tuple. Idempotent and depends on admin_surface_rag_datasources_admin_grant_v1.
Graph tab gate + info banner + per-KB ontology filtering follow-up (PR 5, 2026-05-27). The Graph tab at /knowledge-bases/graph now consults useKbTabGates (the PR 2 hook). Non-admins with zero readable KBs see the NoKbAccessEmpty empty state. When the tab is rendered the new GraphInfoBanner reminds the user β including org admins under PR 1's super-grant β that the ontology graph is currently global: it is stored in Neo4j keyed only by _datasource_id and is not filtered per KB. Per-KB filtering needs new RAG-server work (a kb_ids filter on the /v1/graphrag/* endpoints plus an OpenFGA-driven membership probe in the BFF) and is tracked by docs/docs/specs/2026-05-27-per-kb-ontology-graph-filtering/spec.md.
Share/assign paths mirror data_source + user:* public datasources (2026-06-03). Two correctness fixes to the RAG access model:
- KBβ
data_sourceaccess (now viaparent_kbinheritance). Query-time enforcement readsdata_source:<id>#can_read, but share/assign surfaces writeknowledge_base:<id>tuples β so a KB-only grant once made a datasource discoverable but not searchable. PR 3/PR 4 patched this with a mirror (mirrorKnowledgeBaseDiffToDataSource) that duplicated everyknowledge_basegrant onto the paralleldata_sourceobject. That mirror has been removed (spec 2026-06-03, release 0.5.8):data_sourcenow inherits read/ingest/manage from its knowledge_base via theparent_kbtuple-to-userset edge β see "data_sourceβknowledge_baseinheritance" below. Team grants are written once onknowledge_base:<id>plus one structuraldata_source:<id> parent_kb knowledge_base:<id>edge at creation; no per-team tuples are duplicated ontodata_source. The one-timedata_source_grants_backfill_v1migration is superseded byparent_kb_inheritance_backfill_v1(one edge per existing datasource). user:*public datasources. Thereaderrelation on bothknowledge_baseanddata_sourcenow accepts the typed wildcarduser:*(added todeploy/openfga/model.fgaand the Helm-packaged JSON model). The new admin routePOST /api/admin/rag/public-datasourceswritesuser:* readeron both objects (andGETreports state from thedata_sourcetuple). This is the supported mechanism for keeping pre-RBAC ("public") datasources broadly readable without maintaining an everyone-team roster. The route is gated byadmin_surfaceadmin (withOpenFgaAdminAuth) β making a datasource world-readable is a privileged action and is not delegated to team admins. Surfaced in Settings β Knowledge Bases / RAG Team Access ("Public datasources" section), which also now lists a team's current per-datasource grants with per-row revoke.
data_source and mcp_tool OpenFGA types + reconcilers + BFF list filter (PR 4, 2026-05-27). deploy/openfga/model.fga and the Helm-packaged JSON authorization model include two RAG resource types; local Docker Compose mounts the same chart JSON model used by Helm so there is only one JSON artifact to keep current:
type data_source # datasource component inside the Knowledge Base feature,
# with per-datasource read and ingest/write grants
type mcp_tool # RAG custom MCP tools (PUT /v1/mcp/custom-tools/<id>),
# distinct from the existing tool:<id> used by AgentGateway
Both expose manager: [user, service_account, team#admin, organization#admin] so org admins are an explicit edge on the model β not just a runtime bypass. buildDataSourceRelationshipTupleDiff and buildMcpToolRelationshipTupleDiff (in ui/src/lib/rbac/openfga-owned-resources.ts) emit the same shared-teams diff that PR 3 introduced for knowledge_base. mcp_tool additionally emits the user relation on member tuples so team members get can_call (mirrors how mcp_server invokers are modelled).
The BFF (ui/src/app/api/rag/[...path]/route.ts) now writes mcp_tool:<tool_id> tuples on a successful PUT /v1/mcp/custom-tools/<tool_id> (sourcing the owner team slug from the request body) and filters the GET /v1/mcp/custom-tools response by mcp_tool:<id>#can_read. Org admins bypass via the PR 1 super-grant; non-admins only see tools they have a tuple on.
Two strictly-additive backfill migrations live in ui/src/lib/rbac/migrations/registry.ts:
data_source_grants_backfill_v1mirrors every existingknowledge_base:<id>tuple as a paralleldata_source:<id>tuple, so admins who could read a KB on day zero can still read its data source on day one. No deletes. (Superseded byparent_kbinheritance β see "Unified shareable-resource RBAC" below; retained for the bootstrap window.)mcp_tool_grants_backfill_v1walks Mongoteam_rag_toolsand writes the canonicalteam:<slug>#member reader mcp_tool:<id>+team:<slug>#member user mcp_tool:<id>+team:<slug>#admin manager mcp_tool:<id>tuples. Tools without a team owner fall through to theorganization#admin β manageredge.
Unified shareable-resource RBAC (spec 2026-06-03, release 0.5.8). A single shared module makes the agent owner-team + share-with-teams pattern canonical and brings RAG datasources and custom MCP tools to parity. Five composable pieces live behind it: the OpenFGA template, a reconciler core (buildShareableResourceTupleDiff / reconcileShareableResource + the buildTeamGrantTuples primitive in ui/src/lib/rbac/openfga-owned-resources.ts), a route helper (handleShareableResourceWrite in ui/src/lib/rbac/shareable-resource.ts), a Pydantic OwnedResourceMixin (ai_platform_engineering/knowledge_bases/rag/common/.../models/rag.py), and a <TeamOwnershipFields> React component (ui/src/components/rbac/TeamOwnershipFields.tsx). The agent and knowledge_base reconcilers are thin adapters over the core (their suites pass unchanged). Four structural changes ride along:
- Audit-only
creatorrelation.agent,knowledge_base,data_source, andmcp_tooleach gaindefine creator: [user]. It is written once at create (user:<sub> creator <type>:<id>), never deleted, and referenced by nocan_*β provenance only, no authority. Authority for team-owned resources flows throughteam:<slug>#admin manager, not a personalownertuple. A drift test (ui/src/lib/rbac/__tests__/shareable-type-drift.test.ts) fails the build ifcreatorever appears in a permission or the authored/chart models diverge. data_sourceβknowledge_baseinheritance (parent_kb).data_sourcegainsdefine parent_kb: [knowledge_base]andcan_read/can_ingest/can_manageeach gain... or <perm> from parent_kbβ the model's first tuple-to-userset. Team grants are written once onknowledge_base:<id>; the data source inherits read/ingest/manage via the 1:1 edgedata_source:<id> parent_kb knowledge_base:<id>. This retires themirrorKnowledgeBaseDiffToDataSourcemirror (deleted fromopenfga-owned-resources.ts): the sharing PUT and the team KB-assignment route now write only the inheritance edge instead of duplicating per-team tuples ontodata_source. Fixes the prior "see-but-not-search" gap without double-writing.can_callenforcement on custom MCP tool invocation. The BFF (ui/src/app/api/rag/[...path]/route.ts) checksCheck(<principal>, can_call, mcp_tool:<tool_name>)before forwardingPOST /v1/mcp/invokefor a custom tool (<principal>isuser:<sub>, oragent:<id>for agent-initiated calls viaX-Agent-Id). Built-in tool names (nomcp_toolobject) are not gated; org admins bypass. The tool create/update path now persistsowner_team_slug/shared_with_teams/creator_subjecttoMCPToolConfigand reconciles owner + shared + creator; DELETE removes allmcp_tool:<id>grants (deleteAllMcpToolRelationshipTuples) so no orphan tuples remain.- Persistence (config = source of truth).
DataSourceInfoandMCPToolConfigcomposeOwnedResourceMixin(creator_subject/owner_subject/owner_team_slug/shared_with_teams), persisted to Redis via the RAG server and reconciled into OpenFGA as the derived projection. The datasource sharing GET (/api/rag/kbs/[id]/sharing) now returns the realowner_team_slug+creator_subjectfrom config (previously alwaysnull).
Ownership transfer (spec 2026-06-03, US3) β unified across all three resource types. Owner team is immutable on a normal edit but transferable via the editor's "Transfer ownership" affordance, available on agents, custom MCP tools, AND knowledge bases / datasources. All three share a single decision path: resolveShareableOwnershipWrite (ui/src/lib/rbac/shareable-resource.ts) runs creator-set-once, the transfer guard (canTransferResourceOwnership β caller must hold <type>:<id>#can_manage (owner-team admin) or be org admin), the not-a-member confirmation (confirm_not_member), first-set membership, and the shared-team + org-scope diff; it passes previousOwnerTeamSlug to the reconciler so the old owner team's grants are revoked rather than orphaned. canTransferResourceOwnership has exactly one caller (this resolver), so the transfer rules cannot drift between resource types. Each route applies the decision to its own persistence: the agent writes Mongo + reconcileAgentRelationships (layering org-admin/tool-caller tuples); the MCP tool persists config via the upstream PUT body and reconciles post-success; the KB sharing route does a read-modify-write upsert of the datasource config (owner_team_slug) and reconciles knowledge_base grants + the parent_kb edge. The creator tuple is never touched, preserving provenance across transfers. The synchronous handleShareableResourceWrite wrapper (resolve β reconcile β persist) is available for routes whose persistence isn't split across an external call.
FGA coverage guarantee (spec 2026-06-04-fga-coverage-guarantee). "Every current and new resource type is FGA-gated" is enforced as a build-time invariant by four CI guards, so a new type cannot land ungated:
- Layer 1 β type parity. The
UniversalRebacResourceTypeunion derives from a runtimeconstarray (UNIVERSAL_REBAC_RESOURCE_TYPE_NAMESinui/src/types/rbac-universal.ts).ui/src/lib/rbac/__tests__/fga-type-coverage.test.tsasserts the object-type set agrees across the authored model (deploy/openfga/model.fga), the deployed chart JSON, the union, and the runtime registry (UNIVERSAL_REBAC_RESOURCE_TYPES), modulo a documented subject-only allowlist (service_account,anonymous). Addingtype footo the model fails CI untilfoois registered or allowlisted. (This guard also surfaced and reconciled theanonymoustype, which existed in the chart JSON but not the authored model, and registered the previously-missingdata_source/mcp_toolactionable types.) - Layer 2 β enforcement manifest.
ui/src/lib/rbac/fga-enforcement-manifest.tsclassifies every registered type (rebac_enforced/role_gated/rebac_shadowed/not_gated) with on-disk enforcement surfaces;fga-enforcement-manifest.test.tsrejects any unclassified type, verifies enforced surfaces exist, and only permitsnot_gatedfor an explicitly documented allowlist (secret_reftoday). This manifest is the single artifact an auditor reads to answer "is type X gated, and where?". - Layer 3 β create-path ownership linter.
scripts/validate-fga-create-paths.py(wired intomake test-rbac-lint) asserts that every ownable type's ownership-write helper (reconcile*Relationships/write_*_ownership) is both defined and called from production (non-test) code, catching the "persisted a resource but forgot to write ownership tuples" bug. - Layer 4 β default-deny backstop.
ui/src/lib/rbac/__tests__/default-deny-coverage.test.tsproves, parametrized over the live registry, that a subject with no tuples is denied read/use/manage on every type, that the org-admin bypass does not fire for non-admins, and thatCAIPE_UNSAFE_RBAC_BYPASSis off by default. A newly-added type is auto-covered.
Two backfills register in the 0.5.8 manifest (registry.ts), runnable from the Migrations admin tab with dry-run/sample-diff/confirm:
parent_kb_inheritance_backfill_v1writes onedata_source:<id> parent_kb knowledge_base:<id>edge per existing datasource (supersedes the per-grantdata_source_grants_backfill_v1mirror). Strictly additive, idempotent.creator_from_owner_backfill_v1writescreatorfrom each existing personalownertuple on the four shareable types, retainingowner(no access removed).
Per-KB Share-with-Teams panel + reconciler (PR 3, 2026-05-27). KB admins (anyone with knowledge_base:<id>#can_manage) and org admins can share a Knowledge Base with additional teams from the new /knowledge-bases/sharing/[id] page (KbSharingPanel + TeamMultiPicker). The page calls PUT /api/rag/kbs/[id]/sharing, which reconciles the team list through reconcileKnowledgeBaseRelationships. The reconciler diffs nextSharedTeamSlugs vs previousSharedTeamSlugs and emits explicit deletes for removed teams (mirrors how reconcileAgentRelationships reconciles shared agent teams), so unchecking a team revokes the team:<slug>#member reader, team:<slug>#member ingestor, and team:<slug>#admin manager tuples in a single OpenFGA write. The release migration knowledge_base_shared_team_grants_backfill_v1 walks the legacy team_kb_ownership Mongo collection and writes the canonical team:<slug>#member reader knowledge_base:<id> + team:<slug>#member ingestor knowledge_base:<id> + team:<slug>#admin manager knowledge_base:<id> tuples for every (team, kb) row so existing readers/managers retain access once the per-resource gates ship.
Knowledge sidebar tab gates and empty states (PR 2, 2026-05-27). The Knowledge Base sidebar (KnowledgeSidebar) now consults GET /api/rbac/kb-tab-gates and renders any tab the user cannot see as a disabled-with-tooltip control. Org admins (per the PR 1 super-grant) get every tab true with kb_count=-1 and no empty-state banner. Non-admins get a tab visibility map driven by the count of knowledge_base:<id> objects on which they have can_read (resolved by listing /v1/datasources and filtering via filterResourcesByPermission with bypassForOrgAdmin: false). When has_any_kb=false the sidebar shows a "you don't have access to any knowledge bases yet" banner and the NoKbAccessEmpty component replaces the page-level body for Search / Data Sources / Graph / MCP Tools. The same RAG_ADMIN_BYPASS_DISABLED kill switch disables the org-admin short-circuit on this route, forcing every caller through the per-resource path. The hook fails closed: until the BFF responds every tab is hidden so the UI never exposes a control the BFF would 403.
Explicit "data source author" capability (spec 2026-06-03-explicit-ingest-capability). Creating a new data source is now a distinct, explicitly-granted org-level capability β no longer multiplexed off per-KB ingestor ("push into KB X"). The model adds organization#ingestor: [team#member, team#admin] and organization#can_ingest = ingestor or admin, so only org admins (intrinsically) and members of opted-in teams can author. Org admins opt teams in via the IngestCapabilityToggle in the team dialog's Knowledge Bases tab β PUT/DELETE /api/admin/teams/[id]/ingest-capability (org-admin gated, writes/deletes team:<slug>#member ingestor organization:<key>). The kb-tab-gates route now derives can_ingest from a direct organization#can_ingest check (the old ingest_kb_count per-KB enumeration heuristic is removed) so the Ingest tab no longer appears merely because a user can push into some existing KB. The Ingest form fetches authorable teams from GET /api/rbac/ingest-teams (org admins β all teams; others β capability-holding teams the user is a member of) and requires non-admins to pick an owning team, sending owner_team_slug to the create endpoints. Server-side, authorize_datasource_create (rag/server/.../rbac.py) gates both the web (/v1/ingest/webloader/url) and Confluence (/v1/ingest/confluence/page) create paths β org-admin bypass, else organization#can_ingest and caller membership in the named owning team β while appending to an existing datasource still goes through check_datasource_access. On a successful create, write_datasource_ownership writes the ownership tuples (team:<slug>#member ingestor + team:<slug>#admin manager on the new knowledge_base:<id>, data_source:<id> parent_kb knowledge_base:<id>, and user:<sub> creator β¦; or a personal owner tuple when an org admin authors without a team). Every check fails closed.
Explicit "search" capability (spec 2026-06-03-explicit-search-capability). Using search is now a distinct, explicitly-granted org-level capability β the feature-level gate, layered above the narrower per-tool mcp_tool#can_call and per-datasource data_source#can_read checks. This closes a leak where a tool shared org-wide (writing organization#member caller) let every org member invoke it, and where the built-in search/fetch_document tools (which have no mcp_tool object) were never gated at all: holding can_call on a shared tool no longer, by itself, permits search. The model adds organization#searcher: [team#member, team#admin] and organization#can_search = searcher or admin, so only org admins (intrinsically) and members of opted-in teams can search. Org admins opt teams in via the SearchCapabilityToggle in the team dialog's Knowledge Bases tab β PUT/DELETE /api/admin/teams/[id]/search-capability (org-admin gated, writes/deletes team:<slug>#member searcher organization:<key>). The kb-tab-gates route gates the Search tab via a direct organization#can_search check (search = can_search, decoupled from has_any_kb β see the tab-gate composition note below). The BFF rag proxy (requireSearchCapability in ui/.../api/rag/[...path]/route.ts) enforces can_search on /v1/query and /v1/mcp/invoke (built-in + custom tools) before the per-tool can_call gate; server-side, authorize_search (rag/server/.../rbac.py) enforces the same on both endpoints as defense-in-depth for direct/agent callers. Org admins bypass (kill-switchable via RAG_ADMIN_BYPASS_DISABLED); the per-datasource result ACL (constrainSearchBody / inject_kb_filter) still narrows results afterward. Every check fails closed. This is an opt-in capability with no backfill β a deliberate behavior change so the prior over-broad search default is closed.
KB tab-gate composition β capability-driven tabs are decoupled from has_any_kb (2026-06-04 fix). The original PR 2 sidebar derived every tab from the readable-KB count (has_any_kb), so an org admin who granted a team the explicit Search/Ingest capability but had not yet assigned any KB left members with all tabs greyed out β the capability was unreachable, contradicting the toggle's own copy ("results are still limited to the data sources each member can read"). kb-tab-gates now composes the non-admin gates as: search = can_search; data_sources = has_any_kb OR can_ingest; mcp_tools = has_any_kb OR can_search; graph = has_any_kb (graph stays purely read-driven β it needs readable content). A capability alone is therefore enough to reach its feature even before the first KB is assigned (Data Sources resolves the author-first chicken-and-egg; Search/MCP Tools render with an empty, server-scoped result set). This changes UI tab visibility only β the server-side data paths (requireSearchCapability + authorize_search, authorize_datasource_create) re-check the same capabilities and the per-datasource ACL still narrows results, so an enabled-but-empty tab never leaks data. The KnowledgeSidebar "ask an admin to share a KB" banner is likewise suppressed when the user holds any explicit capability, so it no longer contradicts the now-enabled tabs.
Slack and Webex bot channel/space team resolution uses Mongo mappings (channel_team_mappings, webex_space_team_mappings) to find the owning CAIPE team. Membership prechecks are OpenFGA-first: the bot checks user:<sub> member team:<slug> and only falls back to legacy teams.members when the PDP is not configured or unavailable. A negative OpenFGA decision denies the bot interaction before OBO so users get the friendly "not a member" response. (Phase 3 of spec 2026-05-24-derive-team-from-channel removed the per-team OBO scope mint β the bot now mints a team-agnostic OBO token and the channelβteam mapping is the sole source of team identity downstream.)
RAG accepts both browser user tokens and ingestor client-credentials tokens from Keycloak. For local Docker Compose, OIDC_DISCOVERY_URL and INGESTOR_OIDC_DISCOVERY_URL may be either the realm base URL (http://keycloak:7080/realms/caipe) or the full .well-known/openid-configuration URL; the server normalizes both forms before fetching metadata. Keycloak service-account tokens use preferred_username=service-account-<client>, so RAG treats that token shape as machine-to-machine and assigns RBAC_CLIENT_CREDENTIALS_ROLE; human tokens are identity-only and use OpenFGA for authorization.
User-facing Role Cleanupβ
The Admin UI intentionally separates team/resource authorization from raw Keycloak plumbing:
- Keycloak system roles (
default-roles-caipe,offline_access,uma_authorization) are hidden from the table and role filter because they are OIDC/UMA plumbing, not product permissions. - Teams are the human-facing source for membership and most resource grants.
- Legacy resource roles (
agent_user:*,agent_admin:*,tool_user:*,kb_reader:*,task_user:*,skill_user:*) are stale compatibility data only; cleanup scripts can remove them from local/dev realms.
GET /api/admin/users exposes raw Keycloak protocol roles for platform-admin diagnostics. Non-admin callers must hold admin_surface:users#can_read and then receive a self-scoped response containing only their own Keycloak user row. GET /api/admin/users/[id] checks user_profile:<id>#can_read, which is granted by owner user_profile:<sub> for self reads and by organization:<org>#admin for admins. The baseline Users tab can show "my access" without leaking other users; mutation controls remain admin-only. Product authorization should be read through teams and OpenFGA relationships. Local/dev realms can remove stale legacy CAIPE roles with scripts/cleanup-local-keycloak-legacy-roles.py.
Do not delete Keycloak system roles as part of cleanup. They may be required by Keycloak or OIDC flows even though CAIPE hides them from the main admin UX.
External IdP Brokering (Duo SSO, Okta, or any OIDC provider)β
Badge analogy: The partner agency desk. Whether it's Duo SSO, Okta, or any other corporate identity provider, they all speak the same language (OIDC). Keycloak is the single translator β it talks to whichever agency is configured and converts their badges into standard building badges. The rest of the building never needs to know which agency originally issued the contractor's credentials.
Keycloak acts as a relying party to the upstream IdP (OIDC). From the user's perspective it's invisible β they see only the upstream IdP login page. From a security perspective:
Browser ββOIDC auth code flowβββΆ Keycloak
β
ββOIDC auth codeβββΆ Upstream IdP (Duo SSO / Okta / any OIDC)
β
βββ id_token ββββββββ (external claims: email, name, groups)
β
Preserves external group claims for team sync
Issues new Keycloak JWT with identity claims
β
Browser βββ Keycloak JWT βββββββββββββββ
Supported upstream IdPs β the init-idp.sh script configures any OIDC provider generically via OIDC discovery (/.well-known/openid-configuration):
| Provider | IDP_ALIAS (in realm) | IDP_ISSUER example | Notes |
|---|---|---|---|
| Duo SSO | duo-sso | https://sso-xxx.sso.duosecurity.com/oidc/xxx | Uses firstname/lastname (non-standard); extra IdP mappers handle both given_name and firstname |
| Okta (OIDC) | okta-oidc | https://your-org.okta.com or https://your-org.okta.com/oauth2/default | Standard OIDC claims; groups come from Okta's groups claim (requires Okta app config) |
| Okta (SAML) | okta-saml | β | SAML 2.0; configured as a SAML IdP in Keycloak; attribute mappers needed for groups |
| Microsoft Entra ID (OIDC) | entra-oidc | https://login.microsoftonline.com/{tenant-id}/v2.0 | Standard OIDC; groups claim requires Entra app manifest groupMembershipClaims config |
| Microsoft Entra ID (SAML) | entra-saml | β | SAML 2.0; common in enterprise M365 environments |
| Generic OIDC | any alias | any OIDC-compliant issuer URL | Works as long as the provider exposes /.well-known/openid-configuration |
To wire up a new IdP, set these env vars and run init-idp.sh (or restart the init-idp container β it is idempotent):
IDP_ALIAS=okta # short alias, used in kc_idp_hint
IDP_DISPLAY_NAME="Okta SSO" # shown on Keycloak login page (if visible)
IDP_ISSUER=https://your-org.okta.com # OIDC issuer URL
IDP_CLIENT_ID=<okta-app-client-id>
IDP_CLIENT_SECRET=<okta-app-client-secret>
IDP_ACCESS_GROUP=caipe-users # Okta group β chat_user role (optional)
IDP_ADMIN_GROUP=caipe-admins # Okta group β admin role (optional)
KEYCLOAK_ADMIN_FRONTEND_URL=http://localhost:18080 # optional private master-realm admin URL
KEYCLOAK_FORCE_IDP_REDIRECT=true # disable local app-realm login fallback
OIDC_IDP_HINT=okta # auto-redirect browser to this IdP alias
**OIDC_IDP_HINT** (set in ui/.env.local) is passed to Keycloak as kc_idp_hint on every auth request. It skips the Keycloak login page entirely and redirects straight to the named IdP. Set it to the same value as IDP_ALIAS.
**KEYCLOAK_FORCE_IDP_REDIRECT=true** makes the app realm configured-IdP only: init-idp.sh sets the browser flow's Identity Provider Redirector defaultProvider to IDP_ALIAS, marks that redirector as required, and disables the local username/password form. This prevents CAIPE users from seeing the Keycloak login screen even if a client omits kc_idp_hint. Keep the master realm admin console on its private URL for operational access.
If the upstream OIDC app requires PKCE on the Keycloak broker flow, enable keycloak.idp.pkce.enabled=true in Helm. The chart passes IDP_PKCE_ENABLED=true and IDP_PKCE_METHOD=S256 to init-idp.sh, which adds pkceEnabled=true and pkceMethod=S256 to the Keycloak OIDC identity-provider config. Leave it disabled when the upstream IdP does not require broker-side PKCE.
**KEYCLOAK_ADMIN_FRONTEND_URL** is optional and only affects the master realm admin console. Use it when public ingress intentionally exposes only /realms/caipe and /resources; the caipe realm issuer and Duo broker redirect remain on the public Keycloak hostname.
In production, the browser-facing issuer is Keycloak, not the upstream IdP. For the Grid RBAC environment the UI uses:
OIDC_ISSUER=https://idp.caipe.example.com/realms/caipe
OIDC_CLIENT_ID=caipe-ui
OIDC_IDP_HINT=duo-sso
NEXTAUTH_URL=https://caipe.example.com
Duo credentials stay on the Keycloak IdP broker only. The Duo application's redirect URI points to Keycloak's broker endpoint (https://idp.caipe.example.com/realms/caipe/broker/duo-sso/endpoint), while the Keycloak caipe-ui client allows NextAuth's callback (https://caipe.example.com/api/auth/callback/oidc). Keycloak must be started with a public hostname such as KC_HOSTNAME=https://idp.caipe.example.com and KC_PROXY_HEADERS=xforwarded so discovery metadata and JWT iss match the public issuer. A host-specific Docker Compose overlay (kept outside this repo) sets those Keycloak values alongside the UI/RAG/Dynamic Agents OIDC_ISSUER overrides; otherwise browser login links can point back at the local dev default (http://localhost:7080).
Claim mapping chain: The IdP sends email, given_name/firstname, family_name/lastname, and groups claims. Keycloak IdP mappers write identity attributes to the local user record. Group claims are input to the identity-group-to-team sync path, which writes OpenFGA team relationships; they are not translated into CAIPE realm roles.
The login sequence diagram (one-time login + the silent first-broker-login flow) lives in Workflows βΊ Login.
Keycloak Auth Reconciliation Jobβ
Keycloak browser-flow and identity-provider settings are persisted inside Keycloak's database, not in Kubernetes objects. Upgrades can recreate pods and chart resources without automatically reasserting the Identity Provider Redirector, local-login disablement, first-broker-login flow, or required-action settings. The durable design is:
- Keep an idempotent
keycloak-auth-reconcileJob. - Make it chart-owned, not a Grid-only
extraDeployoverride. - Run it as an early ArgoCD/Helm sync hook on install and upgrade.
- Use
BeforeHookCreation,HookSucceededcleanup. - Remove any temporary Grid-specific reconcile job once the chart contains the same behavior.
- Reassert realm token/session lifetimes on upgrade: access tokens remain short-lived at 1 hour, SSO idle timeout is 8 hours, and the absolute SSO max lifespan is 24 hours unless overridden through the Keycloak chart values.
A CronJob is intentionally avoided. Periodic reconciliation would hide ownership drift and repeatedly exercise Keycloak admin credentials when nothing changed. The desired model is one job pod per install/upgrade event, with idempotent Admin API calls that restore the browser-flow and IdP invariants for every downstream install.
User Profile & Custom Attributesβ
Keycloak 26+ enforces a user profile schema. Custom attributes are silently dropped unless declared or unmanagedAttributePolicy=ADMIN_EDIT is set on the user profile API. The Helm realm import JSON must not include unmanagedAttributePolicy as a top-level realm field because Keycloak 26.3 rejects that RealmRepresentation property during import. init-idp.sh patches both supported user-profile settings after the server starts:
- Adds
slack_user_idto the user profile schema withadmin-only view/edit permissions - Sets
unmanagedAttributePolicy=ADMIN_EDITso other Admin API attribute writes succeed - Makes
firstNameandlastNameoptional, disables Keycloak'sVERIFY_PROFILErequired-action provider, and clears any assignedVERIFY_PROFILEactions from existing users so enterprise SSO users are never stopped at Keycloak's "Update Account Information" form
The Keycloak container exposes login/API traffic on 8080 and management health on 9000; Helm readiness/liveness probes target the management port.
Account Linking (Slack)β
Three onboarding paths, evaluated in order:
- Auto-bootstrap (default,
SLACK_FORCE_LINK=false) β bot looks up the Slack user's email, finds an existing Keycloak user, writesslack_user_idsilently. Zero user action required. - Just-In-Time user creation (default ON,
SLACK_JIT_CREATE_USER=true, spec 103) β when no existing Keycloak user matches, the bot creates a federated-only shell user viaPOST /admin/realms/{realm}/usersusing the samecaipe-platformadmin credential. Optional domain allowlist viaSLACK_JIT_ALLOWED_EMAIL_DOMAINS. 409 races are resolved by re-querying. - Explicit link (
SLACK_FORCE_LINK=true, or fallback when JIT is off / not allowed / fails) β bot sends an HMAC-signed link prompt; user clicks β SSO login βslack_user_idwritten via Admin API.
The full sequence (including HMAC URL shape, TTL enforcement, JIT request body, error kinds, and post-link OBO flow) is in Workflows βΊ Slack identity linking.
Account Linking (Webex)β
Webex uses the same Keycloak identity boundary as Slack but stores the Webex
person identifier in webex_user_id. The Webex link callback lives in the Web UI
backend at /api/auth/webex-link and uses single-use, 10-minute nonces in
webex_link_nonces; HMAC links are converted into nonce-backed completion URLs
before the user reaches the OIDC session. The callback rejects attempts to bind
one Webex person ID to multiple Keycloak users.
For group spaces, the default Webex bootstrap path keeps signed linking URLs out of the shared room. The bot posts only a generic thread notice in the group, then sends the requesting person a 1:1 Adaptive Card with the SSO linking URL. If the 1:1 send fails, the group fallback still avoids posting the signed URL publicly. Slack-style implicit/profile linking is treated as a user-choice path, not the default: it should only be enabled when Webex org and verified-email trust checks can prove the Webex profile maps unambiguously to one Keycloak user.
After linking, the Webex bot exchanges its service-account token for a user OBO
token with the selected active team scope. The Webex bot clients are
caipe-webex-bot and caipe-webex-bot-admin; the caipe-ui client receives the
webex-bot-admin-audience mapper so runtime admin calls can use
client-credentials tokens. The full runtime sequence is in
Workflows βΊ Webex space ReBAC.
Component 2: CAIPE UI β The Reception Deskβ
Badge analogy: The reception desk at each department entrance. When you badge in, it reads your chip (JWT), checks your clearance level for this department, and either waves you through or says "sorry, you don't have access here." It doesn't phone HR β the badge chip already carries everything needed to make the decision.
Technically: Next.js App Router with NextAuth (Auth.js v5) for OIDC session management. Every API route handler runs requireRbacPermission() which validates the server-side session and enforces role requirements before proxying to backend services.
Authentication Flowβ
1. Browser visits http://localhost:3000
2. NextAuth detects no session β 302 to Keycloak (OIDC auth code flow)
3. Keycloak β Duo SSO (kc_idp_hint=duo-sso auto-redirects, user never sees KC)
4. Duo SSO login β auth code returned to Keycloak
5. Keycloak issues JWT β NextAuth exchanges code for tokens
6. NextAuth stores small session metadata in the encrypted httpOnly cookie
7. Large OAuth tokens (access, refresh, ID token) stay in the UI server's in-process token cache and are rehydrated server-side
Security note: The session cookie is httpOnly, Secure, SameSite=Lax, and encrypted with NEXTAUTH_SECRET. Large OAuth tokens are kept out of the browser cookie to avoid oversized request headers when Keycloak emits RBAC scopes, groups, or relationship-derived claims. If the UI process restarts and the in-process token cache is lost while a browser still has a valid slim session cookie, the session is marked AccessTokenMissing and the token-expiry guard sends the user back through login instead of allowing tokenless backend proxy calls. For multi-replica deployments, use sticky sessions or replace the in-process token cache with a shared store.
Server-Side Authorization (api-middleware.ts)β
// Every protected API route:
const { user, session } = await getAuthFromBearerOrSession(request);
await requireRbacPermission(session, "rag", "kb.query");
Two authorization paths:
- Primary PDP:
requireRbacPermission()calls Keycloak Authorization Services with the caller's bearer/session access token and the requestedresource#scope. - Role-based fallback:
hasRoleFallback()checksrealm_access.rolesfrom the session JWT when the PDP is unavailable or not configured. - Bootstrap admin path:
isBootstrapAdmin(email)still provides a temporary break-glass fallback fromBOOTSTRAP_ADMIN_EMAILS, but the same email list is also reconciled by the BFF into durable OpenFGA tuples. Prefer the durable tuple state shown in Admin β Security & Policy β Keycloak, and remove the email fallback once group/team-admin relationships are configured.
Routes that have not yet been rewritten inline no longer remain session-only: the deprecated withAuth() compatibility wrapper now uses getAuthFromBearerOrSession(), resolves the route family to a least-privilege RBAC policy, and calls requireRbacPermission() before invoking the handler. The old generic supervisor umbrella is now split for basic user surfaces: profile and identity-link routes use self_profile#read/write, user search uses user_directory#read, chat/A2A/model discovery uses chat_supervisor#invoke, settings use user_settings#read/write, feedback/NPS uses feedback#submit, session files use user_files#read/write, AI assist uses ai_assist#invoke, credentials use credential_vault#use, and platform settings reads use system_config#read. Unmatched compatibility routes fall back to admin_ui#view for GET and admin_ui#manage for writes instead of a generic baseline-use capability. These user-surface capabilities map to organization-level OpenFGA relations (can_read_self, can_manage_self, can_search_directory, can_chat, can_submit_feedback, can_use_files, can_use_ai_assist, can_use_credentials) that derive from existing organization membership/admin relationships so upgrades preserve current access automatically.
Skill authoring is a member self-service surface (2026-06-04 fix). The coarse withAuth gate for the Skill Builder CRUD (/api/skills/configs POST/PUT/DELETE) and for minting the caller's own read-only catalog API keys (/api/catalog-api-keys) maps every skill capability β skill#view, skill#invoke, skill#configure, and skill#delete β to the member-level organization relation can_use (member or admin), not the admin-only can_manage. Per-skill mutation and deletion of an existing skill are still constrained per-resource by ownership inside the route handlers via requireResourcePermission({ type: "skill", action: "write" | "delete" }); the org gate only asserts "the Skill Builder exists for you at all." Before this fix skill#configure/skill#delete fell through organizationRelationFor to can_manage, so generic members hit 403 "You do not have permission to perform this action." when creating a skill. Sharing a skill with a team in the builder uses the same member-accessible GET /api/dynamic-agents/teams "teams available for sharing" endpoint as the RAG KB / MCP / Dynamic-Agent editors; members pick from their own teams (org admins from all teams) and the save writes team:<slug>#member user skill:<id> grants.
Credential APIs additionally keep concrete secret_ref checks for payload and metadata operations. credential_vault#use only opens the credential surface; it does not authorize retrieving or using a specific secret. Slack and Webex runtime access-check APIs likewise require slack_channel:<workspace>--<channel>#can_read or webex_space:<workspace>--<space>#can_read before they evaluate the requested channel/space grant and target user grant, preventing those endpoints from becoming permission oracles for messaging resources the caller cannot inspect. Platform org admins use the standard resource-authz admin bypass because they already hold global organization:<org_key>#can_manage.
For a route-by-route breakdown of which BFF /api/* endpoints use resource-scoped PDP, which still rely on the legacy withAuth wrapper, and which have a user.role === 'admin' bypass, see the PDP Coverage Audit. The audit also documents how to read audit_event_id rows and how to add explicit route capabilities.
Dynamic Agent Execution Gateβ
Dynamic Agent execution is a data-plane ReBAC decision, not a Keycloak UMA management-plane decision. The Web UI backend chat proxy routes authenticate the caller, extract the stable session or bearer-token subject, and check OpenFGA before proxying execution to Dynamic Agents:
user:<sub> can_use agent:<agent_id>
For compatibility with existing team data that was originally keyed by email,
the Web UI backend and Dynamic Agents runtime check the stable subject first and then
fallback to user:<email> can_use agent:<agent_id> when the token carries an
email claim. New relationship writers should prefer Keycloak sub values.
The UI auth middleware also persists the verified Keycloak subject into
MongoDB users.keycloak_sub and users.metadata.keycloak_sub during session or
bearer authentication. This gives migrations and admin tooling a durable
email-to-sub mapping without depending on transient session cookies.
For browser sessions, the Web UI backend forwards the Keycloak access token to
Dynamic Agents when it is present so the runtime can bind
current_user_token and pass the same bearer to AgentGateway-backed MCP calls.
If the slim NextAuth cookie survives a UI restart but the server-side token
cache is gone, Dynamic Agents proxy routes still forward the signed-in
X-User-Context fallback instead of blocking configuration reads, AI review,
or agent save flows. Token-backed AgentGateway tool calls may still require the
user to sign in again before they can be probed or invoked.
POST /api/v1/chat/stream/start, POST /api/v1/chat/invoke,
POST /api/v1/chat/stream/resume, and POST /api/v1/chat/stream/cancel
fail closed before any backend call unless the caller can use the selected
agent and can write the target conversation through implicit ownership or an
explicit OpenFGA relationship. The older plain SSE proxy at
POST /api/chat/stream also forwards the authenticated session access token to
the supervisor backend and applies the same implicit-or-explicit conversation
write check before proxying.
The Web UI backend emits a unified RBAC Audit event for every OpenFGA agent-use decision,
and the Dynamic Agents runtime persists the same structured openfga_rebac
event to MongoDB audit_events for direct bearer-token calls. Both use
pdp=openfga; the Web UI backend stores the checked tuple in a resource reference shaped
like:
user:<sub> can_use agent:<agent_id>
This gives operators a single RBAC Audit view for runtime OpenFGA allows,
denies, and PDP-unavailable failures alongside admin ReBAC graph/check actions.
The Admin UI's RBAC Audit type filter uses All as a literal unfiltered view
over MongoDB audit_events; selecting a specific type narrows the result to
auth, openfga_rebac, tool_action, or agent_delegation. The AgentGateway
openfga-authz-bridge also writes each external ext_authz decision into the
same audit_events collection with source=openfga_authz_bridge, so
gateway-level OpenFGA allow/deny/error decisions appear without a trace backend.
MongoDB is the durable audit record and the Admin UI reads it directly.
Personal DM Experience β Phase 2 (spec 2026-05-24)β
Slack DMs and Webex 1:1 spaces dispatch through a personal chain. (The
legacy active_team JWT claim has been removed; see Phase 3 demolition
notes above and the deprecated Spec 104 section below.) The BFF owns three new routes,
the Web UI broadens its agent-use check to honor team-union grants, and
both bots intercept text/slash commands before route resolution.
| Surface | Endpoint | Purpose |
|---|---|---|
| Bot β BFF | POST /api/user/check_agent_access | Pure PDP probe for the DM dispatch chain. Wraps evaluateAgentAccess(subject, agent_id) (direct grant β team-union fallback) and returns {allowed, reason, path, matched_team_slug}. No team scope needed on the token. |
| Bot β BFF | GET /api/user/accessible-agents | Pagination-friendly list of agents the calling user can can_use. Drives /caipe-list (Slack) and list (Webex). |
| Bot β BFF | GET/PUT /api/user/preferences | Per-user saved dm_default_agent_id. PUT {"dm_default_agent_id": null} clears the preference (FR-029a, invoked by /caipe-use default). |
| Web UI | requireAgentUsePermission | New ALLOW_TEAM_UNION audit reason code. When direct userβagent grants miss, the helper probes the caller's team slugs (listUserTeamSlugs) and accepts team:<slug>#member can_use agent:<id>. This aligns the Web UI with the bots, which already honored team-mediated grants. |
The bots' DM dispatch chain is:
- Thread/space override (
dm_thread_overrides.OverrideStoreβ LRU capped at 1000 entries, no TTL, cleared on bot restart or explicit/caipe-use default). - Saved preference (
user_preferences.dm_default_agent_idvia the BFF). - Deployment
dm_agent_id(SLACK_INTEGRATION_DM_AGENT_ID/WEBEX_INTEGRATION_DM_AGENT_ID). - Platform default agent (fallback). The Slack bot resolves this from
platform_config.default_agent_id(set in Admin β Settings β Default Agent, the same value the Web UI uses) and falls back to theSLACK_INTEGRATION_DEFAULT_AGENT_IDenv/YAML value when the DB is unset or unreachable. So the platform default now governs Slack channel fallback and DMs in addition to the Web UI.
Every candidate is re-checked via POST /api/user/check_agent_access
before being returned. A stale override that fails the PDP is auto-cleared
with a user-visible notice. A stale saved preference emits a notice but
is NOT auto-cleared (the user may be temporarily off-team). Deployment
defaults fall through silently on deny β org defaults failing is an ops
issue, not something to spam users about. PDP unavailability returns a
clean "try again later" response.
Slack registers /caipe-help, /caipe-list, and /caipe-use Bolt
commands (see docs/integrations/slack-manifest.md). Webex parses
plain-text help / list / use <agent> / use default via
text_commands.parse_command_text and intercepts them in
handle_webex_message BEFORE route resolution so an unmapped 1:1 space
still gets a useful response. Both surfaces are rate-limited per user
(default 5 commands per 30s; SLACK_COMMAND_RATE_LIMIT /
WEBEX_COMMAND_RATE_LIMIT) and reply ephemerally (Slack
response_type=ephemeral; Webex DMs the issuer in group spaces, replies
inline in 1:1).
Credential Exchange Authorizationβ
Connections & Secrets OAuth tokens are never returned to the browser. Browser
users can start or relink OAuth provider connections and can run a server-side
profile check, but POST /api/credentials/connections/[connection_id]/profile
refreshes the token inside the BFF and returns only redacted provider profile
metadata or, for Atlassian, redacted accessible-resource metadata when
/me returns 403. The same response includes a redacted diagnostics checklist
for the Connections page modal so users can see which validation step passed,
failed, or needs follow-up without receiving token material. The Connections page
also calls
POST /api/credentials/connections/[connection_id]/refresh automatically for
the signed-in user's expired or expiring connected providers; that endpoint
persists the refreshed token server-side and returns only non-secret refresh
metadata.
Per-user scope selection. Each user may narrow which OAuth scopes their own
connection requests. The My Connections row exposes an "Advanced settings"
panel listing the connector's allowed scopes (the connector's scopes array
is both the admin-managed upper bound and the default selection β a user can
only narrow within it, never exceed it). The connect route accepts an optional
?scopes= selection; ProviderConnectionService.startConnection runs the pure
boundScopes(connectorScopes, requested) guard β rejecting any scope outside
the connector's allowed set or an empty selection with 400 VALIDATION_ERROR
so a tampered request cannot escalate, and never minting a zero-scope token.
The choice is carried through the signed OAuth state cookie and persisted as
requestedScopes (and grantedScopes from the token response scope claim)
on the per-user provider_connections document. The IdP still encodes the
granted scopes inside the issued token, so the token is valid without the
stored copy; persistence exists so relink pre-fills the user's prior choice
(rather than silently reverting to the full default), the UI can show what a
connection was granted, and the selection is auditable. Existing connections
without these fields and connects that do not open Advanced settings behave
exactly as before (connector default). Changing scopes requires a relink to
take effect; it does not retroactively alter an existing token.
Raw token exchange is reserved for service callers. POST /api/credentials/exchange
rejects browser-origin/session requests, verifies the service bearer JWT through
the OIDC JWKS path, requires the credential-service audience header, and can
resolve credentials in two ways:
provider_connection_id: refreshes that specific connection, returning an access token only when the JWT subject owns it or has delegated use permission.provider: lists provider connections owned by the JWTsub, selects that user's connected provider record, refreshes it, and returns only that user's provider access token.
When a caller asks for a specific connection that is not owned by the JWT subject, the route only returns an access token when the subject has:
user:<service-sub> can_use secret_ref:provider_connection:<connection_id>
This keeps Dynamic Agents and MCP runtimes on a narrow service-to-service path
while preserving OpenFGA as the PDP for delegated provider-token use. Dynamic
Agents uses this path behind USE_IMPERSONATION_TOKENS=true and forwards every
exchanged provider token to the MCP runtime on X-CAIPE-Provider-Token, leaving
the normal Authorization header reserved for Keycloak MCP authentication.
Per-provider token handling:
| Provider | MCP auth | Notes |
|---|---|---|
| Atlassian (Jira/Confluence) | Bearer | MCP rewrites the OAuth base URL to api.atlassian.com/ex/jira/{cloudId} (cloud-ID auto-resolved & cached) before calling the API. |
| PagerDuty | Bearer (OAuth) or Token token= (static API key) | MCP picks Authorization: Bearer <token> when X-CAIPE-Provider-Token is present, otherwise falls back to the static PAGERDUTY_API_KEY with the legacy Token token= scheme. |
| GitHub / GitLab | Bearer | Upstream expects Authorization: Bearer <token>. See the hybrid (per-user OAuth + org PAT fallback) flow below. |
| Knowledge Base (RAG) | Bearer (Keycloak) | The RAG server enforces its own Keycloak/OIDC auth on /mcp. Dynamic Agents forwards the caller's user JWT (per-user RAG group RBAC); in non-user contexts (background reconcile/probe) it mints a caipe-platform client-credentials service token. See the hybrid flow below. |
GitHub / GitLab hybrid (per-user OAuth with org-PAT fallback)β
GitHub and GitLab upstreams authenticate with Authorization: Bearer <token>
and historically used a single static org PAT injected at AgentGateway via a
backendAuth policy. That made every caller act as the org service account. The
hybrid model lets connected users act as themselves while unconnected callers
transparently fall back to the org token:
- Dynamic Agents resolves the credential for each
credential_sourcesentry. When the caller has connected their personal GitHub/GitLab account, it exchanges that per-user OAuth token. When no per-user connection resolves, it reads the static org PAT fromMCPCredentialSource.fallback_env(GITHUB_PERSONAL_ACCESS_TOKEN/GITLAB_PERSONAL_ACCESS_TOKENon the Dynamic Agents pod). - Either way, the resolved token is forwarded to AgentGateway on
X-CAIPE-Provider-Token. - A route-level AgentGateway transformation rewrites that header into the
upstream
Authorization: Bearerheader:'"Bearer " + default(request.headers["x-caipe-provider-token"], "")'. The staticbackendAuthPAT is no longer configured at the gateway β the org PAT now lives only on Dynamic Agents as a fallback.
This is a header rewrite (route-level transformation), not the backend-level
extAuthz response-header injection that AgentGateway does not support. The
X-CAIPE-Provider-Token β Authorization transformation is configured in the
standalone static config (deploy/agentgateway/config.yaml,
deploy/agentgateway/config.caipe-rbac.yaml), the Docker Compose config bridge
(deploy/agentgateway/config_bridge.py), and both Helm routing paths
(templates/agentgateway-static-config.yaml for static routing,
templates/agentgateway-mcp.yaml as an AgentgatewayPolicy for the
Gateway-API path).
GET /api/credentials/inject/atlassian remains available as a BFF-side injector
contract for future AgentGateway integrations.
Knowledge Base (RAG) hybrid (user JWT with caipe-platform service-token fallback)β
The knowledge-base MCP server is backed by the RAG server, which enforces its
own Keycloak/OIDC authentication on /mcp (validating issuer, audience β
caipe-platform β signature, and expiry). AgentGateway does not forward the
incoming Authorization header to MCP backends by default, so the RAG server
previously received no token and returned HTTP 401, surfacing in the UI as
MCP server 'knowledge-base' is unavailable. The hybrid model supplies the right
identity for each call path:
- Dynamic Agents resolves a
caller_tokencredential_sourcesentry forknowledge-base. When a per-request user JWT is present (the caller's Keycloak token incurrent_user_token, set byJwtAuthMiddleware), it forwards that user JWT so the RAG server can apply per-user group RBAC (team:<slug>#member reader knowledge_base:<id>). - When there is no user context β e.g. the background tool reconcile/probe
(
conv=-) β Dynamic Agents mints (and caches until ~30 s before expiry) acaipe-platformOAuth2 client-credentials service token via Keycloak (MCP_SERVICE_OIDC_*, defaulting toINGESTOR_OIDC_CLIENT_*/KEYCLOAK_URL). - Either token is forwarded to AgentGateway on
X-CAIPE-Provider-Token, and the same route-level transformation used by GitHub/GitLab rewrites it into the upstreamAuthorization: Bearerheader:'"Bearer " + default(request.headers["x-caipe-provider-token"], "")'.
The transform is configured for the knowledge-base route in the standalone
static config (deploy/agentgateway/config.yaml,
deploy/agentgateway/config.caipe-rbac.yaml), the config bridge
(deploy/agentgateway/config_bridge.py β
DEFAULT_MCP_ROUTE_POLICY_OVERRIDES["knowledge-base"]), and the Helm static
routing path (knowledgeBaseTarget carries providerTokenAuth: true in
_helpers.tpl). The token-resolution logic lives in
ai_platform_engineering/dynamic_agents/src/dynamic_agents/services/mcp_client.py
(caller_token kind + mint_service_client_credentials_token) and the seed row
ships in dynamic_agents/services/config.yaml.
OpenFGA Relationship Backfillβ
Existing MongoDB team/resource assignments can be reconciled into OpenFGA with
scripts/backfill-universal-rebac.ts. The backfill is a production migration,
not a demo seed: it reads teams, team_membership_sources,
users, platform_config, and dynamic_agents, then writes idempotent
OpenFGA tuples plus Mongo provenance in team_membership_sources and
rebac_relationships.
It records first-run status in rbac_migrations using the stable migration id
openfga_relationship_backfill_v1.
For team membership subjects, the backfill prefers users.keycloak_sub, then
users.metadata.keycloak_sub, and only falls back to legacy subject fields. If
none of those mappings exist, it may use the member email for compatibility;
operators should run the migration after users have logged in at least once so
the stable subject mapping is available.
The migration materializes team grants such as:
user:<sub> member team:<slug>
user:<sub> admin team:<slug>
team:<slug>#member user agent:<agent_id>
team:<slug>#admin manager agent:<agent_id>
team:<slug>#member caller tool:<tool_prefix>
team:<slug>#member reader knowledge_base:<kb_id>
team:<slug>#member user skill:<skill_id>
team:<slug>#member user task:<task_id>
Skill Hub imports use the same skill:<id> resource model as locally-authored
skills. Hub skills are projected into stable catalog ids
hub-<hub_id>-<hub_skill_id>, so team grants write
team:<slug>#member user skill:hub-<hub_id>-<hub_skill_id>. The skills catalog
filters non-admin list responses with can_read skill:<id> and content-bearing
runtime responses with can_use skill:<id>; admins keep full catalog visibility.
The Skill Hubs admin card can list hub metadata for callers with
admin_surface:skills#can_read, but this is operational catalog metadata only.
Which hub skills a user can read or run remains enforced through OpenFGA
skill:<id> relationships after the hub has been crawled.
Locally-created team-visible skills and bulk .zip imports now reconcile selected
teams into OpenFGA skill#user relationships as part of save/import. Skill Hubs
also persist shared_with_teams; every force-refresh grants those teams access to
all refreshed hub skill ids, and the skill_hub_team_grants_backfill_v1 migration
does the same for hub skills that were already crawled before the hub-level team
policy existed.
Per-skill team shares converge on the shared shareable-resource reconciler
(2026-06-04 fix). Locally-authored skill create/update now route their team
shares through reconcileSkillTeamShares β reconcileShareableResource (the
same tuple-core agents, RAG KBs, and MCP tools use, per #1726), with
objectType: "skill", ownerTeamSlug: null (skills are user-owned, not
team-owned), and memberRelations: ["user"]. This closes two gaps in the old
write-only grantSkillsToTeams path: PUT /api/skills/configs previously wrote
nothing to OpenFGA, so editing shared_with_teams (or demoting away from
team visibility) updated Mongo but left the old team:<slug>#member user skill:<id> grants in place, and even POST only ever wrote β never revoked.
Because the reconciler diffs previousSharedTeamSlugs against
nextSharedTeamSlugs, un-sharing or re-scoping a skill now emits the delete
tuples for dropped teams instead of orphaning them. Bulk fan-out paths (.zip
import, Skill Hub force-refresh) intentionally keep the write-only
grantSkillsToTeams helper β they have no previous per-skill state to revoke.
Config (Mongo) stays the source of truth: an OpenFGA failure during reconcile is
logged but never fails the skill save.
GitHub Skill Hub crawl/import uses the hub's validated credentials_ref when
configured, otherwise falls back to the server-side GITHUB_TOKEN environment
variable on caipe-ui. In dev compose, caipe-ui receives GITHUB_TOKEN from
.env or the shell, with GITHUB_PERSONAL_ACCESS_TOKEN as a local fallback.
To preserve the default chat path after Dynamic Agent PDP enforcement, the
OpenFGA model allows a typed wildcard subject on agent.user, and the
migration writes this tuple when a dynamic default agent is configured:
user:* user agent:<default_agent_id>
Default-agent resolution matches the Admin Settings feature: persisted
platform_config.default_agent_id first, then DEFAULT_AGENT_ID, then the
supervisor fallback. Supervisor fallback is not a Dynamic Agent and does not
produce a default-agent OpenFGA tuple. The Slack bot honors the same
platform_config.default_agent_id at runtime (via its
PlatformSettingsReader, with SLACK_INTEGRATION_DEFAULT_AGENT_ID as the
env/YAML fallback), so the one Admin β Settings β Default Agent value governs
the Web UI, Slack channel fallback, and Slack DMs. The backfill is still the bulk repair path
for existing environments, but the Web UI also reconciles this typed-wildcard
grant when an admin saves a default Dynamic Agent, when an admitted user logs in,
and before the chat-available Dynamic Agent picker filters candidates through
OpenFGA. The picker now also repairs the same typed-wildcard grant for every
enabled Dynamic Agent with visibility: "global" before filtering. That keeps the
runtime picker OpenFGA-only without requiring an admin to manually provision
default-agent or global-agent tuples.
Visibility is the source of truth for the wildcard grant (2026-06-04 fix).
The user:* user agent:<id> "everyone can use" grant is now reconciled from
visibility on both create and edit, closing a global β team demote leak:
POST /api/dynamic-agentspassesglobalUserAccess: visibility === "global".PUT /api/dynamic-agentspassesglobalUserAccess: finalVisibility === "global"andpreviousGlobalUserAccess: currentVisibility === "global", so demoting an agent fromglobaltoteam(or transferring it while scoping to a team) deletes the wildcard tuple instead of leaving everyone withcan_use.- The chat-available picker (
GET /api/dynamic-agents/available) is self-healing: it writes the wildcard forglobalagents and revokes it for every non-global agent that is not the configured platform default.filterTupleDiffdrops deletes for tuples that never existed, so this is safe to run on every request and cleans up agents demoted before this fix shipped.
Before this fix, a non-default agent flipped from global to team kept its
user:* user agent:<id> grant (Mongo said team, OpenFGA still said "everyone"),
so non-owner-team members retained can_use and could both see and chat with it.
The platform-default path already revoked its own wildcard on default change,
which is why removing an agent as the platform default correctly restricted it.
Default agent is public by designβ
Selecting an agent in Admin β Settings β Default Agent writes the
user:* user agent:<id> tuple shown above. Every signed-in user (Web UI and
Slack/Webex DMs) is then allowed to can_use that agent, regardless of their
team memberships. To keep that contract visible and reversible:
- The Admin Settings picker shows a persistent banner explaining the
consequence and a confirmation modal on save.
PATCH /api/admin/platform-configrejects requests with400 / PUBLIC_ACCESS_NOT_ACKNOWLEDGEDunlessacknowledge_public_access: trueis included alongside a non-nulldefault_agent_id. Clearing the default (null) does not require the ack β it only revokes the existing wildcard. - Each platform-default change emits a structured audit line
(
[AUDIT] platform_default_agent_changed) withactor,previous,next, andatso log shippers can build an audit trail without a new collection. PUT /api/dynamic-agentsrejects demotingvisibility: global β teamon the current platform default with409 / AGENT_IS_PLATFORM_DEFAULT, andDELETE /api/dynamic-agentsrejects deleting it with the same code. Both paths surface a plain-English message pointing the admin back to Admin β Settings to change the platform default first. The per-agent edit page mirrors this by disabling the visibility selector with an inline note when an agent is the current platform default.- The single source of truth for the invariant is
ui/src/lib/rbac/platform-default.ts
(
isPlatformDefaultAgent(id)), which readsplatform_config.default_agent_idwith theDEFAULT_AGENT_IDenv var as a fallback.
Per-agent MCP tool restrictions are reconciled separately with
scripts/backfill-agent-tool-openfga.ts. That migration reads each dynamic
agent's allowed_tools map and reconciles tuples shaped as:
agent:<agent_id> caller tool:<server_id>/<tool_name>
agent:<agent_id> caller tool:<server_id>/*
Run it after enabling signed agent context so existing agents have the same
AgentGateway/OpenFGA enforcement as newly-created or edited agents. Apply mode
also removes stale agent-tool tuples that no longer match allowed_tools.
Schema-versioned migration agent_org_admin_inheritance_v1 backfills the
organization-admin inheritance tuple for existing Dynamic Agents:
organization:<org>#admin manager agent:<agent_id>
This grants organization admins can_manage through the OpenFGA model without
guessing owner teams for legacy agents. New agents get this tuple during create.
Self-service resource creation is PDP-backed. A signed-in user can create a
private Dynamic Agent, MCP server, or RAG data source and receives a direct
owner tuple (user:<sub> owner <resource>:<id>), which derives read/use/write
and manage permissions in OpenFGA. Config-driven and AgentGateway-synced MCP
servers seed organization:<org>#member read/use/invoke tuples and
organization:<org>#admin manager tuples, so admitted users can discover and use
system MCP servers while config-driven records remain immutable through the Web UI
mutation APIs. For team-scoped resources, the Web UI backend
first checks user:<sub> can_use team:<slug> before creation, then writes
team-scoped tuples so team members can use/read the resource and
team:<slug>#admin can manage it. MongoDB stores resource metadata such as
owner_team_slug, but OpenFGA remains the authorization source of truth.
Token Refreshβ
NextAuth holds the refresh token and silently refreshes the access token before it expires. The bundled Keycloak realm keeps access tokens at 1 hour, sets SSO idle timeout to 8 hours, and uses a 24-hour absolute SSO max lifespan. As long as the user keeps using the app and Keycloak accepts the refresh token, the UI asks Keycloak for a new access token instead of expiring the browser session based on local access-token staleness. If Keycloak rejects refresh (invalid_grant), the realm session is revoked, or Keycloak is unavailable, the user is redirected to login. The access token in the session is always the current live token β it's what gets forwarded to backend services.
Identity Group Sync Hybrid Source Modelβ
Identity Group Sync deliberately has two upstream sources:
- OIDC
memberOf/groupsclaims on login β Keycloak imports the upstream IdPgroupsclaim into theidp_groupsuser attribute and emits it to thecaipe-uiclient as a multivaluedgroupsclaim in ID token/userinfo responses. Login-claim reconciliation is enabled by default; setIDENTITY_SYNC_LOGIN_CLAIMS_ENABLED=falseonly when a deployment needs to disable it.auth-config.tsextracts the signed-in user's group claims and runs a best-effort reconciliation for only that user. This is additive and fast: it refreshes the user's managedteam_membership_sourcesand OpenFGAuser:<sub> member team:<slug>tuples without storing the full group list in the session cookie. Login is not failed if reconciliation cannot run. - Direct Okta directory API for admin dry-runs β
/api/admin/identity-group-sync/dry-runcan fetch full group inventory from Okta using server-side IdP credentials whenfetch_from_provider=trueandprovider_idis an Okta provider. This path is the authoritative source for scheduled/admin sync because it can see users who are not actively logging in, detect removals, produce drift findings, and surface users that still need identity linking before tuples can be written.
The claim path is not a replacement for direct directory querying. It improves freshness for the current user while the directory connector remains responsible for complete inventory and removals. Admins can also use GET /api/admin/identity-group-sync/claim-suggestions from the Identity Group Sync tab to read the current admin's server-side cached login claim groups, convert them through the same OIDC claim mapper, run existing rules, and review suggested teams for unmatched groups before creating anything. The endpoint intentionally does not call the OIDC userinfo endpoint on demand; if the in-process session claim cache is empty after a UI restart, the admin signs out and back in to refresh the cached claim groups. The UI lets admins filter large AD group sets, select one or more detected groups, and apply a reviewed teams_to_create plan to create those CAIPE teams without granting memberships or deleting anything.
Reviewed admin apply flows can materialize missing teams from teams_to_create when a reviewed rule has auto_create_team=true. Login-time reconciliation is intentionally narrower: it reconciles existing teams only, and never creates teams or grants access to missing teams. Later syncs may remove managed membership sources and matching OpenFGA user:<sub> member/admin team:<slug> tuples when a user's IdP claim or group membership disappears, but Identity Group Sync never deletes teams it previously created. Dry-runs include safety warnings for disruptive removals such as admin membership loss, large removal batches, and teams that would be left without active managed identity-sync memberships. Apply requests that include acknowledged removal risks require an explicit acknowledge_removal_risks=true review flag before the Web UI backend removes access. These warnings are also the operator signal to inspect orphaned or abandoned resource grants on now-empty teams.
Identity Group Sync admin APIs use the shared getAuthFromBearerOrSession path before requireRbacPermission, so browser sessions and validated first-party bearer tokens both reach the same OpenFGA organization checks. Keycloak identity and user administration APIs follow the same pattern: list/detail/stats require organization can_audit, while self-scoped identity detail reads use user_profile:<id>#can_read. Profile updates, team membership edits, and relationship writes require organization can_manage. Admin observability APIs for skill statistics and checkpoint persistence statistics require organization can_audit before reaching MongoDB-backed metrics; the Prometheus instant/batch proxy requires admin_surface:metrics#can_read so baseline Metrics & Health viewers can load charts. Skill Hub list metadata requires admin_surface:skills#can_read, while hub registration, refresh, update, and deletion remain admin_ui#admin operations. This keeps Playwright persona tests and future service-triggered sync previews aligned with the Web UI backend authorization path.
Manual team management is also provenance-aware. Teams created through /api/admin/teams are stamped with source=manual, status=active, and creator/updater metadata. Manual membership edits create or remove non-managed team_membership_sources rows (source_type=manual, managed=false) so automated Okta/AD/OIDC sync can prune only managed sources. The Team Details members tab reads /api/admin/identity-group-sync/teams/[teamId]/membership-sources, reconstructs the visible member list from active source rows, and displays each member's manual/synced/stale/pending source labels; the embedded teams.members[] array is legacy fallback only. Team-level admins (members with role=owner or role=admin) can fully manage teams they own β rename and description edits (PATCH /api/admin/teams/[id]), team deletion (DELETE /api/admin/teams/[id]), realm role assignments (PUT /api/admin/teams/[id]/roles), agent/tool resource grants (PUT /api/admin/teams/[id]/resources), member add/remove (POST/DELETE /api/admin/teams/[id]/members), and OpenFGA reconciliation (POST /api/admin/teams/[id]/openfga/reconcile). All six routes share a single requireTeamMembershipManagementPermission(session, actorEmail, team) guard in ui/src/lib/rbac/team-admin-guards.ts that first tries requireRbacPermission(session, "admin_ui", "admin") for the platform-admin bypass and falls back to isScopedTeamAdmin(actorEmail, team) for the team-scoped path. Unrelated team edits remain denied unless the caller is a platform admin (issue #1509).
OpenFGA ReBAC Admin UIβ
Admins can create and visualize OpenFGA policy/resource relationships at Admin β Security & Policy β OpenFGA ReBAC.
The older user-facing Policy tab has been removed. It edited CEL tab-visibility
and legacy policy surfaces that are no longer part of the operational model.
Admin tab visibility is now a deterministic Web UI backend gate (/api/rbac/admin-tab-gates)
based on session role plus feature flags; resource authorization is modeled in
OpenFGA relationships.
The Admin UI also includes a read-only effective-permissions simulator. Platform
admins can add simulate_type=user&simulate_id=<keycloak_sub> or
simulate_type=team&simulate_id=<slug>&simulate_relation=member|admin to the
Admin URL through the View As Effective Permissions control. The browser stays
authenticated as the real admin; the Web UI backend simply evaluates tab gates as
the simulated OpenFGA subject (user:<sub> or team:<slug>#admin). Simulation is
not Keycloak impersonation, never mints a token for the target principal, and
disables mutation-oriented integration panels while previewing.
The UI is intentionally Web UI backend first:
- The browser loads a safe catalog from
/api/admin/openfga/catalog(teams, dynamic agents, MCP tool prefixes, known KB IDs, universal resources, and OpenFGA status). - The Access Manager combines relationship authoring and effective-access checks in one catalog-driven form. It searches/selects subjects, resources, and actions; previews the derived check relation such as
team:platform#member can_use agent:incident-agent; and applies admin grant/revoke mutations through the staged ReBAC change-set API. - The Policy Graph calls
/api/admin/rebac/graphand renders tuple usersets as typed nodes and edges so relationships across the universal resource catalog are visible without reading raw tuple rows. Admins can switch between a single-team scope and an all-relationships system scope, open a full-screen graph workspace, search/select catalog resources in the palette, drag resources onto the canvas, connect valid nodes to stage grants, select existing edges to stage revokes, and save the reviewed tuple diff through/api/admin/openfga/tuples. - The OpenFGA Tuples tab is the default sub-tab. It calls
/api/admin/openfga/tuplesfor capped, filtered reads and admin-only deletes, and can be deep-linked withopenfgaTab=tuples.
The OpenFGA ReBAC sub-tabs are URL-addressable with openfgaTab=<tab> so admins can share links to specific views. Supported values are tuples, graph, and access; old builder and explorer links open Access Manager, while legacy rag, slack, and webex links canonicalize to Settings or Integrations.
Raw OpenFGA HTTP endpoints stay on the Docker/private service network. The browser never talks to OpenFGA directly, and the Web UI backend only accepts writable tuple shapes that match the CAIPE base model (user:<sub> member team:<slug>, team:<slug>#member user/manager agent:<id>, team:<slug>#member caller tool:<prefix>, and KB base relations). Materialized can_* relations are derived by the OpenFGA model for checks and are rejected on tuple writes.
The universal ReBAC catalog lives behind /api/admin/rebac/catalog. It returns the complete protected resource vocabulary, per-type action map, and discovered resource instances from teams, users, dynamic agents, AgentGateway's mcp_gateway:list gate, MCP servers/tools, KB ownership, Slack mappings, Webex mappings, conversations, and built-in admin/system resources. /api/admin/rebac/enforcement-status reports transition state for every resource type (not_gated, role_gated, rebac_shadowed, rebac_enforced, or deprecated) by merging defaults with rebac_enforcement_status overrides. The older OpenFGA admin endpoints use the same session-or-bearer authentication path, and /api/admin/openfga/catalog now embeds these universal resources while preserving its legacy agents, tools, and knowledge_bases picker shape.
Policy authoring is staged through policy_change_sets instead of direct browser-to-tuple writes. The Web UI backend creates a draft change set, validates every requested grant/revocation against the universal action vocabulary, delegated-scope guardrails, circular-grant checks, and last-admin risk, then applies the validated diff to OpenFGA and records provenance in rebac_relationships. The OpenFGA admin tab uses this create/validate/apply sequence for Access Manager edits, graph edits, and tuple revocations so administrators see the staged diff before the write is committed.
Graph and access explanation APIs read OpenFGA tuples and join them with rebac_relationships provenance. /api/admin/rebac/graph supports all-relationship views and scoped filters for team, subject, resource, and Slack channel, returning source metadata with each edge. /api/admin/rebac/check runs the same universal relationship check and explains allow outcomes with the recorded source path or deny outcomes with the missing OpenFGA prerequisite. Access Manager is catalog-driven: operators can search/select team, user, Slack channel, Webex space, external group, or service-account subjects and check any catalog resource type/action, including AgentGateway mcp_gateway:list and tool can_call paths. Admins can remediate denied results by creating the selected relationship, or revoke allowed results, through the same staged change-set validation/apply path used by the graph editor. The legacy /api/admin/openfga/graph endpoint delegates to the universal graph service so older UI code gets the same source-aware graph.
Slack channel ReBAC is managed through /api/admin/slack/channels and the per-channel resources/routes/access-check routes under /api/admin/slack/channels/[workspaceId]/[channelId]. The [workspaceId] value is the configured workspace alias from SLACK_WORKSPACE_ALIAS (for example, CAIPE), not Slack's opaque team_id. Channel management is team-owned: assigning a channel to a team writes team:<slug>#member user slack_channel:<workspace>--<channel> and team:<slug>#admin manager slack_channel:<workspace>--<channel>, and per-channel resource/route mutations check can_manage on that Slack channel instead of requiring global Admin UI permission. The top-level Slack channel list is resource-scoped: a non-admin caller sees only channels where OpenFGA grants can_read or can_manage, with can_manage returned for the UI. Admin tab gates also open the Integrations β Slack tab when the caller can manage at least one concrete Slack channel. The admin UI exposes the currently enforced Slack runtime path: channel-agent associations write base OpenFGA tuples such as slack_channel:CAIPE--C0123456789 user agent:<id>; runtime checks ask for derived can_use.
Team-cascade sharing model (intentional). The channel-dispatch access-check at
/api/integrations/slack/channels/[workspaceId]/[channelId]/access-checksendsuser_subject = "team:<slug>#member"(the channel's mapped team) rather thanuser:<sub>. This is the documented policy: any agent associated with a channel that is mapped to a team is callable in that channel by every member of that team, including members who were never granted the agent directly viauser:<sub> can_use agent:<id>. The DM-dispatch chain (POST /api/user/check_agent_access) is user-scoped and is not subject to this cascade. The Slack and Webex ReBAC admin panels surface this trade-off both in the top-of-card "Sharing model" callout and in a per-channel heads-up under the agent-association form. See Workflows β Sharing model: assigning a channel to a team transitively shares its agents for the full rationale.
OpenFGA is the source of truth for whether a Slack channel may invoke a Dynamic Agent. slack_channel_agent_routes is retained only for dependent dispatch metadata such as listen mode and priority, and a metadata row is valid only while the matching OpenFGA tuple exists. The Slack bot resolves candidate agents from OpenFGA first, joins optional Mongo route metadata for ordering/listen filters, and never lets a stale Mongo route keep a deleted OpenFGA association alive. Deleting a channel-agent association removes both the OpenFGA tuple and the saved route metadata row. Route misses fail closed; user-visible Slack notices are reserved for explicit invocations, while ambient plain channel messages stay silent even when route diagnostics are recorded. The Admin Slack Channels panel exposes runtime diagnostics for the selected channel so operators can see OpenFGA read failures, stale Mongo metadata, missing tuples, listen-mode mismatches, and the latest Slack runtime audit error without checking container logs. Fix buttons in diagnostics repair common drift by removing stale route metadata when its OpenFGA tuple is gone, or by switching a tuple-backed route to listen to both mentions and plain messages.
Slack bot deployments now default to SLACK_AGENT_ROUTES_MODE=db_prefer, so OpenFGA-backed UI-managed routes are preferred when present and static Slack bot config remains the fallback; config remains available for static-only environments and db_only is available for canaries that should ignore static route bindings. At runtime, the Slack bot maps any incoming Slack team_id to SLACK_WORKSPACE_ALIAS, resolves the channel's team from channel_team_mappings, mints the user's team-scoped OBO token, selects an OpenFGA-backed channel agent, and authorizes the selected agent before dispatch. The request is denied unless both the channel association and the user's team/resource relationship allow the selected agent.
For hands-off channel onboarding, operators may set SLACK_AUTO_ASSIGN_UNMAPPED_CHANNELS=true with SLACK_DEFAULT_TEAM_SLUG and SLACK_DEFAULT_AGENT_ID. When a group-channel message arrives and no active channel_team_mappings row exists, the Slack bot writes the configured channel-team mapping, writes slack_channel:<workspace_alias>--<channel_id> user agent:<default_agent_id> to OpenFGA, and stores a slack_channel_agent_routes metadata row with listen: all. The feature is disabled by default in Helm and fails closed if MongoDB, OpenFGA, the default team, or either required env var is missing; existing active channel mappings are never overwritten.
For migrations, the Slack Channels panel includes Slack Channel Association Default backed by GET/POST /api/admin/slack/channels/defaults. The UI shows the currently configured default team and Dynamic Agent from SLACK_DEFAULT_TEAM_SLUG and SLACK_DEFAULT_AGENT_ID. Admins may apply those defaults to all managed channels, or use bot-member discovery to select individual channels and override the team and Dynamic Agent per selected row. The Web UI backend writes the selected channel-team mappings, ensures slack_channel:<workspace_alias>--<channel_id> user agent:<id>, ensures team:<slug>#member user agent:<id> for each selected team/agent pair, ensures the inbound team:<slug>#member user slack_channel:<workspace>--<channel> and team:<slug>#admin manager slack_channel:<workspace>--<channel> visibility tuples (so the channel actually shows as Setup completed in the listing β /api/admin/slack/channels filters each row by can_read and silently drops channels with no inbound teamβchannel tuples), and optionally creates matching bootstrap routes in slack_channel_agent_routes. Those bootstrap routes are stamped with source_type: "bootstrap" and users.listen: "all" so the bot responds to both @mentions and plain channel messages by default β admins who want quieter behaviour can narrow individual routes to mention or message from the Step-2a route picker. The same listen: "all" default applies to route rows the Web UI lazily materialises from an OpenFGA tuple that has no Mongo metadata yet (the "ghost route" path in /api/admin/slack/channels/{workspaceId}/{channelId}/routes); the equivalent Webex spaces endpoint mirrors this default. This is intentionally an explicit bulk write rather than an OpenFGA wildcard/default subject, so every relationship appears in the tuple store and Policy Graph. The shared helpers slackChannelTeamVisibilityRelationships and webexSpaceTeamVisibilityRelationships are used by the onboarding writers and the messaging_team_visibility_v1 migration so admin-PUT, onboarding-defaults, and the backfill path all converge on identical tuple shapes.
The Slack Channels panel also includes Slack Bot Runtime Sync for the running bot process. Browser requests still terminate at the Web UI backend: caipe-ui checks the signed-in user's admin_ui#admin permission, obtains a Keycloak client-credentials token for the Slack bot admin audience, and calls the Slack bot's internal admin API. The Slack bot verifies that token with Keycloak JWKS before returning route-cache status, clearing its in-memory route cache, or upserting static YAML channel-agent routes into slack_channel_agent_routes and OpenFGA. Local no-SSO development can opt into an explicit dev-token path with SLACK_BOT_ADMIN_DEV_AUTH_ENABLED=true on the Web UI and SLACK_ADMIN_DEV_AUTH_ENABLED=true on the bot, with matching dev token values; this bypasses Keycloak only for the internal Slack bot admin API and must not be enabled in shared environments. The sync operation is intentionally upsert-only: it creates missing records and updates matching channel/agent metadata, but it does not delete existing UI-managed associations that are absent from static config.
The Preview YAML Import dry run returns the full per-channel/agent breakdown of what an import will write β channel name, each agent's listen modes, user_list/bot_list, overthink, and escalation (VictorOps/emoji/users/delete_admins) β and the Web UI backend annotates each channel with the team it is currently mapped to (from channel_team_mappings), flagging channels with no team so admins can see, before importing, which channels will still need a team assignment (via the Onboard tab) to become invokable. The static YAML is treated as a seed: once a channel exists in the DB, the per-channel route editor in the Configured tab can view and edit every field (users/bots enable + listen + allow lists, overthink, and escalation) and round-trips them through PUT /api/admin/slack/channels/{workspaceId}/{channelId}/routes without dropping the fields the import wrote. To reduce ID-copying mistakes, the editor uses Web UI backend lookups (/api/admin/slack/users/lookup and /api/admin/slack/emoji) for Slack user IDs and custom emoji names; those calls keep SLACK_BOT_TOKEN server-side, return minimal display fields, and fall back to raw ID/name entry when Slack lookup scopes are missing. Escalation configured on a DB route (not just static YAML) is honored at runtime: the bot's escalation/feedback handlers fall back to SlackAgentRouteResolver.escalation_for(...) when a channel has no static binding, so "Get help" works for UI-managed channels.
The Advanced tab also exposes a superadmin VictorOps escalation agent picker, persisted as platform_config.slack_victorops_escalation_agent_id via PATCH /api/admin/platform-config. The bot reads it at runtime (DB value first, SLACK_INTEGRATION_VICTOROPS_AGENT_ID env/YAML as fallback) when VictorOps escalation fires. Unlike the platform default agent, this setting grants no user access β it is only the agent the bot queries for on-call lookups β so it writes no OpenFGA tuple and requires no public-access acknowledgement.
Webex space ReBAC follows the same team-ownership shape with Webex-specific types and storage:
webex_space:<workspace_alias>--<space_id> user agent:<id> is the OpenFGA source
of truth, while webex_space_agent_routes stores dependent dispatch metadata
such as listen mode, priority, and enabled state. Team-space assignment writes
team:<slug>#member user webex_space:<workspace>--<space> and
team:<slug>#admin manager webex_space:<workspace>--<space>, and per-space
grant/route/diagnostic APIs check the derived Webex space permissions. The top-level
Webex space list is also resource-scoped, and the Integrations β Webex tab appears
for non-admin users who can manage at least one concrete webex_space. The Webex bot never trusts
workspace identifiers from incoming Webex events; policy namespace selection
comes from WEBEX_WORKSPACE_ALIAS or WEBEX_WORKSPACE_ID. Route reads use
server-side OpenFGA tuple filters for the selected webex_space subject and fail
closed on PDP outages.
Threaded Webex replies are anchored with Webex parentId. After an allow decision
and before Dynamic Agent dispatch, the bot may fetch bounded prior thread context
from the Webex Messages API: the root message plus recent replies filtered by the
same parentId and capped by WEBEX_THREAD_CONTEXT_MAX_MESSAGES /
WEBEX_THREAD_CONTEXT_MAX_CHARS. The context is sent only to the already selected
and authorized Dynamic Agent under the user's OBO token; fetch failures do not
weaken authorization and fall back to sending only the current message. Bot replies
include the selected agent_id and tell users to continue in the same Webex
thread. Whether the bot processes follow-up posts still depends on route listen
mode: mention, message, or all.
The Webex Spaces panel includes diagnostics and runtime sync through
/api/admin/webex/* BFF routes. The Web UI backend obtains a
caipe-webex-bot-admin audience token, calls the internal Webex bot admin API,
and the bot verifies that token with Keycloak JWKS. Runtime sync is upsert-only:
it creates or updates configured webex_space_agent_routes rows and corresponding
OpenFGA tuples, but it does not delete UI-managed associations absent from static
config. Diagnostics compares tuple-backed agents with Mongo route metadata and
offers one-click repairs for zero-agent spaces, stale metadata, and listen-mode
mismatches; the zero-agent repair creates a default/selected agent association
with listen: all through the same route API used by manual association saves.
For opt-in onboarding, WEBEX_AUTO_ASSIGN_UNMAPPED_SPACES=true with
WEBEX_DEFAULT_TEAM_SLUG and WEBEX_DEFAULT_AGENT_ID creates an explicit
space-team mapping, route metadata row, and OpenFGA tuple for a previously
unmapped space. The feature is disabled by default, writes MongoDB before
OpenFGA to avoid orphan grants, rolls back on failure, and never overwrites an
existing active space mapping. The onboarding writer
(webex-space-onboarding.ts) also emits the inbound
team:<slug>#member user webex_space:<workspace>--<space> and
team:<slug>#admin manager webex_space:<workspace>--<space> visibility tuples
so the space surfaces in /api/admin/webex/spaces (which filters each row by
can_read). Previously-onboarded spaces are backfilled by the same
messaging_team_visibility_v1 migration that handles Slack channels β both
surfaces share the helper builders so admin-PUT, onboarding-defaults, and
the backfill emit identical tuple shapes.
Future PDP consolidation note: OpenFGA should remain the source of truth for all relationship decisions, but the OpenFGA auth bridge should not be treated as the universal application PDP until it exposes a stable, domain-neutral JSON authorization API in addition to its Envoy ext_authz adapter. Until then, keep the bridge focused on network enforcement for AgentGateway/MCP traffic and keep Slack using /api/admin/slack/channels/[workspaceId]/[channelId]/access-check for domain-aware dispatch checks. The later consolidation path is to extract shared OpenFGA decision helpers and audit/result shapes first, then optionally let Slack, Web UI backend routes, and the bridge call a common PDP service rather than duplicating tuple logic.
Legacy Keycloak realm roles may still appear in old local data, but they are not an authorization source. /api/rbac/enforcement-comparison remains available only as an engineer-facing migration aid for comparing stale role-shaped data with ReBAC decisions for a selected subject/action/resource.
Key Environment Variablesβ
| Variable | Purpose | Security note |
|---|---|---|
OPENFGA_RECONCILE_ENABLED | Enables Team Resources β OpenFGA tuple reconciliation in the Web UI backend | Defaults to false so non-RBAC local UI runs do not require OpenFGA; enable only when the OpenFGA profile is healthy. |
OPENFGA_HTTP | Docker-internal OpenFGA HTTP API URL used by the Web UI backend tuple writer and Slack bot route resolver | Keep this on the private service network; do not point browser clients at OpenFGA. |
OPENFGA_STORE_NAME / OPENFGA_STORE_ID | Selects the OpenFGA store for tuple writes | Prefer OPENFGA_STORE_ID in locked-down deployments to avoid discovery ambiguity. |
BOOTSTRAP_ADMIN_EMAILS / RBAC_BOOTSTRAP_ADMIN_EMAILS | Comma-separated initial admin emails consumed by the Web UI BFF bootstrap reconciler; RBAC_BOOTSTRAP_ADMIN_EMAILS overrides the legacy fallback env var when set | Keep the list short. The BFF resolves emails to Keycloak sub values and writes durable OpenFGA tuples; do not hardcode user UUID tuples in Helm values for normal admin bootstrap. |
OPENFGA_SEED_TUPLES | JSON list of exact OpenFGA tuple keys consumed by the OpenFGA init hook after the authorization model is loaded | Chart-generated from openfga.init.seedTuples; reserve for non-user emergency tuples or recovery. Human bootstrap admins should use BOOTSTRAP_ADMIN_EMAILS so Keycloak UUIDs are resolved automatically. |
AGENT_GATEWAY_ADMIN_URL | Optional Web UI backend URL for AgentGateway admin config discovery; defaults to http://agentgateway:15000/config | Keep the AgentGateway admin port on the private service network. The browser calls only the Web UI backend discovery/sync APIs, which require mcp_server:agentgateway#can_discover for discovery and mcp_server:agentgateway#can_manage for sync. |
AGENT_GATEWAY_URL | AgentGateway data-plane base URL used when onboarding discovered MCP targets; defaults to http://agentgateway:4000 and the UI backend appends /mcp when needed | AgentGateway-discovered MCP server records should route through this URL so JWT/authz enforcement remains on the gateway path. The backend target URL from AgentGateway config is stored only as operator metadata. |
AGENTGATEWAY_CONFIG_BRIDGE_POLL_SECONDS | Docker Compose local-dev poll interval for the AgentGateway config bridge that renders standalone MCP routes from MongoDB mcp_servers rows | Local-only control plane helper. It writes only the shared generated AgentGateway config volume; Kubernetes uses native AgentgatewayBackend and HTTPRoute resources instead. |
CAIPE_AGENT_CONTEXT_HMAC_SECRET | Shared secret used by Dynamic Agents and the OpenFGA authz bridge to sign/verify agent_id context for per-agent MCP tool enforcement | Store only in runtime secrets. When unset, AgentGateway still enforces the coarse user mcp_gateway:list gate, but the bridge cannot enforce derived agent:<id> can_call tool:<server>/<tool> decisions. |
CAIPE_CREDENTIALS_ENABLED / CREDENTIAL_STORE_BACKEND | Enables the Connections & Secrets surface and selects the MongoDB envelope credential backend | Defaults disabled. Browsers can create or rotate credential values, but raw retrieval is limited to server-to-server callers. |
CREDENTIAL_KEY_PROVIDER / CREDENTIAL_KMS_CMK_ID / CREDENTIAL_KMS_REGION | Selects the credential data-key wrapper. Local development uses local-cmk; production should use aws-kms with a real CMK. | local-cmk and legacy dev-local fail closed in production. Do not put real CMK secrets in ConfigMaps; production KMS access must come from runtime identity and least-privilege key policy. |
CREDENTIAL_ALLOW_INSECURE_LOCAL_KEY_WRAP | Dev-only escape hatch. When true, lets the local-cmk/dev-local key wrappers run even under NODE_ENV=production so the credential store works on the prod-parity UI image (caipe-ui-prod) for local testing. Defaults false. | Insecure β data keys are wrapped with locally-derived material, not a real KMS/HSM. The wrapper logs a loud SECURITY WARNING on every construction. Must never be true in a real production deployment; use CREDENTIAL_KEY_PROVIDER=aws-kms there instead. |
CREDENTIAL_BOOTSTRAP_OAUTH_CONNECTORS / GITHUB_* / CONFLUENCE_* / WEBEX_* / PAGERDUTY_* / GITLAB_* | Lets the caipe-ui TypeScript startup bootstrap idempotently seed global GitHub, Atlassian/Confluence, Webex, PagerDuty, and GitLab OAuth connector records from environment variables | Docker Compose reads these from .env; Kubernetes must source them through ESO/ExternalSecret. Provider client secrets must never be placed in ConfigMaps or logs and are immediately written through MongoDB envelope encryption. |
CREDENTIAL_SERVICE_AUDIENCE / CREDENTIAL_API_URL | Audience and service URL used by Dynamic Agents and other internal services when retrieving secret refs or exchanging provider connections | Must match the issued service/OBO token audience. Browser-origin, session-only, and wrong-audience retrieval/exchange requests are denied before credential lookup. |
USE_IMPERSONATION_TOKENS | When true, Dynamic Agents resolves MCP credential_sources through the server-to-server credential exchange (per-user OAuth tokens) instead of session cookies | Required for the per-user Jira/PagerDuty/GitHub/GitLab provider-token flows. Leave false to keep only the coarse user-level AgentGateway/OpenFGA gate. |
GITHUB_PERSONAL_ACCESS_TOKEN / GITLAB_PERSONAL_ACCESS_TOKEN (on Dynamic Agents) | Static org-PAT fallback read via MCPCredentialSource.fallback_env when a caller has not connected their personal GitHub/GitLab account | Keeps GitHub/GitLab tools backward compatible for unconnected callers. The PAT now lives only on Dynamic Agents (no longer a gateway backendAuth key); connected users always get their own OAuth token instead. Source from runtime secrets. |
MONGODB_URI / MONGODB_DATABASE | Enables Python OpenFGA audit writers, including Dynamic Agents and openfga-authz-bridge, to persist durable openfga_rebac rows into audit_events | Store MONGODB_URI in runtime secrets for Helm/production; dev compose uses the local MongoDB service. |
SLACK_AGENT_ROUTES_MODE | Slack bot route source: db_prefer (default; prefer OpenFGA-backed UI-managed channel-agent routes, fall back to static config), config, or db_only | db_prefer and db_only require OpenFGA access; MongoDB is used only to enrich tuple-backed routes with listen/priority metadata. Use config only for static-only environments that should ignore UI-managed channel routes. |
SLACK_INTEGRATION_SILENCE_ENV | Initial setup switch that makes the Slack bot ignore inbound payloads before handlers can send user-visible Slack responses | Use only during bootstrap or broken-route setup windows. Admin/runtime diagnostics remain the place to inspect OpenFGA route health while end-user channel noise is suppressed. |
SLACK_WORKSPACE_ALIAS | Canonical Slack workspace namespace used by the Web UI backend, Slack bot, Mongo route/grant rows, and OpenFGA slack_channel:<alias>--<channel_id> subjects | Configure per deployment (for example, CAIPE or Splunk). The Slack bot maps incoming Slack team_id values to this alias before route and ReBAC lookups. |
SLACK_BOT_TOKEN | Web UI backend Slack Web API token used for admin Slack discovery and editor lookups (available-channels, users/lookup, emoji) | Source from Vault/ExternalSecret, normally the same bot token used by slack-bot. Never place the value in ConfigMaps or logs. User lookup needs users:read (and users:read.email for email lookup/profile email matching); emoji suggestions need emoji:read. |
DISCOVERY_CACHE_TTL_MINUTES | Bootstrap default for the in-process cache TTL on /api/admin/slack/available-channels, /api/admin/slack/users/lookup, /api/admin/slack/emoji, and /api/admin/webex/available-spaces; defaults to 60 and is overridden at runtime by platform_config.discovery_cache_ttl_minutes | Admins set the live value via the Discovery cache popover next to the connector discovery button on Admin β Integrations β Slack and Admin β Integrations β Webex (range 0β1440; 0 disables caching). The env var only sets the bootstrap value when no DB override exists. The same popover exposes a per-provider Refresh from Slack/Webex now button that drops the snapshot immediately for ad-hoc bot-membership changes. |
SLACK_AGENT_ROUTES_ENABLED | Legacy rollout alias; when true and SLACK_AGENT_ROUTES_MODE is unset, behaves as SLACK_AGENT_ROUTES_MODE=db_prefer | Prefer SLACK_AGENT_ROUTES_MODE for new deployments so the fallback behavior is explicit. |
SLACK_AGENT_ROUTES_TTL_SECONDS | Slack bot in-process cache TTL for OpenFGA-backed channel agent routes; defaults to 60 | Short TTLs make UI route changes visible faster at the cost of more OpenFGA reads and Mongo metadata joins. |
SLACK_INTEGRATION_DEFAULT_AGENT_ID / SLACK_INTEGRATION_DM_AGENT_ID | Env/YAML fallback for the Slack bot's channel fallback and DM agent. Overridden at runtime by platform_config.default_agent_id (Admin β Settings β Default Agent) | These are now bootstrap fallbacks only β the platform default agent set in the UI takes precedence so the same value governs Web UI and Slack. |
SLACK_INTEGRATION_VICTOROPS_AGENT_ID | Env/YAML fallback for the agent the Slack bot queries for VictorOps on-call lookups; overridden at runtime by platform_config.slack_victorops_escalation_agent_id | Superadmins set the live value in Admin β Integrations β Slack β Advanced. The env var only applies when no DB value is saved. |
SLACK_PLATFORM_SETTINGS_TTL_SECONDS | Slack bot in-process cache TTL for platform_config settings (default + VictorOps agents); defaults to 60 | Short TTLs surface UI setting changes faster at the cost of more Mongo reads. |
CAIPE_PLATFORM_AUDIENCE | Audience requested by Slack/Webex OBO exchanges for bot β CAIPE UI BFF access checks; defaults to caipe-platform | Keep this aligned with the Keycloak client accepted by the Web UI backend. Do not use agentgateway for bot pre-dispatch access checks because the next hop is the BFF. |
WEBEX_THREAD_CONTEXT_ENABLED | Enables Webex bot thread-context fetch before Dynamic Agent dispatch; defaults to true | Reads only messages visible to the bot in the same Webex thread and sends bounded context to the authorized agent under the user's OBO path. Set to false where message-history minimization is required. |
WEBEX_THREAD_CONTEXT_MAX_MESSAGES | Caps prior Webex thread replies fetched with the Webex Messages API; defaults to 10 | Keep this low to limit prompt size and data exposure. |
WEBEX_THREAD_CONTEXT_MAX_CHARS | Caps formatted Webex thread context sent to Dynamic Agents; defaults to 4000 | Prevents unbounded prompt growth and avoids sending entire long conversations to downstream agents. |
TENANT_ID / AUDIT_SUBJECT_SALT | Controls tenant scoping and privacy-preserving subject hashing for Python OpenFGA audit events | Keep the salt stable per environment so subject hashes remain correlatable without storing raw tokens. |
AUTHZ_TRACING_ENABLED | Enables optional Web UI backend OpenFGA/ReBAC OTLP span export | Defaults off in dev compose. Trace spans are observational only; do not put raw tokens, request bodies, or PII in span attributes. |
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT | Optional OTLP HTTP endpoint for Web UI backend authz spans | Leave unset unless an external collector is explicitly configured. RBAC Audit uses MongoDB audit_events and does not need a trace backend. |
KEYCLOAK_ADMIN_CLIENT_ID | Confidential Keycloak client used by Web UI backend admin APIs for Keycloak Admin REST calls such as user listing, role assignment, client inspection, and Keycloak RBAC OBO permission repair | Use a service-account client with only the required realm-management roles: user roles (view-users, query-users, manage-users), client roles (query-clients, view-clients, manage-clients), and authorization roles (view-authorization, manage-authorization). Production should not rely on the dev admin-cli password-grant fallback. |
KEYCLOAK_ADMIN_CLIENT_SECRET | Matching client secret for KEYCLOAK_ADMIN_CLIENT_ID | Store in Vault/ExternalSecret/Kubernetes Secret only; never commit the secret value. |
KEYCLOAK_ACCESS_TOKEN_LIFESPAN | Keycloak init/reconcile job override for realm access-token lifetime; chart default is 3600 seconds | Keep access tokens short and rely on refresh tokens for active sessions. |
KEYCLOAK_SSO_SESSION_IDLE_TIMEOUT | Keycloak init/reconcile job override for realm SSO idle timeout; chart default is 28800 seconds | This is the user-facing idle logout window. Increasing it should be a deliberate security decision. |
KEYCLOAK_SSO_SESSION_MAX_LIFESPAN | Keycloak init/reconcile job override for absolute realm SSO max lifespan; chart default is 86400 seconds | Must be longer than the idle timeout if active users should keep refreshing throughout the workday. |
OIDC_ACCEPTED_AUDIENCES | Additional bearer JWT audiences accepted by the Web UI backend | The dev compose stack defaults this to caipe-platform so RBAC persona tokens minted by the Keycloak resource-server client can exercise Web UI backend routes; production deployments should set the narrow audience list they actually issue. |
IDENTITY_SYNC_LOGIN_CLAIMS_ENABLED | Controls best-effort login-time reconciliation from OIDC group claims | Defaults on; set to false to disable. Login remains best-effort and must not depend on directory sync health. |
IDENTITY_SYNC_OIDC_CLAIM_PROVIDER_ID | Provider id used to select mapping rules for claim-derived sync | Defaults to oidc-claims; keep separate from direct Okta providers so provenance stays clear. |
IDENTITY_SYNC_OKTA_ORG_URL / IDENTITY_SYNC_OKTA_API_TOKEN | Server-side Okta Management API connector for full inventory dry-runs | Store the token in runtime secrets only; never expose it to the browser or commit it. |
The deploy/keycloak/init-idp.sh bootstrap keeps the IdP group importer on per-mapper syncMode=FORCE, so the idp_groups attribute is refreshed on login without resetting unrelated user attributes such as Slack links. The same idempotent init job may seed identity-only test personas before e2e runs. The caipe-ui mapper intentionally leaves access.token.claim=false to avoid sending large group arrays through every downstream bearer-token path.
Component 3: Supervisor A2A Server β The Dispatcherβ
Badge analogy: The dispatcher at the internal mail room. When you drop off a work order, they scan your badge, note your name and clearance on the paperwork, and attach a photo-copy of your badge to every sub-order sent to other departments. Downstream departments never need to ask who initiated the original request β it's stapled to everything.
Technically: A Starlette/FastAPI application running the LangGraph multi-agent supervisor. It has a layered middleware stack. The JWT is validated once at the outer layer, then decoded and stored in a per-request contextvar by JwtUserContextMiddleware so all downstream code can read user identity without re-parsing the header.
Middleware Stack (outermost β innermost)β
CORSMiddleware
β
PrometheusMetricsMiddleware (metrics, skips /health)
β
OAuth2Middleware / SharedKeyMiddleware (validates JWT signature + expiry)
β
JwtUserContextMiddleware (decodes claims β stores in contextvar)
β
A2A request handler + LangGraph agent
JwtUserContextMiddleware is intentionally read-only. It does not re-validate the token β that's already done by the auth middleware above it. It decodes the JWT payload without verification, fetches the OIDC userinfo endpoint (cached 10 min) for authoritative email/name/groups, and stores the result in a ContextVar:
# Set once per request by JwtUserContextMiddleware
_jwt_user_context_var: ContextVar[JwtUserContext | None]
# Read anywhere in the same request (agent executor, tools, sub-calls)
ctx = get_jwt_user_context()
# ctx.email, ctx.name, ctx.groups, ctx.token
JWT Forwarding to MCP Toolsβ
When FORWARD_JWT_TO_MCP=true, the supervisor forwards the original, unmodified bearer token from the incoming request to AgentGateway. This means:
- The token that reaches AgentGateway has
sub= the real user (or OBO token withact.sub= bot) - AgentGateway can evaluate the user's actual roles, not the supervisor's service account
- MCP servers that do their own JWT validation (e.g. RAG) see the real user identity
User JWT β Supervisor β (same JWT) β AgentGateway β MCP Server
Security implication: The supervisor must not modify or strip the bearer token before forwarding. If it substituted its own service account token, the entire per-user authorization chain would collapse.
Key Environment Variablesβ
| Variable | Purpose | Security note |
|---|---|---|
A2A_AUTH_OAUTH2=true | Enable JWT signature validation | Off in dev; mandatory in prod |
A2A_AUTH_SHARED_KEY | Shared-key auth alternative | Use only for service-to-service; not for user-facing flows |
ENABLE_USER_INFO_TOOL=true | Extract identity from JWT (vs. "by user: email" prefix) | The JWT is the authoritative source; prefer this over message prefix |
FORWARD_JWT_TO_MCP=true | Forward incoming JWT to MCP tools | Required for per-user enforcement at AgentGateway |
ISSUER / OIDC_ISSUER | OIDC issuer for userinfo endpoint discovery | Must match iss claim in tokens |
Component 4: AgentGateway β The Security Checkpointβ
Badge analogy: The armed security checkpoint at the entrance to the server room. Everyone must badge in β no exceptions, no tailgating. The checkpoint verifies the badge locally, then calls the central relationship desk (OpenFGA) to ask whether this person is allowed through.
Technically: AgentGateway is the single Policy Enforcement Point (PEP) for all MCP tool calls. It proxies HTTP/SSE requests to registered MCP backend servers, validates the Keycloak JWT, and calls OpenFGA through extAuthz for the PDP decision before allowing each request through. MCP servers still mount a shared custom middleware package for authentication defense-in-depth (JWT/shared-key validation, token passthrough context, and an optional local-dev localhost bypass). For embedded/local MCP servers that do not sit behind AgentGateway, the same package can also perform an optional Keycloak PDP scope check (for example mcp_jira#invoke) so they still have a real authz gate.
Request Flowβ
Supervisor POST /rag/v1/query
Authorization: Bearer <JWT>
β
βΌ
AgentGateway
ββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Extract JWT from Authorization header β
β 2. Validate signature against JWKS β
β 3. ext_authz β OpenFGA Check β
β 4a. OpenFGA DENY β 403 Forbidden β
β 4b. OpenFGA ALLOW β proxy to MCP server β
ββββββββββββββββββββββββββββββββββββββββββββββ
β ALLOW
βΌ
RAG MCP Server
(receives same JWT for its own validation)
Authorization Modelβ
AgentGateway uses jwtAuth for authentication and extAuthz for authorization.
The openfga-authz-bridge adapts Envoy's gRPC authorization check into an
OpenFGA Check, so gateway authorization is maintained through ReBAC tuples
rather than CEL policy authoring.
For observability and compliance, the bridge also writes a best-effort
openfga_rebac document to MongoDB audit_events for every terminal
authorization result: missing subject, OpenFGA allow, OpenFGA deny, and
OpenFGA unavailable. These writes never affect the allow/deny response returned
to AgentGateway.
ext_authz Timeoutβ
The extAuthz policy fails closed (denyWithStatus: 403) so any error or
timeout reaching the openfga-authz-bridge denies the request β never
fail-open. The proxy's built-in ext_authz timeout is 200ms, which is too
tight in practice: enumerating tools fires one ext_authz Check per MCP route
concurrently, and against a cold or loaded OpenFGA those checks serialize
and individually exceed 200ms, returning fail-closed 403s that surface in the UI
as "MCP server unavailable" even for healthy, authorized servers.
The shipped default raises this to 10s (generous headroom; still bounds a stuck call) on the default static routing path:
| Knob | Default | Where |
|---|---|---|
global.agentgateway.extAuth.timeout | 10s | Helm static routing β rendered into the extAuthz.timeout field of the AgentGateway static ConfigMap (agentgateway-static-config.yaml). Operator-tunable. |
extAuthz.timeout (bootstrap) + DEFAULT_MCP_ROUTE_POLICIES (config-bridge) | 10s | Local Docker Compose dev path (deploy/agentgateway/config.yaml, deploy/agentgateway/config_bridge.py) β kept in parity with the chart. |
Raising the timeout does not change the fail-closed posture: a genuine
OpenFGA DENY (or an unreachable bridge after the timeout) still returns 403.
Gateway-API / CRD routing (opt-in): the
AgentgatewayPolicy.traffic.extAuthresource has no timeout field, so this knob does not apply whenroutingMode: gateway-api. Tune the budget there via theext_authzbackend'srequestTimeout(or a route-level request timeout) instead.
Data-Plane Ingressβ
The Helm chart can expose AgentGateway's MCP data path with
agentgateway.ingress.enabled=true. That ingress always routes to the service
HTTP port (service.port, default 4000). The admin listener
(service.adminPort, default 15000) is not exposed by the ingress and should
remain reachable only from inside the cluster.
Admin UI MCP Discovery and Migrationβ
The Web UI backend owns AgentGateway MCP discovery and sync through
/api/mcp-servers/agentgateway/discover and /api/mcp-servers/agentgateway/sync.
Both routes check the singleton mcp_server:agentgateway resource directly:
can_discover for discovery and can_manage for sync/onboarding. Bootstrap
environments should seed that singleton grant explicitly instead of relying on a
session-role bypass.
Sync is intentionally one-click for migrations: the backend reads the private
AgentGateway admin config, imports every discovered target with status new as a
config-driven source: "agentgateway" MCP server, leaves already-managed targets
unchanged, and never overwrites conflicting legacy MCP servers. Conflicts are
returned as migration warnings with the legacy endpoint and AgentGateway target
endpoint so operators can remove or rename the old row manually.
Why This Is the Right Architecture for a PEPβ
- Decoupled policy from business logic: MCP servers implement domain logic, not authz. Changing a policy means editing
config.yaml, not redeploying an MCP server. - Consistent enforcement: Every tool β RAG, GitHub, ArgoCD, Slack β goes through the same gateway with the same JWT. No tool can be accidentally left unenforced.
- Externalized relationship decisions: OpenFGA gives us a remote PDP for relationship checks without putting that logic inside each MCP server.
- Token passthrough: AgentGateway forwards the JWT to the MCP backend unchanged. The backend can do its own secondary validation (e.g. tenant isolation).
Local / Embedded MCP Exception Pathβ
Most production MCP traffic should still go through AgentGateway. The repository also ships a shared custom MCP middleware for the exception cases:
- Local dev β when an engineer runs a FastMCP server directly on
localhostformcp dev,MCP_TRUSTED_LOCALHOST=truecan bypass auth for the real loopback peer only. - Embedded MCPs β when an MCP lives inside another Python service and therefore cannot be registered as a standalone AgentGateway backend, the same package validates the bearer token locally and can optionally call Keycloak's PDP for a per-MCP scope decision.
That package lives under ai_platform_engineering/agents/common/mcp-auth/ and is intentionally authn-focused by default. In the normal standalone path, AgentGateway remains the source of truth for RBAC.
AgentGateway + OIDC + Keycloak β The Integrated Pictureβ
Badge analogy: Duo SSO is the national ID office β it issues the underlying identity. Keycloak is HR β it takes that national ID, prints a CAIPE-branded employee badge with your roles stamped on it, and publishes a public fingerprint scanner (JWKS) in the lobby so anyone can verify a badge is really HR-issued. AgentGateway is the armed checkpoint at the server room door. The checkpoint verifies the badge locally, then calls the OpenFGA authorization desk through
ext_authzbefore opening the door.
Technically: Keycloak, OpenFGA, and AgentGateway cooperate to put a verified, relationship-checked, role-carrying JWT in front of every MCP request. AG itself is the Policy Enforcement Point (PEP) β it doesn't authenticate users, it doesn't store roles, and it never talks to Duo. It verifies that the JWT in the request was signed by Keycloak (using a cached copy of Keycloak's JWKS), then calls OpenFGA through extAuthz for the authorization decision.
| Layer | Role | What it owns | What it does NOT own |
|---|---|---|---|
| Upstream IdP (e.g. Duo SSO, Okta, Azure AD) | Identity provider | User authentication (password, MFA, device trust), email ownership | Application roles, per-tool access rules |
| Keycloak | OIDC AS + IdP broker | Realm roles (chat_user, admin), JWT issuance, JWKS publication, OBO token exchange (RFC 8693) | Tool-level decisions, user password (delegated to Duo) |
| OpenFGA | Remote PDP | Relationship decisions such as user:<sub> can_call mcp_gateway:list and team resource tuples (team:<slug>#member can_use agent:<id>) | JWT validation, token minting, proxying traffic |
| AgentGateway (PEP) | Policy Enforcement Point | jwtAuth, extAuthz, local JWT verification against cached JWKS | Identity store, role store, token minting, CEL policy storage |
Keycloak brokers the upstream IdP β Duo SSO doesn't issue the JWT that AG sees. Duo authenticates the user, returns an OIDC authorization code to Keycloak, and Keycloak then mints the CAIPE JWT whose identity claims feed the OpenFGA extAuthz check. From AG's perspective, Keycloak is the only issuer it trusts (iss = http://localhost:7080/realms/caipe); the existence of Duo is invisible to AG. This is the standard OIDC/OAuth 2.0 resource-server pattern applied to an MCP-aware proxy.
Identity Provenance: Duo SSO β Keycloak β JWT β AG β MCPβ
Read this as the badge's lifecycle:
- Duo SSO authenticates the human. It doesn't know about CAIPE roles. It only proves "this really is
alice@example.comwith working MFA" and hands an OIDC authorization code to Keycloak. Duo's issuer (IDP_ISSUER) is configured in Keycloak asIDP_ALIAS=duo-sso; this is the only direct contact between CAIPE and Duo. - Keycloak brokers and rebrands the identity. It validates the Duo code, runs its IdP mappers (e.g.
firstnameβgiven_nameto handle Duo's non-standard claim), and signs a fresh JWT with its own RS256 key. Product authorization is evaluated later through OpenFGA organization, team, and resource relationships. This is the only token CAIPE services ever see. Duo's identity token is discarded at the Keycloak boundary. - Every CAIPE caller holds the same JWT. The Slack Bot additionally does an RFC 8693 token-exchange to produce an OBO (On-Behalf-Of) JWT that pins
sub=aliceandact.sub=caipe-slack-botβ but it's still a Keycloak-signed JWT withiss = http://localhost:7080/realms/caipe. From AG's perspective there's no difference between a UI JWT and an OBO JWT; both passjwtAuthas long as they're signed by a key in AG's JWKS cache. - AG verifies locally, calls OpenFGA, forwards unchanged. The JWT reaches the MCP server with Alice's identity intact, so MCP-level defense-in-depth checks (e.g. the RAG server's per-tenant document ACLs) see the real user β not the supervisor's service account and not the Slack bot.
The practical consequence: to switch CAIPE from Duo SSO to Okta or Azure AD you don't touch AgentGateway at all. You change IDP_ISSUER, IDP_CLIENT_ID, IDP_CLIENT_SECRET, IDP_ALIAS, and maybe a mapper in Keycloak, and every component downstream continues to trust Keycloak-issued JWTs. This is the whole point of making Keycloak the IdP broker instead of having each service integrate directly with the upstream IdP.
How AG Is Wired to Keycloak and OpenFGA (at boot and at steady state)β
Four independent channels feed the AG decision:
| # | Channel | Direction | Purpose | Cadence |
|---|---|---|---|---|
| 1 | JWKS | AG β Keycloak | Fetch public keys to verify JWT signatures | On startup; on unknown kid; on Cache-Control TTL expiry |
| 2 | Token issuance | Client β Keycloak β Client | Users/bots obtain JWTs to present to AG; AG never mints tokens | On login / OBO exchange |
| 3 | Relationship decision | AG β openfga-authz-bridge β OpenFGA | Remote PDP check before MCP proxying | Every MCP request |
There is no direct API call from AG to Keycloak per request. JWKS fetching is a pure cache-refresh operation, not a live auth check.
The Exact jwtAuth Contract (from config.yaml)β
binds:
- port: 4000
listeners:
- protocol: HTTP
policies:
jwtAuth:
mode: strict # reject request if no valid JWT present
issuer: https://caipe.example.com/realms/caipe
audiences: [caipe-platform, agentgateway]
jwks:
url: http://keycloak:7080/realms/caipe/protocol/openid-connect/certs
routes:
- policies:
extAuthz:
host: openfga-authz-bridge:9100
failureMode:
denyWithStatus: 403
protocol:
grpc:
metadata:
caipe.auth: '{"sub": jwt.sub}'
What mode: strict means in practice:
**issmust equalissuer**β tokens from any other realm or IdP are rejected with 401.**audmust contain at least one ofaudiences**β protects against token substitution where a token was issued to a different service client.**exp,nbf,iatenforced** β expired or not-yet-valid tokens rejected.- Signature verified against JWKS β
kidin the JWT header must match a cached key. - Unknown
kidtriggers one forced JWKS refresh β handles Keycloak key rotation without manual intervention.
Only after jwtAuth passes does AG call extAuthz. AG sends an Envoy CheckRequest over gRPC with caipe.auth.sub metadata derived from jwt.sub; the OpenFGA bridge maps that subject to user:<sub> and calls OpenFGA Check. The route-level bridge checks the coarse mcp_gateway:list object for MCP browse/list/init traffic, while signed Dynamic Agent tools/call requests additionally check the agent/tool relationships. If jwtAuth fails, the request never reaches policy evaluation; if OpenFGA/bridge is unavailable or denies, AG returns 403 because failureMode.denyWithStatus=403.
OpenFGA ReBAC Modelβ
The dev PDP model keeps the coarse AgentGateway gate and adds admin-configured team relationships:
| Type | Relation | Tuple written by |
|---|---|---|
mcp_gateway:list | can_call: [user] | openfga-init seed / manual bootstrap for the current AGW coarse browse/list gate |
team:<slug> | member: [user] | Team Resources save, using Keycloak sub values resolved from team member emails |
agent:<agent_id> | base user, manager; derived can_use, can_manage | Team Resources agent Use / Manage checkboxes write base relations |
tool:<server>_* and tool:* | base caller; derived can_call | Team Resources MCP-server prefix checkboxes and the All Tools wildcard write base relations |
knowledge_base:<id> | base reader, ingestor, manager; derived can_read, can_ingest, can_admin | Team Knowledge Base assignments and Settings β Knowledge Bases write team:<slug>#member reader/ingestor for read and ingest, and team:<slug>#admin manager for admin, before persisting Mongo assignment metadata. KB pages, sharing, and KB-scoped routes check these relationships. |
data_source:<id> | base reader (incl. user:* wildcard), ingestor, manager; derived can_read, can_ingest, can_manage | Datasource component grants are reconciled alongside Knowledge Base grants when a KB-backed datasource is created, shared, or assigned to a team (every knowledge_base:<id> grant is mirrored onto the matching data_source:<id> so the team can actually search, not just discover). Datasource lists, search filters, and ingest/reload operations check these relationships so read and write can differ per datasource. A user:* reader data_source:<id> tuple (written by POST /api/admin/rag/public-datasources) makes a datasource readable by every authenticated user. |
skill:<id> | base reader, user, writer, manager; derived can_read, can_use, can_write, can_manage | Team Resources skill selection writes user relationships for local and Skill Hub catalog ids; /api/skills filters by can_read/can_use. |
conversation:<id> | base owner, reader, writer, sharer, manager; derived can_read, can_write, can_share, can_delete | Chat list/read/write/share and Dynamic Agent stream/invoke/resume/cancel paths check implicit Mongo ownership first, then explicit OpenFGA conversation access. |
mcp_server:agentgateway | base reader, writer, manager; derived can_discover, can_read, can_manage | AgentGateway discovery uses can_discover; selected-server sync/onboarding uses can_manage. |
system_config:platform_settings | base reader, manager; derived can_read, can_manage | Platform config GET/PATCH checks the concrete system config object in addition to admin session gates. |
organization:<org_key> | base member, admin, auditor, ingestor, searcher; derived can_ingest (ingestor or admin), can_search (searcher or admin) | ingestor is the explicit "data source author" capability written/deleted per team by PUT/DELETE /api/admin/teams/[id]/ingest-capability (team:<slug>#member ingestor organization:<key>). searcher is the explicit "search" capability written/deleted per team by PUT/DELETE /api/admin/teams/[id]/search-capability (team:<slug>#member searcher organization:<key>). kb-tab-gates and the RAG server check can_ingest (authorize_datasource_create) to gate creating new data sources, and can_search (authorize_search, plus the BFF requireSearchCapability) to gate using search (/v1/query, /v1/mcp/invoke) for built-in and custom tools. |
The Web UI backend tuple writer is idempotent: it checks tuples before writes/deletes to avoid duplicate-write failures and to tolerate missing tuples during removals. It intentionally rejects writable can_* tuples; callers must write base relationships and let OpenFGA derive the can_* permissions.
Team membership semantic: On the
teamtype,memberis now defined as[user, external_group#member] or adminβ i.e. anyone with theadminrelation on a team automatically satisfiesteam#memberchecks (and, by extension,team#memberuserset references such as theteam:<slug>#member can_use agent:<id>Slack/Webex resource paths). This means an admin no longer needs a separatemembertuple to use the team's agents, and bots can askcheck(user, "member", team:<slug>)as a single question.admincontinues to be a directly-written relation; onlymembergains the derived branch. Callers that legacy-listed bothteam#memberandteam#adminas subject sets still work but are now redundant.
AgentGateway Policy Modelβ
AgentGateway no longer maintains a Mongo-backed CEL policy surface for MCP
authorization. The checked-in deploy/agentgateway/config.yaml is intentionally
static: it authenticates with jwtAuth, delegates authorization to the OpenFGA
bridge through extAuthz, and then proxies to the configured MCP targets.
The Admin UI's former "AG MCP Policies" tab, /api/rbac/ag-policies,
/api/rbac/ag-sync-status, ag_mcp_policies, ag_mcp_backends, and
ag_sync_state are retired. Relationship changes should be modeled as OpenFGA
tuples through the ReBAC admin surfaces instead of editing AgentGateway CEL.
The Web UI backend's former CEL overlay is also retired: CEL_RBAC_EXPRESSIONS,
/api/rbac/admin-tab-policies, editable admin_tab_policies, and the browser CEL
editor are no longer part of the UI authorization path. Keep custom authorization
logic in OpenFGA tuples and audited ReBAC change sets.
Operational Guaranteesβ
| Guarantee | Mechanism |
|---|---|
| AG restart does not invalidate user sessions | User JWTs are self-contained; AG just re-fetches JWKS on startup |
| Keycloak key rotation is zero-downtime | Unknown kid triggers one forced JWKS refresh; cached keys remain valid until exp |
| Policy update is zero-downtime | OpenFGA tuple writes are independent of AG process restarts; AG keeps using extAuthz |
| Admin UI edit audit trail | ReBAC relationship/policy surfaces write openfga_rebac audit events through the Web UI backend |
| MongoDB outage doesn't take AG down | AG uses static config plus OpenFGA; it does not depend on Mongo-rendered CEL rules |
| Keycloak outage doesn't take AG down for already-issued tokens | JWKS is cached; new logins fail at Keycloak, not at AG |
The end-to-end per-request sequence diagram (and the demo walkthrough that proves all three outcomes β 200, 403, 401) lives in Workflows βΊ Per-request authorization. Use that to demo the system live.
Component 5: Dynamic Agents β The Workshop Floorβ
Badge analogy: A workshop where employees build and operate their own machines. The workshop checks your badge at the door (JWT validation on every request). Once inside, each machine has its own access tag β some are personal (Private), some are shared with your team (Team), some anyone can use (Global). Your badge level determines which machines you can touch. When a machine makes a tool call, it presents your badge β not its own β so the security checkpoint still sees you, not the machine.
Technically: A FastAPI service where every route handler uses get_current_user() as a FastAPI Depends(). Unlike the supervisor (which uses a middleware contextvar), the dynamic agents service validates the JWT on every request at the route level, giving precise control per endpoint.
JWT Validation Chainβ
# FastAPI dependency injection β runs before every protected handler
user: UserContext = Depends(get_current_user)
Inside get_current_user():
1. Extract Bearer token from Authorization header
2. Fetch JWKS from Keycloak (cached in-process)
3. Validate:
- Signature (RS256 against JWKS public key)
- expiry (exp)
- issuer (iss == OIDC_ISSUER)
- audience (aud == OIDC_CLIENT_ID, if set)
4. Call OIDC userinfo endpoint (cached 10 min by token hash)
β authoritative email, name, groups (OIDC tokens often omit these)
5. Extract realm_access.roles from JWT claims
(Keycloak puts roles here; also checked in userinfo)
6. Evaluate the configured required-access group (if set) β 403 if missing
7. Preserve group claims as identity context only; product admin is decided by OpenFGA organization relationships
8. Return UserContext { email, name, groups, access_token, obo_jwt }
Agent-Level Authorization (OpenFGA Execution Gate)β
After the bearer token is validated by JwtAuthMiddleware, Dynamic Agents
decodes the already-validated JWT payload only to extract sub and repeats the
same OpenFGA check used by the Web UI backend:
user:<sub> can_use agent:<agent_id>
The runtime check runs before agent lookup, MCP server lookup, runtime cache
creation, non-streaming invocation, or stream resume work. This second layer is
required because the runtime must not trust the Web UI backend as the only enforcement
point. Denials return 403 / pdp_denied; OpenFGA outages return
503 / pdp_unavailable; missing or malformed bearer context returns a
structured 401.
The older visibility-rule and CEL authorization paths are no longer the authoritative execution gate for start, invoke, and resume. Downstream tool authorization continues to be enforced by AgentGateway and OpenFGA.
Token Forwarding to MCP Toolsβ
The UserContext.obo_jwt (set from X-OBO-JWT header) or UserContext.access_token is forwarded as the Authorization: Bearer header on all MCP tool calls made by the agent runtime. This gives the same per-user enforcement at AgentGateway as the supervisor path provides.
Dynamic Agents also forwards the validated per-request bearer when probing MCP servers for tool manifests. The MCP client connection config carries an explicit Authorization header in addition to the HTTP client factory hook, because AgentGateway denies tokenless probe traffic before any upstream MCP server can return tools.
Only MCP server IDs listed in AGENT_GATEWAY_MCP_SERVER_IDS are rewritten to
AGENT_GATEWAY_URL/mcp/<server_id>. The special value all applies only to
gateway-managed rows (source: agentgateway, agentgateway_discovered: true,
or an endpoint already rooted at AGENT_GATEWAY_URL); manual/direct MCP rows
keep their stored endpoint so runtime-added tools do not get sent to missing
AgentGateway routes. Docker Compose defaults to all because
agentgateway-config-bridge reconciles enabled gateway-managed mcp_servers
rows into the standalone AgentGateway config. The Helm path uses AgentGateway's
native Kubernetes resources: global.agentgateway.knowledgeBaseTarget and
global.agentgateway.extraMcpTargets render AgentgatewayBackend and
HTTPRoute objects instead of running the Mongo polling bridge in-cluster.
For runtime tools/call requests, Dynamic Agents can also attach a signed
X-CAIPE-Agent-Context header containing the calling agent_id. The OpenFGA
bridge verifies this header with CAIPE_AGENT_CONTEXT_HMAC_SECRET, then checks
both relationships before allowing the call:
user:<sub> can_use agent:<agent_id>
agent:<agent_id> can_call tool:<server_id>/<tool_name>
The Web UI backend reconciles the second tuple family from each agent's
allowed_tools whenever an agent is created, updated, or deleted. Empty
per-server tool lists are represented as tool:<server_id>/* so the runtime
allowlist and the enforcement graph use the same wildcard semantics.
Key Environment Variablesβ
| Variable | Default | Security note |
|---|---|---|
AUTH_ENABLED | false | Must be true in production. false returns a hardcoded dev@localhost admin β never deploy with false. |
OIDC_ISSUER | β | Validated against iss claim; tokens from other issuers are rejected |
OIDC_CLIENT_ID | β | Identifies the Web UI client used by browser-facing flows. Dynamic Agents audience validation uses KEYCLOAK_AUDIENCE / OIDC_AUDIENCE. |
KEYCLOAK_URL / KEYCLOAK_REALM | β | Cluster-internal Keycloak base URL and realm used to fetch JWKS. Required when OIDC_ISSUER is a public hostname that is not reachable through the pod's localhost. |
KEYCLOAK_AUDIENCE / OIDC_AUDIENCE | caipe-platform,agentgateway | Comma-separated audiences accepted for Dynamic Agents bearer validation. Include caipe-ui when browser session tokens carry that audience. |
OIDC_REQUIRED_GROUP | β | Optional deployment-specific Web UI admission gate; users missing this upstream group are denied before product authorization runs |
OIDC_REQUIRED_ADMIN_GROUP | β | Deprecated for CAIPE product admin. Map enterprise admin groups to CAIPE teams through Identity Group Sync, then grant OpenFGA admin on organization:<org>. |
DA_REQUIRE_BEARER | false | Set to true to require validated bearer identity for runtime OpenFGA enforcement |
OPENFGA_HTTP | β (http://openfga:8080 in Docker Compose dev) | OpenFGA API base URL used for runtime can_use checks |
OPENFGA_STORE_ID | β | Optional explicit OpenFGA store id; takes precedence over store-name discovery |
OPENFGA_STORE_NAME | caipe-openfga | Store name used when discovering the OpenFGA store id; Docker Compose dev wires this into Dynamic Agents alongside the Web UI backend |
AGENT_GATEWAY_MCP_SERVER_IDS | all | Comma-separated MCP server IDs that Dynamic Agents should reach through AGENT_GATEWAY_URL; all only includes gateway-managed rows, while manual/direct MCP servers keep their stored endpoints. |
CAIPE_AGENT_CONTEXT_HMAC_SECRET | β | Optional shared secret for signing Dynamic Agents β AgentGateway agent_id context used by the OpenFGA bridge for per-agent MCP tool enforcement. Use a secret manager; do not commit values. |
SLACK_BOT_ADMIN_URL | http://ai-platform-engineering-slack-bot:3001 | Web UI backend URL for the Slack bot internal admin API used for runtime route status, cache reload, and static-config sync. Keep cluster-internal. |
OIDC_CLIENT_ID / OIDC_CLIENT_SECRET | caipe-ui / β | Web UI backend Keycloak confidential client credentials. The same caipe-ui client is used for browser OIDC login and server-side client-credentials calls to the Slack bot admin API. Store the secret in a secret manager; do not place it in ConfigMaps. |
SLACK_ADMIN_API_ENABLED | false | Enables the Slack bot's internal admin API. It must remain internal-only and require JWKS-verified Bearer tokens. |
SLACK_BOT_ADMIN_DEV_AUTH_ENABLED / SLACK_BOT_ADMIN_DEV_TOKEN | false / β | Web UI local-dev escape hatch for Slack bot admin API calls when Keycloak is intentionally not running. Sends the configured dev bearer token instead of minting a Keycloak client-credentials token. |
SLACK_ADMIN_DEV_AUTH_ENABLED / SLACK_ADMIN_DEV_TOKEN | false / β | Slack bot side of the same local-dev escape hatch. The bot accepts the dev bearer token only when explicitly enabled. Never enable in shared, staging, or production environments. |
SLACK_ADMIN_JWKS_URL | β | Optional Docker/cluster-internal JWKS URL for Slack bot token verification when the public issuer is not directly reachable from the bot container. |
SLACK_ADMIN_JWT_AUDIENCE | caipe-slack-bot-admin | Expected audience for Web UI backend service tokens calling Slack bot admin endpoints. |
Service-to-Service Authentication (Slack bot β caipe-ui)β
The Slack bot calls caipe-ui's API as a machine client, not as a logged-in user. It uses the OAuth2 client_credentials grant against the caipe realm:
| Env var | Purpose |
|---|---|
SLACK_INTEGRATION_ENABLE_AUTH=true | Enables Bearer-token path in app.py |
SLACK_INTEGRATION_AUTH_TOKEN_URL | ${KEYCLOAK_URL}/realms/caipe/protocol/openid-connect/token |
SLACK_INTEGRATION_AUTH_CLIENT_ID | caipe-slack-bot (pre-created in realm-config.json) |
SLACK_INTEGRATION_AUTH_CLIENT_SECRET | Fetched from Keycloak β see "Provisioning service-client secrets" below |
OAUTH2_CLIENT_SECRET | Helm fallback env var for the same caipe-slack-bot client secret, normally sourced from the keycloak-bot Secret |
KEYCLOAK_BOT_CLIENT_SECRET | Same secret again for the Slack OBO helper (utils/obo_exchange.py) |
Token shape (fields that matter):
issβ${KEYCLOAK_URL}/realms/caipeaudβ[caipe-ui, caipe-platform]β both audiences are needed.caipe-platformis added by Keycloak's default audience resolution;caipe-uicomes from anoidc-audience-mapperprotocol mapper (aud-caipe-ui) on thecaipe-slack-botclient. caipe-ui's JWT validator rejects tokens whose audience doesn't includeOIDC_CLIENT_ID(i.e.caipe-ui), so this mapper is required.azpβcaipe-slack-botsubβ service account UUID (stable)preferred_usernameβservice-account-caipe-slack-botscopeβgroups email profile org roles
The mapper is created automatically by deploy/keycloak/init-idp.sh (idempotent).
This token represents the bot, not the user. User identity is carried separately by the OBO flow in utils/obo_exchange.py (RFC 8693 token exchange), which produces a second token with act.sub=caipe-slack-bot and the real user's sub/email.
Provisioning service-client secrets in productionβ
In dev, secrets are embedded in deploy/keycloak/realm-config.json. In production, operators should treat them as rotating credentials:
Option A β manual (Keycloak Admin UI):
- Log into Keycloak Admin Console β
caiperealm β Clients βcaipe-slack-botβ Credentials tab. - Copy the Secret value (or click Regenerate Secret for rotation).
- Store it in your secret manager (Vault, AWS SSM, K8s Secret) as
SLACK_INTEGRATION_AUTH_CLIENT_SECRET. - Redeploy / restart the Slack bot pod so it picks up the new secret.
Option B β scripted (deploy/keycloak/export-client-secrets.sh):
The script fetches secrets via the Keycloak Admin API and emits them in one of three formats:
# shell (source into current session)
eval "$(KC_URL=https://keycloak.example.com ./export-client-secrets.sh)"
# dotenv (append to a .env file)
KC_URL=https://keycloak.example.com FORMAT=dotenv \
./export-client-secrets.sh >> slack-bot.env
# kubernetes Secret (pipe to kubectl)
KC_URL=https://keycloak.example.com FORMAT=k8s \
K8S_NAMESPACE=caipe K8S_SECRET_NAME=caipe-service-secrets \
./export-client-secrets.sh | kubectl apply -f -
The Helm chart can wire this up as a post-install Job so fresh installs get the Secret populated without operator intervention. Rotation is the same call β the Secret is overwritten in place.
Slack bot β Keycloak Admin REST API (identity lookup)β
Separate from the OBO flow above. The Slack bot also calls Keycloak's Admin REST API to find a Keycloak user by slack_user_id attribute (and to read/write team_id). This is the call that fires when someone @mentions the bot for the first time. It uses client_credentials and a different Keycloak client than the OBO flow.
| Env var | Purpose |
|---|---|
KEYCLOAK_SLACK_BOT_ADMIN_CLIENT_ID | Confidential Keycloak client for slack-bot's Admin API calls (lookup + JIT create). Default caipe-platform β that client's service account is granted view-users + query-users + manage-users on realm-management for user lookup/create, plus the client/authz roles needed by the Web UI BFF Keycloak RBAC migration. |
KEYCLOAK_SLACK_BOT_ADMIN_CLIENT_SECRET | Matching client_secret. In dev, defaults to caipe-platform-dev-secret. |
KEYCLOAK_URL, KEYCLOAK_REALM | Same values as everywhere else. |
SLACK_RBAC_ENABLED | Enables Slack-side identity lookup, team/channel resolution, OBO exchange, and channel ReBAC checks before the bot forwards a request. |
SLACK_JIT_CREATE_USER (spec 103) | true (default) auto-creates a federated-only Keycloak shell user on first DM when no Keycloak user with the Slack email exists. false falls through to the HMAC link URL so onboarding requires the web UI. Reuses KEYCLOAK_SLACK_BOT_ADMIN_* β no new secret. See plan R-8 for the single-credential trade-off. |
SLACK_JIT_ALLOWED_EMAIL_DOMAINS (spec 103) | Optional comma-separated allowlist (e.g. corp.com,acme.io). Empty = any domain. Recommended for prod when the federated IdP can return non-corporate emails. |
In Helm and GitOps installs, charts/ai-platform-engineering/charts/slack-bot/templates/deployment.yaml wires OAUTH2_CLIENT_SECRET and KEYCLOAK_BOT_CLIENT_SECRET from the Keycloak bot Secret, while the Slack tokens and KEYCLOAK_SLACK_BOT_ADMIN_CLIENT_SECRET can come from an ExternalSecret such as Vault path projects/caipe/rbac/slackbot.
Why
KEYCLOAK_SLACK_BOT_ADMIN_*and not justKEYCLOAK_ADMIN_*orKEYCLOAK_BOT_ADMIN_*? Two reasons:
- No collision with the Web UI backend. Pre-098 the slack-bot read the same
KEYCLOAK_ADMIN_*env names as the Web UI backend. Both services sharedocker-compose.dev.yamlenv interpolation, so a singleKEYCLOAK_ADMIN_CLIENT_ID=admin-cliline in.env(intended for the UI's password-grant fallback) silently overrode the slack-bot's client_credentials path, producingHTTP 401 "Public client not allowed to retrieve service account"on every Slack mention.- Room for future surfaces. The surface-specific prefix (
KEYCLOAK_<surface>_BOT_ADMIN_*) means future bot integrations likeKEYCLOAK_WEBEX_BOT_ADMIN_*orKEYCLOAK_TEAMS_BOT_ADMIN_*can each have their own dedicated namespace without yet another rename.
Required client config in Keycloak (any client you point this at):
publicClient: falseserviceAccountsEnabled: trueclientAuthenticatorType: client-secret- Service-account user has these
realm-managementclient roles for Slack identity lookup/JIT:view-users,query-users, andmanage-users. In the defaultcaipe-platformwiring it also hasquery-clients,view-clients,manage-clients,view-authorization, andmanage-authorizationso the Web UI BFF can repair Keycloak OBO mappings.
The realm seeder already provisions caipe-platform with all of those, so the default values "just work" in dev.
Spec 104 β active_team JWT claim (REMOVED by Phase 3 of spec 2026-05-24-derive-team-from-channel)β
Status: removed. The
active_teamJWT claim mechanism described here has been demolished. Team identity is now derived from thechannel_team_mappingscollection at request time (BFF + AgentGateway PDP). Bots no longer mintteam-<slug>client scopes, the OBO audience client no longer has anyteam-*default scope, and Keycloak no longer participates in team-identity negotiation.See spec 2026-05-24-derive-team-from-channel for the full demolition rationale. The
active_teammechanism never shipped to production, so no realm has legacyteam-*scopes to clean up β Phase 3 is a pure code/Helm/UI deletion.
Components touched (post-demolition)β
- Keycloak β no per-team client scopes, no
active_teammapper, noteam-personalDM-marker scope. Only the team-agnostic OBO permission wiring (token-exchange decision strategy, bot service-account impersonation roles, realm-wideusers.impersonatescope-permission) remains in scope of the reconciliation migration. - Web UI backend (
caipe-ui) βPOST /api/admin/teamswrites a Mongo team row + OpenFGA tuples only.DELETE /api/admin/teams/[id]removes those rows. Slack / Webex channel onboarding writeschannel_team_mappingsentries (no Keycloak touch). - Slack / Webex bots β
obo_exchange.impersonate_user()no longer requests ateam-*scope and no longer verifies anactive_teamclaim. Channel β team resolution lives entirely inchannel_team_resolver, which reads fromchannel_team_mappings. - Dynamic agents β request-bound auth context is the user OBO JWT
only. No
active_teamclaim is read or written. - AgentGateway PDP / RAG server β both consume the user JWT plus
the channelβteam mapping. RAG's
UserContext.active_teamfield is gone;_kb_cel_contextnow exposesuser.teamsas a list of teams the user belongs to (OpenFGA-sourced), not the single channel team.
Failure modes (post-demolition)β
- Group channel without a team mapping β bot replies "this channel isn't assigned to a CAIPE team yet"; nothing reaches AGW.
- User not in the mapped team β bot replies "you aren't a member of
<team>". - DM with no
dm_agent_idpreference and no realm default β bot replies with thedefault_agent_idselection UI. - DA receives a request without a user JWT β middleware logs
WARNING, MCP call goes out without
Authorization, AGW 401s.