Research: Slack JIT Keycloak user creation with web-UI auto-merge
Companion to: spec.md, plan.md, tasks.md Created: 2026-04-22 Status: Final (post-implementation)
This document captures the decisions taken and the alternatives explicitly rejected while designing the JIT path. Everything here is load-bearing for the security posture of the feature; future maintainers should re-read it before changing the JIT flow.
D1 โ Single Keycloak admin client (caipe-platform) for both lookup and creationโ
Decisionโ
The Slack bot uses one Keycloak admin client
(caipe-platform, env: KEYCLOAK_SLACK_BOT_ADMIN_CLIENT_*) for both:
- Reading users (
view-users,query-users) - Creating JIT shell users (
manage-users) - Setting attributes on users it just created (
manage-users)
The service account holds exactly the three realm-management client
roles {view-users, query-users, manage-users} โ no more, no less.
Alternatives consideredโ
A1.a โ Dedicated caipe-slack-bot-provisioner clientโ
A separate Keycloak client with only manage-users, kept out of the
slack-bot's normal lookup token cache. The lookup path keeps reading
through the existing read-only client; only the JIT branch acquires a
write-capable token via the second client.
Pros:
- Strict privilege separation: a compromised lookup token cannot create users; a compromised provisioner token cannot enumerate them.
- Token-exchange audit logs in Keycloak cleanly attribute the two responsibility classes to different clients.
Rejected because:
- Doubles the secret-bootstrap surface (two
ExternalSecrets, twoSecretHelm templates, two rotation calendars) for a ~5 KLOC integration component. - The threat being mitigated (compromised lookup token escalating to
write) is already mitigated by the helper-shape constraint M1
(see ยงM1 below): the only
PUT /users/{id}call site is hard-coded to a UUID just returned by the samePOST, never user-supplied. - The audit story is recovered cheaply by always-on
event=slack_jit_user_createdlog lines from the bot itself, which also include theslack_user_idandmask_email(email)for human triage. - Operationally the team's Keycloak admin already had to be told
exactly one secret per environment for
caipe-platform; doubling that during the 098 rollout would have been a regression.
A1.b โ A separate Keycloak admin client per slack workspace / tenantโ
Multi-workspace tenancy was deferred to a later spec (cf. spec.md ยง"Out-of-scope"). Single workspace โ single client.
Trade-off accepted (R-8 in plan.md)โ
A compromise of caipe-platform's service-account credentials grants the
attacker both read and write on every user in the realm. Mitigations in
production:
- The credential is held only in Helm
ExternalSecrets(or in dev.env) and never logged (seeslack_bot/utils/log_redaction.py). realm-managementroles do not includemanage-realm/manage-clients/view-events, so the blast radius is bounded to the user database; an attacker cannot escalate to realm configuration or read auth events.- Keycloak admin events for
MANAGE_USER/CREATE_USERare forwarded to the SIEM (seedocs/docs/security/rbac/architecture.mdComponent 1) and an unexpected creation rate triggers theKeycloakUserCreationSpikealert.
D2 โ Auto-merge on first Duo sign-in (no user prompt)โ
Decisionโ
The Keycloak Identity Provider for Duo OIDC is configured with:
firstBrokerLoginFlowAlias=silent-broker-login(a custom flow containing onlyidp-create-user-if-unique(ALTERNATIVE) followed byidp-auto-link(ALTERNATIVE))trustEmail = truesyncMode = IMPORT(changed fromFORCEmid-session โ see D2.b)
The result: when a JIT-created shell user (with a verified email but no federated identity) signs in via Duo for the first time, Keycloak silently links the federated identity to the existing shell user. The user sees a normal Duo redirect and lands in the CAIPE web UI; they do not see Keycloak's default "We found an existing account, link?" confirmation prompt.
Alternatives consideredโ
A2.a โ Keycloak's default first broker login flow with confirmation promptโ
The out-of-the-box flow shows the user the email of the existing account and asks them to confirm linking it. This is the safe default for an unbounded multi-tenant Keycloak: it stops a malicious IdP from auto-attaching itself to a victim's existing account.
Rejected because:
- In our deployment the IdP is the corporate enterprise SSO; it authoritatively owns the email namespace. There is no scenario where a different person legitimately owns the same enterprise email at the IdP level.
- The user-visible confirmation prompt would be the only time the user ever sees the Keycloak realm; the rest of CAIPE hides Keycloak behind the BFF and Duo. Showing it once for what looks like an internal app bug is bad UX.
- All five auth-architecture peer reviewers preferred the silent-merge trade-off given the bounded threat model.
A2.b โ syncMode=FORCE (initial choice, reverted)โ
The initial implementation used syncMode=FORCE so attributes from the
IdP overwrite local attributes on every login. Mid-session we discovered
that this wiped the slack_user_id attribute that the JIT branch
had just written, because the IdP's userinfo response does not include
that attribute.
Reverted to syncMode=IMPORT which only sets attributes on first
import and never overwrites them on subsequent logins. The slack_user_id
attribute is now preserved across web-UI logins, which was always the
intended behavior.
The change is captured in:
charts/ai-platform-engineering/charts/keycloak/scripts/init-idp.sh(canonical script, IDPsyncMode: IMPORT)docs/docs/security/rbac/workflows.md(updated diagram)
D3 โ Default JIT to ON, gated by env flagโ
Decisionโ
SLACK_JIT_CREATE_USER defaults to true. Operators opt out by
setting it to false in the deployment env or Helm values
(slackBot.jit.createUser=false).
Alternatives consideredโ
A3.a โ Default OFF, opt-inโ
Rejected because the entire reason this spec exists is that the default behavior (refuse to talk to unknown Slack users) is bad UX in the corporate-Slack-only environment that is our primary deployment target. Defaulting OFF would mean every operator immediately flips it ON, which is a strong signal the default is wrong.
A3.b โ Default ON, no opt-out flagโ
Rejected because we have at least one known deployment (an external
partner pilot) where the slack workspace is multi-organization and
auto-creating Keycloak users for unknown email domains would let a
partner's user impersonate a CAIPE user just by joining a shared Slack
channel. The combination SLACK_JIT_CREATE_USER=true +
SLACK_JIT_ALLOWED_EMAIL_DOMAINS=corp.com covers this case explicitly.
D4 โ SLACK_JIT_ALLOWED_EMAIL_DOMAINS allowlist (CSV)โ
Decisionโ
When non-empty, SLACK_JIT_ALLOWED_EMAIL_DOMAINS is a CSV of email
domain suffixes (lowercase, no leading @). The JIT branch creates a
user only if the Slack profile's email's domain is in the list.
When empty (the default), all domains are allowed (subject only to JIT being enabled at all).
Alternatives consideredโ
A4.a โ Regex-based allowlistโ
Rejected as overkill for the actual operator need (typically 1-3 literal corporate domains). A CSV is easier to audit in PR review.
A4.b โ Denylist insteadโ
Rejected: an operator that needs precise control wants to specify the known good set, not chase down every public-email domain that might leak in via partner Slack federation.
D5 โ Reuse the existing admin token cache, no new token plumbingโ
Decisionโ
The JIT branch acquires its admin token via the same
KeycloakAdminConfig + admin token cache that the lookup path already
uses. No new connection pool, no new env var, no separate refresh
schedule.
Rationaleโ
Adding a second token client would have meant either a second
httpx.AsyncClient instance (doubling sockets per process) or
threading two tokens through every helper signature (boilerplate
explosion). Since D1 already settled on a single client, reusing the
single token cache is a free correctness win.
D6 โ email_masking.py for log redaction (FR-010, FR-011)โ
Decisionโ
A small mask_email(email) helper returns "<first 3 chars>***@<domain>"
(e.g. srz***@cisco.com). Used by every log line that needs to mention
a user identity for triage but must not leak the full email.
This is paired with a broader log_redaction.py (PII scrubber) wired
into the loguru sink at startup so existing call sites pick up redaction
without per-line code changes.
Alternatives consideredโ
A6.a โ Hash the email instead of maskingโ
Rejected because operators triaging a JIT failure need to recognize
the user from the log line. A hash provides correlation but no human
readability; masking gives both ("yeah that's sri***@cisco.com, that's
me").
A6.b โ Don't log the email at allโ
Rejected because it makes triage of "JIT created the wrong user for me" complaints impossible. Masked email is the right balance.
M1 โ Helper-shape mitigation: PUT /users/{id} is bound to the freshly-created UUIDโ
Mitigationโ
create_user_from_slack() returns a kc_user_id UUID parsed from the
Location header of the POST /admin/realms/caipe/users response.
Any subsequent PUT /users/{id} call inside the same flow (e.g. to set
a slack_user_id attribute defensively) is constrained to that exact
UUID โ the caller never gets a generic update_arbitrary_user(id, ...)
helper.
This is structural: the JIT module exports
create_user_from_slack(slack_user_id, email) which internally chains
the POST + optional PUT, but never exports a set_attribute_on_any_user
function. The keycloak_admin module's general-purpose set_user_attribute
is rate-limited and audit-logged separately.
Why this mattersโ
In a future bug where Slack profile data flows into the JIT path unsanitized, the worst an attacker could do is create a user with a wonky email/name and write attributes on that new user only. They cannot pivot to writing attributes on, say, the admin user.
Test coverage: test_post_users_url_targets_only_freshly_created_id in
test_keycloak_admin_jit.py exercises this invariant.
References usedโ
- Keycloak Server Administration โ IdP brokering flow customization:
idp-create-user-if-uniqueexecutionidp-auto-linkexecutionfirstBrokerLoginFlowAliasconfigurationsyncModesemantics (FORCE vs IMPORT vs LEGACY)
- Keycloak Admin REST API โ user representation requirements:
- PUT
/admin/realms/{realm}/users/{id}requiring full user-profile fields in Keycloak 26 (the round-trip fix inkeycloak_admin.py).
- PUT
- RFC 8693 โ OAuth 2.0 Token Exchange (background only; JIT does not use token-exchange).
docs/docs/security/rbac/architecture.mdComponent 1 โ Keycloak baseline configuration that this spec extends.
Open follow-ups (NOT for this spec)โ
- F1: Periodic CI assertion that
service-account-caipe-platformholds exactly{view-users, query-users, manage-users}and no others. Tracked separately from this spec. - F2: Multi-workspace Slack tenancy: per-workspace admin client and per-workspace JIT allowlist. Out of scope here; tracked in a future spec (104).
- F3: An admin UI surface to list "JIT-created users that never
completed Duo sign-in" (i.e. shell users older than N days with no
federatedIdentitiesentry) for cleanup.
Assisted-by: Claude:claude-opus-4-7