Feasibility — Remote PDP options for CAIPE
Status: Historical decision note. The current implementation has adopted
OpenFGA behind AgentGateway ext_authz; this page remains useful as rationale
for why OpenFGA was selected over Keycloak UMA, OPA, Cedar, and keeping inline
AgentGateway rules as the long-term policy surface.
Audience: Anyone evaluating PDP choices (OpenFGA, OPA, Cedar, Cerbos, …) for CAIPE's authorization layer. Read roles-scopes-comparison.md first for the current model.
TL;DR
- CAIPE is a two-PDP system today. AgentGateway delegates data-plane MCP authorization to OpenFGA through
ext_authz; Keycloak Authorization Services (UMA 2.0) handles management-plane checks viarequire_rbac_permission()(ai_platform_engineering/utils/auth/keycloak_authz.pyand friends). No CEL is involved on the Keycloak side — it's role/group/aggregated policies and (deprecated) JS. - AgentGateway already supports remote PDPs out of the box — both gRPC ext_authz (Envoy-compatible, the same API OPA/OpenFGA/Cedar agents speak) and HTTP ext_authz, with
failureMode: FailOpen | FailClosed(default closed). A remote PDP is a config change, not a code change in AGW. (See AGW external authz docs.) - CAIPE's data is genuinely relationship-shaped — the simplified entity diagram in roles-scopes-comparison.md shows USER → TEAM → TOOL relationships that today are encoded by string-concatenating role names. ReBAC engines (OpenFGA, SpiceDB) express this natively.
- OpenFGA is the selected AGW data-plane PDP for CAIPE specifically. OPA is the safer/more general choice if you're planning to layer many other policy domains (data, network, K8s admission). Cedar is intellectually elegant but smaller community. Keycloak's own PDP can be extended to the AGW hot path via ext_authz too — see the explicit "why not just use Keycloak" section below for the tradeoffs.
What problem are we solving?
| Pain | Does a remote PDP help? |
|---|---|
The slug-vs-ObjectId bug (team_member:<oid> vs team_member:<slug>) | ❌ No. That was an admin-API consistency bug. Phase A (identity-service) is the fix. |
| 5 services duplicating Keycloak Admin API calls | ❌ No. Same as above — identity-service problem. |
| Gateway policy rules getting hard to maintain as we add resources | ✅ Yes. ReBAC moved these relationship-shaped decisions into OpenFGA tuples instead of a growing inline rule set. |
| Want "who has access to X?" reverse queries (e.g. "show me everyone who can invoke jira_search") | ✅ Yes. ReBAC engines do this in ms; doing it against Keycloak today requires walking every user's roles. |
| Want hierarchical/delegated permissions ("team A admin can grant access to team A's resources") | ✅ Yes. ReBAC models this natively. |
| Want to swap Keycloak for another IdP later | 🟡 Partial. PDP separation makes the IdP-switch cleaner because the IdP no longer owns policy. But the PDP is not itself an IdP abstraction. |
| Want policy-as-code with versioning, signing, bundles | ✅ Yes. OPA in particular is built around this. |
If your "yes" rows are mostly the bottom three, a PDP makes sense. If they're mostly the top two, build identity-service first and revisit.
The four families of PDPs
Family A — Relationship/Graph PDPs (Zanzibar-style)
Store tuples like (user, relation, object). Answer "is there a path from this user to this object via these relations?" Originally Google Zanzibar.
| PDP | Status | License | Native ext_authz | Best for |
|---|---|---|---|---|
| OpenFGA | CNCF Sandbox; donated by Auth0/Okta | Apache 2.0 | ✅ gRPC + HTTP | Hierarchical resources, sharing/delegation, "who has access to X?" reverse queries |
| SpiceDB | Open core; commercial backing (Authzed) | Apache 2.0 | ✅ gRPC | Same as OpenFGA + stronger consistency guarantees (zookies) |
| Permify | Open source | Apache 2.0 | ✅ gRPC | Smaller footprint, Postgres-backed |
| Warrant | Commercial-only (Auth0 acquired) | Closed | ✅ via Auth0 FGA | Companies already on Auth0 |
Verdict for CAIPE: OpenFGA edges out the rest because (a) the Keycloak event-publisher SPI exists out of the community (keycloak-openfga-event-publisher), (b) CNCF Sandbox status, (c) Auth0/Okta support means the tooling around it (UI, SDKs, debugging) is well-funded.
Family B — General-purpose policy engines
Evaluate policies as code against arbitrary input documents. Far more flexible, but you model both the data and the policy yourself.
| PDP | Language | Status | Native ext_authz | Best for |
|---|---|---|---|---|
| OPA (Open Policy Agent) | Rego | CNCF Graduated | ✅ gRPC | When policies span many domains (auth + data + network + K8s admission); when "policies live next to code" with bundles + signing matters |
| Cedar (AWS) | Cedar | CNCF Sandbox; AWS-designed | ✅ via cedar-agent | Teams that find Rego painful; teams that want a formally verified type system |
| Topaz (Aserto) | Rego + relationship directory | Open source | ✅ gRPC | OPA's flexibility + pre-built RBAC/ReBAC scaffolding |
Verdict for CAIPE: OPA is overkill for the current 5-rule CEL footprint. Becomes attractive if you start adding many other policy domains (e.g. data-access policies, K8s admission, network policies). Cedar is intellectually clean but has a smaller community than OPA.
Family C — Commercial / managed PDPs
| PDP | Model | Notable for |
|---|---|---|
| Cerbos | Stateless decisions; YAML policies; sidecar | Apps that pass principal+resource attributes per request — no separate relationship store. Lowest latency. |
| Permit.io | OPA + OpenFGA underneath, with admin UI + SaaS | Teams that want the policy authoring UX more than the engine |
| Aserto | Managed Topaz | OPA + directory as a service |
| Auth0 FGA | Hosted OpenFGA | Already on Auth0 |
| Casbin | Embedded library, polyglot (Go/Python/Java/Rust/Node/etc.) | Embed-in-app use cases — not a gateway PDP |
Verdict for CAIPE: Cerbos is the most interesting commercial-friendly option if you want low latency and don't want a separate tuple store. Permit.io is worth a demo if the admin UX matters more than the engine choice.
Family D — Use what you already ship
| Approach | Notable for |
|---|---|
| Keep inline rules on AGW only | Rejected for the current branch. Relationship-shaped grants now live in OpenFGA so the Admin UI can answer and explain "who has access to X?". |
| Use Keycloak Authorization Services (UMA) more | Already in production for management-plane checks (Web UI backend, supervisor, MCP middleware, slack bot — see ai_platform_engineering/utils/auth/keycloak_authz.py). Could be extended to AGW via ext_authz at the cost of latency and Keycloak-on-hot-path coupling. See Why we don't use Keycloak's PDP for AGW today below. |
| Roll your own | The "small RBAC service we'll build in 2 weeks" is the most-rewritten artifact in the industry. Don't. |
Why we don't use Keycloak's PDP for AGW today
Keycloak ships its own PDP (Keycloak Authorization Services, UMA 2.0). It's already on the management plane: require_rbac_permission() in ai_platform_engineering/utils/auth/keycloak_authz.py calls Keycloak's /realms/<r>/protocol/openid-connect/token endpoint with grant_type=urn:ietf:params:oauth:grant-type:uma-ticket for every Web UI backend/supervisor/MCP-middleware permission check. The data plane now uses OpenFGA behind AgentGateway ext_authz; Keycloak UMA remains off the hot path.
We don't put Keycloak's PDP on AGW's data plane for three reasons:
- Latency. Every tool call would add a Keycloak RPC (~5-30ms). CAIPE issues many tool calls per chat turn.
- Policy expressiveness. Gateway authorization references MCP resource names and per-request team context. To replicate this in Keycloak you'd pre-mint a resource per (tool × team), or use Keycloak's deprecated JS policies. Both are awkward.
- Operational coupling. Putting Keycloak on the per-request decision path means every tool call hard-depends on Keycloak liveness. With OpenFGA, AGW only depends on Keycloak for JWT signature validation (JWKS, cached), while relationship decisions use the OpenFGA bridge.
These are the same reasons people picking OpenFGA / OPA / Cedar at scale don't pick Keycloak's PDP: those engines are purpose-built for low-latency decision RPCs with caching, sharding, and decision-keyed replication patterns Keycloak's UMA wasn't optimized for.
Comparison matrix
| Dimension | OpenFGA | OPA | Cedar | Cerbos | Keycloak AuthZ (UMA) | Keep inline AGW rules |
|---|---|---|---|---|---|---|
| Relationship-shaped data fit | ✅ excellent | 🟡 you build it | 🟡 you build it | ❌ stateless model | 🟡 role/group only | 🟡 string roles work |
| Policy authoring complexity | DSL — moderate | Rego — high | Cedar — moderate | YAML — low | UI / JSON — low | CEL — low |
| ext_authz integration with AGW | ✅ native gRPC | ✅ native gRPC | ✅ via cedar-agent | ✅ HTTP | 🟡 HTTP, but awkward semantics | n/a (built-in) |
| "Who has access to X?" reverse queries | ✅ excellent | ❌ not really | 🟡 limited | ❌ no | 🟡 via Admin API | ❌ no |
Per-request variables (e.g. mcp.tool.name) | ✅ via tuples or context | ✅ via input doc | ✅ via context | ✅ via input doc | ❌ requires pre-minting per-resource | ✅ native |
| Operational overhead | New service + DB | New service + bundles | New service | New sidecar | None — already deployed | None |
| Latency added per check | ~2-5ms | ~1-3ms | ~1-3ms | less than 1ms (sidecar) | ~5-30ms (JVM, JWT minting) | 0 |
| Maturity / community | High (CNCF Sandbox) | Highest (CNCF Graduated) | Medium | Medium | High (Red Hat) | n/a |
| Vendor lock | None | None | None | None (open core) | None | n/a |
| Multi-tenancy story | Good | DIY | DIY | Good | Per-realm isolation | DIY |
| When you outgrow CAIPE's scale | Scales well | Scales very well | Scales well | Scales (stateless) | UMA endpoint becomes a bottleneck | CEL scales fine |
| Custom policy logic | Yes (relations) | Yes (Rego — anything) | Yes (Cedar — typed) | Yes (YAML conditions) | 🟡 only via deprecated JS policies (KC ≤25) | Yes (CEL — anything) |
Recommendation
If you're committing to ReBAC long-term
Pick OpenFGA. Multi-team resource sharing, hierarchical agents, and complex delegation are all natural in ReBAC and painful in role-string concatenation. The data model in roles-scopes-comparison.md is already relationship-shaped — OpenFGA expresses it directly:
model
schema 1.1
type user
type team
relations
define member: [user]
type tool
relations
define can_use: [team#member]
Tuples:
user:alice member team:platform
team:platform can_use tool:jira_search
Check:
check(user:alice, can_use, tool:jira_search)
→ true if alice is a member of any team that can_use that tool
If you're staying RBAC-shaped but want a real PDP
Pick OPA. The investment in Rego pays off across many policy domains beyond just RBAC.
Current decision
OpenFGA is now the selected data-plane PDP for AgentGateway. Keycloak
Authorization Services remains the management-plane PDP for Web UI backend, supervisor,
MCP middleware, and Slack bot checks. Do not reintroduce a separate AG MCP
policy CRUD surface; model gateway access as OpenFGA tuples and let AG call the
bridge through ext_authz.
Why not just lean harder on Keycloak's PDP and skip OpenFGA/OPA entirely?
It's the most defensible "do nothing new" answer. Keycloak's PDP is fine for management-plane checks and you already have the wiring (require_rbac_permission(), mcp_agent_auth.pdp, keycloak-authz.ts). You can put it on AGW's data plane via HTTP ext_authz too. The reasons not to are practical, not architectural — see Why we don't use Keycloak's PDP for AGW today. If you accept the latency budget (Keycloak UMA ~5-30ms × tool calls per chat turn) and live with the per-resource pre-minting limitation, this is the cheapest path.
The argument for OpenFGA/OPA over Keycloak as a PDP (separate from "as an IdP") is essentially:
- Sub-millisecond decisions vs Keycloak's tens of ms.
- Reverse queries ("who can do X?") that Keycloak's UMA doesn't support natively.
- Decision-shaped sharding/replication patterns built for hot-path PDP work.
- Per-request variables (like
mcp.tool.name) without pre-minting a Keycloak resource per tool.
If those don't matter for CAIPE's expected scale and policy complexity, Keycloak's PDP for management-plane checks plus a simpler gateway policy can still be defensible. This branch chose OpenFGA because the team/resource graph and access explanation UI are now first-class requirements.
Migration considerations (if you do introduce one)
These apply broadly to any remote PDP, but OpenFGA-flavored examples are given for concreteness.
1. Cache coherency between IdP and PDP
When admin operations write to Keycloak (via identity-service) and to the PDP, there's a window where the two disagree. Options:
- Synchronous dual-write —
identity-servicewrites Keycloak then PDP in the same transaction. Failure on either rolls back. Simple but couples liveness. - Event-driven sync —
identity-servicewrites Keycloak, then enqueues a PDP-tuple-write event. PDP eventually consistent. Tolerates PDP outages but admins see "permission granted" before it actually takes effect. - PDP-as-source-of-truth —
identity-servicewrites only to PDP; Keycloak is reduced to identity-only (no realm roles for resources). Cleanest but requires policy code in CEL/AGW to no longer referencerealm_access.roles.
For CAIPE, option 2 with a small staleness budget (≤2s) is the natural fit — admin operations are infrequent and the PDP is the load-bearing path, not Keycloak.
2. AGW fast-path / slow-path
Don't overcomplicate the gateway hot path. Keep AgentGateway focused on JWT validation and one ext_authz decision:
extAuthz:
host: openfga-authz-bridge:9100
failureMode:
denyWithStatus: 403
Admin bypasses and resource relationships should be modeled in OpenFGA, not in a second policy surface inside AgentGateway.
3. Decision caching
PDP decisions are deterministic given the same inputs. Cache (sub, active_team, tool_name) → decision for ~60s in AGW (or in the PDP itself). For CAIPE's call patterns this drops PDP load by ~95%.
4. Migration ordering
Recommended order — each step independently shippable:
- Add OpenFGA and tuple writers.
- Add AgentGateway
ext_authzpointing at the OpenFGA bridge. - Flip AGW to enforce the OpenFGA decision.
- Delete the AG MCP policy CRUD surface and the Mongo-backed config bridge.
This staging means you can abort at any point and roll back without data loss.
5. What stays in Keycloak no matter what
Even if you adopt a PDP, Keycloak still owns:
- JWT issuance and signing
- OIDC login flows
- User/identity management (JIT, federated identities)
- Token-exchange (OBO)
- IdP brokering (Duo SSO, Cisco SSO, etc.)
The PDP replaces only Keycloak's AuthZ Services (the UMA-based PDP), which CAIPE doesn't currently use anyway. So adoption is additive, not destructive.
What this doc deliberately doesn't decide
- Whether to introduce a PDP at all. That's a roadmap decision; this doc only enumerates the options if/when you do.
- Whether to build
identity-service. Tracked separately (TODO: writefeasibility-authz-service.md). - Vendor selection if a commercial PDP is chosen. Cerbos, Permit.io, Aserto, and Topaz all need product-level evaluation that exceeds this doc's scope.
- The
model.fgafile for CAIPE's full rule set. Once a decision is made, that becomes a spec deliverable.
References
- AgentGateway External Authorization docs
- Envoy External Authorization filter
- OpenFGA — Authorization Concepts
- Google Zanzibar paper (2019)
- OPA — Envoy ext_authz integration
- Cedar — Language reference
- keycloak-openfga-event-publisher SPI — sync Keycloak roles to OpenFGA tuples via event listener
- Spec 093 research doc — original architecture exploration that identified ext_authz as a future direction