AgentOps: Operations and Deployment Guide
Definition of AgentOps
AgentOps involves the lifecycle management and operationalization of AI agentic Systems.
- Agent Registry: A repository for managing agent versions, configurations, and artifact provenance to ensure traceability and reproducibility.
- Prompt Library: A versioned artifact library for managing and evaluating prompts and their effectiveness.
- MCP Registry: A repository for managing MCP server versions, configurations, and artifact provenance to maintain accountability and integrity.
Overview
This document describes the comprehensive AgentOps processes we follow for building, testing, deploying, and operating the AI Platform Engineering system. It covers GitHub Actions CI/CD, sanity checks, evaluations, Helm charts, and Kubernetes deployments.
Architecture Philosophy
CAIPE AgentOps follows a microservice deployment architecture for distributed agents. This architecture enables:
- Independent Scaling: Each agent can be scaled independently based on workload
- Isolated Deployments: Agents are deployed as separate microservices with their own containers
- Fault Isolation: Failures in one agent don't cascade to others
- Technology Flexibility: Different agents can use different runtime configurations
- Resource Optimization: Allocate resources per agent based on actual needs
- Independent Lifecycle: Update, rollback, or replace agents without affecting the entire system
- Container Registry Hosting: Agents and MCP server containers are hosted in container registries like GitHub Container Registry (GHCR) for version control, distribution, and deployment
AgentOps Methodology
The AI Platform Engineering project follows a comprehensive AgentOps methodology that ensures:
- Automated Testing: Quick and detailed sanity checks validate agent functionality
- Continuous Integration: GitHub Actions workflows automate builds, tests, and deployments
- Quality Assurance: Evaluation frameworks validate agent routing and tool usage
- Infrastructure as Code: Helm charts enable reproducible Kubernetes deployments
- Observability: Distributed tracing and logging provide visibility into agent operations
- Microservice Architecture: Distributed agents deployed as independent services
Table of Contents
- GitHub Actions CI/CD
- Sanity Checks
- Evaluations
- Helm Charts
- Kubernetes Deployments
- Monitoring and Observability
- Best Practices
GitHub Actions CI/CD
Architecture and Implementation
CAIPE uses GitHub Actions for comprehensive CI/CD automation, implementing a sophisticated workflow system that:
- Change Detection: Intelligently detects which agents/components changed to build only what's necessary
- Matrix Builds: Parallel builds across multiple agents and platforms for efficiency
- Security Hardening: Uses
step-security/harden-runnerto secure workflow execution - Multi-Platform Support: Builds for both
linux/amd64andlinux/arm64architectures - Caching: Leverages GitHub Actions cache for Docker layers to speed up builds
- Conditional Execution: Smart conditional logic to skip unnecessary builds
Workflow Overview
Our CI/CD pipeline consists of multiple GitHub Actions workflows that run automatically on code changes:
1. Quick Sanity Integration Tests
Workflow: tests-quick-sanity-integration-dev.yml
Purpose: Fast integration tests that validate core agent functionality
Triggers:
- Push to
mainbranch - Manual trigger via
workflow_dispatch
Execution:
- Runs on dedicated
caipe-integration-testsrunner - Uses
docker-compose.dev.yamlwithp2pprofile - Executes
make quick-sanitywhich runstest_prompts_quick_sanity.yaml
Test Configuration: See integration/test_prompts_quick_sanity.yaml
# integration/test_prompts_quick_sanity.yaml
prompts:
- id: "quick_test_1"
messages:
- role: "user"
content: "Test prompt"
expected_keywords: ["keyword1", "keyword2"]
category: "quick_sanity"
Key Steps:
- Cleanup: Remove Python cache files from previous runs
- Workspace Setup: Ensure directory exists with correct permissions
- Checkout: Clone repository code
- Secret Management: Create
.envfrom GitHub Secrets (masked in logs) - Docker Setup: Verify Docker and Docker Compose versions
- Python Setup: Install Python 3.13 for A2A client tests
- Service Startup: Start services with
docker compose -f docker-compose.dev.yaml --profile=p2p up -d - Log Streaming: Stream logs in background to file and console
- Readiness Check: Wait up to 3 minutes (36 retries × 5s) for service health
- Test Execution: Run
make quick-sanitywhich executestest_prompts_quick_sanity.yaml - Artifact Upload: Upload logs on failure for debugging
- Cleanup: Stop containers, remove volumes, clean Docker images
Implementation Details:
Dedicated Runner:
- Uses
caipe-integration-testsself-hosted runner - Provides persistent Docker daemon
- Faster startup times compared to GitHub-hosted runners
Readiness Check:
# Checks both health endpoints
curl -sfS http://localhost:8000/ >/dev/null || \
curl -sfS http://localhost:8000/.well-known/agent.json >/dev/null
Log Management:
- Logs streamed to
compose-live.logfile - Background process allows tests to run while logging
- Last 300 lines shown on failure
- Full logs uploaded as artifact
Environment Setup:
- Creates
.envfile with secrets from GitHub Actions secrets:- LLM provider credentials (Azure OpenAI)
- Agent API tokens (ArgoCD, Backstage, Atlassian, GitHub, PagerDuty, Splunk, Komodor, Slack)
- A2A transport configuration (
A2A_TRANSPORT=p2p) - Tracing disabled (
ENABLE_TRACING=false)
2. Detailed Sanity Integration Tests
Workflow: tests-detailed-sanity-integration.yml
Purpose: Comprehensive integration tests with verbose output
Triggers:
- Manual trigger via
workflow_dispatchonly
Execution:
- Runs on
ubuntu-latestrunner - Uses production
docker-compose.yamlwithstableimage tag - Executes
make detailed-testwhich runstest_prompts_detailed.yaml
Differences from Quick Sanity:
- More comprehensive test coverage
- Uses production Docker Compose configuration
- Verbose logging enabled (
log_cli_level=INFO) - Longer test execution time
3. Tag-Based Sanity Tests
Workflows:
tests-quick-sanity-integration-on-latest-tag.yml- Tests againstlatesttagtests-quick-sanity-integration-on-stable-tag.yml- Tests againststabletag
Purpose: Validate that published container images work correctly
Triggers:
- Push to
mainbranch - Scheduled (nightly at 2 AM UTC for latest tag)
- Manual trigger
Execution:
- Pulls specific image tags from GitHub Container Registry
- Runs same quick sanity tests against tagged images
- Ensures published images maintain quality
4. Helm Chart Testing
Workflow: helm-chart-test.yml
Purpose: Validate Helm chart templates and configurations
Triggers:
- Push to
mainordevelopbranches (whenhelm/**changes) - Pull requests (when
helm/**changes) - Scheduled (daily at 2 AM UTC)
Test Matrix:
- Tests against multiple Helm versions:
v3.18.2,v3.17.3
Test Scenarios:
- Chart Validation: Lint and package charts
- Dependency Management: Verify external dependencies (Milvus, Neo4j, External Secrets)
- Installation Tests: Dry-run installations with various configurations
- Resource Configuration: Test with custom CPU/memory limits
- Ingress Configuration: Test ingress setups
- Security Contexts: Validate security settings
- Storage Classes: Test persistent volume configurations
- Multi-Agent Configurations: Test different agent combinations
- SLIM Integration: Test SLIM transport configuration
- External Secrets: Validate external secrets integration
Example Test:
helm template test-all-services . \
--set ai-platform-engineering.enabled=true \
--set backstage-plugin-agent-forge.enabled=true \
--set kb-rag-stack.enabled=true \
--set graphrag.enabled=true
5. Helm Chart Publishing
Workflow: helm.yml
Purpose: Automatically publish Helm charts to GitHub Container Registry
Triggers:
- Push to
mainbranch (whencharts/**changes) - Pull requests (when
charts/**changes)
Process:
1. Version Bump Check (PR only):
Validates that chart versions are bumped when substantive changes are made:
# Detects which charts changed
CHART_CHANGES=$(git diff --name-only origin/$BASE_BRANCH...HEAD | grep "^charts/")
# Checks if changes are more than just Chart.lock
if echo "$CHART_CHANGES" | grep -qv "Chart.lock$"; then
# Version bump required
if [ "$CURRENT_VERSION" = "$BASE_VERSION" ]; then
echo "❌ Error: Chart has changes but version was not bumped!"
exit 1
fi
fi
Logic:
- ✅ Skips version check for
Chart.lock-only changes (dependency updates) - ❌ Fails PR if substantive changes without version bump
- ✅ Allows new charts without version check
2. Chart Packaging:
# Update dependencies
helm dependency update charts/rag-stack/
helm dependency update charts/ai-platform-engineering/
# Package charts
helm package charts/rag-stack/ --destination ./packaged-charts/
helm package charts/ai-platform-engineering/ --destination ./packaged-charts/
Dependency Verification:
# Verify nested dependencies are included
tar -tzf charts/ai-platform-engineering/charts/rag-stack-*.tgz | \
grep -q "^rag-stack/charts/neo4j/Chart.yaml" # Must include neo4j
tar -tzf charts/ai-platform-engineering/charts/rag-stack-*.tgz | \
grep -q "^rag-stack/charts/milvus/Chart.yaml" # Must include milvus
3. Chart Publishing:
REGISTRY="oci://ghcr.io/${{ github.repository_owner }}/helm-charts"
# Check if version already exists
if helm pull $REGISTRY/$CHART_NAME --version $CHART_VERSION; then
echo "⚠️ Version already exists, skipping"
else
helm push "$CHART_FILE" $REGISTRY
fi
Publishing Logic:
- Only publishes on push to
mainbranch - Skips if version already exists (idempotent)
- Supports multiple charts:
rag-stack,ai-platform-engineering - Uses OCI registry format (
oci://)
Registry: oci://ghcr.io/cnoe-io/helm-charts
Installation:
helm repo add cnoe-io oci://ghcr.io/cnoe-io/helm-charts
helm repo update
helm install ai-platform-engineering cnoe-io/ai-platform-engineering
6. Agent Build Workflows
Workflows:
ci-supervisor-agent.yml- Builds supervisor agentci-mcp-sub-agent.yml- Builds MCP sub-agentsci-a2a-sub-agent.yml- Builds A2A sub-agentsci-a2a-rag.yml- Builds RAG agentsci-agent-forge-plugin.yml- Builds Backstage plugin
Purpose: Build and publish container images for individual agents and MCP servers
Container Registry: All agent and MCP server containers are hosted in GitHub Container Registry (GHCR) at ghcr.io/cnoe-io/
Implementation Details
1. Change Detection and Path Filtering
Workflows use intelligent change detection to build only affected components. See ci-a2a-sub-agent.yml for implementation:
# Example from ci-a2a-sub-agent.yml
- name: Detect changed paths
id: filter
uses: dorny/paths-filter@v3
with:
filters: |
shared:
- 'ai_platform_engineering/utils/a2a/**'
- 'build/agents/Dockerfile.a2a'
github:
- 'ai_platform_engineering/agents/github/**'
- '!ai_platform_engineering/agents/github/mcp/**'
Logic:
- Push to
mainor tags: Builds all agents - Pull Requests: Only builds agents with changed files
- Shared changes: If shared utilities change, all agents are rebuilt
- Manual dispatch: Can force build all agents with
build_all: trueinput
2. Matrix Build Strategy
Agents are built in parallel using GitHub Actions matrix:
strategy:
matrix:
agent: ${{ fromJson(needs.determine-agents.outputs.agents) }}
fail-fast: false # Continue building other agents if one fails
Supported Agents:
- A2A Agents:
argocd,aws,backstage,confluence,github,jira,komodor,pagerduty,slack,splunk,template,webex,weather - MCP Agents:
argocd,backstage,confluence,jira,komodor,pagerduty,slack,splunk,webex - RAG Components:
agent-rag,agent-ontology,server,webui
Example: When ai_platform_engineering/agents/github/ changes, only agent-github and mcp-github images are built. When ai_platform_engineering/utils/a2a/ changes, all agents are rebuilt.
3. Security Hardening
All build workflows use step-security/harden-runner:
- name: 🔒 harden runner
uses: step-security/harden-runner@95d9a5deda9de15063e7595e9719c11c38c90ae2
with:
egress-policy: audit # Monitor outbound connections
Benefits:
- Prevents supply chain attacks
- Audits network egress
- Restricts unnecessary network access
4. Multi-Platform Builds
All images are built for both architectures:
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Build and Push Docker image
uses: docker/build-push-action@v6
with:
platforms: linux/amd64,linux/arm64
Process:
- Docker Buildx: Sets up advanced build features
- QEMU: Enables cross-platform builds (ARM64 on AMD64 runners)
- Parallel Builds: Both architectures built simultaneously
- Manifest Lists: Creates multi-arch manifests automatically
5. Docker Layer Caching
Uses GitHub Actions cache for faster builds:
cache-from: type=gha # GitHub Actions cache
cache-to: type=gha,mode=max # Store all layers
Cache Strategy:
- Cache Key: Based on Dockerfile content and build context
- Cache Scope: Per workflow run, shared across matrix jobs
- Cache Mode:
maxstores all layers for maximum reuse - Benefits: Reduces build time by 50-80% on subsequent runs
6. Image Tagging Strategy
Multiple tags are generated automatically:
tags: |
type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}
type=ref,event=branch,prefix=
type=ref,event=tag,prefix=
type=sha,format=short,prefix=
Tag Examples:
latest- Latest build frommainbranchmain- Branch name tagv1.2.3- Semantic version tagabc1234- Short SHA tagstable- Production release tag (manual)
7. Conditional Push Logic
Images are only pushed on specific conditions:
push: ${{ github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/tags/') }}
Push Conditions:
- ✅ Push to
mainbranch - ✅ Push of tags (releases)
- ❌ Pull requests (build only, no push)
- ❌ Feature branches (unless manually triggered)
8. Dockerfile Resolution
Workflows support agent-specific Dockerfiles:
- name: Determine Dockerfile path
id: dockerfile
run: |
if [ -f "${{ env.AGENT_DIR }}/build/Dockerfile.a2a" ]; then
echo "path=${{ env.AGENT_DIR }}/build/Dockerfile.a2a"
else
echo "path=build/agents/Dockerfile.a2a" # Fallback to shared
fi
Priority:
- Agent-specific Dockerfile:
ai_platform_engineering/agents/{agent}/build/Dockerfile.{type}(e.g.,ai_platform_engineering/agents/github/build/Dockerfile.a2a) - Shared Dockerfile:
build/agents/Dockerfile.{type}(e.g.,build/agents/Dockerfile.a2a)
9. Build Arguments
Agent-specific build arguments:
build-args: |
AGENT_NAME=${{ matrix.agent }}
AGENT_PACKAGE=${{ steps.agent_package.outputs.name }}
Special Cases:
templateagent usespetstorepackage name- MCP agents use
AGENT_NAMEfor MCP server configuration - RAG components use component-specific build args
10. Pre-Release Workflows
Separate workflows for PR preview builds:
Workflows:
pre-release-supervisor-agent.yamlpre-release-a2a-sub-agent.yamlpre-release-mcp-agent.yamlpre-release-a2a-rag.ymlpre-release-agent-forge-plugin.yaml
Trigger: Pull requests with branch prefix prebuild/
Image Naming: ghcr.io/cnoe-io/prebuild/{component}:{pr-number}
Purpose: Preview images for testing before merge
Image Naming Conventions:
- Agents:
ghcr.io/cnoe-io/agent-{name}:{tag}(e.g.,ghcr.io/cnoe-io/agent-github:stable) - MCP Servers:
ghcr.io/cnoe-io/mcp-{name}:{tag}(e.g.,ghcr.io/cnoe-io/mcp-argocd:latest) - Supervisor:
ghcr.io/cnoe-io/ai-platform-engineering:{tag} - Backstage Plugin:
ghcr.io/cnoe-io/backstage-plugin-agent-forge:{tag} - RAG Components:
ghcr.io/cnoe-io/caipe-rag-{component}:{tag}(e.g.,ghcr.io/cnoe-io/caipe-rag-server:latest)
Example Build Output:
# After pushing to main branch
ghcr.io/cnoe-io/agent-github:latest
ghcr.io/cnoe-io/agent-github:main
ghcr.io/cnoe-io/agent-github:abc1234 # Short SHA
# After tagging v1.2.3
ghcr.io/cnoe-io/agent-github:v1.2.3
ghcr.io/cnoe-io/agent-github:1.2
ghcr.io/cnoe-io/agent-github:1
Registry Benefits:
- Version Control: Each container image is tagged with specific versions for reproducibility
- Distribution: Centralized registry enables easy deployment across environments
- Security: SBOM and attestations provide supply chain security
- Access Control: GitHub-based authentication and permissions
- Multi-Platform: Supports both AMD64 and ARM64 architectures
- Parallel Builds: Matrix strategy builds multiple agents simultaneously
- Smart Caching: Docker layer caching reduces build times significantly
Workflow Best Practices
- Secret Management: All secrets stored in GitHub Actions secrets, never hardcoded
- Artifact Retention: Logs and test results retained for debugging
- Cleanup: Always clean up Docker resources and workspace
- Failure Handling: Upload logs and artifacts on failure for debugging
- Parallel Execution: Independent workflows run in parallel for faster feedback
Sanity Checks
Quick Sanity
Command: make quick-sanity
Purpose: Fast validation of core agent functionality
Execution:
cd integration && A2A_PROMPTS_FILE=test_prompts_quick_sanity.yaml uv run pytest -o log_cli=true -o log_cli_level=DEBUG
Test File: integration/test_prompts_quick_sanity.yaml
Characteristics:
- Minimal test set (5-10 prompts)
- Fast execution (< 5 minutes)
- Validates basic routing and responses
- Used in CI/CD for quick feedback
Example:
# Run quick sanity locally
make quick-sanity
# Output shows:
# ✓ GitHub agent routes correctly
# ✓ ArgoCD agent responds to queries
# ✓ Multi-agent orchestration works
# All tests passed in 3m 42s
Detailed Sanity
Command: make detailed-sanity or make detailed-test
Purpose: Comprehensive validation of all agent capabilities
Execution:
cd integration && A2A_PROMPTS_FILE=test_prompts_detailed.yaml uv run pytest -o log_cli=true -o log_cli_level=INFO
Test File: integration/test_prompts_detailed.yaml
Characteristics:
- Comprehensive test coverage (20+ prompts)
- Tests all agent types (GitHub, ArgoCD, Jira, etc.)
- Validates complex multi-agent workflows
- Longer execution time (15-30 minutes)
- Used for pre-release validation
Test Categories:
- Single agent routing
- Multi-agent orchestration
- Tool usage validation
- Error handling
- Streaming responses
Example:
# Run detailed sanity
make detailed-sanity
# Tests 20+ scenarios including:
# - "Show GitHub repos and ArgoCD apps" → Parallel routing
# - "Who is on-call?" → Deep agent with PagerDuty + RAG
# - Error handling when agent unavailable
ArgoCD-Specific Sanity
Command: make argocd-sanity
Purpose: Validate ArgoCD agent functionality
Execution:
cd integration && A2A_PROMPTS_FILE=test_prompts_argocd_sanity.yaml uv run pytest -o log_cli=true -o log_cli_level=INFO
Test File: integration/test_prompts_argocd_sanity.yaml
Focus: ArgoCD-specific operations:
- Application listing
- Application status checks
- Resource queries
- Cluster information
Running Sanity Checks Locally
Prerequisites:
- Services running:
docker compose -f docker-compose.dev.yaml --profile=p2p up -d - Python 3.13+ with
uvpackage manager .envfile configured with API credentials
Quick Sanity:
make quick-sanity
Detailed Sanity:
make detailed-sanity
ArgoCD Sanity:
make argocd-sanity
Adding New Sanity Tests
Method 1: YAML Configuration (Recommended)
Add to appropriate test file:
prompts:
- id: "my_new_test"
messages:
- role: "user"
content: "My test prompt"
expected_keywords: ["keyword1", "keyword2"]
category: "my_category"
Method 2: Python Test Function
Add to integration/integration_ai_platform_engineering.py:
async def test_my_new_functionality(self):
"""Test my new functionality"""
response = await send_message_to_agent("my prompt")
assert response is not None
assert len(response) > 0
# Add specific assertions
Evaluations
Overview
The evaluation system provides automated testing of Platform Engineer multi-agent workflows using Langfuse dataset evaluation capabilities.
Architecture
┌─────────────────┐
│ Langfuse UI │
│ Dashboard │
└────────┬────────┘
│
│ Trigger Evaluation
▼
┌─────────────────┐
│ Webhook Service│
│ langfuse_webhook│
└────────┬────────┘
│
│ Orchestrate
▼
┌─────────────────┐
│ Evaluation │
│ Runner │
└────────┬────────┘
│
┌────┴────┐
│ │
▼ ▼
┌────────┐ ┌──────────────┐
│ A2A │ │ Trace │
│ Client │ │ Extractor │
└───┬────┘ └──────┬───────┘
│ │
│ │
▼ ▼
┌─────────────────────────┐
│ Platform Engineer │
│ Multi-Agent System │
└─────────────┬───────────┘
│
│ Auto-trace
▼
┌─────────────────────────┐
│ Langfuse Server │
│ Trace Storage │
└─────────────┬───────────┘
│
│ Analyze
▼
┌─────────────────┐
│ Dual Evaluators │
├─────────────────┤
│ Routing │
│ Evaluator │
├─────────────────┤
│ Tool Match │
│ Evaluator │
└────────┬────────┘
│
│ Submit Scores
▼
┌─────────────────┐
│ Langfuse UI │
│ Results Display │
└─────────────────┘
Dual Evaluator System
The system uses two specialized evaluators:
-
Routing Evaluator: Validates supervisor-to-agent routing decisions
- Checks if correct agent was selected
- Validates routing logic
- Scores: 0.0 - 1.0
-
Tool Match Evaluator: Validates agent-to-tool usage patterns
- Checks if correct tools were used
- Validates tool parameters
- Scores: 0.0 - 1.0
Quick Start
1. Start the System:
docker compose -f docker-compose.dev.yaml --profile p2p-tracing up
2. Upload Dataset:
cd evals
python upload_dataset.py datasets/single_agent.yaml
3. Configure Webhook in Langfuse:
- Navigate to Langfuse UI: http://localhost:3000
- Go to Datasets → Select your dataset
- Click "Start Experiment" → "Custom Experiment" (⚡ button)
- Set webhook URL:
http://evaluation-webhook:8000/evaluate - Click "Run" to start evaluation
4. Monitor Results:
- View evaluation progress in Langfuse dashboard
- Check individual trace scores and reasoning
- Analyze routing and tool usage patterns
Dataset Format
name: single_agent_tests
description: Single agent evaluation tests
prompts:
- id: "github_repo_description"
messages:
- role: "user"
content: "show repo description for ai-platform-engineering"
expected_agents: ["github"]
expected_behavior: "Should use GitHub agent to fetch repository description"
expected_output: "The ai-platform-engineering repository is a platform engineering toolkit..."
Example: Upload and run evaluation:
# Upload dataset
cd evals
python upload_dataset.py datasets/single_agent.yaml
# Trigger evaluation via Langfuse UI
# Navigate to http://localhost:3000 → Datasets → Start Experiment
# Set webhook: http://evaluation-webhook:8000/evaluate
# Click "Run"
# Check results
curl http://localhost:8011/health
# Response shows evaluation status and scores
Evaluation Flow
- Dataset Upload: YAML datasets uploaded to Langfuse
- Webhook Trigger: Langfuse UI triggers evaluation via webhook
- Request Processing: Runner sends prompts to Platform Engineer via A2A
- Trace Analysis: Extract tool calls and agent interactions from traces
- Dual Evaluation:
- Route correctness (supervisor → agent)
- Tool alignment (agent → tool)
- Score Submission: Results submitted back to Langfuse with detailed reasoning
Environment Variables
# Langfuse configuration
export LANGFUSE_PUBLIC_KEY="pk-lf-your-key"
export LANGFUSE_SECRET_KEY="sk-lf-your-key"
export LANGFUSE_HOST="http://localhost:3000"
# Platform Engineer connection
export PLATFORM_ENGINEER_URL="http://localhost:8000"
# Optional: LLM evaluation (fallback uses pattern matching)
export OPENAI_API_KEY="your-openai-key"
Monitoring
Service Health:
curl http://localhost:8011/health
Response:
{
"status": "healthy",
"langfuse": "configured",
"evaluators": ["routing", "tool_match"],
"platform_engineer": "connected"
}
Helm Charts
Chart Structure
charts/
├── ai-platform-engineering/
│ ├── Chart.yaml
│ ├── values.yaml
│ ├── templates/
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ ├── ingress.yaml
│ │ ├── secret.yaml
│ │ └── ...
│ └── charts/
│ ├── agent/
│ ├── supervisor-agent/
│ └── backstage-plugin-agent-forge/
└── rag-stack/
├── Chart.yaml
├── values.yaml
└── charts/
├── agent-rag/
├── agent-ontology/
├── rag-server/
└── ...
Main Charts
1. AI Platform Engineering Chart
Location: charts/ai-platform-engineering/
Components:
- Supervisor agent (Platform Engineer)
- Individual agent deployments (GitHub, ArgoCD, Jira, etc.)
- MCP server deployments
- Backstage plugin agent-forge
Key Features:
- Microservice Architecture: Each agent deployed as independent Kubernetes deployment
- Multi-agent orchestration via supervisor agent
- Configurable prompt types (default, deep_agent, custom)
- External secrets support
- Ingress configuration per agent
- Horizontal Pod Autoscaling (HPA) per agent
- Resource limits and requests per agent
- Independent scaling and lifecycle management
2. RAG Stack Chart
Location: charts/rag-stack/
Components:
- RAG server
- RAG web UI
- RAG agents (rag, ontology)
- Redis (persistent)
- Neo4j (via dependency)
- Milvus (via dependency)
Dependencies:
- Neo4j Helm chart
- Milvus Helm chart
Prompt Configuration
The chart supports multiple prompt configurations with versioning:
Versioning Strategy
Prompt configuration files are versioned alongside the Helm chart:
- Chart Version: Tracked in
Chart.yaml(version: 0.4.7) - App Version: Tracked in
Chart.yaml(appVersion: 0.2.1) - ConfigMap Labels: Includes version metadata for traceability:
labels:
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
Prompt Config Files:
data/prompt_config.yaml- Default configurationdata/prompt_config.deep_agent.yaml- Deep agent configuration
When the Helm chart version is bumped, prompt config changes are included in that chart version, ensuring:
- Reproducibility: Specific chart versions always use the same prompt config
- Traceability: ConfigMap labels include version information
- Rollback Safety: Downgrading chart version restores previous prompt config
Default Configuration
promptConfigName: "default" # Uses data/prompt_config.yaml
- Balanced orchestrator
- General platform engineering tasks
- Medium strictness
- Versioned with chart release
Deep Agent Configuration
promptConfigName: "deep_agent" # Uses data/prompt_config.deep_agent.yaml
- Strict zero-hallucination mode
- Mission-critical operations
- High strictness
- Versioned with chart release
Custom Configuration
promptConfig: |
agent_name: "My Custom Platform Agent"
agent_description: |
Custom description...
system_prompt_template: |
Custom system prompt...
- Overrides versioned configs
- Not versioned by chart (managed separately)
- Use for specialized workflows
Versioned Config Selection
The Helm template selects config files based on promptConfigName:
# Template logic in templates/prompt-config.yaml
{{- $configName := .Values.promptConfigName | default "deep_agent" }}
{{ .Files.Get (printf "data/prompt_config.%s.yaml" $configName) | nindent 4 }}
Best Practice: Always specify promptConfigName explicitly in your values file to ensure consistent behavior across upgrades:
# values-secrets.yaml
promptConfigName: "deep_agent" # Explicitly set for production
Example: Deploying with default prompt config:
helm install ai-platform-engineering . \
--values values-secrets.yaml \
--set promptConfigName=default
Example: Deploying with deep agent prompt config:
helm install ai-platform-engineering . \
--values values-secrets.yaml \
--set promptConfigName=deep_agent
Deployment Options
Option 1: Simple Deployment (Port-Forward)
helm install ai-platform-engineering . --values values-secrets.yaml
Access:
kubectl port-forward service/ai-platform-engineering-agent-github 8001:8000
Option 2: Ingress Deployment
helm install ai-platform-engineering . \
--values values-secrets.yaml \
--values values-ingress.yaml
Configure DNS:
echo "$(minikube ip) agent-github.local" | sudo tee -a /etc/hosts
Secret Management
Option 1: Direct Secrets
cp values-secrets.yaml.example values-secrets.yaml
# Edit with your values
helm install ai-platform-engineering . --values values-secrets.yaml
Option 2: Existing Kubernetes Secrets
agent-argocd:
secrets:
secretName: "my-existing-secret"
Option 3: External Secrets (Recommended)
cp values-external-secrets.yaml.example values-external-secrets.yaml
# Configure external secrets store
helm install ai-platform-engineering . --values values-external-secrets.yaml
Chart Testing
Local Testing:
helm template test . --values values-secrets.yaml
Dry-Run Installation:
helm install --dry-run --debug ai-platform-engineering . --values values-secrets.yaml
Chart Linting:
helm lint charts/ai-platform-engineering/
See charts/ai-platform-engineering/ and charts/rag-stack/ for chart source code.
Chart Publishing
Charts are automatically published to GitHub Container Registry on merge to main:
Registry: oci://ghcr.io/cnoe-io/helm-charts
Installation:
helm repo add cnoe-io oci://ghcr.io/cnoe-io/helm-charts
helm repo update
helm install ai-platform-engineering cnoe-io/ai-platform-engineering
Kubernetes Deployments
Microservice Architecture
CAIPE follows a microservice deployment architecture where each agent is deployed as an independent Kubernetes service:
- Independent Deployments: Each agent (GitHub, ArgoCD, Jira, etc.) runs in its own pod/deployment
- Service Isolation: Agents communicate via A2A protocol, not direct dependencies
- Individual Scaling: Scale agents independently based on workload (e.g., scale GitHub agent separately from ArgoCD)
- Resource Allocation: Set CPU/memory limits per agent based on actual usage patterns
- Health Checks: Each agent has its own liveness and readiness probes
- Rolling Updates: Update individual agents without affecting others
- Fault Tolerance: Agent failures are isolated and don't cascade
Example Architecture:
┌─────────────────────────────────────┐
│ Supervisor Agent (Orchestrator) │
│ - Routes requests to agents │
│ - Manages multi-agent workflows │
└──────────────┬──────────────────────┘
│
┌───────┴────────┐
│ A2A Protocol │
└───────┬────────┘
│
┌──────────┼──────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ GitHub │ │ ArgoCD │ │ Jira │
│ Agent │ │ Agent │ │ Agent │
│ Pod │ │ Pod │ │ Pod │
└────────┘ └────────┘ └────────┘
│ │ │
└──────────┼──────────┘
│
┌──────┴──────┐
│ MCP Servers │
│ (Separate) │
└─────────────┘
Prerequisites
- Kubernetes cluster (1.24+)
- Helm 3.14+
- kubectl configured
- Container Registry Access: Access to GitHub Container Registry (
ghcr.io/cnoe-io/) where all agent and MCP server containers are hosted- Authenticate:
echo $GITHUB_TOKEN | docker login ghcr.io -u USERNAME --password-stdin - Pull images:
docker pull ghcr.io/cnoe-io/agent-github:stable
- Authenticate:
Deployment Steps
1. Configure Secrets
Create values-secrets.yaml:
global:
imageRegistry: ghcr.io/cnoe-io
ai-platform-engineering:
enabled: true
image:
repository: ai-platform-engineering
tag: stable
agent-github:
enabled: true
secrets:
secretName: agent-secrets
2. Create Kubernetes Secrets
kubectl create secret generic agent-secrets \
--from-literal=GITHUB_PERSONAL_ACCESS_TOKEN=your-token \
--from-literal=AZURE_OPENAI_API_KEY=your-key
3. Install Chart
helm install ai-platform-engineering . \
--namespace ai-platform \
--create-namespace \
--values values-secrets.yaml
4. Verify Deployment
kubectl get pods -n ai-platform
# Expected output:
# NAME READY STATUS RESTARTS AGE
# ai-platform-engineering-supervisor-0 1/1 Running 0 2m
# agent-github-7d8f9c4b5-abc12 1/1 Running 0 2m
# agent-argocd-6c7e8d9a0-def34 1/1 Running 0 2m
kubectl get services -n ai-platform
kubectl get ingress -n ai-platform
Example: Check specific agent logs:
kubectl logs -n ai-platform deployment/agent-github --tail=50
Resource Management
Resource Limits:
agent-github:
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
Horizontal Pod Autoscaling:
agent-github:
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 5
targetCPUUtilizationPercentage: 80
High Availability
Multi-Replica Deployment:
agent-github:
replicaCount: 3
podDisruptionBudget:
enabled: true
minAvailable: 2
Node Affinity:
agent-github:
nodeSelector:
kubernetes.io/os: linux
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- agent-github
topologyKey: kubernetes.io/hostname
Persistent Storage
Redis Persistence:
rag-redis:
persistence:
enabled: true
storageClass: fast-ssd
size: 10Gi
Network Policies
Ingress Configuration:
agent-github:
ingress:
enabled: true
className: nginx
hosts:
- host: agent-github.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: agent-github-tls
hosts:
- agent-github.example.com
Monitoring Integration
ServiceMonitor for Prometheus:
agent-github:
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
Troubleshooting
View Pod Logs:
kubectl logs -n ai-platform deployment/agent-github -f
Describe Pod:
kubectl describe pod -n ai-platform agent-github-xxx
Check Events:
kubectl get events -n ai-platform --sort-by='.lastTimestamp'
Port Forward for Debugging:
kubectl port-forward -n ai-platform service/agent-github 8001:8000
# Test agent at http://localhost:8001
Example: Debugging a failing agent:
# Check pod status
kubectl get pods -n ai-platform | grep agent-github
# Output: agent-github-xxx 0/1 CrashLoopBackOff 3 2m
# Check logs for errors
kubectl logs -n ai-platform agent-github-xxx --previous
# Check events for resource issues
kubectl describe pod -n ai-platform agent-github-xxx | grep -A 5 Events
# Common issues:
# - ImagePullBackOff: Check registry access
# - CrashLoopBackOff: Check application logs
# - OOMKilled: Increase memory limits
Monitoring and Observability
Distributed Tracing
Langfuse Integration:
- Automatic trace collection
- Tool call tracking
- Agent interaction visualization
- Performance metrics
Configuration:
agent-github:
environment:
- name: ENABLE_TRACING
value: "true"
- name: LANGFUSE_PUBLIC_KEY
valueFrom:
secretKeyRef:
name: langfuse-secrets
key: public-key
- name: LANGFUSE_SECRET_KEY
valueFrom:
secretKeyRef:
name: langfuse-secrets
key: secret-key
- name: LANGFUSE_HOST
value: "http://langfuse-web:3000"
Logging
Log Levels:
DEBUG: Detailed debugging informationINFO: General operational informationWARNING: Warning messagesERROR: Error conditions
Log Aggregation:
- Structured JSON logging
- Centralized log collection (via Fluentd/Fluent Bit)
- Log retention policies
Metrics
Key Metrics:
- Request rate
- Response time (p50, p95, p99)
- Error rate
- Agent routing decisions
- Tool usage patterns
Prometheus Integration:
agent-github:
serviceMonitor:
enabled: true
path: /metrics
port: http
Health Checks
Liveness Probe:
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
Readiness Probe:
readinessProbe:
httpGet:
path: /.well-known/agent.json
port: http
initialDelaySeconds: 10
periodSeconds: 5
Best Practices
Development
- Run Quick Sanity Before Committing:
make quick-sanity - Run Detailed Sanity Before Release:
make detailed-sanity - Test Helm Charts Locally:
helm template test . --values values-secrets.yaml - Validate Kubernetes Manifests:
kubectl apply --dry-run=client -f manifests/
CI/CD
- Version Bumps: Always bump chart versions for substantive changes
- Secret Management: Never commit secrets, use GitHub Actions secrets
- Artifact Retention: Keep logs and test results for debugging
- Parallel Execution: Use workflow dependencies for parallel runs
Deployment
- Staging First: Deploy to staging before production
- Gradual Rollout: Use canary deployments for major changes
- Rollback Plan: Always have a rollback strategy
- Monitoring: Set up alerts before deployment
Operations
- Resource Limits: Always set resource limits and requests
- Health Checks: Configure liveness and readiness probes
- Logging: Enable structured logging and centralized collection
- Tracing: Enable distributed tracing for production workloads
Summary
This AgentOps guide provides a comprehensive overview of:
- CI/CD: Automated testing and deployment via GitHub Actions
- Sanity Checks: Quick and detailed validation of agent functionality
- Evaluations: Automated assessment of agent routing and tool usage
- Helm Charts: Infrastructure as code for Kubernetes deployments
- Kubernetes: Production-ready deployment configurations
Following these practices ensures reliable, scalable, and maintainable AI Platform Engineering deployments.