Spec: LangGraph Persistence β Checkpointer & Cross-Thread Store
Overviewβ
Implement full LangGraph persistence with two complementary layers: (1) per-thread Checkpointer for saving conversation state (messages) within a thread, including auto-trimming and orphan repair; and (2) cross-thread Store for long-term user-scoped memory (summaries, facts, preferences) that persists across threads. Additionally, integrate LangMem for automatic fact extraction from conversations.
Motivationβ
Without persistence, conversation state is lost on restart, threads are isolated, and agents have no memory of prior interactions. This spec covers:
- Checkpointer: Enables multi-turn conversation continuity within a thread. Without it, the agent forgets the conversation after each message.
- Cross-thread Store: When a user starts a new conversation, the agent can recall context from prior threads β no need to re-explain preferences, project details, or environment.
- Fact extraction: Actively extracts and persists facts from conversations so the agent proactively remembers details, even from short conversations that never trigger context compression.
Scopeβ
In Scopeβ
- Checkpointer (per-thread state persistence):
- InMemorySaver (default), with InMemorySaver disabled when
LANGGRAPH_DEVis set - Checkpointer wired into graph compilation for deep agent, GitHub, GitLab, Slack, AWS, Splunk agents
_trim_messages_if_neededauto-compression when context exceeds token limit_find_safe_split_indexrespects tool-call/tool-result boundaries during trimming- Repair fallback: resets thread state via
aupdate_statewhen orphan repair fails - Thread isolation: different
thread_idvalues produce isolated state context_idβthread_idmapping for A2A protocol
- InMemorySaver (default), with InMemorySaver disabled when
- Cross-thread Store (user-scoped long-term memory):
- Store factory with InMemoryStore (default), Redis, and Postgres backends
- Wiring the store through
deepagentsgraph compilation - Saving LangMem compression summaries to the store for cross-thread access
- Retrieving cross-thread context (summaries + memories) when starting new threads
- Propagating user identity from JWT middleware into agent config
- Automatic fact extraction from conversations using LangMem's
create_memory_store_manager - Environment variable configuration
- Unit tests for all layers
Out of Scopeβ
- AsyncRedisSaver / AsyncPostgresSaver checkpointer backends (future β infrastructure not yet wired)
- Explicit "remember this" / "forget this" user commands (future)
- Store-based RAG / semantic search over memories (future)
- Admin UI for viewing/managing stored memories
Designβ
Architectureβ
LangGraph persistence has two independent layers:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 1: Checkpointer (per-thread state) β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β Scope: thread_id β
β Stores: Raw messages (Human, AI, Tool, System) β
β Backends: InMemorySaver (default) β
β Features: β
β β’ Multi-turn conversation continuity β
β β’ Auto-trim when context exceeds token limit β
β β’ Safe split: respects tool-call/result pairs β
β β’ Orphan repair fallback: resets corrupted state β
β β’ Disabled when LANGGRAPH_DEV is set β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 2: Store (cross-thread user memory) β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β Scope: user_id (across all threads) β
β Stores: User memories, conversation summaries β
β Backends: InMemoryStore (default), Redis, Postgres β
β Features: β
β β’ Cross-thread recall on new conversations β
β β’ LangMem summary persistence after compression β
β β’ Automatic fact extraction (opt-in) β
β β’ User isolation: each user has own namespace β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Modelβ
Checkpointer:
thread_id -> [HumanMessage, AIMessage, ToolMessage, SystemMessage, ...]
Store Namespaces:
("memories", <user_id>) -> {key: uuid, value: {"data": "...", "source_thread": "...", "timestamp": ...}}
("summaries", <user_id>) -> {key: uuid, value: {"summary": "...", "thread_id": "...", "timestamp": ...}}
Components Affectedβ
- Multi-Agents (
ai_platform_engineering/multi_agents/) - deep_agent.py (checkpointer + store wiring), agent.py (repair fallback, fact extraction), agent_executor.py (user_id propagation), agent_registry.py - Utils (
ai_platform_engineering/utils/) - store.py, agent_memory/fact_extraction.py, a2a_common/langmem_utils.py, a2a_common/base_langgraph_agent.py (trim, safe split, checkpointer) - Deepagents (
deepagents/) - graph.py (store parameter) - Agents (
ai_platform_engineering/agents/) - GitHub, GitLab, Slack, AWS, Splunk all wire checkpointers via base_langgraph_agent - Documentation (
docs/) - MCP Servers
- Knowledge Bases (
ai_platform_engineering/knowledge_bases/) - UI (
ui/) - Helm Charts (
charts/)
Acceptance Criteriaβ
Checkpointer (per-thread persistence)β
- InMemorySaver attached to deep agent by default
- Checkpointer disabled when
LANGGRAPH_DEVenv var is set - Thread isolation: different thread_ids produce independent state
- Same-thread accumulation: messages persist across invocations
-
_trim_messages_if_neededtrims old messages when context exceeds token limit -
_find_safe_split_indexnever orphans tool-call/tool-result pairs - System messages preserved during trimming
- Repair fallback: adds reset message via
aupdate_statewhen orphan repair fails - Repair fallback skipped when checkpointer is None or thread_id is absent
-
context_idcorrectly mapped tothread_idin stream config - Individual agents (GitHub, GitLab, Slack, AWS, Splunk) wire checkpointers
- Graph compiles correctly with, without, and with None checkpointer
- Checkpoint tests pass (49 tests)
Cross-thread Store (user-scoped memory)β
- Store factory creates InMemoryStore by default
- Store factory supports Redis and Postgres via env vars
- Store is wired into graph compilation via deepagents
- User identity flows from JWT middleware to agent config
- LangMem summaries are saved to store after compression
- New threads retrieve cross-thread summaries/memories
- Graceful fallback when store is unavailable
- Store unit tests pass (86 tests)
Automatic Fact Extractionβ
- Fact extraction runs in background after each response (when enabled)
- Controlled by
ENABLE_FACT_EXTRACTIONenv var (default false) - Extracted facts persisted to ("memories", user_id) namespace via MemoryStoreManager
- Fact extraction unit tests pass (65 tests)
Documentation & Overallβ
- Documentation updated (ADR + env vars)
- All 289 persistence-related unit tests pass
Implementation Planβ
Phase 1: Checkpointer (per-thread persistence)β
- InMemorySaver wired into deep_agent.py with
LANGGRAPH_DEVtoggle - base_langgraph_agent.py uses
MemorySaverfor individual agent graphs -
_trim_messages_if_neededauto-compression with_find_safe_split_indexboundary safety - Repair fallback in agent.py when orphan repair fails (checks checkpointer presence)
- context_id β thread_id mapping and user_id/trace_id metadata propagation
Phase 2: Cross-Thread Store Infrastructureβ
- Create store factory (
ai_platform_engineering/utils/store.py) - Add
storeparameter to deepagents graph builder - Wire store into deep_agent.py
Phase 3: Cross-Thread Data Flowβ
- Propagate user_id from JWT middleware through executor to agent
- Save LangMem summaries to store
- Retrieve cross-thread context on new threads
Phase 4: Configuration & Testsβ
- Update .env.example and docker-compose.dev.yaml
- Write unit tests for store (86 tests)
- Create ADR for cross-thread store
Phase 5: Automatic Fact Extractionβ
- Create
ai_platform_engineering/utils/agent_memory/fact_extraction.pywith LangMemcreate_memory_store_managerintegration - Add background
asyncio.create_task()inagent.pystream() to extract facts after response - Add
ENABLE_FACT_EXTRACTIONandFACT_EXTRACTION_MODELenv vars - Verify
store_get_cross_thread_contexthandles MemoryStoreManager output format - Write unit tests for fact extraction (65 tests)
- Create ADR for automatic fact extraction decision
Phase 6: Checkpoint Testingβ
- Write comprehensive checkpoint tests (49 tests) covering:
- InMemorySaver lifecycle and thread isolation
- State round-trip (Human, AI, System, Unicode messages)
_find_safe_split_indexboundary safety with tool-call pairs_trim_messages_if_neededall branches (disabled, no state, under limit, over limit, system preserved)- Repair fallback with/without checkpointer/thread_id, error handling
- Concurrent checkpoint access (10 threads write, 10 concurrent reads)
- Graph compilation variants (with, without, None checkpointer)
- Agent checkpointer wiring verification (source inspection)
- Edge cases (long thread IDs, special chars, 50-message accumulation)
Testing Strategyβ
Unit Tests (289 total)β
| Test File | Count | Coverage |
|---|---|---|
tests/test_checkpoint.py | 49 | InMemorySaver lifecycle, thread isolation, state round-trip, _find_safe_split_index, _trim_messages_if_needed (all branches), repair fallback, context_idβthread_id, concurrent access, graph compilation, agent wiring, edge cases |
tests/test_store.py | 86 | Store factory, put memory/summary, cross-thread retrieval, user isolation, LangMem integration, user_id extraction/propagation, InMemoryStore integration, lazy Postgres |
tests/test_fact_extraction.py | 65 | Feature flag, config builder, extraction model, extractor creation/caching, extract-and-store, store compatibility, agent integration, edge cases |
tests/test_persistence_unit.py | 89 | _extract_tool_call_ids, _find_safe_summarization_boundary, summarize_messages, _fallback_summarize, preflight_context_check, _repair_orphaned_tool_calls, stream config wiring, deep_agent checkpointer wiring |
Integration Testsβ
integration/test_fact_extraction_live.py-- Seeds facts via multi-turn conversation, waits for background extraction, verifies recall on a new thread, and checks user isolationintegration/test_persistence_features.py-- End-to-end thread persistence, recall, isolation, multi-turn via A2A HTTP API
Manual verificationβ
- Multi-thread conversation with memory recall
Rollout Planβ
- Merge with InMemoryStore default (no infrastructure changes needed)
- Teams can opt-in to Redis/Postgres store via env vars
- Future: semantic search over memories, explicit remember/forget commands
Relatedβ
- ADR: 2026-02-26-cross-thread-langgraph-store
- PR: #861 (LangGraph Redis persistence)