Decision Log (ADRs)
As-of: 2026-05-09. Source: code archaeology across
passkey-shell,v2-spec-validation.md,vault/architecture-sketch-v2.md, migration history, service implementations.
ADR-001: Split IKeeperService into IVaultReadService + IVaultShareService
Decision: Decompose the monolithic IKeeperService interface into two purpose-specific interfaces: IVaultReadService (KSM SDK for reads) and IVaultShareService (Commander CLI for shares).
Context: v1 had a single IKeeperService that conflated read operations (fetching record metadata, listing folders) with write operations (creating one-time shares). The underlying Keeper platform uses two distinct APIs for these: KSM SDK (pull-based, application-scoped) and Commander CLI (interactive session, user-scoped). Forcing both through one interface created impedance mismatch and made the factory’s mock/real swap confusing.
Options Considered:
- Keep single interface, add optional methods for share-specific behavior.
- Split into read + share interfaces with independent implementations.
- Wrap Commander in a REST sidecar, unify via HTTP.
Choice: Option 2.
Rationale: The two Keeper integration surfaces have fundamentally different authentication models (KSM Application token vs. Commander user session), error shapes, and availability characteristics. Separate interfaces make each independently mockable and testable. The factory (vault.factory.ts) ensures only valid combinations exist: local-mock (both mock), staging (real reads, stub shares), full (both real).
Date: 2026-04. Status: Implemented. Legacy shim retained during convergence.
ADR-002: Naming Migration keeper-* → vault-*
Decision: Rename all Keeper-specific service names, file names, and env vars from keeper-* to vault-*.
Context: The original naming coupled the product to a specific vendor. If the vault backend ever changes (or if the product supports multiple vault providers), keeper-* names would be misleading. More immediately, the naming confused developers about which services were Keeper-specific vs. generic.
Options Considered:
- Keep
keeper-*naming. - Rename to
vault-*globally. - Rename to
credential-*.
Choice: Option 2. vault-* is vendor-neutral and accurately describes the role.
Rationale: The interface boundary is “vault operations” — read records, create shares, revoke shares. The Keeper-specific implementation detail lives behind the interface. Env vars like VAULT_DEPLOYMENT_MODE are clearer than KEEPER_MODE for operators who may not know Keeper.
Date: 2026-04. Status: Complete. All source files, factory, env vars migrated. v2-spec-validation.md documents the full rename table.
ADR-003: KSM SDK for Reads, Commander CLI for Shares (Hybrid Integration)
Decision: Use KSM SDK for vault reads and Commander subprocess for share operations, rather than using a single integration path for both.
Context: Keeper provides two developer-facing interfaces:
- KSM SDK (
@keeper-security/secrets-manager-core): Application-scoped, token-based, read-only by design. Can fetch records and folder metadata but cannot create one-time shares. - Commander CLI (
python3 -m keepercommander): User-session-scoped, supports share creation/revocation but requires a Python subprocess and interactive session management.
Options Considered:
- All-Commander (shell out for everything).
- All-KSM (wait for KSM to add share creation API, if ever).
- Hybrid: KSM reads + Commander shares.
Choice: Option 3 — hybrid.
Rationale: KSM is purpose-built for programmatic reads with proper application-scoped auth. Commander is the only path to share creation. Going all-Commander would sacrifice KSM’s cleaner auth model for reads. Going all-KSM would block on a feature that may never ship. The hybrid approach uses each tool where it’s strongest. The factory pattern (vault.factory.ts) makes the split invisible to consumers.
Risk noted: Commander subprocess is a single point of failure for share creation. If Keeper’s CLI behavior changes, share creation breaks. See Risk Register.
Date: 2026-04. Status: Implemented. Commander timeout configurable via VAULT_COMMANDER_TIMEOUT_MS.
ADR-004: 9-State RequestStatus Enum
Decision: Use a 9-state enum for RequestStatus: PENDING, APPROVED, ISSUED, DENIED, EXPIRED, RELEASED, RENEWAL_PENDING, REQUIRES_TRIAGE, UNFULFILLABLE.
Context: The v1 spec originally proposed 8 states. During v2 validation, the team identified that REQUIRES_TRIAGE (from the policy engine) and ISSUED/UNFULFILLABLE (from the sync-first/async-fallback issuance pattern) served orthogonal purposes and all needed to coexist.
Options Considered:
- 7 states (merge REQUIRES_TRIAGE into PENDING with a flag).
- 8 states (drop either ISSUED or UNFULFILLABLE).
- 9 states (keep all, each serves a distinct workflow need).
Choice: Option 3.
Rationale per state:
REQUIRES_TRIAGE: The policy engine raises this when rules produce a TRIAGE decision. Admins can query the triage queue via the existing request list filter. It’s orthogonal to the approval/denial/issuance flow.ISSUED: Indicates successful credential delivery — the share URL is available. Distinct from APPROVED (which means “approved but not yet delivered”).UNFULFILLABLE: Commander couldn’t deliver after exhausting retries. Distinct from DENIED (which is a policy decision, not a technical failure).
The v2-spec-validation.md analysis confirms: “REQUIRES_TRIAGE should be preserved — it serves the policy engine and is orthogonal to ISSUED/UNFULFILLABLE.”
Date: 2026-04. Status: Implemented. Migration 202604241200_harden_governance_issuance_additive added ISSUED and UNFULFILLABLE.
ADR-005: Sync-First / Async-Fallback Issuance Pattern
Decision: When a request is approved, attempt synchronous issuance immediately. If the vault call fails transiently, leave the request in APPROVED and let a background retry job pick it up.
Context: The approval and issuance steps are logically distinct but tightly coupled in user expectation — approvers expect the credential to be available immediately after clicking “Approve.” However, Commander subprocess calls can fail due to timeouts, rate limits, or transient vault issues.
Options Considered:
- Fully synchronous: block the approval response until issuance completes or fails permanently.
- Fully asynchronous: always queue issuance for background processing.
- Sync-first with async fallback: try immediately, queue on transient failure.
Choice: Option 3.
Rationale: Option 1 makes the approval endpoint slow and fragile. Option 2 adds unnecessary latency for the happy path (most issuances succeed immediately). Option 3 gets the best of both: fast happy path, resilient failure path. The lease.service.startLease() function implements this with a try/catch around vaultShareService.createOneTimeShare(), classifying the error via classifyCommanderError() into transient (retry), persistent (retry with longer backoff), or terminal (UNFULFILLABLE).
Date: 2026-04. Status: Implemented in lease.service.ts.
ADR-006: Persistent Retry Budget N=3 with Backoff [30s, 1m, 2m, 4m]
Decision: Failed issuance attempts get a retry budget of 3 attempts with exponential backoff (30s, 1m, 2m, 4m slots).
Context: When startLease() fails to create a one-time share, the request stays in APPROVED status. The issuance-retry.job.ts background job picks up these requests and retries.
Options Considered:
- Infinite retries with linear backoff.
- Fixed budget with exponential backoff.
- No retries — immediately mark UNFULFILLABLE on first failure.
Choice: Option 2 — budget N=3, backoff [30s, 1m, 2m, 4m].
Rationale: Infinite retries risk hammering a consistently failing Commander instance. No retries are too aggressive — most transient failures resolve within a few minutes. A budget of 3 with exponential backoff covers typical transient failures (network blips, rate limits) without creating a thundering-herd problem. After exhausting the budget, the request transitions to UNFULFILLABLE, requiring admin intervention.
Date: 2026-04. Status: Implemented in issuance-retry.job.ts.
ADR-007: Discovery Matching — Bootstrap Title-Match → UID-Pinned (Option B)
Decision: Use title-based matching for initial bootstrap (linking Postgres Record rows to Keeper vault UIDs), then pin the UID so discovery never re-evaluates the linkage.
Context: Records in Postgres need to be linked to their corresponding Keeper vault records. Two strategies were considered for ongoing discovery:
Options Considered:
- Option A: Continuous title-match (discovery job always re-matches by title).
- Option B: Bootstrap title-match, then pin UID. After initial linkage, set
vaultRecordUidPinned = trueand never re-evaluate.
Choice: Option B.
Rationale: Title-match is fragile — vault records can be renamed, and titles aren’t guaranteed unique. Once a Postgres record is linked to a vault UID, that linkage should be stable. The vaultRecordUidPinned flag (added in v2 schema) makes the distinction explicit. Discovery only considers unpinned records. Admin can manually re-link via the register endpoint if needed.
Current state: Bootstrap linkage is done via seed-definition.ts and manual admin registration. The discovery job (discovery.job.ts) handles new vault records. Some hand-linked SQL exists between Postgres and Keeper UIDs that needs reconciliation or reset before v3.
Date: 2026-04. Status: Implemented. Hand-linked SQL is tactical debt — see Risk Register.
ADR-008: Postgres as Cached Projection (Not Source of Truth)
Decision: Postgres stores a cached projection of Keeper vault state. The Keeper vault is the source of truth for credential data.
Context: The application needs to query credential metadata (title, owner, revision, folder membership) for governance decisions and UI rendering. Querying the Keeper vault on every request would be slow and rate-limited.
Options Considered:
- Postgres as source of truth (copy all credential data into Postgres, stop querying vault).
- Postgres as cached projection (sync vault state into Postgres periodically, vault remains authoritative).
- No local cache (always query vault in real time).
Choice: Option 2.
Rationale: Option 1 would create a dangerous divergence — Postgres and vault could disagree on credential content, and there’s no mechanism to push changes back to the vault. Option 3 would make the application unusable under load. Option 2 keeps the vault as the authority while giving the application fast local reads. The syncStatus enum (UNKNOWN/FRESH/STALE/ORPHANED) and vaultRevision field make staleness explicit and queryable.
Implementation: vault-sync.job.ts runs periodically, comparing vaultRevision against the live vault. Record.syncedAt + syncStatus give admins freshness visibility via GET /api/admin/vault-sync-status.
Date: 2026-04. Status: Implemented. Phase 1 (eager sync on approval) complete. Phase 2 (scheduled background sync) complete.
ADR-009: seed-definition.ts as Single Source of Truth for Fixtures
Decision: All Postgres seed data and Vault fixture data are derived from a single seed-definition.ts file.
Context: The application needs deterministic seed data for local development and testing. Two persistence surfaces exist (Postgres and Keeper vault mock), and they need to agree on record IDs, UIDs, folder mappings, and authority structures.
Options Considered:
- Separate seed files for Postgres and vault.
- Single definition file that generates both.
Choice: Option 2.
Rationale: A single source eliminates drift between Postgres and vault fixtures. seed-definition.ts defines the canonical record/user/folder/authority shapes. emit-postgres-seed.ts and emit-vault-fixture.ts transform these into the format each target expects. Tests import from seed-definition.ts to get stable IDs for assertions.
Date: 2026-04. Status: Implemented in seed/.
ADR-010: HARDEN_GOVERNANCE_v1 Flag Gating Issuance Token-Exchange Flow
Decision: Gate the full issuance token-exchange flow behind a HARDEN_GOVERNANCE_v1 boolean flag. When false, the system operates in “soak mode” with relaxed issuance constraints. When true, the token-exchange is the sole path to credential delivery.
Context: The token-exchange flow (generate token at approval, verify at issuance, rate-limit, hash-only storage) is a significant security hardening. Deploying it immediately in production without a soak period risks breaking existing workflows.
Options Considered:
- Ship hardened flow immediately with no fallback.
- Feature flag with soak period.
- A/B test (some requests use old flow, some use new).
Choice: Option 2.
Rationale: The feature flag allows deploying the hardened code to staging/production in “observe mode” — the code paths execute, the issuance events are written, but enforcement is relaxed. Once the team confirms no regressions during the soak period, the flag is flipped to true and the hardened flow becomes the only path.
Current state: HARDEN_GOVERNANCE_v1=false in both staging and production. Flipping requires runtime validation — the flag has never been tested in true mode on staging.
Date: 2026-04. Status: Implemented. Flag exists. Not yet exercised in true mode — see Risk Register.