Safety Model - SikkerKey Docs

The sync agent manages database credentials. A bug here doesn't just lose data — it locks users out of their production database. The safety model is designed around one principle: the live secret must always match what the database accepts.

Two-Phase Rotation

SikkerKey never updates the live secret value until the database is confirmed to have the new password. This is the core safety guarantee.

Without two-phase (naive approach)

SikkerKey rotates the password
Agent polls, sees the change, tries to apply
If the agent is down or the apply fails, the database has the old password but SikkerKey has the new one
Every service reading from SikkerKey gets credentials that don't work

With two-phase (what SikkerKey does)

SikkerKey generates a new password and holds it as pending
Agent detects the pending rotation and applies it to the database
Agent verifies the new password works by connecting as the managed user
Agent confirms back to SikkerKey
SikkerKey promotes the pending value to the live secret
If any step fails, the live secret stays at the old (working) password

The key difference: the secret value that machines read via SDK/CLI is always the password that the database actually accepts. There is no window where they can disagree.

Verification After Apply

The agent does not trust that ALTER ROLE succeeded just because it didn't return an error. After applying the new credentials, the agent makes a separate connection to the database using the managed username and the new password. Only if this test connection succeeds does the agent confirm.

This catches edge cases like:

The ALTER command succeeded but the password policy rejected the value
The connection pool returned a stale connection that didn't reflect the change
A different process reverted the password between the ALTER and the verify

Rollback on Verify Failure

If the apply succeeds but verification fails (the new password doesn't work despite the ALTER returning success), the agent:

Fetches the current live secret from SikkerKey (which is still the old password)
Applies the old password back to the database
Rejects the rotation with both the verify error and the rollback result
SikkerKey clears the pending state and logs the error

If the rollback also fails, the error message includes both failures so the operator can intervene manually.

Agent Lock

Each managed secret is locked to exactly one machine. The first machine to fetch the sync config becomes the registered agent (agentMachineId is recorded). Subsequent requests from a different machine are rejected with HTTP 409 on sync-config fetch and HTTP 403 on confirm/reject.

This prevents:

Two agents racing to apply the same rotation (double ALTER ROLE)
A decommissioned machine's agent interfering with the replacement

To transfer agent ownership to a new machine, delete the managed secret and recreate it, or reset the agent lock from the employee portal.

Zombie Prevention

The agent exits cleanly when the managed secret is deleted or access is revoked:

Secret deleted (404 on sync-config poll): agent prints "Secret deleted or access revoked" and exits
Access revoked (403 on sync-config poll): same behavior
Agent disabled: returns 404 (config not found with enabled = true)

For agents running as system services (systemd/launchd), the exit triggers a restart. The restart immediately hits the same 404/403 and exits again. Systemd's restart backoff eventually stops retrying.

Pending Rotation Hold

SikkerKey will not generate a new pending rotation if:

A rotation is already pending: the previous one must be confirmed or rejected first
The agent is unhealthy: no heartbeat for 90+ seconds
The agent status is "no_agent": no agent has ever connected

This prevents:

Stacking multiple unconfirmed rotations
Generating rotations that nobody will apply
Wasting entropy on passwords that will never be used

Idempotent Confirm

Each pending rotation has a unique rotationId (UUID). The confirm endpoint verifies the ID matches before promoting. If the agent sends a duplicate confirm (e.g. retry after network timeout), the second request is rejected with "Rotation ID mismatch" because the first confirm already cleared the pending state. The secret is not double-promoted.

Retry After Failure

When a rotation fails, rotationStatus is set to failed with the error message. The next rotation timer tick generates a new pending rotation with a fresh password (the failed password is discarded). The user can also click "Retry rotation" in the dashboard to reset the status to idle, triggering a new rotation on the next cycle.

Failed passwords are not retried. A fresh password is generated each time. This is intentional — a password that failed to apply should not be reused.

What the Dashboard Shows

The dashboard surfaces agent health and rotation state for each managed secret:

Agent status: Healthy, Unhealthy, Error, No Agent
Rotation status: Idle (normal), Pending (waiting for agent), Failed (with error message and retry button)
Last heartbeat: when the agent last checked in
Last rotated: when the last rotation was confirmed (not attempted)

Status changes are audit-logged as agent_status_change. Rotation failures are logged as secret_rotate_denied.