SigningLog retention/prune reaper (unbounded growth from #28 usage caps) #37

Open
opened 2026-06-21 10:35:20 +00:00 by padreug · 0 comments
Owner

Follow-up deferred from #28 / #34 (per-rule windowed usage caps).

Problem

recordSigning() appends one SigningLog row for every allowed consequential signing (sign_event/encrypt/decrypt). The table is append-only and nothing ever deletes from it, so it grows unbounded for the life of the bunker. On a busy operator-IdP instance this is the highest-write table in the schema.

Caps count live against it (checkIfPubkeyAllowed step 4: COUNT(SigningLog WHERE keyUserId/method/kind AND createdAt > since)), so only rows inside the active window ever affect a verdict — older rows are dead weight that only slow the count and bloat the DB.

Proposed reaper

A periodic prune (daemon timer, or a maintenance RPC) that deletes SigningLog rows older than the longest active finite window across all PolicyRules:

DELETE FROM SigningLog WHERE createdAt < now() - max(windowSeconds across capped rules)

The existing @@index([keyUserId, method, createdAt]) supports a createdAt-bounded delete.

⚠️ Correctness constraint — do NOT prune rows a lifetime cap counts

A rule with windowSeconds = null is a lifetime cap: step 4 counts that (method, kind) all-time with no createdAt floor (acl/index.ts:145-146). If the reaper deletes old rows that a lifetime-capped rule would have counted, the lifetime count silently under-reports and the cap lifts itself — a security regression, not just a perf miss.

So the prune boundary is not simply "oldest finite window." The reaper must:

  1. Compute the max finite windowSeconds among active rules → candidate cutoff.
  2. Exclude any (keyUserId, method, kind) that is matched by an active rule with maxUsageCount != null AND windowSeconds == null (lifetime cap) — those rows are load-bearing forever.
  3. Only then delete rows older than the cutoff for the remaining (method, kind) tuples.

If any active lifetime cap exists, the simplest correct v1 is to skip pruning the tuples it covers entirely and only reap rows for tuples governed solely by finite-window caps.

Notes / open questions

  • Cadence: hourly timer vs. on-startup vs. amortized (prune-on-write every N rows). A daemon timer mirroring the relay-reconnect timer pattern is probably cleanest.
  • Rows for (method, kind) tuples that are uncapped (no matching maxUsageCount) are never counted at all — those are always safe to prune to now() - shortest-window or even immediately, but keeping the logic simple (single global cutoff + lifetime exclusion) is fine for v1.
  • Consider whether the log has independent audit value (it's a record of every signing the bunker performed). If we ever want it for audit/forensics, retention becomes a policy decision, not just a cap-budget GC. Worth a one-line config knob (signingLogRetentionDays) defaulting to the cap-driven minimum.

Relevant code: src/daemon/lib/acl/index.ts (recordSigning, step-4 count), prisma/schema.prisma model SigningLog.

Follow-up deferred from #28 / #34 (per-rule windowed usage caps). ## Problem `recordSigning()` appends one `SigningLog` row for **every allowed consequential signing** (`sign_event`/encrypt/decrypt). The table is append-only and nothing ever deletes from it, so it grows unbounded for the life of the bunker. On a busy operator-IdP instance this is the highest-write table in the schema. Caps count live against it (`checkIfPubkeyAllowed` step 4: `COUNT(SigningLog WHERE keyUserId/method/kind AND createdAt > since)`), so only rows **inside the active window** ever affect a verdict — older rows are dead weight that only slow the count and bloat the DB. ## Proposed reaper A periodic prune (daemon timer, or a maintenance RPC) that deletes `SigningLog` rows older than the **longest active finite window** across all `PolicyRule`s: ``` DELETE FROM SigningLog WHERE createdAt < now() - max(windowSeconds across capped rules) ``` The existing `@@index([keyUserId, method, createdAt])` supports a `createdAt`-bounded delete. ## ⚠️ Correctness constraint — do NOT prune rows a lifetime cap counts A rule with `windowSeconds = null` is a **lifetime** cap: step 4 counts that `(method, kind)` all-time with no `createdAt` floor (`acl/index.ts:145-146`). If the reaper deletes old rows that a lifetime-capped rule would have counted, the lifetime count silently under-reports and the cap **lifts itself** — a security regression, not just a perf miss. So the prune boundary is not simply "oldest finite window." The reaper must: 1. Compute the max **finite** `windowSeconds` among active rules → candidate cutoff. 2. **Exclude** any `(keyUserId, method, kind)` that is matched by an active rule with `maxUsageCount != null AND windowSeconds == null` (lifetime cap) — those rows are load-bearing forever. 3. Only then delete rows older than the cutoff for the remaining `(method, kind)` tuples. If any active lifetime cap exists, the simplest correct v1 is to **skip pruning the tuples it covers entirely** and only reap rows for tuples governed solely by finite-window caps. ## Notes / open questions - Cadence: hourly timer vs. on-startup vs. amortized (prune-on-write every N rows). A daemon timer mirroring the relay-reconnect timer pattern is probably cleanest. - Rows for `(method, kind)` tuples that are **uncapped** (no matching `maxUsageCount`) are never counted at all — those are always safe to prune to `now() - shortest-window` or even immediately, but keeping the logic simple (single global cutoff + lifetime exclusion) is fine for v1. - Consider whether the log has independent **audit** value (it's a record of every signing the bunker performed). If we ever want it for audit/forensics, retention becomes a policy decision, not just a cap-budget GC. Worth a one-line config knob (`signingLogRetentionDays`) defaulting to the cap-driven minimum. Relevant code: `src/daemon/lib/acl/index.ts` (`recordSigning`, step-4 count), `prisma/schema.prisma` `model SigningLog`.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aiolabs/nsecbunkerd#37
No description provided.