Design discussion / RFC: enforce token + grant lifecycle at sign time (the root behind #24) #25
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why this is a discussion, not a patch
#24 (token
expiresAtignored after connect) looks like a one-line fix. It isn't. It's one symptom of a structural decision, and that decision has at least three symptoms already shipped — so before patching #24 in isolation, let's agree on the foundation, because the choice we make here decides whether this class of bug can recur.This is pre-release / pre-public-launch code. We have time, no stored-data migration burden, and the freedom to make the strict-from-the-start choice now. That's exactly when this kind of decision is cheap to get right and expensive to defer.
The pattern (one cause, three symptoms)
At connect time,
applyToken(src/daemon/backend/index.ts:99) materializes a token's policy rules into per-KeyUserSigningConditionrows. At sign time,checkIfPubkeyAllowed(src/daemon/lib/acl/index.ts:23) reads those materialized rows first (step 3b) and short-circuits — so anything that lives on theToken/Policybut isn't copied onto theSigningConditionis invisible to signing:Token.expiresAtToken.revokedAtaiolabs/spirekeeper#22PolicyRule.maxUsageCountSigningConditionhas noexpiresAt/revokedAt/usage column at all (prisma/schema.prisma:59). Compounding it: "is this token valid right now?" is defined in two places —validateToken(redeem, checks expiry) and ACL step 4 (sign, doesn't) — and they have already drifted. That drift is the bug.Root cause, stated once: the materialized
SigningConditionset is a cache of a derivation, and it has no invalidation. Every lifecycle feature anyone adds will arrive broken for the same reason.The decision
Two principled ways to make a cached system correct:
applyTokenrecords only the binding (this pubkey is paired via this token) and nothing else. Every sign request computes the answer fresh from the realToken/Policy, checking revoke and expiry and usage in one indexed join. The whole bug family becomes impossible by construction — a new lifecycle rule is just one more predicate, never a forgotten photocopy. The usual "live is slow" objection doesn't apply: it's one indexed join on a path already spending ~5–10 ms on NIP-44 + Schnorr per round-trip.Recommendation: Option D, with one non-negotiable discipline that prevents the original drift from returning:
A clean result that falls out of D
The model also forces a useful distinction we should adopt in the vocabulary:
KeyUser.revokedAt): sticky, beats everything; un-banning is a deliberate admin action; re-pairing a banned subject should fail loudly (todayapplyToken'supdate:{}would silently create a dead binding).Token.expiresAt): not a ban; re-pairing with a fresh token simply adds a new live grant alongside. Naturally correct because expiry lives on the grant, never on the subject.This directly answers the re-pair wrinkle raised on #24 /
aiolabs/spirekeeper#22.The open question for the team
If we take D, one genuinely hard-to-reverse modelling choice remains — this is the part I want input on:
Token+Policy) and manual admin overrides (add_signing_condition, the web-approvalallowAllRequestsFromKeypath,create_accountbootstrap) as distinct kinds, but give the override layer a real lifecycle (createdAt/expiresAt/revokedAt) and run both through the one sharedgrantIsLive(now)predicate. Lower ceremony on the interactive path; clean audit trail. (My lean.)The robustness win (killing the bug family) comes from D itself, not from this sub-choice — so D1 vs D2 is a taste call about uniformity vs. simplicity. Both are defensible; I'd like a second opinion before committing to a schema.
Scope note
Strict-from-the-start: no back-compat shims, no "absent → pass" graceful paths. If we adopt D, the same change also fixes the revoke no-op (
aiolabs/spirekeeper#22) and unlocks live usage-cap enforcement — three findings closed at one seam.Cross-refs
expiresAt).aiolabs/spirekeeper#22— sibling finding: token-revoke is a post-redeem no-op for the same materialization reason.Please weigh in on Option C vs D, and (if D) on D1 vs D2.
Prior art:
lnbits/nostr_bunkeris a shipping Option DI went and read upstream
lnbits/nostr_bunker(services.py/models.py/crud.py, verified againstmainon 2026-06-19) to see whether anyone in the NIP-46 space has already made this exact decision. They have — and they landed on D, which I think strengthens the recommendation above with an empirical existence-proof.Their design is the degenerate-but-instructive case: the grant is the
bunker://URL record (UrlData). There's no materialization step — every signing request re-reads the live grant. Concretely, all three of our broken lifecycle rules are enforced live per request:lnbits/nostr_bunker_assert_url_is_active()checksexpires_atevery request_assert_post_rate_limit()enforced liveSo a real, shipping NIP-46 bunker enforces the exact trio we drop, with zero invalidation machinery. That's the direct answer to the "won't deciding live be too fiddly / too slow?" objection against D — somebody already runs it in production on the same per-request crypto path.
Two mechanisms worth stealing regardless of D1/D2
1. Usage caps by counting the source of truth, not maintaining a counter. This is the one I'd actually change in our schema. Our
PolicyRule.maxUsageCount/currentUsageCountis a mutable counter — a second cache you have to remember to decrement, which is its own drift hazard (and is partly why the usage sibling is broken). Upstream instead counts signing-request rows in the trailing window (get_signing_requests_since(24h)) — nothing to increment, nothing to invalidate, the count is derived from records we already write. If we go D, I'd dropcurrentUsageCountentirely and countRequestrows the same way. This is the same "source of truth, don't re-derive a copy" principle the RFC argues, applied to the usage rule.2.
_assert_url_is_active()as a single named predicate is literally the "define 'valid right now' in exactly one place" discipline from the RFC, already factored out. Concrete template for ourgrantIsLive(now).What it does NOT settle — and why it nudges D1, not D2
Upstream dodges the entire family by having no indirection to drift: no token redeem/handoff, no manual admin-grant path, one grant type, one tenant per wallet, flat permission strings on the grant. It's effectively the limit case of D2 ("one evaluation path") — but only because it deleted the manual path, not because it unified it.
That's not evidence for D2 in our context; it's evidence that uniformity is free when you have one grant kind, which we don't (per-device redeemable tokens and interactive admin overrides are both real requirements). What transfers is the evaluation strategy (live read, single predicate, derive-don't-count), not the schema. If anything it reinforces my lean toward D1: keep token-derived and manual grants as distinct typed sources, give the override layer a real lifecycle, and run both through one shared
grantIsLive(now)— you get upstream's robustness without amputating the manual path the way they did.tl;dr — upstream validates D outright and hands us the counting-not-counter fix for the usage sibling for free; it stays neutral-to-favorable on D1.
Prior art #2:
Letdown2491/signet— a re-architecture of our own codebase, and a cautionary oneSurveyed the OSS NIP-46 field for daemons with a real policy model. The standout is Signet (TS daemon + React UI + Android companion, MIT, very active — v1.11.0, 2026-06): an extensive fork of the same kind-0/nsecbunkerd codebase we maintain, re-architected around exactly this ACL/lifecycle problem. I read
acl.ts,nip46-backend.ts, andschema.prismadirectly. The conclusion is more useful than "copy it" — Signet independently shipped our #24 bug, which is strong corroboration of the root-cause framing above.What it solved, what it didn't (verified against source)
applyToken(nip46-backend.ts:807) is structurally identical to ours:Token.expiresAtonce, at redeem (nip46-backend.ts:895),policy.rulesintoSigningConditionrows (:845-862) — carryingmethod/kind/allowed, no expiry, no usage (theirSigningConditionis byte-identical to ours),acl.ts:checkRequestPermission) never readsTokenagain.Grep confirms the blast radius:
maxUsageCount/currentUsageCountare touched only in the policy CRUD route — never decremented, never checked on the hot path. So Signet ships the exact #24 (token TTL ignored after connect) and dead usage-caps. A second team fell into the same materialization trap → independent confirmation this is structural, not a one-off oversight.What Signet does add over us: a coarse-cache layer for subject-level state on
KeyUser—revokedAt,suspendedAt/suspendUntil,trustLevel— read live every request, invalidated on change (invalidateAclCache). That genuinely fixes live revoke (our siblingspirekeeper#22). Notably it puts revoke onKeyUser, notToken— corroborating the revoke=subject-level / expiry=grant-level split proposed above.Why this sharpens the C-vs-D call
The fix cleaves exactly along the revoke/expiry line:
KeyUserrow): Signet's coarse-cache-with-invalidation is the right, cheap tool. This is "Option C done carefully," and it works because the cached state is a single row whose every mutation has an invalidation hook.Token/PolicyRule): caching a materialized photocopy cannot work — Signet is the proof. This half needs the Option D live join.So Signet is not a drop-in: it solves the half we'd half-solved and leaves #24 proper open. The synthesis I'd propose for D (leaning D1):
KeyUsersubject-state:+ suspendedAt,+ suspendUntil, optional+ trustLevel, with@@index([revokedAt])/@@index([suspendedAt]). Coarse-cache it + invalidate on change. (We already haveKeyUser.revokedAt.)Requestindexing —+ keyUserIdFK,@@index([allowed, createdAt]),@@index([keyUserId])— to enable usage =COUNT(Request)in window (thelnbits/nostr_bunkerderive-don't-count pattern). DropPolicyRule.currentUsageCount— the mutable counter is itself a drift-prone cache.applyTokenmaterialization.applyTokenrecords only theToken.keyUserIdbinding; sign-time joinsToken → Policy → PolicyRulelive and runs everything through onegrantIsLive(now)predicate (Token.expiresAt∧Token.revokedAt∧ subject state ∧ usage). This is the line Signet kept and we should delete.ConnectionToken(handshake — validates, never auto-approves) from durable policy-backedToken. That's our two-typed-sources model; the manual-overrideSigningConditionlayer then needs its own lifecycle (+ createdAt/expiresAt/revokedAt) so both sources run through the same predicate.Full schema diff + the rest of the survey (promenade/FROST can't do NIP-04/44 decrypt — relevant to the #18 server-decrypt need; FROSTR's 3-layer revocation model; Amber's per-(app×method×kind×relay) grants; NDK's
Nip46PermitCallbackseam we sit behind) captured offline — happy to drop the schema-diff doc here if useful.tl;dr: Signet confirms (a) revoke belongs on the subject and must be live — adopt their cache; (b) grant-level TTL/usage cannot be materialized — they proved it by re-shipping #24. That's the strongest case yet for D, and their schema is a ready-made reference for D1 minus the one
applyTokenline we must not copy.Prior-art survey, source-verified — the complete picture
Read the actual source of every other NIP-46 signer worth learning from (clones at the commits cited; an initial automated pass overstated several of these, so each claim below is checked against code). Full writeup with all file:line citations lands in
docs/acl-prior-art-survey.md. Net: nothing unseats Option D, leaning D1 — and we now have a verified reference implementation for each piece.Does anyone enforce grant lifecycle live at sign time?
acceptUntil > now()every request; sweep is cleanup-onlyUntilonlyAmber is the positive template (verified)
IntentUtils.isRemembered()(IntentUtils.kt:1087-1101) is the per-request verdict and recomputes the deadline againstnow()every call; expired → returnsnull→ prompt. The 24hupdateExpiredPermissionssweep (ApplicationDao.kt:51) is non-load-bearing — correctness doesn't depend on it firing. Three things worth lifting straight into our D design:acceptUntilandrejectUntil): "reject for 5 min" decays back to a prompt instead of a permanent no.CachingApplicationDao): keeps thenow()re-check on every cache hit. Same lesson Signet's coarse-cache teaches for subject state.Corrections to the earlier secondhand summary
frost/ecdh.goimplements threshold ECDH; promenade chooses not to wire it (AuthorizeEncryption → false,GroupContext.Encrypt → "not implemented"). Relevant to the #18 "bunker for everything" endgame: threshold-protecting the server identity wouldn't mathematically preclude DM decryption — but keeping ECDH on a separate non-threshold key is the cheaper path. The functional "promenade can't decrypt" stands; the reason was wrong.status='revoked'), only per-grant revoke is implicit.What each implementation contributes to our redesign
Token/Policy/ConnectionTokendecomposition (itsConnectionToken-vs-Tokensplit is D1) — minus the oneapplyTokenmaterialization line we must not copy.revoked_atchecked first, last-used tracking, audit-event-on-grant-change).get_public_keybypassespubkeyAllowedentirely, so identity disclosure is ungated/unauditable through our ACL seam (every other method gates; this one doesn't). Worth a deliberate accept-or-override decision as part of the "one predicate on every request" goal.Bottom line for the open decision
Option D is the only design that closes the grant-level family, and now has a production existence-proof (Amber) plus a cautionary re-ship of our bug (Signet) on either side of it. D1 is corroborated by Signet's two-source schema and avoids promenade's revoke=re-key trap.
Option D (leaning D1) implemented and deployed to all servers via #27 (merge
992c6a8):grantIsLive(now)predicate used identically at redeem (validateToken) and sign (checkIfPubkeyAllowed)applyTokende-materialized — token grants evaluated live offToken → Policy → PolicyRuleSigningConditionlayer carries its own lifecycle (D1)The materialization-drift family is closed by construction. Spinoffs tracked separately: usage-cap enforcement #28, DB integration tests #29, NDK
get_public_keyseam #26. Prior-art survey + keep-our-fork decision landed indocs/.Closing the design RFC as delivered.