create_new_key races: backend listener not ready when caller proceeds to NIP-46 connect #9
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
create_new_keyreturns its response BEFORE the NIP-46 backend listener for the newly-provisioned key is fully started. Any client that chainscreate_new_key→ user-channel NIP-46connect(kind-24133) on the same target key races against the backend startup — when the race is lost the connect event arrives at a relay subscription that hasn't been set up yet, and the bunker is silent until the caller times out.This surfaced as a hard block for
aiolabs/lnbitsPR #33 (issue-32-eager-bind-bunker), which moves NIP-46connectfrom lazy-on-first-sign_eventinto the provisioning chain so the token is redeemed atomically. Every account-create round-trip now hits this race.Repro
Spin up nsecbunkerd, drive a single client that:
create_new_keyover kind-24134 → receivesRESPONSE_SENT(~10ms).create_new_tokenfor the target → receives ack.connectover kind-24133.Observed: kind-24133
connectis delivered to the relay (lnbits'snostrrelayextension logs_handle_eventon the right kind) but the bunker has no listener subscribed for that target's pubkey yet, so the event is dropped. The client side times out after its configured deadline. Fromaiolabs/lnbitsPR #33 with a 15s deadline:Cross-check on the bunker side: zero
REQUEST_IN method=connectlog lines for any kind-24133 event in the same window — the backend genuinely never saw the request.Root cause
src/daemon/run.ts:loadNsecis synchronous and fires-and-forgetsstartKey. So whensrc/daemon/admin/commands/create_new_key.ts:35does:the await is a no-op —
loadNsecalready returnedundefined. The very next line ships the response to the caller. The backend'sstart()(which is whereNDKNip46Backend.start()registers the relay subscription for kind-24133 events tagged#p: target_pubkey) is still running in the background, finishes some milliseconds later, but by then the caller may have already published itsconnectevent and started its 15s window.unlockKey(run.ts:253-263) has the same pattern.By contrast,
startKeys(run.ts:195-203) callsawait this.startKey(name, nsec);correctly — that's why the existing-keys-loaded-at-startup path works and the post-provisioning path doesn't.Proposed fix
One-line change in two places:
create_new_key.ts:35already awaitsloadNsec, so promotingloadNsecto async closes the race without any caller-side changes.Backwards-compat:
loadNsecwas only awaited at the call site above; all other callers don't depend on the timing.Why this didn't bite pre-PR-#33
Pre-PR-#33 lnbits provisioning did the four admin RPCs at create-account time (
create_new_key,_ensure_policy,create_new_token,get_key_tokens) and persisted the token, but DEFERRED the NIP-46 user-channel connect to firstsign_event— which only happens after a real user action, minutes-to-hours later, well outside any plausible race window with backend startup. The first sign_event always succeeded. Subsequent ones failed withToken already redeemed(the bug fixed by PR #33), but the connect itself reached the backend.PR #33 moves the connect into the same HTTP request as
create_new_key. The race window collapses to milliseconds, and now we lose every time.Evidence trail
aiolabs/lnbitsPR #33: https://git.atitlan.io/aiolabs/lnbits/pulls/33aiolabs/lnbitsissue #32: https://git.atitlan.io/aiolabs/lnbits/issues/32~/dev/coordination/log.mdentries2026-05-30T07:00Z(original bug report),2026-05-30T08:05Z(PR #33 fix announcement),2026-05-30T08:20Z(regtest stack pointed at the fix branch + smoke-test ask).regtest-nsecbunker-1: bunker logs showRESPONSE_SENT id=... kind=24134 elapsed=10msfor every admin RPC, but zero kind-24133REQUEST_IN method=connectlines aftercreate_new_keyruns. The relay (lnbits/nostrrelay) carries the connect event; the bunker just isn't subscribed yet.Update: race is real but latent (flake, not hard block)
Retested against the same bohm regtest stack after
aiolabs/lnbitsPR #33 was rebased ontodev(headb1694186) and the CORS allowlist landed via PR #31. Drove demo-account creation fromhttp://192.168.0.32:5181/login. Account creation succeeded end-to-end this time:Connect-then-sign-event handled cleanly for the freshly-provisioned key.
So I should downgrade the characterization in the original report: this isn't a deterministic block on PR #33's eager-bind path the way I described. The structural issue is still present in the code —
loadNsecIS sync,loadNsecIS not awaiting the asyncstartKey, theawaitatcreate_new_key.ts:35IS a no-op — and the timing window for the race is real (backend subscription is established insidebackend.start(), which can finish aftercreate_new_keyreturns). But in practice the relay subscription often completes before the client's connect event arrives, so the race resolves in our favor on most runs.What that means for the fix priority:
async/awaitfix. The race is a flake hazard, especially under load, on slower machines, or under conditions where the relay subscription handshake takes longer than usual (cold WebSocket, GC pause, etc.). A test suite that exercises this path will probably eventually catch the timeout intermittently.Repro for the fix's regression test: I'd suggest adding an integration test that drives
create_new_keyfollowed immediately by NIP-46connectagainst the new key (no artificial sleep), and assert connect returns within a tight bound (e.g. 500ms). Run it repeatedly (e.g. 100 iterations) — if the race ever loses, this catches it.Filed-as-blocker mischaracterization on
~/dev/coordination/log.mdentry2026-05-30T08:50Zis now corrected in a follow-up08:55Zentry.Fix in #10
Opened PR #10 with the fix: https://git.atitlan.io/aiolabs/nsecbunkerd/pulls/10
Correction to the previous comment
My
08:48Zcomment downgraded this from "hard blocker" to "latent flake" based on a single successful retry. That conclusion was wrong — subsequent retries reproduced the timeout deterministically. The hard-blocker characterization from the original report was correct.The deeper diagnosis from working the fix:
Layer 1 (
loadNsec/unlockKeynot awaitingstartKey) — real, but the surface symptom. Even after promoting both toasync+await, the connect still timed out.Layer 2 (
NDKNip46Backend.start()returns before relay registers the subscription) — the actual race-closer. Inspecting the running NDK2.8.1:this.ndk.subscribe(...)returns immediately — theNDKSubscriptionqueues a REQ on the relay connection but the relay's EOSE hasn't arrived yet. Layer 1's await was awaiting the wrong thing.The fix in #10 overrides
start()in ourBackendsubclass to await EOSE before resolving. The lucky 08:44Z success that triggered my flake-mischaracterization was almost certainly a case where EOSE happened to land before lnbits's client published its connect event — but the window is sub-100ms in the happy case and we lose it under any kind of jitter (cold WebSocket, GC pause, NDK pool churn). End-to-end verified on bohm regtest: with #10 applied, fresh demo-account creation + a follow-up event-publish round-trip both complete in ~3s with noNsecBunkerTimeoutError/ noToken already redeemed/ no silent publish failure.Closing once #10 merges.