aiolabs/nsecbunkerd

Fork 0

create_new_key races: backend listener not ready when caller proceeds to NIP-46 connect #9

New issue

Closed

opened 2026-05-30 08:42:56 +00:00 by padreug · 2 comments

padreug commented

2026-05-30 08:42:56 +00:00

Owner

Summary

create_new_key returns its response BEFORE the NIP-46 backend listener for the newly-provisioned key is fully started. Any client that chains create_new_key → user-channel NIP-46 connect (kind-24133) on the same target key races against the backend startup — when the race is lost the connect event arrives at a relay subscription that hasn't been set up yet, and the bunker is silent until the caller times out.

This surfaced as a hard block for aiolabs/lnbits PR #33 (issue-32-eager-bind-bunker), which moves NIP-46 connect from lazy-on-first-sign_event into the provisioning chain so the token is redeemed atomically. Every account-create round-trip now hits this race.

Repro

Spin up nsecbunkerd, drive a single client that:

Sends admin create_new_key over kind-24134 → receives RESPONSE_SENT (~10ms).
Sends admin create_new_token for the target → receives ack.
Immediately opens a fresh NIP-46 channel keyed to that brand-new target and sends connect over kind-24133.

Observed: kind-24133 connect is delivered to the relay (lnbits's nostrrelay extension logs _handle_event on the right kind) but the bunker has no listener subscribed for that target's pubkey yet, so the event is dropped. The client side times out after its configured deadline. From aiolabs/lnbits PR #33 with a 15s deadline:

no NIP-46 response for 'connect' within 15.0s

Cross-check on the bunker side: zero REQUEST_IN method=connect log lines for any kind-24133 event in the same window — the backend genuinely never saw the request.

Root cause

src/daemon/run.ts:

loadNsec(keyName: string, nsec: string) {           // line 265
    this.activeKeys[keyName] = nsec;
    this.startKey(keyName, nsec);                   // line 268 — promise discarded
}

async startKey(name: string, nsec: string) {         // line 227 — async
    ...
    const backend = new Backend(this.ndk, this.fastify, hexpk, cb, this.config.baseUrl);
    await backend.start();                           // line 250 — relay subscription happens inside
}

loadNsec is synchronous and fires-and-forgets startKey. So when src/daemon/admin/commands/create_new_key.ts:35 does:

await admin.loadNsec(keyName, nsec);

the await is a no-op — loadNsec already returned undefined. The very next line ships the response to the caller. The backend's start() (which is where NDKNip46Backend.start() registers the relay subscription for kind-24133 events tagged #p: target_pubkey) is still running in the background, finishes some milliseconds later, but by then the caller may have already published its connect event and started its 15s window.

unlockKey (run.ts:253-263) has the same pattern.

By contrast, startKeys (run.ts:195-203) calls await this.startKey(name, nsec); correctly — that's why the existing-keys-loaded-at-startup path works and the post-provisioning path doesn't.

Proposed fix

One-line change in two places:

async loadNsec(keyName: string, nsec: string) {
    this.activeKeys[keyName] = nsec;
    await this.startKey(keyName, nsec);
}

async unlockKey(keyName: string, passphrase: string): Promise<boolean> {
    ...
    this.activeKeys[keyName] = nsec;
    await this.startKey(keyName, nsec);
    return true;
}

create_new_key.ts:35 already awaits loadNsec, so promoting loadNsec to async closes the race without any caller-side changes.

Backwards-compat: loadNsec was only awaited at the call site above; all other callers don't depend on the timing.

Why this didn't bite pre-PR-#33

Pre-PR-#33 lnbits provisioning did the four admin RPCs at create-account time (create_new_key, _ensure_policy, create_new_token, get_key_tokens) and persisted the token, but DEFERRED the NIP-46 user-channel connect to first sign_event — which only happens after a real user action, minutes-to-hours later, well outside any plausible race window with backend startup. The first sign_event always succeeded. Subsequent ones failed with Token already redeemed (the bug fixed by PR #33), but the connect itself reached the backend.

PR #33 moves the connect into the same HTTP request as create_new_key. The race window collapses to milliseconds, and now we lose every time.

Evidence trail

aiolabs/lnbits PR #33: https://git.atitlan.io/aiolabs/lnbits/pulls/33
aiolabs/lnbits issue #32: https://git.atitlan.io/aiolabs/lnbits/issues/32
Cross-session coordination log: ~/dev/coordination/log.md entries 2026-05-30T07:00Z (original bug report), 2026-05-30T08:05Z (PR #33 fix announcement), 2026-05-30T08:20Z (regtest stack pointed at the fix branch + smoke-test ask).
Concretely on regtest-nsecbunker-1: bunker logs show RESPONSE_SENT id=... kind=24134 elapsed=10ms for every admin RPC, but zero kind-24133 REQUEST_IN method=connect lines after create_new_key runs. The relay (lnbits/nostrrelay) carries the connect event; the bunker just isn't subscribed yet.

## Summary `create_new_key` returns its response BEFORE the NIP-46 backend listener for the newly-provisioned key is fully started. Any client that chains `create_new_key` → user-channel NIP-46 `connect` (kind-24133) on the same target key races against the backend startup — when the race is lost the connect event arrives at a relay subscription that hasn't been set up yet, and the bunker is silent until the caller times out. This surfaced as a hard block for `aiolabs/lnbits` PR #33 (`issue-32-eager-bind-bunker`), which moves NIP-46 `connect` from lazy-on-first-`sign_event` into the provisioning chain so the token is redeemed atomically. Every account-create round-trip now hits this race. ## Repro Spin up nsecbunkerd, drive a single client that: 1. Sends admin `create_new_key` over kind-24134 → receives `RESPONSE_SENT` (~10ms). 2. Sends admin `create_new_token` for the target → receives ack. 3. Immediately opens a fresh NIP-46 channel keyed to that brand-new target and sends `connect` over kind-24133. Observed: kind-24133 `connect` is delivered to the relay (lnbits's `nostrrelay` extension logs `_handle_event` on the right kind) but the bunker has no listener subscribed for that target's pubkey yet, so the event is dropped. The client side times out after its configured deadline. From `aiolabs/lnbits` PR #33 with a 15s deadline: ``` no NIP-46 response for 'connect' within 15.0s ``` Cross-check on the bunker side: zero `REQUEST_IN method=connect` log lines for any kind-24133 event in the same window — the backend genuinely never saw the request. ## Root cause `src/daemon/run.ts`: ```ts loadNsec(keyName: string, nsec: string) { // line 265 this.activeKeys[keyName] = nsec; this.startKey(keyName, nsec); // line 268 — promise discarded } async startKey(name: string, nsec: string) { // line 227 — async ... const backend = new Backend(this.ndk, this.fastify, hexpk, cb, this.config.baseUrl); await backend.start(); // line 250 — relay subscription happens inside } ``` `loadNsec` is synchronous and fires-and-forgets `startKey`. So when `src/daemon/admin/commands/create_new_key.ts:35` does: ```ts await admin.loadNsec(keyName, nsec); ``` the await is a no-op — `loadNsec` already returned `undefined`. The very next line ships the response to the caller. The backend's `start()` (which is where `NDKNip46Backend.start()` registers the relay subscription for kind-24133 events tagged `#p: target_pubkey`) is still running in the background, finishes some milliseconds later, but by then the caller may have already published its `connect` event and started its 15s window. `unlockKey` (`run.ts:253-263`) has the same pattern. By contrast, `startKeys` (`run.ts:195-203`) calls `await this.startKey(name, nsec);` correctly — that's why the existing-keys-loaded-at-startup path works and the post-provisioning path doesn't. ## Proposed fix One-line change in two places: ```ts async loadNsec(keyName: string, nsec: string) { this.activeKeys[keyName] = nsec; await this.startKey(keyName, nsec); } async unlockKey(keyName: string, passphrase: string): Promise<boolean> { ... this.activeKeys[keyName] = nsec; await this.startKey(keyName, nsec); return true; } ``` `create_new_key.ts:35` already awaits `loadNsec`, so promoting `loadNsec` to async closes the race without any caller-side changes. Backwards-compat: `loadNsec` was only awaited at the call site above; all other callers don't depend on the timing. ## Why this didn't bite pre-PR-#33 Pre-PR-#33 lnbits provisioning did the four admin RPCs at create-account time (`create_new_key`, `_ensure_policy`, `create_new_token`, `get_key_tokens`) and persisted the token, but DEFERRED the NIP-46 user-channel connect to first `sign_event` — which only happens after a real user action, minutes-to-hours later, well outside any plausible race window with backend startup. The first sign_event always succeeded. Subsequent ones failed with `Token already redeemed` (the bug fixed by PR #33), but the connect itself reached the backend. PR #33 moves the connect into the same HTTP request as `create_new_key`. The race window collapses to milliseconds, and now we lose every time. ## Evidence trail - `aiolabs/lnbits` PR #33: <https://git.atitlan.io/aiolabs/lnbits/pulls/33> - `aiolabs/lnbits` issue #32: <https://git.atitlan.io/aiolabs/lnbits/issues/32> - Cross-session coordination log: `~/dev/coordination/log.md` entries `2026-05-30T07:00Z` (original bug report), `2026-05-30T08:05Z` (PR #33 fix announcement), `2026-05-30T08:20Z` (regtest stack pointed at the fix branch + smoke-test ask). - Concretely on `regtest-nsecbunker-1`: bunker logs show `RESPONSE_SENT id=... kind=24134 elapsed=10ms` for every admin RPC, but zero kind-24133 `REQUEST_IN method=connect` lines after `create_new_key` runs. The relay (`lnbits/nostrrelay`) carries the connect event; the bunker just isn't subscribed yet.

padreug commented

2026-05-30 08:48:44 +00:00

Author

Owner

Update: race is real but latent (flake, not hard block)

Retested against the same bohm regtest stack after aiolabs/lnbits PR #33 was rebased onto dev (head b1694186) and the CORS allowlist landed via PR #31. Drove demo-account creation from http://192.168.0.32:5181/login. Account creation succeeded end-to-end this time:

🔑 8d312bcf6bb6460f944f72073c3a496c is being requested to connect by npub1s4lamw8c...
🔎 npub1s4lamw8c... is allowed to connect with key 8d312bcf6bb6460f944f72073c3a496c
🔑 8d312bcf6bb6460f944f72073c3a496c is being requested to sign_event by npub1s4lamw8c...
🔎 npub1s4lamw8c... is allowed to sign_event with key 8d312bcf6bb6460f944f72073c3a496c

POST /api/v1/auth/register HTTP/1.1 200

Connect-then-sign-event handled cleanly for the freshly-provisioned key.

So I should downgrade the characterization in the original report: this isn't a deterministic block on PR #33's eager-bind path the way I described. The structural issue is still present in the code — loadNsec IS sync, loadNsec IS not awaiting the async startKey, the await at create_new_key.ts:35 IS a no-op — and the timing window for the race is real (backend subscription is established inside backend.start(), which can finish after create_new_key returns). But in practice the relay subscription often completes before the client's connect event arrives, so the race resolves in our favor on most runs.

What that means for the fix priority:

Still worth landing the one-line async/await fix. The race is a flake hazard, especially under load, on slower machines, or under conditions where the relay subscription handshake takes longer than usual (cold WebSocket, GC pause, etc.). A test suite that exercises this path will probably eventually catch the timeout intermittently.
Not a hard blocker for shipping PR #33. End-to-end smoke succeeds on bohm regtest as-is; PR #33 can merge without waiting on this.
The earlier deterministic-looking failure (15s timeout on every attempt from 08:21Z to 08:30Z) may have been a stuck relay subscription on the bunker side from a previous failed run that the bunker restart cleared, rather than the structural race itself. Hard to confirm retroactively without bunker-side state inspection.

Repro for the fix's regression test: I'd suggest adding an integration test that drives create_new_key followed immediately by NIP-46 connect against the new key (no artificial sleep), and assert connect returns within a tight bound (e.g. 500ms). Run it repeatedly (e.g. 100 iterations) — if the race ever loses, this catches it.

Filed-as-blocker mischaracterization on ~/dev/coordination/log.md entry 2026-05-30T08:50Z is now corrected in a follow-up 08:55Z entry.

## Update: race is real but latent (flake, not hard block) Retested against the same bohm regtest stack after `aiolabs/lnbits` PR #33 was rebased onto `dev` (head `b1694186`) and the CORS allowlist landed via PR #31. Drove demo-account creation from `http://192.168.0.32:5181/login`. **Account creation succeeded end-to-end** this time: ``` 🔑 8d312bcf6bb6460f944f72073c3a496c is being requested to connect by npub1s4lamw8c... 🔎 npub1s4lamw8c... is allowed to connect with key 8d312bcf6bb6460f944f72073c3a496c 🔑 8d312bcf6bb6460f944f72073c3a496c is being requested to sign_event by npub1s4lamw8c... 🔎 npub1s4lamw8c... is allowed to sign_event with key 8d312bcf6bb6460f944f72073c3a496c POST /api/v1/auth/register HTTP/1.1 200 ``` Connect-then-sign-event handled cleanly for the freshly-provisioned key. So I should downgrade the characterization in the original report: this **isn't a deterministic block** on PR #33's eager-bind path the way I described. The structural issue is still present in the code — `loadNsec` IS sync, `loadNsec` IS not awaiting the async `startKey`, the `await` at `create_new_key.ts:35` IS a no-op — and the timing window for the race is real (backend subscription is established inside `backend.start()`, which can finish after `create_new_key` returns). But in practice the relay subscription often completes before the client's connect event arrives, so the race resolves in our favor on most runs. What that means for the fix priority: - **Still worth landing the one-line `async`/`await` fix.** The race is a flake hazard, especially under load, on slower machines, or under conditions where the relay subscription handshake takes longer than usual (cold WebSocket, GC pause, etc.). A test suite that exercises this path will probably eventually catch the timeout intermittently. - **Not a hard blocker for shipping PR #33.** End-to-end smoke succeeds on bohm regtest as-is; PR #33 can merge without waiting on this. - **The earlier deterministic-looking failure** (15s timeout on every attempt from 08:21Z to 08:30Z) may have been a stuck relay subscription on the bunker side from a previous failed run that the bunker restart cleared, rather than the structural race itself. Hard to confirm retroactively without bunker-side state inspection. Repro for the fix's regression test: I'd suggest adding an integration test that drives `create_new_key` followed immediately by NIP-46 `connect` against the new key (no artificial sleep), and assert connect returns within a tight bound (e.g. 500ms). Run it repeatedly (e.g. 100 iterations) — if the race ever loses, this catches it. Filed-as-blocker mischaracterization on `~/dev/coordination/log.md` entry `2026-05-30T08:50Z` is now corrected in a follow-up `08:55Z` entry.

padreug referenced this issue from a commit

2026-05-30 10:25:49 +00:00

fix(#9): close race between create_new_key and NIP-46 connect

padreug referenced this issue from a pull request that will close it,

2026-05-30 10:26:39 +00:00

fix(#9): close race between create_new_key and NIP-46 connect #10

padreug commented

2026-05-30 10:27:05 +00:00

Author

Owner

Fix in #10

Opened PR #10 with the fix: https://git.atitlan.io/aiolabs/nsecbunkerd/pulls/10

Correction to the previous comment

My 08:48Z comment downgraded this from "hard blocker" to "latent flake" based on a single successful retry. That conclusion was wrong — subsequent retries reproduced the timeout deterministically. The hard-blocker characterization from the original report was correct.

The deeper diagnosis from working the fix:

Layer 1 (loadNsec / unlockKey not awaiting startKey) — real, but the surface symptom. Even after promoting both to async + await, the connect still timed out.
Layer 2 (NDKNip46Backend.start() returns before relay registers the subscription) — the actual race-closer. Inspecting the running NDK 2.8.1:
```
async start() {
    this.localUser = await this.signer.user();
    const sub = this.ndk.subscribe(
        { kinds: [24133], "#p": [this.localUser.pubkey] },
        { closeOnEose: false }
    );
    sub.on("event", (e) => this.handleIncomingEvent(e));
}
```
this.ndk.subscribe(...) returns immediately — the NDKSubscription queues a REQ on the relay connection but the relay's EOSE hasn't arrived yet. Layer 1's await was awaiting the wrong thing.

The fix in #10 overrides start() in our Backend subclass to await EOSE before resolving. The lucky 08:44Z success that triggered my flake-mischaracterization was almost certainly a case where EOSE happened to land before lnbits's client published its connect event — but the window is sub-100ms in the happy case and we lose it under any kind of jitter (cold WebSocket, GC pause, NDK pool churn). End-to-end verified on bohm regtest: with #10 applied, fresh demo-account creation + a follow-up event-publish round-trip both complete in ~3s with no NsecBunkerTimeoutError / no Token already redeemed / no silent publish failure.

Closing once #10 merges.

## Fix in #10 Opened PR #10 with the fix: <https://git.atitlan.io/aiolabs/nsecbunkerd/pulls/10> ## Correction to the previous comment My `08:48Z` comment downgraded this from "hard blocker" to "latent flake" based on a single successful retry. That conclusion was wrong — subsequent retries reproduced the timeout deterministically. The hard-blocker characterization from the original report was correct. The deeper diagnosis from working the fix: 1. **Layer 1 (`loadNsec` / `unlockKey` not awaiting `startKey`)** — real, but the surface symptom. Even after promoting both to `async` + `await`, the connect still timed out. 2. **Layer 2 (`NDKNip46Backend.start()` returns before relay registers the subscription)** — the actual race-closer. Inspecting the running NDK `2.8.1`: ```js async start() { this.localUser = await this.signer.user(); const sub = this.ndk.subscribe( { kinds: [24133], "#p": [this.localUser.pubkey] }, { closeOnEose: false } ); sub.on("event", (e) => this.handleIncomingEvent(e)); } ``` `this.ndk.subscribe(...)` returns immediately — the `NDKSubscription` queues a REQ on the relay connection but the relay's EOSE hasn't arrived yet. Layer 1's await was awaiting the wrong thing. The fix in #10 overrides `start()` in our `Backend` subclass to await EOSE before resolving. The lucky 08:44Z success that triggered my flake-mischaracterization was almost certainly a case where EOSE happened to land before lnbits's client published its connect event — but the window is sub-100ms in the happy case and we lose it under any kind of jitter (cold WebSocket, GC pause, NDK pool churn). End-to-end verified on bohm regtest: with #10 applied, fresh demo-account creation + a follow-up event-publish round-trip both complete in ~3s with no `NsecBunkerTimeoutError` / no `Token already redeemed` / no silent publish failure. Closing once #10 merges.

padreug closed this issue

2026-05-30 11:23:45 +00:00

padreug referenced this issue from a commit

2026-05-30 11:23:45 +00:00

Merge pull request 'fix(#9): close race between create_new_key and NIP-46 connect' (#10) from issue-9-fix-create-new-key-race into dev

padreug referenced this issue from a commit

2026-05-31 13:36:24 +00:00

feat(daemon): boot-time autounlock of encrypted keys (#16)

padreug referenced this issue

2026-06-03 16:22:31 +00:00

NDK gives up reconnecting to admin relay after ~3 ECONNREFUSED retries — bunker stays disconnected forever #20