Per-key Backend kind:24133 subscription sometimes fails to register on relay after a fresh boot #21

Closed
opened 2026-06-03 16:23:31 +00:00 by padreug · 0 comments
Owner

Symptom

After a clean bunker boot where the admin NDK connection lands successfully and nsecBunker ready to serve requests fires, a subsequent create_new_keyloadNsecstartKeynew Backend(...)backend.start() chain completes (the admin RPC response is sent back to lnbits within ~10–20 ms), but the per-key subscription {kinds:[24133], #p:[localUser.pubkey]} set up by Backend.start() never appears on the relay's incoming-REQ list.

Net effect: lnbits's subsequent NIP-46 calls against the new key (connect, get_public_key, sign_event, nip44_*) sometimes go through, sometimes timeout silently because the bunker isn't actually subscribed to receive them. In our reproduction the admin subscription registered but every per-key Backend subscription was lost.

Evidence

Patched lnbits/extensions/nostrrelay/relay/client_connection.py:_handle_request to log every incoming REQ filter:

logger.info(
    f"[NOSTRRELAY REQ] sub={subscription_id!r} "
    f"filter={nostr_filter.dict(exclude_none=True, exclude_defaults=True)}"
)

After a signup that provisioned 1 new bunker key, the bunker's WS connection (172.18.0.3 in our compose network) only registered:

[NOSTRRELAY REQ] sub='ntdsml-n1fmy' filter={'p': ['fb90174c…bunker_admin'], 'kinds': [24133, 24134]}
[NOSTRRELAY REQ] sub='4ndw-5icz8'   filter={'p': ['fb90174c…bunker_admin'], 'kinds': [24133, 24134]}

Both are the admin subscription (AdminInterface.connect() in src/daemon/admin/index.ts:145). The expected per-key Backend filter — {kinds:[24133], #p:[3114902068934b…new_user_key]} from Backend.start() in src/daemon/backend/index.ts:48never appears.

Meanwhile relay logs show lnbits publishing kind:24133 events tagged with the new user's pubkey:

nostr event: [24133, fa50d88a5c43…lnbits_session_client, ...]
[NOSTRRELAY CLIENT] ❌ Filter didn't match for event ...

— the relay tries to route to subscribed clients, finds no match for ["p", new_user_pubkey], and drops.

Hypotheses

  1. Race against the shared NDK pool's relay-state machine. Backend.start() awaits EOSE on its own subscription, but this.ndk is the same NDK instance used by the AdminInterface. If the relay-connectivity machine is in a "FLAPPING" or post-error state from earlier WebSocket churn (see #20 — the NDK retry-give-up bug), ndk.subscribe(...) may register internally but never flush a REQ to the wire.

  2. NDK 3.x outbox-routing interception. NDK 3.x routes subscriptions via outbox lookup by default. Without an explicit pool override, the Backend's subscription may be queued waiting for the new user pubkey's relay-list event (which doesn't exist) and never sent to the configured explicit relay.

  3. WebSocket reuse across Backend instances — the shared ndk.pool re-uses one socket for all per-key Backends. If that socket flapped at any point before a given Backend's REQ would have been sent, the queued REQ may have been lost without retry.

aio-demo doesn't reproduce this because its bunker has long-running uptime and never has its admin connection flapping. The bug surfaces locally where the dev cycle restarts lnbits often.

Suggested investigation

  • Pass relaySet explicitly to this.ndk.subscribe(...) in Backend.start() (a NDKRelaySet containing just the configured admin relay) so the subscription bypasses outbox routing.
  • Verify with DEBUG=ndk:* that ndk:subscription-manager emits a "send REQ" event for the new sub to the live relay socket. Cross-check the relay-side REQ log to confirm receipt.
  • If NDK's subscription queue stays in PENDING after the EOSE-await Promise resolves, that's the bug.

Context

  • nsecbunkerd dev branch at commit 131f689
  • NDK 3.0.3 (per #14)
  • aiolabs/lnbits#46 added eager default-merchant provisioning which is the practical trigger — every signup hits sign_event on the newly-provisioned key.
  • Related: aiolabs/lnbits PR #48 fixes a separate DEFAULT_POLICY_RULES kind-number bug that masked this routing bug.
  • Related: #20 (NDK retry-give-up after ECONNREFUSED) creates the FLAPPING relay state that triggers this.

🤖 Generated with Claude Code

## Symptom After a clean bunker boot where the admin NDK connection lands successfully and `nsecBunker ready to serve requests` fires, a subsequent `create_new_key` → `loadNsec` → `startKey` → `new Backend(...)` → `backend.start()` chain completes (the admin RPC response is sent back to lnbits within ~10–20 ms), **but the per-key subscription `{kinds:[24133], #p:[localUser.pubkey]}` set up by `Backend.start()` never appears on the relay's incoming-REQ list.** Net effect: lnbits's subsequent NIP-46 calls against the new key (`connect`, `get_public_key`, `sign_event`, `nip44_*`) sometimes go through, sometimes timeout silently because the bunker isn't actually subscribed to receive them. In our reproduction the admin subscription registered but every per-key Backend subscription was lost. ## Evidence Patched `lnbits/extensions/nostrrelay/relay/client_connection.py:_handle_request` to log every incoming REQ filter: ```python logger.info( f"[NOSTRRELAY REQ] sub={subscription_id!r} " f"filter={nostr_filter.dict(exclude_none=True, exclude_defaults=True)}" ) ``` After a signup that provisioned 1 new bunker key, the bunker's WS connection (172.18.0.3 in our compose network) only registered: ``` [NOSTRRELAY REQ] sub='ntdsml-n1fmy' filter={'p': ['fb90174c…bunker_admin'], 'kinds': [24133, 24134]} [NOSTRRELAY REQ] sub='4ndw-5icz8' filter={'p': ['fb90174c…bunker_admin'], 'kinds': [24133, 24134]} ``` Both are the **admin** subscription (`AdminInterface.connect()` in `src/daemon/admin/index.ts:145`). The expected per-key Backend filter — `{kinds:[24133], #p:[3114902068934b…new_user_key]}` from `Backend.start()` in `src/daemon/backend/index.ts:48` — **never appears**. Meanwhile relay logs show lnbits publishing kind:24133 events tagged with the new user's pubkey: ``` nostr event: [24133, fa50d88a5c43…lnbits_session_client, ...] [NOSTRRELAY CLIENT] ❌ Filter didn't match for event ... ``` — the relay tries to route to subscribed clients, finds no match for `["p", new_user_pubkey]`, and drops. ## Hypotheses 1. **Race against the shared NDK pool's relay-state machine.** `Backend.start()` awaits EOSE on its own subscription, but `this.ndk` is the same NDK instance used by the AdminInterface. If the relay-connectivity machine is in a "FLAPPING" or post-error state from earlier WebSocket churn (see #20 — the NDK retry-give-up bug), `ndk.subscribe(...)` may register internally but never flush a REQ to the wire. 2. **NDK 3.x outbox-routing interception.** NDK 3.x routes subscriptions via outbox lookup by default. Without an explicit pool override, the Backend's subscription may be queued waiting for the new user pubkey's relay-list event (which doesn't exist) and never sent to the configured explicit relay. 3. **WebSocket reuse across Backend instances** — the shared `ndk.pool` re-uses one socket for all per-key Backends. If that socket flapped at any point before a given Backend's REQ would have been sent, the queued REQ may have been lost without retry. aio-demo doesn't reproduce this because its bunker has long-running uptime and never has its admin connection flapping. The bug surfaces locally where the dev cycle restarts lnbits often. ## Suggested investigation - Pass `relaySet` explicitly to `this.ndk.subscribe(...)` in `Backend.start()` (a `NDKRelaySet` containing just the configured admin relay) so the subscription bypasses outbox routing. - Verify with `DEBUG=ndk:*` that `ndk:subscription-manager` emits a "send REQ" event for the new sub to the live relay socket. Cross-check the relay-side REQ log to confirm receipt. - If NDK's subscription queue stays in `PENDING` after the EOSE-await Promise resolves, that's the bug. ## Context - nsecbunkerd dev branch at commit `131f689` - NDK 3.0.3 (per #14) - aiolabs/lnbits#46 added eager default-merchant provisioning which is the practical trigger — every signup hits `sign_event` on the newly-provisioned key. - Related: aiolabs/lnbits PR #48 fixes a separate `DEFAULT_POLICY_RULES` kind-number bug that masked this routing bug. - Related: #20 (NDK retry-give-up after ECONNREFUSED) creates the FLAPPING relay state that triggers this. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aiolabs/nsecbunkerd#21
No description provided.