Per-key Backend kind:24133 subscription sometimes fails to register on relay after a fresh boot #21

New issue

Closed

opened 2026-06-03 16:23:31 +00:00 by padreug · 0 comments

padreug commented

2026-06-03 16:23:31 +00:00

Owner

Symptom

After a clean bunker boot where the admin NDK connection lands successfully and nsecBunker ready to serve requests fires, a subsequent create_new_key → loadNsec → startKey → new Backend(...) → backend.start() chain completes (the admin RPC response is sent back to lnbits within ~10–20 ms), but the per-key subscription {kinds:[24133], #p:[localUser.pubkey]} set up by Backend.start() never appears on the relay's incoming-REQ list.

Net effect: lnbits's subsequent NIP-46 calls against the new key (connect, get_public_key, sign_event, nip44_*) sometimes go through, sometimes timeout silently because the bunker isn't actually subscribed to receive them. In our reproduction the admin subscription registered but every per-key Backend subscription was lost.

Evidence

Patched lnbits/extensions/nostrrelay/relay/client_connection.py:_handle_request to log every incoming REQ filter:

logger.info(
    f"[NOSTRRELAY REQ] sub={subscription_id!r} "
    f"filter={nostr_filter.dict(exclude_none=True, exclude_defaults=True)}"
)

After a signup that provisioned 1 new bunker key, the bunker's WS connection (172.18.0.3 in our compose network) only registered:

[NOSTRRELAY REQ] sub='ntdsml-n1fmy' filter={'p': ['fb90174c…bunker_admin'], 'kinds': [24133, 24134]}
[NOSTRRELAY REQ] sub='4ndw-5icz8'   filter={'p': ['fb90174c…bunker_admin'], 'kinds': [24133, 24134]}

Both are the admin subscription (AdminInterface.connect() in src/daemon/admin/index.ts:145). The expected per-key Backend filter — {kinds:[24133], #p:[3114902068934b…new_user_key]} from Backend.start() in src/daemon/backend/index.ts:48 — never appears.

Meanwhile relay logs show lnbits publishing kind:24133 events tagged with the new user's pubkey:

nostr event: [24133, fa50d88a5c43…lnbits_session_client, ...]
[NOSTRRELAY CLIENT] ❌ Filter didn't match for event ...

— the relay tries to route to subscribed clients, finds no match for ["p", new_user_pubkey], and drops.

Hypotheses

Race against the shared NDK pool's relay-state machine. Backend.start() awaits EOSE on its own subscription, but this.ndk is the same NDK instance used by the AdminInterface. If the relay-connectivity machine is in a "FLAPPING" or post-error state from earlier WebSocket churn (see #20 — the NDK retry-give-up bug), ndk.subscribe(...) may register internally but never flush a REQ to the wire.
NDK 3.x outbox-routing interception. NDK 3.x routes subscriptions via outbox lookup by default. Without an explicit pool override, the Backend's subscription may be queued waiting for the new user pubkey's relay-list event (which doesn't exist) and never sent to the configured explicit relay.
WebSocket reuse across Backend instances — the shared ndk.pool re-uses one socket for all per-key Backends. If that socket flapped at any point before a given Backend's REQ would have been sent, the queued REQ may have been lost without retry.

aio-demo doesn't reproduce this because its bunker has long-running uptime and never has its admin connection flapping. The bug surfaces locally where the dev cycle restarts lnbits often.

Suggested investigation

Pass relaySet explicitly to this.ndk.subscribe(...) in Backend.start() (a NDKRelaySet containing just the configured admin relay) so the subscription bypasses outbox routing.
Verify with DEBUG=ndk:* that ndk:subscription-manager emits a "send REQ" event for the new sub to the live relay socket. Cross-check the relay-side REQ log to confirm receipt.
If NDK's subscription queue stays in PENDING after the EOSE-await Promise resolves, that's the bug.

Context

nsecbunkerd dev branch at commit 131f689
NDK 3.0.3 (per #14)
aiolabs/lnbits#46 added eager default-merchant provisioning which is the practical trigger — every signup hits sign_event on the newly-provisioned key.
Related: aiolabs/lnbits PR #48 fixes a separate DEFAULT_POLICY_RULES kind-number bug that masked this routing bug.
Related: #20 (NDK retry-give-up after ECONNREFUSED) creates the FLAPPING relay state that triggers this.

🤖 Generated with Claude Code

## Symptom After a clean bunker boot where the admin NDK connection lands successfully and `nsecBunker ready to serve requests` fires, a subsequent `create_new_key` → `loadNsec` → `startKey` → `new Backend(...)` → `backend.start()` chain completes (the admin RPC response is sent back to lnbits within ~10–20 ms), **but the per-key subscription `{kinds:[24133], #p:[localUser.pubkey]}` set up by `Backend.start()` never appears on the relay's incoming-REQ list.** Net effect: lnbits's subsequent NIP-46 calls against the new key (`connect`, `get_public_key`, `sign_event`, `nip44_*`) sometimes go through, sometimes timeout silently because the bunker isn't actually subscribed to receive them. In our reproduction the admin subscription registered but every per-key Backend subscription was lost. ## Evidence Patched `lnbits/extensions/nostrrelay/relay/client_connection.py:_handle_request` to log every incoming REQ filter: ```python logger.info( f"[NOSTRRELAY REQ] sub={subscription_id!r} " f"filter={nostr_filter.dict(exclude_none=True, exclude_defaults=True)}" ) ``` After a signup that provisioned 1 new bunker key, the bunker's WS connection (172.18.0.3 in our compose network) only registered: ``` [NOSTRRELAY REQ] sub='ntdsml-n1fmy' filter={'p': ['fb90174c…bunker_admin'], 'kinds': [24133, 24134]} [NOSTRRELAY REQ] sub='4ndw-5icz8' filter={'p': ['fb90174c…bunker_admin'], 'kinds': [24133, 24134]} ``` Both are the **admin** subscription (`AdminInterface.connect()` in `src/daemon/admin/index.ts:145`). The expected per-key Backend filter — `{kinds:[24133], #p:[3114902068934b…new_user_key]}` from `Backend.start()` in `src/daemon/backend/index.ts:48` — **never appears**. Meanwhile relay logs show lnbits publishing kind:24133 events tagged with the new user's pubkey: ``` nostr event: [24133, fa50d88a5c43…lnbits_session_client, ...] [NOSTRRELAY CLIENT] ❌ Filter didn't match for event ... ``` — the relay tries to route to subscribed clients, finds no match for `["p", new_user_pubkey]`, and drops. ## Hypotheses 1. **Race against the shared NDK pool's relay-state machine.** `Backend.start()` awaits EOSE on its own subscription, but `this.ndk` is the same NDK instance used by the AdminInterface. If the relay-connectivity machine is in a "FLAPPING" or post-error state from earlier WebSocket churn (see #20 — the NDK retry-give-up bug), `ndk.subscribe(...)` may register internally but never flush a REQ to the wire. 2. **NDK 3.x outbox-routing interception.** NDK 3.x routes subscriptions via outbox lookup by default. Without an explicit pool override, the Backend's subscription may be queued waiting for the new user pubkey's relay-list event (which doesn't exist) and never sent to the configured explicit relay. 3. **WebSocket reuse across Backend instances** — the shared `ndk.pool` re-uses one socket for all per-key Backends. If that socket flapped at any point before a given Backend's REQ would have been sent, the queued REQ may have been lost without retry. aio-demo doesn't reproduce this because its bunker has long-running uptime and never has its admin connection flapping. The bug surfaces locally where the dev cycle restarts lnbits often. ## Suggested investigation - Pass `relaySet` explicitly to `this.ndk.subscribe(...)` in `Backend.start()` (a `NDKRelaySet` containing just the configured admin relay) so the subscription bypasses outbox routing. - Verify with `DEBUG=ndk:*` that `ndk:subscription-manager` emits a "send REQ" event for the new sub to the live relay socket. Cross-check the relay-side REQ log to confirm receipt. - If NDK's subscription queue stays in `PENDING` after the EOSE-await Promise resolves, that's the bug. ## Context - nsecbunkerd dev branch at commit `131f689` - NDK 3.0.3 (per #14) - aiolabs/lnbits#46 added eager default-merchant provisioning which is the practical trigger — every signup hits `sign_event` on the newly-provisioned key. - Related: aiolabs/lnbits PR #48 fixes a separate `DEFAULT_POLICY_RULES` kind-number bug that masked this routing bug. - Related: #20 (NDK retry-give-up after ECONNREFUSED) creates the FLAPPING relay state that triggers this. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

padreug referenced this issue

2026-06-03 16:58:07 +00:00

fix(daemon): keep retrying relay reconnect indefinitely, overriding NDK give-up (#20) #22

padreug referenced this issue from a commit

2026-06-03 17:05:08 +00:00

fix(backend): pin per-key kind:24133 subscription to explicit relays (#21)

padreug referenced this issue from a pull request that will close it,

2026-06-03 17:05:42 +00:00

fix(backend): pin per-key kind:24133 subscription to explicit relays (#21) #23

padreug closed this issue

2026-06-03 17:07:23 +00:00

padreug referenced this issue from a commit