Per-key Backend kind:24133 subscription sometimes fails to register on relay after a fresh boot #21
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
After a clean bunker boot where the admin NDK connection lands successfully and
nsecBunker ready to serve requestsfires, a subsequentcreate_new_key→loadNsec→startKey→new Backend(...)→backend.start()chain completes (the admin RPC response is sent back to lnbits within ~10–20 ms), but the per-key subscription{kinds:[24133], #p:[localUser.pubkey]}set up byBackend.start()never appears on the relay's incoming-REQ list.Net effect: lnbits's subsequent NIP-46 calls against the new key (
connect,get_public_key,sign_event,nip44_*) sometimes go through, sometimes timeout silently because the bunker isn't actually subscribed to receive them. In our reproduction the admin subscription registered but every per-key Backend subscription was lost.Evidence
Patched
lnbits/extensions/nostrrelay/relay/client_connection.py:_handle_requestto log every incoming REQ filter:After a signup that provisioned 1 new bunker key, the bunker's WS connection (172.18.0.3 in our compose network) only registered:
Both are the admin subscription (
AdminInterface.connect()insrc/daemon/admin/index.ts:145). The expected per-key Backend filter —{kinds:[24133], #p:[3114902068934b…new_user_key]}fromBackend.start()insrc/daemon/backend/index.ts:48— never appears.Meanwhile relay logs show lnbits publishing kind:24133 events tagged with the new user's pubkey:
— the relay tries to route to subscribed clients, finds no match for
["p", new_user_pubkey], and drops.Hypotheses
Race against the shared NDK pool's relay-state machine.
Backend.start()awaits EOSE on its own subscription, butthis.ndkis the same NDK instance used by the AdminInterface. If the relay-connectivity machine is in a "FLAPPING" or post-error state from earlier WebSocket churn (see #20 — the NDK retry-give-up bug),ndk.subscribe(...)may register internally but never flush a REQ to the wire.NDK 3.x outbox-routing interception. NDK 3.x routes subscriptions via outbox lookup by default. Without an explicit pool override, the Backend's subscription may be queued waiting for the new user pubkey's relay-list event (which doesn't exist) and never sent to the configured explicit relay.
WebSocket reuse across Backend instances — the shared
ndk.poolre-uses one socket for all per-key Backends. If that socket flapped at any point before a given Backend's REQ would have been sent, the queued REQ may have been lost without retry.aio-demo doesn't reproduce this because its bunker has long-running uptime and never has its admin connection flapping. The bug surfaces locally where the dev cycle restarts lnbits often.
Suggested investigation
relaySetexplicitly tothis.ndk.subscribe(...)inBackend.start()(aNDKRelaySetcontaining just the configured admin relay) so the subscription bypasses outbox routing.DEBUG=ndk:*thatndk:subscription-manageremits a "send REQ" event for the new sub to the live relay socket. Cross-check the relay-side REQ log to confirm receipt.PENDINGafter the EOSE-await Promise resolves, that's the bug.Context
131f689sign_eventon the newly-provisioned key.DEFAULT_POLICY_RULESkind-number bug that masked this routing bug.🤖 Generated with Claude Code