Watchdog reconnect re-opens the socket but never replays subscriptions → bunker goes silently deaf after a relay flap #41

Open
opened 2026-06-23 22:19:02 +00:00 by padreug · 0 comments
Owner

Summary

After a relay disconnect/flap, attachIndefiniteReconnect (src/daemon/lib/relay-reconnect.ts) re-establishes the WebSocket via relay.connect(), but the NIP-46 subscriptions are never replayed. The bunker ends up connected but deaf: it holds an open socket to the relay yet has no active REQ, so every incoming signing request (NIP-46 kind-24133) is delivered to a relay the bunker is no longer subscribed on. All sign_event / nip44_encrypt / etc. requests time out client-side (no response in 10000ms) until the daemon is manually restarted. This affects every key, not one binding.

Hit in production (demo, 2026-06-23)

  • nsecbunkerd 0.10.5 on aio-demo, single backend relay ws://127.0.0.1:5000/nostrrelay/demo.
  • 19:55:47Z the relay dropped; log shows the watchdog kicking in:
    🚫 Disconnected from ws://127.0.0.1:5000/nostrrelay/demo/
    🔁 backend: scheduling reconnect ... (attempt 1, overriding NDK give-up)
    ✅ Connected to ws://127.0.0.1:5000/nostrrelay/demo/
    ✅ backend: recovered ... after 1 manual reconnect attempt(s)
    
  • After that line the daemon logged nothing for any key — total silence — despite a remote NIP-46 client (a bitSpire ATM) reconnecting and retrying nip44_encrypt every ~10s for ~2.5 hours. The client only ever saw BunkerTimeoutError: bunker nip44_encrypt: no response in 10000ms ("signer unreachable").
  • Confirmed the request never reached the daemon (restarted the client, watched journalctl -u nsecbunkerd — zero new entries).
  • Restarting nsecbunkerd fully recovered it: fresh ✅ Connected, 🔓 autounlock: unlocked … for all keys, subscriptions re-established. The exact same client + seed + pubkey then worked. Nothing about the keys, tokens, bindings, or seed was wrong — purely the stale post-reconnect subscription.

Why the current watchdog doesn't cover this

relay-reconnect.ts solves connectivity (NDK 3.x gives up after isFlapping(), #20) by manually calling relay.connect(). But when NDK declares a relay FLAPPING it has already torn down / stopped tracking the relay's NDKSubscriptions, so re-opening the socket alone does not re-issue the REQs. The socket is healthy; the subscription set is empty.

Suggested fix

On the relay:connect recovery path (where n > 0, i.e. we just recovered from a manual reconnect), re-attach the daemon's active subscriptions to that relay — e.g. re-run the per-key NIP-46 subscription setup, or call NDK's subscription-replay for the recovered relay rather than relying on NDK to do it. Needs to cover both the admin subscription and every unlocked key's request subscription.

Acceptance: after a forced relay flap (kill the relay, let it come back), the bunker resumes serving signing requests without a manual restart. Add a regression/integration test that flaps the relay mid-session and asserts a subsequent sign_event still gets answered.

Severity

High for any deployment whose backend relay can blip (which is all of them — a co-located lnbits restart drops the relay, exactly the #20 scenario). The failure is silent: connected status looks healthy, signing just stops. A bitSpire ATM sat in "signer unreachable" maintenance for hours because of it.

refs: src/daemon/lib/relay-reconnect.ts, #20 (the reconnect watchdog this extends); demo incident 2026-06-23 (spire-4zVmKnVB7Mqp49MH2AQu4c / 13a9b624…).

## Summary After a relay disconnect/flap, `attachIndefiniteReconnect` (`src/daemon/lib/relay-reconnect.ts`) re-establishes the **WebSocket** via `relay.connect()`, but the **NIP-46 subscriptions are never replayed**. The bunker ends up *connected but deaf*: it holds an open socket to the relay yet has no active `REQ`, so every incoming signing request (NIP-46 kind-24133) is delivered to a relay the bunker is no longer subscribed on. All `sign_event` / `nip44_encrypt` / etc. requests time out client-side (`no response in 10000ms`) until the daemon is manually restarted. This affects **every key**, not one binding. ## Hit in production (demo, 2026-06-23) - `nsecbunkerd 0.10.5` on `aio-demo`, single backend relay `ws://127.0.0.1:5000/nostrrelay/demo`. - **19:55:47Z** the relay dropped; log shows the watchdog kicking in: ``` 🚫 Disconnected from ws://127.0.0.1:5000/nostrrelay/demo/ 🔁 backend: scheduling reconnect ... (attempt 1, overriding NDK give-up) ✅ Connected to ws://127.0.0.1:5000/nostrrelay/demo/ ✅ backend: recovered ... after 1 manual reconnect attempt(s) ``` - After that line the daemon logged **nothing** for any key — total silence — despite a remote NIP-46 client (a bitSpire ATM) reconnecting and retrying `nip44_encrypt` every ~10s for ~2.5 hours. The client only ever saw `BunkerTimeoutError: bunker nip44_encrypt: no response in 10000ms` ("signer unreachable"). - Confirmed the request never reached the daemon (restarted the client, watched `journalctl -u nsecbunkerd` — zero new entries). - **Restarting `nsecbunkerd` fully recovered it**: fresh `✅ Connected`, `🔓 autounlock: unlocked …` for all keys, subscriptions re-established. The exact same client + seed + pubkey then worked. Nothing about the keys, tokens, bindings, or seed was wrong — purely the stale post-reconnect subscription. ## Why the current watchdog doesn't cover this `relay-reconnect.ts` solves connectivity (NDK 3.x gives up after `isFlapping()`, #20) by manually calling `relay.connect()`. But when NDK declares a relay FLAPPING it has already torn down / stopped tracking the relay's `NDKSubscription`s, so re-opening the socket alone does not re-issue the `REQ`s. The socket is healthy; the subscription set is empty. ## Suggested fix On the `relay:connect` recovery path (where `n > 0`, i.e. we just recovered from a manual reconnect), re-attach the daemon's active subscriptions to that relay — e.g. re-run the per-key NIP-46 subscription setup, or call NDK's subscription-replay for the recovered relay rather than relying on NDK to do it. Needs to cover both the admin subscription and every unlocked key's request subscription. Acceptance: after a forced relay flap (kill the relay, let it come back), the bunker resumes serving signing requests **without** a manual restart. Add a regression/integration test that flaps the relay mid-session and asserts a subsequent `sign_event` still gets answered. ## Severity High for any deployment whose backend relay can blip (which is all of them — a co-located lnbits restart drops the relay, exactly the #20 scenario). The failure is silent: connected status looks healthy, signing just stops. A bitSpire ATM sat in "signer unreachable" maintenance for hours because of it. refs: `src/daemon/lib/relay-reconnect.ts`, #20 (the reconnect watchdog this extends); demo incident 2026-06-23 (spire-4zVmKnVB7Mqp49MH2AQu4c / 13a9b624…).
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aiolabs/nsecbunkerd#41
No description provided.