Watchdog reconnect re-opens the socket but never replays subscriptions → bunker goes silently deaf after a relay flap #41
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
After a relay disconnect/flap,
attachIndefiniteReconnect(src/daemon/lib/relay-reconnect.ts) re-establishes the WebSocket viarelay.connect(), but the NIP-46 subscriptions are never replayed. The bunker ends up connected but deaf: it holds an open socket to the relay yet has no activeREQ, so every incoming signing request (NIP-46 kind-24133) is delivered to a relay the bunker is no longer subscribed on. Allsign_event/nip44_encrypt/ etc. requests time out client-side (no response in 10000ms) until the daemon is manually restarted. This affects every key, not one binding.Hit in production (demo, 2026-06-23)
nsecbunkerd 0.10.5onaio-demo, single backend relayws://127.0.0.1:5000/nostrrelay/demo.nip44_encryptevery ~10s for ~2.5 hours. The client only ever sawBunkerTimeoutError: bunker nip44_encrypt: no response in 10000ms("signer unreachable").journalctl -u nsecbunkerd— zero new entries).nsecbunkerdfully recovered it: fresh✅ Connected,🔓 autounlock: unlocked …for all keys, subscriptions re-established. The exact same client + seed + pubkey then worked. Nothing about the keys, tokens, bindings, or seed was wrong — purely the stale post-reconnect subscription.Why the current watchdog doesn't cover this
relay-reconnect.tssolves connectivity (NDK 3.x gives up afterisFlapping(), #20) by manually callingrelay.connect(). But when NDK declares a relay FLAPPING it has already torn down / stopped tracking the relay'sNDKSubscriptions, so re-opening the socket alone does not re-issue theREQs. The socket is healthy; the subscription set is empty.Suggested fix
On the
relay:connectrecovery path (wheren > 0, i.e. we just recovered from a manual reconnect), re-attach the daemon's active subscriptions to that relay — e.g. re-run the per-key NIP-46 subscription setup, or call NDK's subscription-replay for the recovered relay rather than relying on NDK to do it. Needs to cover both the admin subscription and every unlocked key's request subscription.Acceptance: after a forced relay flap (kill the relay, let it come back), the bunker resumes serving signing requests without a manual restart. Add a regression/integration test that flaps the relay mid-session and asserts a subsequent
sign_eventstill gets answered.Severity
High for any deployment whose backend relay can blip (which is all of them — a co-located lnbits restart drops the relay, exactly the #20 scenario). The failure is silent: connected status looks healthy, signing just stops. A bitSpire ATM sat in "signer unreachable" maintenance for hours because of it.
refs:
src/daemon/lib/relay-reconnect.ts, #20 (the reconnect watchdog this extends); demo incident 2026-06-23 (spire-4zVmKnVB7Mqp49MH2AQu4c / 13a9b624…).