fix(daemon): keep retrying relay reconnect indefinitely, overriding NDK give-up (#20) #22

Merged

padreug merged 1 commit from fix-20-indefinite-relay-reconnect into dev

2026-06-03 16:58:48 +00:00

Author	SHA1	Message	Date
Padreug	a690596b85	fix(daemon): keep retrying relay reconnect indefinitely, overriding NDK give-up (#20 ) Some checks failed Docker image / build-and-push-image (push) Has been cancelled Details NDK 3.x's per-relay connectivity machine gives up after ~3 fast-fail (ECONNREFUSED) cycles. Three sub-second failures look identical, so `isFlapping()` (std-dev < 1s) returns true and the relay transitions to FLAPPING; NDKPool's `handleFlapping` then reschedules with doubling backoff (5s → 10s → 20s → 40s → 80s …). For nsecbunkerd, "disconnected for 80+s after every lnbits restart" is the failure mode users hit on the regtest dev stack: bunker container boots before lnbits's nostrrelay extension is accepting WebSockets → ECONNREFUSED storm → NDK flagged FLAPPING → bunker stays silently deaf until manual restart. Symptom is particularly hostile because: - `relay:connect` fires optimistically; the immediate ECONNREFUSED follow-up doesn't propagate to user-facing logs. - `NSEC_BUNKER_DISABLE_WATCHDOG=1` (the dev-stack default) skips the exit-and-restart safety net. - Manual `docker compose restart nsecbunker` is the only recovery. Fix: attach a small supervisor (`attachIndefiniteReconnect`) to both NDK instances (daemon's backend NDK in run.ts, AdminInterface's admin NDK in admin/index.ts). On `relay:disconnect` or `flapping`, schedule a manual `relay.connect()` with a SHORT capped delay (1s → 2s → 4s → 8s → 10s, capped at 10s instead of NDK's unbounded doubling). Successful connect resets the attempt counter so a future disconnect storm starts fresh. Coexists cleanly with the relay-connection watchdog (admin/index.ts:500): - Brief disconnects (e.g. lnbits restart): supervisor recovers within seconds, watchdog never fires. - Persistent disconnects (relay truly down): supervisor keeps trying every ≤10s; if it can't recover within 60s, watchdog still exits and the process supervisor restarts the bunker. So the watchdog becomes a long-tail safety net; this supervisor handles the common case. Operators with `NSEC_BUNKER_DISABLE_WATCHDOG=1` set as a workaround for this bug can re-enable the watchdog once this lands. Trade-off: we may hammer a permanently-down relay every 10s. Acceptable because the bunker's primary relay is typically on the same host or LAN (loopback or docker-internal); TCP RSTs are cheap. Public-relay setups can layer external supervision on top. Verified on regtest dev stack (cold-boot race): bunker logs 🔁 admin: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up) 🔁 backend: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up) on each disconnect, where pre-fix the bunker stayed silently deaf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-03 18:55:55 +02:00

Author

SHA1

Message

Date

Padreug

a690596b85

fix(daemon): keep retrying relay reconnect indefinitely, overriding NDK give-up (#20 )

Docker image / build-and-push-image (push) Has been cancelled

Details

NDK 3.x's per-relay connectivity machine gives up after ~3 fast-fail
(ECONNREFUSED) cycles. Three sub-second failures look identical, so
`isFlapping()` (std-dev < 1s) returns true and the relay transitions
to FLAPPING; NDKPool's `handleFlapping` then reschedules with doubling
backoff (5s → 10s → 20s → 40s → 80s …). For nsecbunkerd, "disconnected
for 80+s after every lnbits restart" is the failure mode users hit on
the regtest dev stack: bunker container boots before lnbits's
nostrrelay extension is accepting WebSockets → ECONNREFUSED storm →
NDK flagged FLAPPING → bunker stays silently deaf until manual restart.

Symptom is particularly hostile because:
- `relay:connect` fires optimistically; the immediate ECONNREFUSED
  follow-up doesn't propagate to user-facing logs.
- `NSEC_BUNKER_DISABLE_WATCHDOG=1` (the dev-stack default) skips the
  exit-and-restart safety net.
- Manual `docker compose restart nsecbunker` is the only recovery.

Fix: attach a small supervisor (`attachIndefiniteReconnect`) to both
NDK instances (daemon's backend NDK in run.ts, AdminInterface's admin
NDK in admin/index.ts). On `relay:disconnect` or `flapping`, schedule
a manual `relay.connect()` with a SHORT capped delay (1s → 2s → 4s →
8s → 10s, capped at 10s instead of NDK's unbounded doubling). Successful
connect resets the attempt counter so a future disconnect storm starts
fresh.

Coexists cleanly with the relay-connection watchdog (admin/index.ts:500):
- Brief disconnects (e.g. lnbits restart): supervisor recovers within
  seconds, watchdog never fires.
- Persistent disconnects (relay truly down): supervisor keeps trying
  every ≤10s; if it can't recover within 60s, watchdog still exits and
  the process supervisor restarts the bunker. So the watchdog becomes
  a long-tail safety net; this supervisor handles the common case.

Operators with `NSEC_BUNKER_DISABLE_WATCHDOG=1` set as a workaround for
this bug can re-enable the watchdog once this lands.

Trade-off: we may hammer a permanently-down relay every 10s. Acceptable
because the bunker's primary relay is typically on the same host or LAN
(loopback or docker-internal); TCP RSTs are cheap. Public-relay setups
can layer external supervision on top.

Verified on regtest dev stack (cold-boot race): bunker logs
  🔁 admin: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
  🔁 backend: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
on each disconnect, where pre-fix the bunker stayed silently deaf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-06-03 18:55:55 +02:00