fix(daemon): keep retrying relay reconnect indefinitely, overriding NDK give-up (#20) #22

Merged
padreug merged 1 commit from fix-20-indefinite-relay-reconnect into dev 2026-06-03 16:58:48 +00:00

1 commit

Author SHA1 Message Date
a690596b85 fix(daemon): keep retrying relay reconnect indefinitely, overriding NDK give-up (#20)
Some checks failed
Docker image / build-and-push-image (push) Has been cancelled
NDK 3.x's per-relay connectivity machine gives up after ~3 fast-fail
(ECONNREFUSED) cycles. Three sub-second failures look identical, so
`isFlapping()` (std-dev < 1s) returns true and the relay transitions
to FLAPPING; NDKPool's `handleFlapping` then reschedules with doubling
backoff (5s → 10s → 20s → 40s → 80s …). For nsecbunkerd, "disconnected
for 80+s after every lnbits restart" is the failure mode users hit on
the regtest dev stack: bunker container boots before lnbits's
nostrrelay extension is accepting WebSockets → ECONNREFUSED storm →
NDK flagged FLAPPING → bunker stays silently deaf until manual restart.

Symptom is particularly hostile because:
- `relay:connect` fires optimistically; the immediate ECONNREFUSED
  follow-up doesn't propagate to user-facing logs.
- `NSEC_BUNKER_DISABLE_WATCHDOG=1` (the dev-stack default) skips the
  exit-and-restart safety net.
- Manual `docker compose restart nsecbunker` is the only recovery.

Fix: attach a small supervisor (`attachIndefiniteReconnect`) to both
NDK instances (daemon's backend NDK in run.ts, AdminInterface's admin
NDK in admin/index.ts). On `relay:disconnect` or `flapping`, schedule
a manual `relay.connect()` with a SHORT capped delay (1s → 2s → 4s →
8s → 10s, capped at 10s instead of NDK's unbounded doubling). Successful
connect resets the attempt counter so a future disconnect storm starts
fresh.

Coexists cleanly with the relay-connection watchdog (admin/index.ts:500):
- Brief disconnects (e.g. lnbits restart): supervisor recovers within
  seconds, watchdog never fires.
- Persistent disconnects (relay truly down): supervisor keeps trying
  every ≤10s; if it can't recover within 60s, watchdog still exits and
  the process supervisor restarts the bunker. So the watchdog becomes
  a long-tail safety net; this supervisor handles the common case.

Operators with `NSEC_BUNKER_DISABLE_WATCHDOG=1` set as a workaround for
this bug can re-enable the watchdog once this lands.

Trade-off: we may hammer a permanently-down relay every 10s. Acceptable
because the bunker's primary relay is typically on the same host or LAN
(loopback or docker-internal); TCP RSTs are cheap. Public-relay setups
can layer external supervision on top.

Verified on regtest dev stack (cold-boot race): bunker logs
  🔁 admin: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
  🔁 backend: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
on each disconnect, where pre-fix the bunker stayed silently deaf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-03 18:55:55 +02:00