NDK 3.x's per-relay connectivity machine gives up after ~3 fast-fail
(ECONNREFUSED) cycles. Three sub-second failures look identical, so
`isFlapping()` (std-dev < 1s) returns true and the relay transitions
to FLAPPING; NDKPool's `handleFlapping` then reschedules with doubling
backoff (5s → 10s → 20s → 40s → 80s …). For nsecbunkerd, "disconnected
for 80+s after every lnbits restart" is the failure mode users hit on
the regtest dev stack: bunker container boots before lnbits's
nostrrelay extension is accepting WebSockets → ECONNREFUSED storm →
NDK flagged FLAPPING → bunker stays silently deaf until manual restart.
Symptom is particularly hostile because:
- `relay:connect` fires optimistically; the immediate ECONNREFUSED
follow-up doesn't propagate to user-facing logs.
- `NSEC_BUNKER_DISABLE_WATCHDOG=1` (the dev-stack default) skips the
exit-and-restart safety net.
- Manual `docker compose restart nsecbunker` is the only recovery.
Fix: attach a small supervisor (`attachIndefiniteReconnect`) to both
NDK instances (daemon's backend NDK in run.ts, AdminInterface's admin
NDK in admin/index.ts). On `relay:disconnect` or `flapping`, schedule
a manual `relay.connect()` with a SHORT capped delay (1s → 2s → 4s →
8s → 10s, capped at 10s instead of NDK's unbounded doubling). Successful
connect resets the attempt counter so a future disconnect storm starts
fresh.
Coexists cleanly with the relay-connection watchdog (admin/index.ts:500):
- Brief disconnects (e.g. lnbits restart): supervisor recovers within
seconds, watchdog never fires.
- Persistent disconnects (relay truly down): supervisor keeps trying
every ≤10s; if it can't recover within 60s, watchdog still exits and
the process supervisor restarts the bunker. So the watchdog becomes
a long-tail safety net; this supervisor handles the common case.
Operators with `NSEC_BUNKER_DISABLE_WATCHDOG=1` set as a workaround for
this bug can re-enable the watchdog once this lands.
Trade-off: we may hammer a permanently-down relay every 10s. Acceptable
because the bunker's primary relay is typically on the same host or LAN
(loopback or docker-internal); TCP RSTs are cheap. Public-relay setups
can layer external supervision on top.
Verified on regtest dev stack (cold-boot race): bunker logs
🔁 admin: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
🔁 backend: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
on each disconnect, where pre-fix the bunker stayed silently deaf.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>