fix(daemon): keep retrying relay reconnect indefinitely, overriding NDK give-up (#20) #22
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix-20-indefinite-relay-reconnect"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes #20.
NDK 3.x's per-relay connectivity state machine transitions to
FLAPPINGafter a small number of fast-fail connection attempts (3 sub-secondECONNREFUSEDdurations land belowFLAPPING_THRESHOLD_MS = 1sstd-dev →isFlapping()returns true → per-relay retry stops).NDKPool.handleFlappingthen reschedules with doubling backoff (5s → 10s → 20s → 40s → 80s …), unbounded.For nsecbunkerd, "disconnected for 80+ seconds after every dev-stack restart" is the failure mode users hit:
nostrrelayextension is accepting WebSockets.✅ Connected to ws://...optimistically (fromrelay:connect) then nothing — the immediateECONNREFUSEDfollow-up doesn't propagate to user-facing logs.NSEC_BUNKER_DISABLE_WATCHDOG=1set (the regtest default to avoid restart loops during heavy dev cycles), there's no safety net either. Manualdocker compose restart nsecbunkeris the only recovery.Fix
Add
src/daemon/lib/relay-reconnect.ts— a small supervisor that:relay:disconnectandflappingevents on the pool.relay.connect()with a bounded, capped delay (1s → 2s → 4s → 8s → 10s, max 10s — vs NDK's unbounded doubling).relay:connectso a future disconnect storm starts fresh.🔁 <label>: scheduling reconnect to <url> in <delay>ms (attempt N, overriding NDK give-up)).Wired into both NDK instances:
Daemon.ndk(run.ts:174) — the per-key Backend instances' shared pool.AdminInterface.ndk(admin/index.ts:73) — the admin-channel pool.Coexistence with the connection watchdog
Both behaviors layer cleanly:
process.exit(1)→ external supervisor restarts bunkerSo the watchdog becomes a long-tail safety net; this supervisor handles the common case without involving the orchestrator. Operators with
NSEC_BUNKER_DISABLE_WATCHDOG=1set as a workaround for #20 can re-enable the watchdog once this lands.Trade-off
We may hammer a permanently-down relay every 10s. Acceptable because:
ws://lnbits:5001/...orws://127.0.0.1:.../); TCP RSTs are cheap.Test plan
Verified on the regtest dev stack (cold-boot race):
docker compose stop nsecbunker lnbits && docker compose start nsecbunker lnbits(parallel boot → reproduces the original race).pnpm run buildclean (the two pre-existing TS warnings insrc/daemon/admin/index.tsare unrelated).Notes
🤖 Generated with Claude Code