fix(daemon): keep retrying relay reconnect indefinitely, overriding NDK give-up (#20) #22

Merged
padreug merged 1 commit from fix-20-indefinite-relay-reconnect into dev 2026-06-03 16:58:48 +00:00
Owner

Summary

Closes #20.

NDK 3.x's per-relay connectivity state machine transitions to FLAPPING after a small number of fast-fail connection attempts (3 sub-second ECONNREFUSED durations land below FLAPPING_THRESHOLD_MS = 1s std-dev → isFlapping() returns true → per-relay retry stops). NDKPool.handleFlapping then reschedules with doubling backoff (5s → 10s → 20s → 40s → 80s …), unbounded.

For nsecbunkerd, "disconnected for 80+ seconds after every dev-stack restart" is the failure mode users hit:

  1. Bunker container boots before lnbits's nostrrelay extension is accepting WebSockets.
  2. NDK fires ECONNREFUSED storm, declares FLAPPING.
  3. Bunker logs ✅ Connected to ws://... optimistically (from relay:connect) then nothing — the immediate ECONNREFUSED follow-up doesn't propagate to user-facing logs.
  4. With NSEC_BUNKER_DISABLE_WATCHDOG=1 set (the regtest default to avoid restart loops during heavy dev cycles), there's no safety net either. Manual docker compose restart nsecbunker is the only recovery.

Fix

Add src/daemon/lib/relay-reconnect.ts — a small supervisor that:

  • Listens for relay:disconnect and flapping events on the pool.
  • On either, schedules a manual relay.connect() with a bounded, capped delay (1s → 2s → 4s → 8s → 10s, max 10s — vs NDK's unbounded doubling).
  • Resets the attempt counter on successful relay:connect so a future disconnect storm starts fresh.
  • Logs the override transparently (🔁 <label>: scheduling reconnect to <url> in <delay>ms (attempt N, overriding NDK give-up)).

Wired into both NDK instances:

  • Daemon.ndk (run.ts:174) — the per-key Backend instances' shared pool.
  • AdminInterface.ndk (admin/index.ts:73) — the admin-channel pool.

Coexistence with the connection watchdog

Both behaviors layer cleanly:

Scenario Supervisor Watchdog (60s, when enabled)
Brief disconnect (lnbits restart, ~5–15s) Recovers within seconds Never fires
Sustained disconnect (relay truly down) Keeps trying every ≤10s Fires after 60s → process.exit(1) → external supervisor restarts bunker

So the watchdog becomes a long-tail safety net; this supervisor handles the common case without involving the orchestrator. Operators with NSEC_BUNKER_DISABLE_WATCHDOG=1 set as a workaround for #20 can re-enable the watchdog once this lands.

Trade-off

We may hammer a permanently-down relay every 10s. Acceptable because:

  • The bunker's primary relay is typically on the same host or LAN (ws://lnbits:5001/... or ws://127.0.0.1:.../); TCP RSTs are cheap.
  • Public-relay setups can layer external supervision on top if they care about retry pressure.

Test plan

Verified on the regtest dev stack (cold-boot race):

  • docker compose stop nsecbunker lnbits && docker compose start nsecbunker lnbits (parallel boot → reproduces the original race).
  • On the resulting disconnect, bunker logs both:
    🔁 admin:   scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
    🔁 backend: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
    
  • pnpm run build clean (the two pre-existing TS warnings in src/daemon/admin/index.ts are unrelated).
  • To verify: long-running soak — bunker survives 10 consecutive lnbits restarts without manual intervention. (Pre-fix this would fail on the first restart.)

Notes

  • The other half of the bunker-side regression we're tracking — per-key Backend subscriptions sometimes failing to register on the relay after fresh boot — is filed separately as #21 and not addressed here.
  • Cross-references: aiolabs/lnbits PR #48 (DEFAULT_POLICY_RULES NIP-15 fix), aiolabs/nostrmarket PR #8 (publish-timeout-or-fire-and-forget). All three were discovered together while debugging a regtest signup hang on 2026-06-03.

🤖 Generated with Claude Code

## Summary Closes #20. NDK 3.x's per-relay connectivity state machine transitions to `FLAPPING` after a small number of fast-fail connection attempts (3 sub-second `ECONNREFUSED` durations land below `FLAPPING_THRESHOLD_MS = 1s` std-dev → `isFlapping()` returns true → per-relay retry stops). `NDKPool.handleFlapping` then reschedules with doubling backoff (5s → 10s → 20s → 40s → 80s …), unbounded. For nsecbunkerd, "disconnected for 80+ seconds after every dev-stack restart" is the failure mode users hit: 1. Bunker container boots before lnbits's `nostrrelay` extension is accepting WebSockets. 2. NDK fires ECONNREFUSED storm, declares FLAPPING. 3. Bunker logs `✅ Connected to ws://...` optimistically (from `relay:connect`) then nothing — the immediate `ECONNREFUSED` follow-up doesn't propagate to user-facing logs. 4. With `NSEC_BUNKER_DISABLE_WATCHDOG=1` set (the regtest default to avoid restart loops during heavy dev cycles), there's no safety net either. Manual `docker compose restart nsecbunker` is the only recovery. ## Fix Add `src/daemon/lib/relay-reconnect.ts` — a small supervisor that: - Listens for `relay:disconnect` and `flapping` events on the pool. - On either, schedules a manual `relay.connect()` with a **bounded, capped** delay (1s → 2s → 4s → 8s → 10s, max 10s — vs NDK's unbounded doubling). - Resets the attempt counter on successful `relay:connect` so a future disconnect storm starts fresh. - Logs the override transparently (`🔁 <label>: scheduling reconnect to <url> in <delay>ms (attempt N, overriding NDK give-up)`). Wired into both NDK instances: - `Daemon.ndk` (run.ts:174) — the per-key Backend instances' shared pool. - `AdminInterface.ndk` (admin/index.ts:73) — the admin-channel pool. ## Coexistence with the connection watchdog Both behaviors layer cleanly: | Scenario | Supervisor | Watchdog (60s, when enabled) | |---|---|---| | Brief disconnect (lnbits restart, ~5–15s) | Recovers within seconds | Never fires | | Sustained disconnect (relay truly down) | Keeps trying every ≤10s | Fires after 60s → `process.exit(1)` → external supervisor restarts bunker | So the watchdog becomes a long-tail safety net; this supervisor handles the common case without involving the orchestrator. Operators with `NSEC_BUNKER_DISABLE_WATCHDOG=1` set as a workaround for #20 can re-enable the watchdog once this lands. ## Trade-off We may hammer a permanently-down relay every 10s. Acceptable because: - The bunker's primary relay is typically on the same host or LAN (`ws://lnbits:5001/...` or `ws://127.0.0.1:.../`); TCP RSTs are cheap. - Public-relay setups can layer external supervision on top if they care about retry pressure. ## Test plan Verified on the regtest dev stack (cold-boot race): - [x] `docker compose stop nsecbunker lnbits && docker compose start nsecbunker lnbits` (parallel boot → reproduces the original race). - [x] On the resulting disconnect, bunker logs both: ``` 🔁 admin: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up) 🔁 backend: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up) ``` - [x] `pnpm run build` clean (the two pre-existing TS warnings in `src/daemon/admin/index.ts` are unrelated). - [ ] To verify: long-running soak — bunker survives 10 consecutive lnbits restarts without manual intervention. (Pre-fix this would fail on the first restart.) ## Notes - The other half of the bunker-side regression we're tracking — per-key Backend subscriptions sometimes failing to register on the relay after fresh boot — is filed separately as #21 and not addressed here. - Cross-references: aiolabs/lnbits PR #48 (DEFAULT_POLICY_RULES NIP-15 fix), aiolabs/nostrmarket PR #8 (publish-timeout-or-fire-and-forget). All three were discovered together while debugging a regtest signup hang on 2026-06-03. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
fix(daemon): keep retrying relay reconnect indefinitely, overriding NDK give-up (#20)
Some checks failed
Docker image / build-and-push-image (push) Has been cancelled
a690596b85
NDK 3.x's per-relay connectivity machine gives up after ~3 fast-fail
(ECONNREFUSED) cycles. Three sub-second failures look identical, so
`isFlapping()` (std-dev < 1s) returns true and the relay transitions
to FLAPPING; NDKPool's `handleFlapping` then reschedules with doubling
backoff (5s → 10s → 20s → 40s → 80s …). For nsecbunkerd, "disconnected
for 80+s after every lnbits restart" is the failure mode users hit on
the regtest dev stack: bunker container boots before lnbits's
nostrrelay extension is accepting WebSockets → ECONNREFUSED storm →
NDK flagged FLAPPING → bunker stays silently deaf until manual restart.

Symptom is particularly hostile because:
- `relay:connect` fires optimistically; the immediate ECONNREFUSED
  follow-up doesn't propagate to user-facing logs.
- `NSEC_BUNKER_DISABLE_WATCHDOG=1` (the dev-stack default) skips the
  exit-and-restart safety net.
- Manual `docker compose restart nsecbunker` is the only recovery.

Fix: attach a small supervisor (`attachIndefiniteReconnect`) to both
NDK instances (daemon's backend NDK in run.ts, AdminInterface's admin
NDK in admin/index.ts). On `relay:disconnect` or `flapping`, schedule
a manual `relay.connect()` with a SHORT capped delay (1s → 2s → 4s →
8s → 10s, capped at 10s instead of NDK's unbounded doubling). Successful
connect resets the attempt counter so a future disconnect storm starts
fresh.

Coexists cleanly with the relay-connection watchdog (admin/index.ts:500):
- Brief disconnects (e.g. lnbits restart): supervisor recovers within
  seconds, watchdog never fires.
- Persistent disconnects (relay truly down): supervisor keeps trying
  every ≤10s; if it can't recover within 60s, watchdog still exits and
  the process supervisor restarts the bunker. So the watchdog becomes
  a long-tail safety net; this supervisor handles the common case.

Operators with `NSEC_BUNKER_DISABLE_WATCHDOG=1` set as a workaround for
this bug can re-enable the watchdog once this lands.

Trade-off: we may hammer a permanently-down relay every 10s. Acceptable
because the bunker's primary relay is typically on the same host or LAN
(loopback or docker-internal); TCP RSTs are cheap. Public-relay setups
can layer external supervision on top.

Verified on regtest dev stack (cold-boot race): bunker logs
  🔁 admin: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
  🔁 backend: scheduling reconnect to ws://lnbits:5001/nostrrelay/test/ in 1000ms (attempt 1, overriding NDK give-up)
on each disconnect, where pre-fix the bunker stayed silently deaf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
padreug deleted branch fix-20-indefinite-relay-reconnect 2026-06-03 16:58:49 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aiolabs/nsecbunkerd!22
No description provided.