relayConnectionWatchdog: make threshold + poll-interval env-configurable + add soft-fail mode (don't process.exit on transient partitions) #18
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
relayConnectionWatchdoginsrc/daemon/admin/index.ts:499-518hard-codes a 60s partition threshold + 10s poll interval, and callsprocess.exit(1)when no relays are connected for ≥60s. Both the threshold and the exit-on-partition behavior need to be operator-tunable — the 60s default is too aggressive for several real operational patterns we've hit on the bohm regtest.What's happening
Current code:
The watchdog is a deliberate replacement for the older publish-echo watchdog (#4 + #7) — that change was correct. The issue isn't the design; it's the hard-coded threshold + hard-coded exit policy.
Real-world patterns where 60s is too short (bohm regtest, 2026-05-31)
lnbits container restart — lnbits hosts the relay (
nostrrelayextension). When we restart the lnbits container (docker compose restart lnbits— happens during dev iteration, deployment, env change), the bunker loses its only relay. Lnbits typically takes 30-60s to come back up + accept WS connections. Bunker hits the 60s threshold mid-restart and exits.Long bulk migrations — driving 50+ accounts through the bunker via
RemoteBunkerSigner.provision()from the lnbits-side migration tool keeps the admin channel busy + occasionally causes the underlying WS to momentarily drop. The bunker counts that as "no relays" and exits mid-migration, leaving accounts half-migrated.Network blips on multi-host deploys (future) — when the bunker runs on a different host than its relay, brief packet loss / restart of the relay host trips the watchdog deterministically.
The bunker exits even though NDK's pool will auto-reconnect on its own — the exit prevents the recovery from happening.
Proposed fix
Three changes:
1. Make thresholds env-configurable
Operators can tune for their relay topology. The defaults stay unchanged for backward compat.
2. Add a soft-fail mode
Useful for:
When set to
log, the watchdog continues to emit❌ No connected relays for Xs (action=log, NDK pool will reconnect)every poll interval until connectivity returns, then logs✅ Relay connectivity restored after Ys.3. Increase default threshold to something less aggressive
Current 60s is shorter than the lnbits container restart cycle. A 300s (5min) default would survive normal restarts without losing the watchdog's "permanent partition" detection capability.
This is the most opinionated of the three changes — happy to keep 60s default if you prefer + just let operators tune via env. But 300s would be more consistent with typical "I've lost my network entirely" semantics rather than "I had a brief WS reconnect."
Aside: exit code discrepancy
Code reads
process.exit(1)butdocker ps -areports the bunker asExited (0)after the watchdog fires. Compose policy isrestart: on-failurewhich doesn't trigger on exit 0, hence the bunker stays down. Either:Either way: changing this exit to
process.exit(1)explicitly OR documenting why it's 0 would let operators setrestart: on-failureand get the auto-restart they expect. Worth a separate investigation pass.Acceptance
NSECBUNKER_RELAY_POLL_INTERVAL_SECONDS+NSECBUNKER_RELAY_PARTITION_THRESHOLD_SECONDSenv vars wired. Defaults unchanged or bumped per discussion in §3.NSECBUNKER_RELAY_PARTITION_ACTION=log|exitenv var wired. Defaultexitpreserves backward compat.logmode is on + partition detected, daemon stays alive, emits periodic WARN, recovery emits a✅log line.exitmode fires — so dockerrestart: on-failuretriggers correctly. Investigate why current observed exit code is 0.Cross-references
src/daemon/admin/index.ts:499-518— current watchdog implaiolabs/nsecbunkerd#4+#7— the publish-echo watchdog this replaced (correct decision)2026-05-31T17:15Z(bohm regtest bulk-migration session) — the workload that surfaced this~/dev/local/docker/regtest/docker-compose.dev.yml—restart: on-failurepolicy that fails to trigger because of the exit-code thing