relayConnectionWatchdog: make threshold + poll-interval env-configurable + add soft-fail mode (don't process.exit on transient partitions) #18

Open
opened 2026-05-31 17:22:58 +00:00 by padreug · 0 comments
Owner

Summary

relayConnectionWatchdog in src/daemon/admin/index.ts:499-518 hard-codes a 60s partition threshold + 10s poll interval, and calls process.exit(1) when no relays are connected for ≥60s. Both the threshold and the exit-on-partition behavior need to be operator-tunable — the 60s default is too aggressive for several real operational patterns we've hit on the bohm regtest.

What's happening

Current code:

async function relayConnectionWatchdog(ndk: NDK) {
    const POLL_INTERVAL_MS = 10_000;
    const PARTITION_THRESHOLD_MS = 60_000;
    let lastConnectedAt = Date.now();

    setInterval(() => {
        const connectedCount = ndk.pool.connectedRelays().length;
        if (connectedCount > 0) {
            lastConnectedAt = Date.now();
            return;
        }
        const elapsed = Date.now() - lastConnectedAt;
        if (elapsed > PARTITION_THRESHOLD_MS) {
            console.log(`❌ No connected relays for ${Math.floor(elapsed / 1000)}s. Exiting.`);
            process.exit(1);
        }
    }, POLL_INTERVAL_MS);
}

The watchdog is a deliberate replacement for the older publish-echo watchdog (#4 + #7) — that change was correct. The issue isn't the design; it's the hard-coded threshold + hard-coded exit policy.

Real-world patterns where 60s is too short (bohm regtest, 2026-05-31)

  1. lnbits container restart — lnbits hosts the relay (nostrrelay extension). When we restart the lnbits container (docker compose restart lnbits — happens during dev iteration, deployment, env change), the bunker loses its only relay. Lnbits typically takes 30-60s to come back up + accept WS connections. Bunker hits the 60s threshold mid-restart and exits.

  2. Long bulk migrations — driving 50+ accounts through the bunker via RemoteBunkerSigner.provision() from the lnbits-side migration tool keeps the admin channel busy + occasionally causes the underlying WS to momentarily drop. The bunker counts that as "no relays" and exits mid-migration, leaving accounts half-migrated.

  3. Network blips on multi-host deploys (future) — when the bunker runs on a different host than its relay, brief packet loss / restart of the relay host trips the watchdog deterministically.

The bunker exits even though NDK's pool will auto-reconnect on its own — the exit prevents the recovery from happening.

Proposed fix

Three changes:

1. Make thresholds env-configurable

const POLL_INTERVAL_MS = Number(process.env.NSECBUNKER_RELAY_POLL_INTERVAL_SECONDS ?? 10) * 1000;
const PARTITION_THRESHOLD_MS = Number(process.env.NSECBUNKER_RELAY_PARTITION_THRESHOLD_SECONDS ?? 60) * 1000;

Operators can tune for their relay topology. The defaults stay unchanged for backward compat.

2. Add a soft-fail mode

const ON_PARTITION = (process.env.NSECBUNKER_RELAY_PARTITION_ACTION ?? "exit").toLowerCase();
// "exit" — current behavior: process.exit(1)
// "log"  — log loudly + emit a periodic warning, let NDK's pool reconnect

Useful for:

  • Dev environments where supervisor restart adds friction (no autorestart wrapper in docker-compose, or operator wants to keep the daemon alive across relay flaps for log inspection)
  • Single-relay deployments where the relay being down is OK temporarily — exiting just adds N seconds of unavailability after the relay comes back

When set to log, the watchdog continues to emit ❌ No connected relays for Xs (action=log, NDK pool will reconnect) every poll interval until connectivity returns, then logs ✅ Relay connectivity restored after Ys.

3. Increase default threshold to something less aggressive

Current 60s is shorter than the lnbits container restart cycle. A 300s (5min) default would survive normal restarts without losing the watchdog's "permanent partition" detection capability.

This is the most opinionated of the three changes — happy to keep 60s default if you prefer + just let operators tune via env. But 300s would be more consistent with typical "I've lost my network entirely" semantics rather than "I had a brief WS reconnect."

Aside: exit code discrepancy

Code reads process.exit(1) but docker ps -a reports the bunker as Exited (0) after the watchdog fires. Compose policy is restart: on-failure which doesn't trigger on exit 0, hence the bunker stays down. Either:

  • The exit code is getting transformed somewhere up the stack (Node.js process wrapper, shell wrapper, etc.)
  • Or the watchdog isn't actually the path that fired (different exit path with code 0)

Either way: changing this exit to process.exit(1) explicitly OR documenting why it's 0 would let operators set restart: on-failure and get the auto-restart they expect. Worth a separate investigation pass.

Acceptance

  • NSECBUNKER_RELAY_POLL_INTERVAL_SECONDS + NSECBUNKER_RELAY_PARTITION_THRESHOLD_SECONDS env vars wired. Defaults unchanged or bumped per discussion in §3.
  • NSECBUNKER_RELAY_PARTITION_ACTION=log|exit env var wired. Default exit preserves backward compat.
  • When log mode is on + partition detected, daemon stays alive, emits periodic WARN, recovery emits a log line.
  • Exit code on partition is 1 (not 0) when exit mode fires — so docker restart: on-failure triggers correctly. Investigate why current observed exit code is 0.
  • Docs section spelling out the three env vars + recommended values for various deployment patterns (single-host dev vs. multi-host prod).

Cross-references

  • src/daemon/admin/index.ts:499-518 — current watchdog impl
  • aiolabs/nsecbunkerd#4 + #7 — the publish-echo watchdog this replaced (correct decision)
  • Coord log 2026-05-31T17:15Z (bohm regtest bulk-migration session) — the workload that surfaced this
  • ~/dev/local/docker/regtest/docker-compose.dev.ymlrestart: on-failure policy that fails to trigger because of the exit-code thing
## Summary `relayConnectionWatchdog` in `src/daemon/admin/index.ts:499-518` hard-codes a 60s partition threshold + 10s poll interval, and calls `process.exit(1)` when no relays are connected for ≥60s. **Both the threshold and the exit-on-partition behavior need to be operator-tunable** — the 60s default is too aggressive for several real operational patterns we've hit on the bohm regtest. ## What's happening Current code: ```ts async function relayConnectionWatchdog(ndk: NDK) { const POLL_INTERVAL_MS = 10_000; const PARTITION_THRESHOLD_MS = 60_000; let lastConnectedAt = Date.now(); setInterval(() => { const connectedCount = ndk.pool.connectedRelays().length; if (connectedCount > 0) { lastConnectedAt = Date.now(); return; } const elapsed = Date.now() - lastConnectedAt; if (elapsed > PARTITION_THRESHOLD_MS) { console.log(`❌ No connected relays for ${Math.floor(elapsed / 1000)}s. Exiting.`); process.exit(1); } }, POLL_INTERVAL_MS); } ``` The watchdog is a deliberate replacement for the older publish-echo watchdog (#4 + #7) — that change was correct. The issue isn't the design; it's the hard-coded threshold + hard-coded exit policy. ## Real-world patterns where 60s is too short (bohm regtest, 2026-05-31) 1. **lnbits container restart** — lnbits hosts the relay (`nostrrelay` extension). When we restart the lnbits container (`docker compose restart lnbits` — happens during dev iteration, deployment, env change), the bunker loses its only relay. Lnbits typically takes 30-60s to come back up + accept WS connections. Bunker hits the 60s threshold mid-restart and exits. 2. **Long bulk migrations** — driving 50+ accounts through the bunker via `RemoteBunkerSigner.provision()` from the lnbits-side migration tool keeps the admin channel busy + occasionally causes the underlying WS to momentarily drop. The bunker counts that as "no relays" and exits mid-migration, leaving accounts half-migrated. 3. **Network blips on multi-host deploys** (future) — when the bunker runs on a different host than its relay, brief packet loss / restart of the relay host trips the watchdog deterministically. The bunker exits even though NDK's pool will auto-reconnect on its own — the exit prevents the recovery from happening. ## Proposed fix Three changes: ### 1. Make thresholds env-configurable ```ts const POLL_INTERVAL_MS = Number(process.env.NSECBUNKER_RELAY_POLL_INTERVAL_SECONDS ?? 10) * 1000; const PARTITION_THRESHOLD_MS = Number(process.env.NSECBUNKER_RELAY_PARTITION_THRESHOLD_SECONDS ?? 60) * 1000; ``` Operators can tune for their relay topology. The defaults stay unchanged for backward compat. ### 2. Add a soft-fail mode ```ts const ON_PARTITION = (process.env.NSECBUNKER_RELAY_PARTITION_ACTION ?? "exit").toLowerCase(); // "exit" — current behavior: process.exit(1) // "log" — log loudly + emit a periodic warning, let NDK's pool reconnect ``` Useful for: - Dev environments where supervisor restart adds friction (no autorestart wrapper in docker-compose, or operator wants to keep the daemon alive across relay flaps for log inspection) - Single-relay deployments where the relay being down is OK temporarily — exiting just adds N seconds of unavailability after the relay comes back When set to `log`, the watchdog continues to emit `❌ No connected relays for Xs (action=log, NDK pool will reconnect)` every poll interval until connectivity returns, then logs `✅ Relay connectivity restored after Ys`. ### 3. Increase default threshold to something less aggressive Current 60s is shorter than the lnbits container restart cycle. A 300s (5min) default would survive normal restarts without losing the watchdog's "permanent partition" detection capability. This is the most opinionated of the three changes — happy to keep 60s default if you prefer + just let operators tune via env. But 300s would be more consistent with typical "I've lost my network entirely" semantics rather than "I had a brief WS reconnect." ## Aside: exit code discrepancy Code reads `process.exit(1)` but `docker ps -a` reports the bunker as `Exited (0)` after the watchdog fires. Compose policy is `restart: on-failure` which doesn't trigger on exit 0, hence the bunker stays down. Either: - The exit code is getting transformed somewhere up the stack (Node.js process wrapper, shell wrapper, etc.) - Or the watchdog isn't actually the path that fired (different exit path with code 0) Either way: changing this exit to `process.exit(1)` explicitly OR documenting why it's 0 would let operators set `restart: on-failure` and get the auto-restart they expect. Worth a separate investigation pass. ## Acceptance - [ ] `NSECBUNKER_RELAY_POLL_INTERVAL_SECONDS` + `NSECBUNKER_RELAY_PARTITION_THRESHOLD_SECONDS` env vars wired. Defaults unchanged or bumped per discussion in §3. - [ ] `NSECBUNKER_RELAY_PARTITION_ACTION=log|exit` env var wired. Default `exit` preserves backward compat. - [ ] When `log` mode is on + partition detected, daemon stays alive, emits periodic WARN, recovery emits a `✅` log line. - [ ] Exit code on partition is 1 (not 0) when `exit` mode fires — so docker `restart: on-failure` triggers correctly. Investigate why current observed exit code is 0. - [ ] Docs section spelling out the three env vars + recommended values for various deployment patterns (single-host dev vs. multi-host prod). ## Cross-references - `src/daemon/admin/index.ts:499-518` — current watchdog impl - `aiolabs/nsecbunkerd#4` + `#7` — the publish-echo watchdog this replaced (correct decision) - Coord log `2026-05-31T17:15Z` (bohm regtest bulk-migration session) — the workload that surfaced this - `~/dev/local/docker/regtest/docker-compose.dev.yml` — `restart: on-failure` policy that fails to trigger because of the exit-code thing
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aiolabs/nsecbunkerd#18
No description provided.