relayConnectionWatchdog: make threshold + poll-interval env-configurable + add soft-fail mode (don't process.exit on transient partitions) #18

New issue

Open

opened 2026-05-31 17:22:58 +00:00 by padreug · 0 comments

padreug commented

2026-05-31 17:22:58 +00:00

Owner

Summary

relayConnectionWatchdog in src/daemon/admin/index.ts:499-518 hard-codes a 60s partition threshold + 10s poll interval, and calls process.exit(1) when no relays are connected for ≥60s. Both the threshold and the exit-on-partition behavior need to be operator-tunable — the 60s default is too aggressive for several real operational patterns we've hit on the bohm regtest.

What's happening

Current code:

async function relayConnectionWatchdog(ndk: NDK) {
    const POLL_INTERVAL_MS = 10_000;
    const PARTITION_THRESHOLD_MS = 60_000;
    let lastConnectedAt = Date.now();

    setInterval(() => {
        const connectedCount = ndk.pool.connectedRelays().length;
        if (connectedCount > 0) {
            lastConnectedAt = Date.now();
            return;
        }
        const elapsed = Date.now() - lastConnectedAt;
        if (elapsed > PARTITION_THRESHOLD_MS) {
            console.log(`❌ No connected relays for ${Math.floor(elapsed / 1000)}s. Exiting.`);
            process.exit(1);
        }
    }, POLL_INTERVAL_MS);
}

The watchdog is a deliberate replacement for the older publish-echo watchdog (#4 + #7) — that change was correct. The issue isn't the design; it's the hard-coded threshold + hard-coded exit policy.

Real-world patterns where 60s is too short (bohm regtest, 2026-05-31)

lnbits container restart — lnbits hosts the relay (nostrrelay extension). When we restart the lnbits container (docker compose restart lnbits — happens during dev iteration, deployment, env change), the bunker loses its only relay. Lnbits typically takes 30-60s to come back up + accept WS connections. Bunker hits the 60s threshold mid-restart and exits.
Long bulk migrations — driving 50+ accounts through the bunker via RemoteBunkerSigner.provision() from the lnbits-side migration tool keeps the admin channel busy + occasionally causes the underlying WS to momentarily drop. The bunker counts that as "no relays" and exits mid-migration, leaving accounts half-migrated.
Network blips on multi-host deploys (future) — when the bunker runs on a different host than its relay, brief packet loss / restart of the relay host trips the watchdog deterministically.

The bunker exits even though NDK's pool will auto-reconnect on its own — the exit prevents the recovery from happening.

Proposed fix

Three changes:

1. Make thresholds env-configurable

const POLL_INTERVAL_MS = Number(process.env.NSECBUNKER_RELAY_POLL_INTERVAL_SECONDS ?? 10) * 1000;
const PARTITION_THRESHOLD_MS = Number(process.env.NSECBUNKER_RELAY_PARTITION_THRESHOLD_SECONDS ?? 60) * 1000;

Operators can tune for their relay topology. The defaults stay unchanged for backward compat.

2. Add a soft-fail mode

const ON_PARTITION = (process.env.NSECBUNKER_RELAY_PARTITION_ACTION ?? "exit").toLowerCase();
// "exit" — current behavior: process.exit(1)
// "log"  — log loudly + emit a periodic warning, let NDK's pool reconnect

Useful for:

Dev environments where supervisor restart adds friction (no autorestart wrapper in docker-compose, or operator wants to keep the daemon alive across relay flaps for log inspection)
Single-relay deployments where the relay being down is OK temporarily — exiting just adds N seconds of unavailability after the relay comes back

When set to log, the watchdog continues to emit ❌ No connected relays for Xs (action=log, NDK pool will reconnect) every poll interval until connectivity returns, then logs ✅ Relay connectivity restored after Ys.

3. Increase default threshold to something less aggressive

Current 60s is shorter than the lnbits container restart cycle. A 300s (5min) default would survive normal restarts without losing the watchdog's "permanent partition" detection capability.

This is the most opinionated of the three changes — happy to keep 60s default if you prefer + just let operators tune via env. But 300s would be more consistent with typical "I've lost my network entirely" semantics rather than "I had a brief WS reconnect."

Aside: exit code discrepancy

Code reads process.exit(1) but docker ps -a reports the bunker as Exited (0) after the watchdog fires. Compose policy is restart: on-failure which doesn't trigger on exit 0, hence the bunker stays down. Either:

The exit code is getting transformed somewhere up the stack (Node.js process wrapper, shell wrapper, etc.)
Or the watchdog isn't actually the path that fired (different exit path with code 0)

Either way: changing this exit to process.exit(1) explicitly OR documenting why it's 0 would let operators set restart: on-failure and get the auto-restart they expect. Worth a separate investigation pass.

Acceptance

NSECBUNKER_RELAY_POLL_INTERVAL_SECONDS + NSECBUNKER_RELAY_PARTITION_THRESHOLD_SECONDS env vars wired. Defaults unchanged or bumped per discussion in §3.
NSECBUNKER_RELAY_PARTITION_ACTION=log|exit env var wired. Default exit preserves backward compat.
When log mode is on + partition detected, daemon stays alive, emits periodic WARN, recovery emits a ✅ log line.
Exit code on partition is 1 (not 0) when exit mode fires — so docker restart: on-failure triggers correctly. Investigate why current observed exit code is 0.
Docs section spelling out the three env vars + recommended values for various deployment patterns (single-host dev vs. multi-host prod).

Cross-references

src/daemon/admin/index.ts:499-518 — current watchdog impl
aiolabs/nsecbunkerd#4 + #7 — the publish-echo watchdog this replaced (correct decision)
Coord log 2026-05-31T17:15Z (bohm regtest bulk-migration session) — the workload that surfaced this
~/dev/local/docker/regtest/docker-compose.dev.yml — restart: on-failure policy that fails to trigger because of the exit-code thing

## Summary `relayConnectionWatchdog` in `src/daemon/admin/index.ts:499-518` hard-codes a 60s partition threshold + 10s poll interval, and calls `process.exit(1)` when no relays are connected for ≥60s. **Both the threshold and the exit-on-partition behavior need to be operator-tunable** — the 60s default is too aggressive for several real operational patterns we've hit on the bohm regtest. ## What's happening Current code: ```ts async function relayConnectionWatchdog(ndk: NDK) { const POLL_INTERVAL_MS = 10_000; const PARTITION_THRESHOLD_MS = 60_000; let lastConnectedAt = Date.now(); setInterval(() => { const connectedCount = ndk.pool.connectedRelays().length; if (connectedCount > 0) { lastConnectedAt = Date.now(); return; } const elapsed = Date.now() - lastConnectedAt; if (elapsed > PARTITION_THRESHOLD_MS) { console.log(`❌ No connected relays for ${Math.floor(elapsed / 1000)}s. Exiting.`); process.exit(1); } }, POLL_INTERVAL_MS); } ``` The watchdog is a deliberate replacement for the older publish-echo watchdog (#4 + #7) — that change was correct. The issue isn't the design; it's the hard-coded threshold + hard-coded exit policy. ## Real-world patterns where 60s is too short (bohm regtest, 2026-05-31) 1. **lnbits container restart** — lnbits hosts the relay (`nostrrelay` extension). When we restart the lnbits container (`docker compose restart lnbits` — happens during dev iteration, deployment, env change), the bunker loses its only relay. Lnbits typically takes 30-60s to come back up + accept WS connections. Bunker hits the 60s threshold mid-restart and exits. 2. **Long bulk migrations** — driving 50+ accounts through the bunker via `RemoteBunkerSigner.provision()` from the lnbits-side migration tool keeps the admin channel busy + occasionally causes the underlying WS to momentarily drop. The bunker counts that as "no relays" and exits mid-migration, leaving accounts half-migrated. 3. **Network blips on multi-host deploys** (future) — when the bunker runs on a different host than its relay, brief packet loss / restart of the relay host trips the watchdog deterministically. The bunker exits even though NDK's pool will auto-reconnect on its own — the exit prevents the recovery from happening. ## Proposed fix Three changes: ### 1. Make thresholds env-configurable ```ts const POLL_INTERVAL_MS = Number(process.env.NSECBUNKER_RELAY_POLL_INTERVAL_SECONDS ?? 10) * 1000; const PARTITION_THRESHOLD_MS = Number(process.env.NSECBUNKER_RELAY_PARTITION_THRESHOLD_SECONDS ?? 60) * 1000; ``` Operators can tune for their relay topology. The defaults stay unchanged for backward compat. ### 2. Add a soft-fail mode ```ts const ON_PARTITION = (process.env.NSECBUNKER_RELAY_PARTITION_ACTION ?? "exit").toLowerCase(); // "exit" — current behavior: process.exit(1) // "log" — log loudly + emit a periodic warning, let NDK's pool reconnect ``` Useful for: - Dev environments where supervisor restart adds friction (no autorestart wrapper in docker-compose, or operator wants to keep the daemon alive across relay flaps for log inspection) - Single-relay deployments where the relay being down is OK temporarily — exiting just adds N seconds of unavailability after the relay comes back When set to `log`, the watchdog continues to emit `❌ No connected relays for Xs (action=log, NDK pool will reconnect)` every poll interval until connectivity returns, then logs `✅ Relay connectivity restored after Ys`. ### 3. Increase default threshold to something less aggressive Current 60s is shorter than the lnbits container restart cycle. A 300s (5min) default would survive normal restarts without losing the watchdog's "permanent partition" detection capability. This is the most opinionated of the three changes — happy to keep 60s default if you prefer + just let operators tune via env. But 300s would be more consistent with typical "I've lost my network entirely" semantics rather than "I had a brief WS reconnect." ## Aside: exit code discrepancy Code reads `process.exit(1)` but `docker ps -a` reports the bunker as `Exited (0)` after the watchdog fires. Compose policy is `restart: on-failure` which doesn't trigger on exit 0, hence the bunker stays down. Either: - The exit code is getting transformed somewhere up the stack (Node.js process wrapper, shell wrapper, etc.) - Or the watchdog isn't actually the path that fired (different exit path with code 0) Either way: changing this exit to `process.exit(1)` explicitly OR documenting why it's 0 would let operators set `restart: on-failure` and get the auto-restart they expect. Worth a separate investigation pass. ## Acceptance - [ ] `NSECBUNKER_RELAY_POLL_INTERVAL_SECONDS` + `NSECBUNKER_RELAY_PARTITION_THRESHOLD_SECONDS` env vars wired. Defaults unchanged or bumped per discussion in §3. - [ ] `NSECBUNKER_RELAY_PARTITION_ACTION=log|exit` env var wired. Default `exit` preserves backward compat. - [ ] When `log` mode is on + partition detected, daemon stays alive, emits periodic WARN, recovery emits a `✅` log line. - [ ] Exit code on partition is 1 (not 0) when `exit` mode fires — so docker `restart: on-failure` triggers correctly. Investigate why current observed exit code is 0. - [ ] Docs section spelling out the three env vars + recommended values for various deployment patterns (single-host dev vs. multi-host prod). ## Cross-references - `src/daemon/admin/index.ts:499-518` — current watchdog impl - `aiolabs/nsecbunkerd#4` + `#7` — the publish-echo watchdog this replaced (correct decision) - Coord log `2026-05-31T17:15Z` (bohm regtest bulk-migration session) — the workload that surfaced this - `~/dev/local/docker/regtest/docker-compose.dev.yml` — `restart: on-failure` policy that fails to trigger because of the exit-code thing

padreug referenced this issue

2026-06-19 08:41:17 +00:00

Design discussion / RFC: enforce token + grant lifecycle at sign time (the root behind #24) #25

padreug referenced this issue

2026-06-19 12:42:22 +00:00

Design discussion / RFC: enforce token + grant lifecycle at sign time (the root behind #24) #25

padreug referenced this issue

2026-06-19 15:57:24 +00:00

fix(acl): enforce token grant lifecycle live at sign time (#24, #25) #27