pingOrDie self-watchdog false-positives → bunker exits every 30s on non-public relays #4

Closed
opened 2026-05-25 21:52:33 +00:00 by padreug · 0 comments
Owner

Symptom

After getting the bunker to boot (#1, #2, #3) and pointing it at a non-public relay channel (e.g. an LNbits nostrrelay/test instance running on the same host), the bunker successfully connects and reports:

✅ adminNpubs: npub1...
✅ Connected to ws://lnbits:5001/nostrrelay/test
✅ nsecBunker ready to serve requests.
✅ nsecBunker Admin Interface ready

…then 30 seconds later:

🔔 Sent ping event: 1779743172
🔔 Sent ping event: 1779743192
❌ No ping event received in 30 seconds. Exiting.

Container exits with code 0 (so restart: on-failure doesn't even kick in).

Root cause

src/daemon/admin/index.ts:pingOrDie is a self-watchdog: every 20s the bunker publishes a kind-24133 event tagged to its own pubkey, and listens (via the same NDK instance) for matching events. If no echo arrives within 50s (the death timer), it process.exit(1).

Two separate problems make this fire spuriously on our setup:

  1. process.exit(1) is documented in the log message but the actual call is process.exit(0) somewhere upstream — the container exits with 0, so restart: on-failure doesn't restart it.
  2. NDK 2.8.1's pool / outbox model doesn't reliably deliver self-published events back through the same subscription when the relay set is one custom relay. We verified independently from a plain Python WebSocket client that the relay DOES echo kind-24133 events back to subscribers correctly (subscribe-then-publish-then-receive round-trip works in <1s). The failure is on the NDK side.

So the watchdog is killing the bunker for a reason that doesn't reflect actual problems with admin RPCs (those work — ping, create_new_key over the same relay channel both succeed; see aiolabs/lnbits/issues/18 for the spike findings).

Fix we applied

Comment out the pingOrDie(this.ndk) call in src/daemon/admin/index.ts:125. Bunker stays up indefinitely afterward.

-            pingOrDie(this.ndk);
+            /* pingOrDie disabled — NDK 2.8.1 outbox model doesn't echo
+             * through non-public relay channels reliably; the watchdog
+             * fires false positives. Revisit when wiring real client work. */

Real fix candidates

In rough order of investment:

  1. Make the watchdog optional via env flag (DISABLE_PING_WATCHDOG=1) — quick fix, lets operators turn it off when they know their relay setup is fine.
  2. Upgrade NDK to a version that handles the publish-then-subscribe echo correctly. Worth testing 2.10+ before doing this — might just be fixed.
  3. Replace the watchdog with a direct relay-connectivity check (e.g. pool.connectedRelayCount() > 0) rather than the round-trip-via-self pattern. Simpler, fewer moving parts.
  4. Fix the exit code — the message says exit 1 but the actual call is exit 0 (somewhere). Should at minimum be consistent so restart: on-failure works.

Acceptance

  • Workaround applied (watchdog disabled).
  • NDK upgrade tested to see if it resolves the underlying issue.
  • Decision on whether to upstream a flag-controlled version of the watchdog.

Cross-refs

  • Discovered during the aiolabs/lnbits#18 phase 2 spike.
  • Spike findings: ~/dev/lnbits/nsec-bunker-spike-findings.md ("pingOrDie watchdog disabled" section).
  • See also: same NDK echo issue may bite get_keys responses (#5) and possibly future client-side signing flows. Worth investigating together.
## Symptom After getting the bunker to boot (#1, #2, #3) and pointing it at a non-public relay channel (e.g. an LNbits `nostrrelay/test` instance running on the same host), the bunker successfully connects and reports: ``` ✅ adminNpubs: npub1... ✅ Connected to ws://lnbits:5001/nostrrelay/test ✅ nsecBunker ready to serve requests. ✅ nsecBunker Admin Interface ready ``` …then 30 seconds later: ``` 🔔 Sent ping event: 1779743172 🔔 Sent ping event: 1779743192 ❌ No ping event received in 30 seconds. Exiting. ``` Container exits with code 0 (so `restart: on-failure` doesn't even kick in). ## Root cause `src/daemon/admin/index.ts:pingOrDie` is a self-watchdog: every 20s the bunker publishes a kind-24133 event tagged to its own pubkey, and listens (via the same NDK instance) for matching events. If no echo arrives within 50s (the death timer), it `process.exit(1)`. Two separate problems make this fire spuriously on our setup: 1. **`process.exit(1)` is documented in the log message but the actual call is `process.exit(0)`** somewhere upstream — the container exits with 0, so `restart: on-failure` doesn't restart it. 2. **NDK 2.8.1's pool / outbox model doesn't reliably deliver self-published events back through the same subscription** when the relay set is one custom relay. We verified independently from a plain Python WebSocket client that the relay DOES echo kind-24133 events back to subscribers correctly (subscribe-then-publish-then-receive round-trip works in <1s). The failure is on the NDK side. So the watchdog is killing the bunker for a reason that doesn't reflect actual problems with admin RPCs (those work — `ping`, `create_new_key` over the same relay channel both succeed; see `aiolabs/lnbits/issues/18` for the spike findings). ## Fix we applied Comment out the `pingOrDie(this.ndk)` call in `src/daemon/admin/index.ts:125`. Bunker stays up indefinitely afterward. ```diff - pingOrDie(this.ndk); + /* pingOrDie disabled — NDK 2.8.1 outbox model doesn't echo + * through non-public relay channels reliably; the watchdog + * fires false positives. Revisit when wiring real client work. */ ``` ## Real fix candidates In rough order of investment: 1. **Make the watchdog optional via env flag** (`DISABLE_PING_WATCHDOG=1`) — quick fix, lets operators turn it off when they know their relay setup is fine. 2. **Upgrade NDK** to a version that handles the publish-then-subscribe echo correctly. Worth testing 2.10+ before doing this — might just be fixed. 3. **Replace the watchdog with a direct relay-connectivity check** (e.g. `pool.connectedRelayCount() > 0`) rather than the round-trip-via-self pattern. Simpler, fewer moving parts. 4. **Fix the exit code** — the message says exit 1 but the actual call is exit 0 (somewhere). Should at minimum be consistent so `restart: on-failure` works. ## Acceptance - [x] Workaround applied (watchdog disabled). - [ ] NDK upgrade tested to see if it resolves the underlying issue. - [ ] Decision on whether to upstream a flag-controlled version of the watchdog. ## Cross-refs - Discovered during the `aiolabs/lnbits#18` phase 2 spike. - Spike findings: `~/dev/lnbits/nsec-bunker-spike-findings.md` ("pingOrDie watchdog disabled" section). - See also: same NDK echo issue may bite `get_keys` responses (#5) and possibly future client-side signing flows. Worth investigating together.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aiolabs/nsecbunkerd#4
No description provided.