NDK gives up reconnecting to admin relay after ~3 ECONNREFUSED retries — bunker stays disconnected forever #20

Closed
opened 2026-06-03 16:22:31 +00:00 by padreug · 0 comments
Owner

Symptom

If the bunker container starts before its configured admin relay is accepting WebSockets (e.g. docker-compose brings up nsecbunker and lnbits in parallel, and lnbits's nostrrelay extension takes a few seconds to come up after uvicorn boots), NDK's relay-connectivity state machine retries ~3 times in ~3 seconds, fails on ECONNREFUSED, and then stops trying forever. The bunker process stays alive but is silently deaf — no kind:24134 admin event from lnbits ever reaches it.

Particularly nasty because:

  • The bunker logs ✅ Connected to ws://... optimistically from relay:connect; the ECONNREFUSED follow-up that actually invalidates the connection doesn't propagate to user-facing logs.
  • NSEC_BUNKER_DISABLE_WATCHDOG=1 keeps the process alive (no clean exit-and-supervisor-restart loop), so docker's restart: unless-stopped policy never fires either.
  • Manual docker compose restart nsecbunker is the only recovery.

Reproduction (regtest dev stack)

  1. Bring stack up cold: docker compose -f docker-compose.dev.yml up -d --build (lnbits + nsecbunker both starting).
  2. nsecbunker boots faster than lnbits's extension loading; NDK starts trying ws://lnbits:5001/nostrrelay/test before nostrrelay's WS endpoint is ready.
  3. With DEBUG=ndk:* on the bunker, observe:
ndk:relay:ws://lnbits:5001/nostrrelay/test:connectivity67 WebSocket error: ErrorEvent {
  error: Error: connect ECONNREFUSED 172.18.0.2:5001
}
ndk:relay:ws://lnbits:5001/nostrrelay/test:connectivity918 Using standard backoff, attempt 0, delay 1000ms
ndk:relay:ws://lnbits:5001/nostrrelay/test:connectivity918 Reconnecting in 1000
... (3 retries, then nothing)
  1. After ~5 seconds NDK stops retrying. Bunker remains "connected" per its own logs but the relay never sees a REQ from the bunker IP. Verified by patching nostrrelay's _handle_request to log all REQ filters: only late-arriving REQs from lnbits's own admin client appear; the bunker's {kinds:[24133,24134], #p:[bunker_admin_pubkey]} never lands.
  2. Manual restart of just the nsecbunker container (with lnbits now fully up) → NDK connects on first try, admin REQ lands.

Suggested fix

NDK's backoff/retry config — the bunker should be configured to retry indefinitely or with much larger maxRetries for its primary admin relay. If NDK 3.0.3 doesn't expose that knob ergonomically, fall back to wrapping the connect call in our own supervision loop that retries until success, before considering start() complete.

Alternatively, expose a healthcheck endpoint that fails until the admin relay subscription is confirmed registered (EOSE-acked), so docker / k8s can hold the bunker in unhealthy and restart it via the orchestrator's policy.

Context

  • nsecbunkerd dev branch at 131f689 (post-NDK-3.0.3 bump #14, post-Backend.start() EOSE-await race fix #9)
  • Discovered 2026-06-03 while debugging a regtest signup-via-bunker hang. Sister bugs documented in #N and aiolabs/nostrmarket#M (filed alongside).
  • aio-demo production isn't hit because its bunker has uninterrupted uptime; the bug only surfaces on heavy-restart dev cycles. Worth fixing before any production restart workflow that doesn't preserve bunker uptime.

🤖 Generated with Claude Code

## Symptom If the bunker container starts before its configured admin relay is accepting WebSockets (e.g. docker-compose brings up nsecbunker and lnbits in parallel, and lnbits's `nostrrelay` extension takes a few seconds to come up after uvicorn boots), NDK's relay-connectivity state machine retries ~3 times in ~3 seconds, fails on `ECONNREFUSED`, and then **stops trying forever**. The bunker process stays alive but is silently deaf — no kind:24134 admin event from lnbits ever reaches it. Particularly nasty because: - The bunker logs `✅ Connected to ws://...` optimistically from `relay:connect`; the `ECONNREFUSED` follow-up that actually invalidates the connection doesn't propagate to user-facing logs. - `NSEC_BUNKER_DISABLE_WATCHDOG=1` keeps the process alive (no clean exit-and-supervisor-restart loop), so docker's `restart: unless-stopped` policy never fires either. - Manual `docker compose restart nsecbunker` is the only recovery. ## Reproduction (regtest dev stack) 1. Bring stack up cold: `docker compose -f docker-compose.dev.yml up -d --build` (lnbits + nsecbunker both starting). 2. nsecbunker boots faster than lnbits's extension loading; NDK starts trying `ws://lnbits:5001/nostrrelay/test` before nostrrelay's WS endpoint is ready. 3. With `DEBUG=ndk:*` on the bunker, observe: ``` ndk:relay:ws://lnbits:5001/nostrrelay/test:connectivity67 WebSocket error: ErrorEvent { error: Error: connect ECONNREFUSED 172.18.0.2:5001 } ndk:relay:ws://lnbits:5001/nostrrelay/test:connectivity918 Using standard backoff, attempt 0, delay 1000ms ndk:relay:ws://lnbits:5001/nostrrelay/test:connectivity918 Reconnecting in 1000 ... (3 retries, then nothing) ``` 4. After ~5 seconds NDK stops retrying. Bunker remains "connected" per its own logs but the relay never sees a REQ from the bunker IP. Verified by patching nostrrelay's `_handle_request` to log all REQ filters: only late-arriving REQs from lnbits's own admin client appear; the bunker's `{kinds:[24133,24134], #p:[bunker_admin_pubkey]}` never lands. 5. Manual restart of just the nsecbunker container (with lnbits now fully up) → NDK connects on first try, admin REQ lands. ## Suggested fix NDK's backoff/retry config — the bunker should be configured to retry indefinitely or with much larger maxRetries for its primary admin relay. If NDK 3.0.3 doesn't expose that knob ergonomically, fall back to wrapping the connect call in our own supervision loop that retries until success, before considering `start()` complete. Alternatively, expose a healthcheck endpoint that fails until the admin relay subscription is confirmed registered (EOSE-acked), so docker / k8s can hold the bunker in `unhealthy` and restart it via the orchestrator's policy. ## Context - nsecbunkerd dev branch at `131f689` (post-NDK-3.0.3 bump #14, post-`Backend.start()` EOSE-await race fix #9) - Discovered 2026-06-03 while debugging a regtest signup-via-bunker hang. Sister bugs documented in #N and aiolabs/nostrmarket#M (filed alongside). - aio-demo production isn't hit because its bunker has uninterrupted uptime; the bug only surfaces on heavy-restart dev cycles. Worth fixing before any production restart workflow that doesn't preserve bunker uptime. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aiolabs/nsecbunkerd#20
No description provided.