NDK gives up reconnecting to admin relay after ~3 ECONNREFUSED retries — bunker stays disconnected forever #20
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
If the bunker container starts before its configured admin relay is accepting WebSockets (e.g. docker-compose brings up nsecbunker and lnbits in parallel, and lnbits's
nostrrelayextension takes a few seconds to come up after uvicorn boots), NDK's relay-connectivity state machine retries ~3 times in ~3 seconds, fails onECONNREFUSED, and then stops trying forever. The bunker process stays alive but is silently deaf — no kind:24134 admin event from lnbits ever reaches it.Particularly nasty because:
✅ Connected to ws://...optimistically fromrelay:connect; theECONNREFUSEDfollow-up that actually invalidates the connection doesn't propagate to user-facing logs.NSEC_BUNKER_DISABLE_WATCHDOG=1keeps the process alive (no clean exit-and-supervisor-restart loop), so docker'srestart: unless-stoppedpolicy never fires either.docker compose restart nsecbunkeris the only recovery.Reproduction (regtest dev stack)
docker compose -f docker-compose.dev.yml up -d --build(lnbits + nsecbunker both starting).ws://lnbits:5001/nostrrelay/testbefore nostrrelay's WS endpoint is ready.DEBUG=ndk:*on the bunker, observe:_handle_requestto log all REQ filters: only late-arriving REQs from lnbits's own admin client appear; the bunker's{kinds:[24133,24134], #p:[bunker_admin_pubkey]}never lands.Suggested fix
NDK's backoff/retry config — the bunker should be configured to retry indefinitely or with much larger maxRetries for its primary admin relay. If NDK 3.0.3 doesn't expose that knob ergonomically, fall back to wrapping the connect call in our own supervision loop that retries until success, before considering
start()complete.Alternatively, expose a healthcheck endpoint that fails until the admin relay subscription is confirmed registered (EOSE-acked), so docker / k8s can hold the bunker in
unhealthyand restart it via the orchestrator's policy.Context
131f689(post-NDK-3.0.3 bump #14, post-Backend.start()EOSE-await race fix #9)🤖 Generated with Claude Code