Investigate NDK echo issue — RPC responses unreliably reach client subscriptions on custom relays #7

New issue

Open

opened 2026-05-25 21:57:14 +00:00 by padreug · 0 comments

padreug commented

2026-05-25 21:57:14 +00:00

Owner

Summary

A cross-cutting issue underlying both #4 (pingOrDie watchdog false-positives) and #5 (get_keys silent failure): when the bunker publishes a NIP-46 / admin response event to a non-public relay channel, the corresponding client subscription on the same channel sometimes (often?) doesn't receive it. The publish succeeds (the relay accepts it and ACKs with OK); the subscriber just never sees the event.

This is the single biggest blocker for the LNbits-side integration in aiolabs/lnbits#18 — if RPC responses don't reach the client reliably, no signing flow can be built on top.

Evidence so far

Plain WebSocket round-trip on the same relay works flawlessly. From a Python client connecting to ws://localhost:5001/nostrrelay/test (LNbits's nostrrelay extension channel): subscribe for {kinds: [24133], authors: [pk]}, then publish a kind-24133 event signed by pk tagged to itself — receives OK immediately and EVENT echo within milliseconds. Round-trip < 1 second. Verified lnbits/core/services/nostr_transport/ works against this same channel for the entirely different kind-21000 nostr-transport plumbing.
NDK 2.8.1's pingOrDie watchdog never receives its own ping. Publishes a kind-24133 event tagged to its own pubkey every 20s, subscribes for matching events, never sees them. Bunker exits after 50s death timer.
Admin RPCs work sporadically. ping (admin kind 24134) round-trips successfully via a hand-rolled Python client to the bunker's admin endpoint. But:
- get_keys after a create_new_key never returns (this is partly #5 — bunker throws, but the error response should still come back).
- Pattern across requests is roughly alternating success/failure — suggests timing rather than method-specific.
The bunker's other connection paths use NDK the same way. Likely the same issue would affect sign_event responses from a NIP-46 client perspective, which would make the whole signing flow unusable.

Hypotheses

In rough order of likelihood:

NDK's outbox model picks a different relay for publish vs. subscribe. NDK 2.x defaults to "outbox" routing: for each event, pick relays based on the author's NIP-65 relay-list. If the bunker's pubkey has no NIP-65 published anywhere, NDK falls back to some default — possibly not the one in config.nostr.relays. The plain-WebSocket test bypasses this because there's no outbox logic.
NDK opens separate WebSocket connections for publish vs. subscribe. Even on a single-relay setup, NDK might dial twice — one for the publisher half, one for the subscriber half. If the subscriber connection's filter is registered AFTER the publish hits the relay, the relay won't backfill (kind-24133/24134 are ephemeral, kinds 20000-29999, NOT stored by spec).
Race between subscribe-establishment and publish. The bunker's pingOrDie code subscribes via sub.start() and immediately schedules the publish via setInterval. If start() returns before the relay has actually accepted the REQ, the first publish goes through but no subscriber is registered yet to receive it.
A bug in @nostr-dev-kit/ndk@2.8.1 specifically. Hypothesis #4 — there's known churn in NDK around the 2.8/2.10 boundary. Worth testing the latest NDK separately.

Suggested verification spikes (any single one might answer it)

Patch nsecbunkerd to log all NDK pool events (relay:connect, relay:disconnect, relay:notice, individual subscribe/publish events) at startup — confirm whether NDK is actually opening one vs two connections to the same relay.
Patch nsecbunkerd to use a non-outbox publish path — bypass NDK's outbox logic by directly calling relay.publish(event) on a specific relay handle. If this fixes the echo problem, hypothesis #1 is correct and the fix upstream is to explicitly opt-out of outbox mode for the daemon's own self-traffic.
Upgrade NDK to latest (2.10+ or 2.11+ if available) and re-test. If fixed, just bump the dep.
Switch the channel to a public relay (damus.io) and re-test the same flows. If it works against damus but not against our internal nostrrelay channel, the issue is something specific about how the LNbits nostrrelay extension handles certain message patterns (delivery batching? filter parsing? subscription ID handling?). Earlier independent testing suggested the channel is fine, but NDK's specific traffic shape might trigger different code paths in the relay.
Implement the LNbits-side RemoteBunkerSigner as a plain WebSocket client (no NDK on the client side) and test against the same bunker. If signing round-trips work from the Python client where they fail from the NDK-using admin path, the answer is "don't use NDK for the LNbits client" — we use plain websockets + the NIP-44 v2 code we already have from PR #4. That sidesteps the whole investigation for our integration purposes.

Impact on the LNbits integration

This is the gating question for whether aiolabs/lnbits#18 proceeds with nsecbunkerd, OR pivots to a different bunker (e.g. building our own thin Go wrapper per the original aiobunker fallback plan).

Best-case scenario (most likely): plain-WebSocket client from LNbits sidesteps the NDK issue entirely. We get reliable round-trips, the integration ships, the NDK issue is something nsecbunkerd needs to fix in its own admin / NIP-46 server-side use of NDK but doesn't affect us.

Worst-case scenario: even raw-WebSocket clients can't reliably get responses from the bunker over our internal relay channel. Then we need a different transport (loopback unix socket, dedicated bunker-local relay) or a different bunker entirely.

Acceptance

Spike #5 from the list above (plain-WebSocket client end-to-end) — answers whether this blocks our LNbits integration.
Spike #4 (public-relay control test) — narrows whether issue is NDK or relay-side.
If NDK-side: spike #2 or #3 — proves the underlying mechanism.
Patch upstream once mechanism is understood.

Cross-refs

#4 (pingOrDie watchdog) — manifestation #1.
#5 (get_keys silent failure) — manifestation #2.
aiolabs/lnbits#18 — the integration that depends on resolving this.
~/dev/lnbits/nsec-bunker-spike-findings.md — full spike log including the plain-WebSocket round-trip evidence.

## Summary A cross-cutting issue underlying both #4 (`pingOrDie` watchdog false-positives) and #5 (`get_keys` silent failure): when the bunker publishes a NIP-46 / admin response event to a non-public relay channel, the corresponding client subscription on the same channel sometimes (often?) doesn't receive it. The publish succeeds (the relay accepts it and ACKs with `OK`); the subscriber just never sees the event. This is the single biggest blocker for the LNbits-side integration in `aiolabs/lnbits#18` — if RPC responses don't reach the client reliably, no signing flow can be built on top. ## Evidence so far 1. **Plain WebSocket round-trip on the same relay works flawlessly.** From a Python client connecting to `ws://localhost:5001/nostrrelay/test` (LNbits's `nostrrelay` extension channel): subscribe for `{kinds: [24133], authors: [pk]}`, then publish a kind-24133 event signed by `pk` tagged to itself — receives `OK` immediately and `EVENT` echo within milliseconds. Round-trip < 1 second. Verified `lnbits/core/services/nostr_transport/` works against this same channel for the entirely different kind-21000 nostr-transport plumbing. 2. **NDK 2.8.1's pingOrDie watchdog never receives its own ping.** Publishes a kind-24133 event tagged to its own pubkey every 20s, subscribes for matching events, never sees them. Bunker exits after 50s death timer. 3. **Admin RPCs work sporadically.** `ping` (admin kind 24134) round-trips successfully via a hand-rolled Python client to the bunker's admin endpoint. But: - `get_keys` after a `create_new_key` never returns (this is partly #5 — bunker throws, but the error response should still come back). - Pattern across requests is roughly alternating success/failure — suggests timing rather than method-specific. 4. **The bunker's other connection paths use NDK** the same way. Likely the same issue would affect `sign_event` responses from a NIP-46 client perspective, which would make the whole signing flow unusable. ## Hypotheses In rough order of likelihood: 1. **NDK's outbox model picks a different relay for publish vs. subscribe.** NDK 2.x defaults to "outbox" routing: for each event, pick relays based on the author's NIP-65 relay-list. If the bunker's pubkey has no NIP-65 published anywhere, NDK falls back to *some* default — possibly not the one in `config.nostr.relays`. The plain-WebSocket test bypasses this because there's no outbox logic. 2. **NDK opens separate WebSocket connections for publish vs. subscribe.** Even on a single-relay setup, NDK might dial twice — one for the publisher half, one for the subscriber half. If the subscriber connection's filter is registered AFTER the publish hits the relay, the relay won't backfill (kind-24133/24134 are ephemeral, kinds 20000-29999, NOT stored by spec). 3. **Race between subscribe-establishment and publish.** The bunker's pingOrDie code subscribes via `sub.start()` and immediately schedules the publish via `setInterval`. If `start()` returns before the relay has actually accepted the REQ, the first publish goes through but no subscriber is registered yet to receive it. 4. **A bug in `@nostr-dev-kit/ndk@2.8.1` specifically.** Hypothesis #4 — there's known churn in NDK around the 2.8/2.10 boundary. Worth testing the latest NDK separately. ## Suggested verification spikes (any single one might answer it) 1. **Patch nsecbunkerd to log all NDK pool events** (`relay:connect`, `relay:disconnect`, `relay:notice`, individual subscribe/publish events) at startup — confirm whether NDK is actually opening one vs two connections to the same relay. 2. **Patch nsecbunkerd to use a non-outbox publish path** — bypass NDK's outbox logic by directly calling `relay.publish(event)` on a specific relay handle. If this fixes the echo problem, hypothesis #1 is correct and the fix upstream is to explicitly opt-out of outbox mode for the daemon's own self-traffic. 3. **Upgrade NDK** to latest (2.10+ or 2.11+ if available) and re-test. If fixed, just bump the dep. 4. **Switch the channel to a public relay** (damus.io) and re-test the same flows. If it works against damus but not against our internal nostrrelay channel, the issue is something specific about how the LNbits nostrrelay extension handles certain message patterns (delivery batching? filter parsing? subscription ID handling?). Earlier independent testing suggested the channel is fine, but NDK's specific traffic shape might trigger different code paths in the relay. 5. **Implement the LNbits-side `RemoteBunkerSigner` as a plain WebSocket client** (no NDK on the client side) and test against the same bunker. If signing round-trips work from the Python client where they fail from the NDK-using admin path, the answer is "don't use NDK for the LNbits client" — we use plain `websockets` + the NIP-44 v2 code we already have from PR #4. That sidesteps the whole investigation for our integration purposes. ## Impact on the LNbits integration This is the gating question for whether `aiolabs/lnbits#18` proceeds with nsecbunkerd, OR pivots to a different bunker (e.g. building our own thin Go wrapper per the original aiobunker fallback plan). **Best-case scenario** (most likely): plain-WebSocket client from LNbits sidesteps the NDK issue entirely. We get reliable round-trips, the integration ships, the NDK issue is something nsecbunkerd needs to fix in *its own* admin / NIP-46 server-side use of NDK but doesn't affect us. **Worst-case scenario**: even raw-WebSocket clients can't reliably get responses from the bunker over our internal relay channel. Then we need a different transport (loopback unix socket, dedicated bunker-local relay) or a different bunker entirely. ## Acceptance - [ ] Spike #5 from the list above (plain-WebSocket client end-to-end) — answers whether this blocks our LNbits integration. - [ ] Spike #4 (public-relay control test) — narrows whether issue is NDK or relay-side. - [ ] If NDK-side: spike #2 or #3 — proves the underlying mechanism. - [ ] Patch upstream once mechanism is understood. ## Cross-refs - #4 (`pingOrDie` watchdog) — manifestation #1. - #5 (`get_keys` silent failure) — manifestation #2. - `aiolabs/lnbits#18` — the integration that depends on resolving this. - `~/dev/lnbits/nsec-bunker-spike-findings.md` — full spike log including the plain-WebSocket round-trip evidence.

padreug referenced this issue

2026-05-25 22:30:43 +00:00

startKey passes bech32 nsec to NDKPrivateKeySigner — every newly-created key fails to load #8

padreug referenced this issue from a commit

2026-05-25 22:34:13 +00:00

disable pingOrDie watchdog — false-positives on non-public relays

padreug referenced this issue from a commit

2026-05-26 07:21:16 +00:00

feat(core): NsecBunkerAdminClient — async admin RPC over plain WebSocket

padreug referenced this issue from a commit

2026-05-27 16:24:54 +00:00

diag(#7): env-gated per-relay transport instrumentation