Investigate NDK echo issue — RPC responses unreliably reach client subscriptions on custom relays #7
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
A cross-cutting issue underlying both #4 (
pingOrDiewatchdog false-positives) and #5 (get_keyssilent failure): when the bunker publishes a NIP-46 / admin response event to a non-public relay channel, the corresponding client subscription on the same channel sometimes (often?) doesn't receive it. The publish succeeds (the relay accepts it and ACKs withOK); the subscriber just never sees the event.This is the single biggest blocker for the LNbits-side integration in
aiolabs/lnbits#18— if RPC responses don't reach the client reliably, no signing flow can be built on top.Evidence so far
Plain WebSocket round-trip on the same relay works flawlessly. From a Python client connecting to
ws://localhost:5001/nostrrelay/test(LNbits'snostrrelayextension channel): subscribe for{kinds: [24133], authors: [pk]}, then publish a kind-24133 event signed bypktagged to itself — receivesOKimmediately andEVENTecho within milliseconds. Round-trip < 1 second. Verifiedlnbits/core/services/nostr_transport/works against this same channel for the entirely different kind-21000 nostr-transport plumbing.NDK 2.8.1's pingOrDie watchdog never receives its own ping. Publishes a kind-24133 event tagged to its own pubkey every 20s, subscribes for matching events, never sees them. Bunker exits after 50s death timer.
Admin RPCs work sporadically.
ping(admin kind 24134) round-trips successfully via a hand-rolled Python client to the bunker's admin endpoint. But:get_keysafter acreate_new_keynever returns (this is partly #5 — bunker throws, but the error response should still come back).The bunker's other connection paths use NDK the same way. Likely the same issue would affect
sign_eventresponses from a NIP-46 client perspective, which would make the whole signing flow unusable.Hypotheses
In rough order of likelihood:
NDK's outbox model picks a different relay for publish vs. subscribe. NDK 2.x defaults to "outbox" routing: for each event, pick relays based on the author's NIP-65 relay-list. If the bunker's pubkey has no NIP-65 published anywhere, NDK falls back to some default — possibly not the one in
config.nostr.relays. The plain-WebSocket test bypasses this because there's no outbox logic.NDK opens separate WebSocket connections for publish vs. subscribe. Even on a single-relay setup, NDK might dial twice — one for the publisher half, one for the subscriber half. If the subscriber connection's filter is registered AFTER the publish hits the relay, the relay won't backfill (kind-24133/24134 are ephemeral, kinds 20000-29999, NOT stored by spec).
Race between subscribe-establishment and publish. The bunker's pingOrDie code subscribes via
sub.start()and immediately schedules the publish viasetInterval. Ifstart()returns before the relay has actually accepted the REQ, the first publish goes through but no subscriber is registered yet to receive it.A bug in
@nostr-dev-kit/ndk@2.8.1specifically. Hypothesis #4 — there's known churn in NDK around the 2.8/2.10 boundary. Worth testing the latest NDK separately.Suggested verification spikes (any single one might answer it)
Patch nsecbunkerd to log all NDK pool events (
relay:connect,relay:disconnect,relay:notice, individual subscribe/publish events) at startup — confirm whether NDK is actually opening one vs two connections to the same relay.Patch nsecbunkerd to use a non-outbox publish path — bypass NDK's outbox logic by directly calling
relay.publish(event)on a specific relay handle. If this fixes the echo problem, hypothesis #1 is correct and the fix upstream is to explicitly opt-out of outbox mode for the daemon's own self-traffic.Upgrade NDK to latest (2.10+ or 2.11+ if available) and re-test. If fixed, just bump the dep.
Switch the channel to a public relay (damus.io) and re-test the same flows. If it works against damus but not against our internal nostrrelay channel, the issue is something specific about how the LNbits nostrrelay extension handles certain message patterns (delivery batching? filter parsing? subscription ID handling?). Earlier independent testing suggested the channel is fine, but NDK's specific traffic shape might trigger different code paths in the relay.
Implement the LNbits-side
RemoteBunkerSigneras a plain WebSocket client (no NDK on the client side) and test against the same bunker. If signing round-trips work from the Python client where they fail from the NDK-using admin path, the answer is "don't use NDK for the LNbits client" — we use plainwebsockets+ the NIP-44 v2 code we already have from PR #4. That sidesteps the whole investigation for our integration purposes.Impact on the LNbits integration
This is the gating question for whether
aiolabs/lnbits#18proceeds with nsecbunkerd, OR pivots to a different bunker (e.g. building our own thin Go wrapper per the original aiobunker fallback plan).Best-case scenario (most likely): plain-WebSocket client from LNbits sidesteps the NDK issue entirely. We get reliable round-trips, the integration ships, the NDK issue is something nsecbunkerd needs to fix in its own admin / NIP-46 server-side use of NDK but doesn't affect us.
Worst-case scenario: even raw-WebSocket clients can't reliably get responses from the bunker over our internal relay channel. Then we need a different transport (loopback unix socket, dedicated bunker-local relay) or a different bunker entirely.
Acceptance
Cross-refs
pingOrDiewatchdog) — manifestation #1.get_keyssilent failure) — manifestation #2.aiolabs/lnbits#18— the integration that depends on resolving this.~/dev/lnbits/nsec-bunker-spike-findings.md— full spike log including the plain-WebSocket round-trip evidence.startKeypasses bech32 nsec to NDKPrivateKeySigner — every newly-created key fails to load #8