Swap daemon relay transport from NDK to nostr-tools (root fix for the resubscribe-on-reconnect bug, #41) #42
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why
#41 (bunker goes silently deaf after a relay flap) is the latest in a long line of reconnect patches (#4 → #7 → #20 → #21) and it will not be the last, because the root cause is structural: NDK does not replay subscriptions on reconnect. A
NDKRelaySubscriptionregistersrelay.once("ready", executeOnRelayReady)— fires once on the initial connect, thenoffs the listener — andonConnect()never iteratesopenSubsto re-issue REQs. So a running subscription is dead after any reconnect, by design, not by a bug we can wait for upstream to fix. Confirmed against the current NDK source (mirror4b86acd1, 2026-04-05):core/src/relay/connectivity.ts+core/src/relay/subscription.ts.nostr-tools, by contrast, resubscribes on reconnect natively (
AbstractRelayreconnect +resubscribeBackoff). It even just hardened the exact failure mode we hit —fiatjaf/nostr-tools@455124e(2026-06-20): "a long-running daemon goes silently deaf on a subscription after the relay operator restarts the relay… the pool still considers the relay connected" — our #41 nearly verbatim.And it's what the field already converged on. Every comparable signing daemon over nostr uses nostr-tools, not NDK, and binds resubscribe to reconnect:
nostrRelayConnection.ts)Relayrelay-pool.ts)SimplePoolpool-reset→recreate all^2.x+wsnsecbunkerd is the outlier still supervising NDK's split connection/subscription lifecycle.
Scope
Replace only the daemon's relay/transport layer with a nostr-tools
SimplePool(pinned≥455124e). The ACL / token-lifecycle logic — our live-enforcement work from #25, which is the reason we keep this fork over Signet (Signet re-ships our #24 expiry/usage-cap bug) — is out of scope and stays untouched.Concretely:
pool-reset→recreate).run.ts) and the admin interface NDK (admin/index.ts) onto it; preserve the kind:24133#p-pinned-to-explicit-relays behavior (#21) and the await-EOSE-before-resolve start semantics (#9).admin/index.ts:512,connectedRelays().length) with a session-liveness check + signet-style heartbeat time-jump (sleep/wake) detection, so the safety net can no longer be blinded by a live-but-deaf socket (the regression #20 introduced).relay-reconnect.ts(attachIndefiniteReconnect) — its job moves into the pool.Acceptance
sign_eventis still answered. (This test was missing every prior round — it's the durable artifact.)Relationship to #41
#41 stays as the bug report / incident record (two demo outages, 2026-06-23 + 2026-06-26, recovered only by manual
systemctl restart nsecbunkerd). This issue is the chosen root fix. The alternative — a targeted NDK patch that manually re-executes subs on everyrelay:connect— was considered and rejected as another patch on an architecture the rest of the ecosystem has left behind.refs: #41, #21, #20, #9;
fiatjaf/nostr-tools@455124e; lightning.pubnostrRelayConnection.ts, signetrelay-pool.ts, FROSTR igloo; NDKcore/src/relay/{connectivity,subscription}.ts(no reconnect-replay).