Swap daemon relay transport from NDK to nostr-tools (root fix for the resubscribe-on-reconnect bug, #41) #42

Open
opened 2026-06-26 21:19:21 +00:00 by padreug · 0 comments
Owner

Why

#41 (bunker goes silently deaf after a relay flap) is the latest in a long line of reconnect patches (#4#7#20#21) and it will not be the last, because the root cause is structural: NDK does not replay subscriptions on reconnect. A NDKRelaySubscription registers relay.once("ready", executeOnRelayReady) — fires once on the initial connect, then offs the listener — and onConnect() never iterates openSubs to re-issue REQs. So a running subscription is dead after any reconnect, by design, not by a bug we can wait for upstream to fix. Confirmed against the current NDK source (mirror 4b86acd1, 2026-04-05): core/src/relay/connectivity.ts + core/src/relay/subscription.ts.

nostr-tools, by contrast, resubscribes on reconnect natively (AbstractRelay reconnect + resubscribeBackoff). It even just hardened the exact failure mode we hit — fiatjaf/nostr-tools@455124e (2026-06-20): "a long-running daemon goes silently deaf on a subscription after the relay operator restarts the relay… the pool still considers the relay connected" — our #41 nearly verbatim.

And it's what the field already converged on. Every comparable signing daemon over nostr uses nostr-tools, not NDK, and binds resubscribe to reconnect:

Repo Transport Model
lightning.pub (nostrRelayConnection.ts) nostr-tools Relay per-relay connect-loop; resubscribe bound to reconnect
signet (relay-pool.ts) nostr-tools SimplePool subscription registry + heartbeat + pool-reset→recreate all
FROSTR / Bifrost+Igloo nostr-tools ^2.x + ws own reconnect over nostr-tools

nsecbunkerd is the outlier still supervising NDK's split connection/subscription lifecycle.

Scope

Replace only the daemon's relay/transport layer with a nostr-tools SimplePool (pinned ≥455124e). The ACL / token-lifecycle logic — our live-enforcement work from #25, which is the reason we keep this fork over Signet (Signet re-ships our #24 expiry/usage-cap bug) — is out of scope and stays untouched.

Concretely:

  • Stand up a relay pool over nostr-tools that binds resubscribe to (re)connect (lightning.pub pattern), keeping an explicit registry of active subscriptions (the Backend's per-key kind:24133 subs + the admin sub) and recreating them on every reconnect (signet's pool-reset→recreate).
  • Port the daemon backend NDK (run.ts) and the admin interface NDK (admin/index.ts) onto it; preserve the kind:24133 #p-pinned-to-explicit-relays behavior (#21) and the await-EOSE-before-resolve start semantics (#9).
  • Replace the connection-only watchdog (admin/index.ts:512, connectedRelays().length) with a session-liveness check + signet-style heartbeat time-jump (sleep/wake) detection, so the safety net can no longer be blinded by a live-but-deaf socket (the regression #20 introduced).
  • Retire relay-reconnect.ts (attachIndefiniteReconnect) — its job moves into the pool.

Acceptance

  • After a forced relay flap mid-session (drop the relay, bring it back), the bunker resumes serving signing requests with no manual restart.
  • Regression test: flap the relay during an active session and assert a subsequent sign_event is still answered. (This test was missing every prior round — it's the durable artifact.)
  • Existing ACL / token-lifecycle tests unchanged and green.

Relationship to #41

#41 stays as the bug report / incident record (two demo outages, 2026-06-23 + 2026-06-26, recovered only by manual systemctl restart nsecbunkerd). This issue is the chosen root fix. The alternative — a targeted NDK patch that manually re-executes subs on every relay:connect — was considered and rejected as another patch on an architecture the rest of the ecosystem has left behind.

refs: #41, #21, #20, #9; fiatjaf/nostr-tools@455124e; lightning.pub nostrRelayConnection.ts, signet relay-pool.ts, FROSTR igloo; NDK core/src/relay/{connectivity,subscription}.ts (no reconnect-replay).

## Why #41 (bunker goes silently deaf after a relay flap) is the latest in a long line of reconnect patches (#4 → #7 → #20 → #21) and it will not be the last, because the root cause is structural: **NDK does not replay subscriptions on reconnect.** A `NDKRelaySubscription` registers `relay.once("ready", executeOnRelayReady)` — fires once on the *initial* connect, then `off`s the listener — and `onConnect()` never iterates `openSubs` to re-issue REQs. So a running subscription is dead after any reconnect, by design, not by a bug we can wait for upstream to fix. Confirmed against the current NDK source (mirror `4b86acd1`, 2026-04-05): `core/src/relay/connectivity.ts` + `core/src/relay/subscription.ts`. **nostr-tools, by contrast, resubscribes on reconnect natively** (`AbstractRelay` reconnect + `resubscribeBackoff`). It even just hardened the exact failure mode we hit — `fiatjaf/nostr-tools@455124e` (2026-06-20): *"a long-running daemon goes silently deaf on a subscription after the relay operator restarts the relay… the pool still considers the relay connected"* — our #41 nearly verbatim. And it's what the field already converged on. Every comparable **signing daemon over nostr** uses nostr-tools, not NDK, and binds resubscribe to reconnect: | Repo | Transport | Model | |---|---|---| | lightning.pub (`nostrRelayConnection.ts`) | nostr-tools `Relay` | per-relay connect-loop; resubscribe bound to reconnect | | signet (`relay-pool.ts`) | nostr-tools `SimplePool` | subscription registry + heartbeat + `pool-reset`→recreate all | | FROSTR / Bifrost+Igloo | nostr-tools `^2.x` + `ws` | own reconnect over nostr-tools | nsecbunkerd is the outlier still supervising NDK's split connection/subscription lifecycle. ## Scope Replace **only the daemon's relay/transport layer** with a nostr-tools `SimplePool` (pinned `≥455124e`). The ACL / token-lifecycle logic — our live-enforcement work from #25, which is the reason we keep this fork over Signet (Signet re-ships our #24 expiry/usage-cap bug) — is **out of scope and stays untouched**. Concretely: - Stand up a relay pool over nostr-tools that binds **resubscribe to (re)connect** (lightning.pub pattern), keeping an explicit registry of active subscriptions (the Backend's per-key kind:24133 subs + the admin sub) and recreating them on every reconnect (signet's `pool-reset`→recreate). - Port the daemon backend NDK (`run.ts`) and the admin interface NDK (`admin/index.ts`) onto it; preserve the kind:24133 `#p`-pinned-to-explicit-relays behavior (#21) and the await-EOSE-before-resolve start semantics (#9). - Replace the connection-only watchdog (`admin/index.ts:512`, `connectedRelays().length`) with a **session-liveness** check + signet-style heartbeat time-jump (sleep/wake) detection, so the safety net can no longer be blinded by a live-but-deaf socket (the regression #20 introduced). - Retire `relay-reconnect.ts` (`attachIndefiniteReconnect`) — its job moves into the pool. ## Acceptance - After a forced relay flap mid-session (drop the relay, bring it back), the bunker resumes serving signing requests **with no manual restart**. - **Regression test:** flap the relay during an active session and assert a subsequent `sign_event` is still answered. (This test was missing every prior round — it's the durable artifact.) - Existing ACL / token-lifecycle tests unchanged and green. ## Relationship to #41 #41 stays as the bug report / incident record (two demo outages, 2026-06-23 + 2026-06-26, recovered only by manual `systemctl restart nsecbunkerd`). **This issue is the chosen root fix.** The alternative — a targeted NDK patch that manually re-executes subs on every `relay:connect` — was considered and rejected as another patch on an architecture the rest of the ecosystem has left behind. refs: #41, #21, #20, #9; `fiatjaf/nostr-tools@455124e`; lightning.pub `nostrRelayConnection.ts`, signet `relay-pool.ts`, FROSTR igloo; NDK `core/src/relay/{connectivity,subscription}.ts` (no reconnect-replay).
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
aiolabs/nsecbunkerd#42
No description provided.