From 5574d3bed3c113efde815819014ab4636964c9a4 Mon Sep 17 00:00:00 2001 From: Rob Date: Fri, 8 May 2026 14:03:00 -0400 Subject: [PATCH] =?UTF-8?q?flows:=20add=20lxmf-outbound-retry=20=E2=80=94?= =?UTF-8?q?=20process=5Foutbound=20retry=20loop=20+=20state=20machine?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents the outbound retry layer that wraps the existing per-method send-* flows. Pinned to LXMF 0.9.7 / RNS 1.2.4 with literal-quoted upstream source for every claim: - 4-second tick cadence (PROCESSING_INTERVAL × JOB_OUTBOUND_INTERVAL) - All seven retry constants (MAX_DELIVERY_ATTEMPTS, DELIVERY_RETRY_WAIT, PATH_REQUEST_WAIT, MAX_PATHLESS_TRIES, MESSAGE_EXPIRY, LINK_MAX_INACTIVITY, P_LINK_MAX_INACTIVITY) at LXMRouter.py:30-38 - Eight-state machine (GENERATING/OUTBOUND/SENDING/SENT/DELIVERED/ REJECTED/CANCELLED/FAILED) at LXMessage.py:13-22 - The four terminal-state branches at top of process_outbound (lines 2517-2558) and the three per-method retry branches (OPPORTUNISTIC 2566-2592, DIRECT 2596-2673, PROPAGATED 2677-2730) - fail_message semantics at LXMRouter.py:2395-2402 Includes a "what does NOT happen" section calling out common misconceptions: no automatic DIRECT→PROPAGATED fallback, no exponential backoff, no in-router persistence of pending_outbound, MESSAGE_EXPIRY governs the propagation-node store not per-sender retries, SENT is the terminal success state for PROPAGATED (not DELIVERED). No verifier needed per agent.md §1 — all claims are direct upstream source citations. Co-Authored-By: Claude Opus 4.7 (1M context) --- flows/README.md | 1 + flows/lxmf-outbound-retry.md | 208 +++++++++++++++++++++++++++++++++++ todo.md | 2 + 3 files changed, 211 insertions(+) create mode 100644 flows/lxmf-outbound-retry.md diff --git a/flows/README.md b/flows/README.md index 53c7c77..0346ff1 100644 --- a/flows/README.md +++ b/flows/README.md @@ -20,6 +20,7 @@ The two views are complementary: SPEC.md tells you what each piece looks like; t | [`forward-announce.md`](forward-announce.md) (transport-node rebroadcast logic, announce_cap, queue) | ✅ | | [`send-propagated-lxmf.md`](send-propagated-lxmf.md) (PROPAGATED method, via a propagation node) | ✅ | | [`receive-propagated-lxmf.md`](receive-propagated-lxmf.md) (recipient pulling messages via `/get`) | ✅ | +| [`lxmf-outbound-retry.md`](lxmf-outbound-retry.md) (process_outbound retry loop, per-message state machine, fail_message) | ✅ | ## Conventions diff --git a/flows/lxmf-outbound-retry.md b/flows/lxmf-outbound-retry.md new file mode 100644 index 0000000..8e2ae4a --- /dev/null +++ b/flows/lxmf-outbound-retry.md @@ -0,0 +1,208 @@ +# Flow: LXMF outbound retry loop and per-message state machine + +What `LXMRouter.process_outbound` actually does on each tick — the layer that wraps [`send-opportunistic-lxmf.md`](send-opportunistic-lxmf.md), [`send-link-lxmf.md`](send-link-lxmf.md), and [`send-propagated-lxmf.md`](send-propagated-lxmf.md) and decides when each happy-path operation runs, retries, gives up, or falls through. + +The three send-* flows describe what happens for *one* attempt of each method. This doc describes how attempts are scheduled, how the per-message state advances, and when a message moves from retry-eligible to terminally `FAILED`. It is the missing piece for any client that wants delivery semantics matching upstream Sideband. + +Pinned against **RNS 1.2.4 / LXMF 0.9.7**. Line numbers below are from those versions. + +--- + +## Cadence: how often process_outbound runs + +`LXMRouter.jobloop` (`LXMF/LXMRouter.py:889-899`) is a daemon thread that wakes every `PROCESSING_INTERVAL` seconds and calls `LXMRouter.jobs`, which dispatches to `process_outbound` whenever its tick counter is divisible by `JOB_OUTBOUND_INTERVAL`: + +| Constant | Value | File:line | +|---|---|---| +| `PROCESSING_INTERVAL` | `4` (seconds) | `LXMF/LXMRouter.py:31` | +| `JOB_OUTBOUND_INTERVAL` | `1` | `LXMF/LXMRouter.py:852` | + +So the **effective outbound tick is every 4 seconds.** Any per-message timer (path-request defer, retry backoff, link-establish timeout) is sampled at this granularity — a 10-second backoff isn't actually 10 seconds, it's "first tick at or after `now + 10s`." + +`handle_outbound` also kicks `process_outbound` directly on a fresh thread when a new message is queued (`LXMF/LXMRouter.py:1691`), so the first attempt doesn't wait for the next jobloop tick. + +--- + +## Constants that drive retry behavior + +All on `LXMRouter`, all module-cited (`LXMF/LXMRouter.py:30-38`): + +| Constant | Value | Meaning | +|---|---|---| +| `MAX_DELIVERY_ATTEMPTS` | `5` | Per-message attempt cap. Crossing this triggers `fail_message`. | +| `DELIVERY_RETRY_WAIT` | `10` (seconds) | Wait between attempts when path is known but the prior attempt didn't yield delivery proof. | +| `PATH_REQUEST_WAIT` | `7` (seconds) | Wait after issuing a `path?` request before the next attempt. | +| `MAX_PATHLESS_TRIES` | `1` | OPPORTUNISTIC only — number of attempts before forcing a path request. | +| `MESSAGE_EXPIRY` | `30*24*60*60` (30 days) | Used by propagation-node store cleanup, not the per-message retry path. | +| `LINK_MAX_INACTIVITY` | `10*60` | Direct-link idle teardown threshold (`clean_links`). | +| `P_LINK_MAX_INACTIVITY` | `3*60` | Propagation-link idle teardown threshold. | + +A full single-message retry budget for DIRECT or PROPAGATED is therefore **5 attempts × 10 seconds ≈ 50 seconds of wall-clock** before `fail_message` runs, plus whatever each attempt itself spends inside the link-establishment / proof-wait window. + +--- + +## Per-message state machine + +States from `LXMF/LXMessage.py:13-22`: + +| State | Value | When | +|---|---|---| +| `GENERATING` | `0x00` | Stamp generation in progress (deferred-stamp messages only) | +| `OUTBOUND` | `0x01` | Queued in `pending_outbound`; not currently transmitting | +| `SENDING` | `0x02` | A send is in flight on the wire (packet sent / Resource transferring) | +| `SENT` | `0x04` | Wire send completed, but no end-to-end PROOF yet — also the **terminal** state for PROPAGATED (delivery to the recipient is the propagation node's job) | +| `DELIVERED` | `0x08` | End-to-end PROOF received from the final recipient — only reachable for OPPORTUNISTIC and DIRECT | +| `REJECTED` | `0xFD` | Receiver explicitly rejected (e.g. stamp validation failed on a propagation node) | +| `CANCELLED` | `0xFE` | Sender called `LXMessage.cancel` while still queued | +| `FAILED` | `0xFF` | `MAX_DELIVERY_ATTEMPTS` exhausted, or unrecoverable error | + +The valid-method enum is `LXMessage.OPPORTUNISTIC = 0x01`, `DIRECT = 0x02`, `PROPAGATED = 0x03`, `PAPER = 0x05` (`LXMF/LXMessage.py:29-32`). + +--- + +## Per-tick decision tree + +`process_outbound` (`LXMF/LXMRouter.py:2513`) holds `outbound_processing_lock` across the whole tick (line 2514-2515) and walks `pending_outbound` once. For each message, the top-of-loop branches on terminal state first: + +| Branch | File:line | Effect | +|---|---|---| +| `state == DELIVERED` | 2517-2542 | Remove from queue. If method was DIRECT, perform backchannel-identify on the link so the recipient can reply over the same link. | +| `method == PROPAGATED and state == SENT` | 2544-2546 | Remove from queue (PROPAGATED's terminal success state is SENT, not DELIVERED — see state table). | +| `state == CANCELLED` | 2548-2552 | Remove and fire `failed_callback`. | +| `state == REJECTED` | 2554-2558 | Remove and fire `failed_callback`. | +| Else (`OUTBOUND` or `SENDING`) | 2560+ | Per-method retry/send branch — see below. | + +The non-terminal branch in turn switches on `lxmessage.method`: + +### OPPORTUNISTIC branch (`LXMF/LXMRouter.py:2566-2592`) + +```python +if lxmessage.method == LXMessage.OPPORTUNISTIC: + if lxmessage.delivery_attempts <= LXMRouter.MAX_DELIVERY_ATTEMPTS: + if lxmessage.delivery_attempts >= LXMRouter.MAX_PATHLESS_TRIES \ + and not RNS.Transport.has_path(lxmessage.get_destination().hash): + # Force a path request, defer PATH_REQUEST_WAIT seconds + ... + lxmessage.next_delivery_attempt = time.time() + LXMRouter.PATH_REQUEST_WAIT + elif lxmessage.delivery_attempts == LXMRouter.MAX_PATHLESS_TRIES + 1 \ + and RNS.Transport.has_path(...): + # Path is known but prior attempt failed — drop_path + re-discover + RNS.Reticulum.get_instance().drop_path(...) + ... + lxmessage.next_delivery_attempt = time.time() + LXMRouter.PATH_REQUEST_WAIT + else: + if not hasattr(lxmessage, "next_delivery_attempt") \ + or time.time() > lxmessage.next_delivery_attempt: + lxmessage.delivery_attempts += 1 + lxmessage.next_delivery_attempt = time.time() + LXMRouter.DELIVERY_RETRY_WAIT + lxmessage.send() + else: + self.fail_message(lxmessage) +``` + +Key behaviors: +- **First attempt is "pathless-tolerant":** if `delivery_attempts < MAX_PATHLESS_TRIES (=1)` and there's no path, the message still tries a send (relying on `handle_outbound`'s pre-emptive `path?` at `LXMF/LXMRouter.py:1675-1679`). +- **After the pathless tries are exhausted,** an explicit `path?` is fired and the message defers `PATH_REQUEST_WAIT (=7s)`. +- **The `MAX_PATHLESS_TRIES + 1` case** is the "I have a stale path that didn't deliver" recovery: `Reticulum.drop_path` evicts the bad path table entry, then a fresh `path?` is requested. +- **The `else` branch is the actual retransmit:** increment attempts, schedule `+ DELIVERY_RETRY_WAIT (=10s)`, fire `lxmessage.send()`. +- **`fail_message` runs only after `delivery_attempts > MAX_DELIVERY_ATTEMPTS`** — i.e. attempts 1..5 are tried, attempt 6 trips `fail_message`. + +### DIRECT branch (`LXMF/LXMRouter.py:2596-2673`) + +Two sub-paths, decided by whether a usable link already exists in `direct_links` or `backchannel_links`: + +**Existing link, `status == ACTIVE` (line 2616-2627):** +- If `state != SENDING`, set the link as the delivery destination and call `lxmessage.send()`. +- If `state == SENDING`, just log progress — the prior send is still pending its proof. + +**Existing link, `status == CLOSED` (line 2628-2647):** +- If the link was previously activated (`activated_at != None`), the link died unexpectedly — issue a fresh `path?` and schedule `PATH_REQUEST_WAIT`. +- Else (link was never activated — LRPROOF never arrived on the prior attempt), retry the path request once via `path_request_retried`, then schedule `PATH_REQUEST_WAIT`. +- Either way, **drop the dead link from `direct_links` / `backchannel_links`** and schedule the next attempt at `+ DELIVERY_RETRY_WAIT`. + +**No link exists (line 2651-2670):** +```python +if not hasattr(lxmessage, "next_delivery_attempt") \ + or time.time() > lxmessage.next_delivery_attempt: + lxmessage.delivery_attempts += 1 + lxmessage.next_delivery_attempt = time.time() + LXMRouter.DELIVERY_RETRY_WAIT + + if lxmessage.delivery_attempts < LXMRouter.MAX_DELIVERY_ATTEMPTS: + if RNS.Transport.has_path(lxmessage.get_destination().hash): + delivery_link = RNS.Link(lxmessage.get_destination()) + delivery_link.set_link_established_callback(self.process_outbound) + self.direct_links[delivery_destination_hash] = delivery_link + else: + RNS.Transport.request_path(lxmessage.get_destination().hash) + lxmessage.next_delivery_attempt = time.time() + LXMRouter.PATH_REQUEST_WAIT +``` + +The `set_link_established_callback(self.process_outbound)` re-entry is what lets the next tick after a successful LRPROOF immediately enter the "existing ACTIVE link" branch and fire `send()` — see [`send-link-lxmf.md`](send-link-lxmf.md) §2 for why this works. + +`fail_message` runs at line 2671-2673 once `delivery_attempts > MAX_DELIVERY_ATTEMPTS`. + +### PROPAGATED branch (`LXMF/LXMRouter.py:2677-2730`) + +Structurally mirrors DIRECT but against `outbound_propagation_link` / `outbound_propagation_node` instead of per-recipient direct links. Two early failures: + +- `outbound_propagation_node == None` → immediate `fail_message` (line 2680-2682). LXMF will not attempt PROPAGATED without an explicitly configured node — there is **no automatic fallback** from DIRECT/OPPORTUNISTIC to PROPAGATED. Sideband configures one via `LXMRouter.set_outbound_propagation_node` at startup; a clean-room client must do the same before the user picks PROPAGATED. +- All `MAX_DELIVERY_ATTEMPTS` exhausted → `fail_message` (line 2728-2730). + +Otherwise the link-state branching is identical to DIRECT: ACTIVE → send / CLOSED → drop and retry / no-link → establish-or-path-request. + +--- + +## The terminal transition: `fail_message` + +`LXMF/LXMRouter.py:2395-2402`: + +```python +def fail_message(self, lxmessage): + RNS.log(str(lxmessage)+" failed to send", RNS.LOG_DEBUG) + + lxmessage.progress = 0.0 + if lxmessage in self.pending_outbound: self.pending_outbound.remove(lxmessage) + if lxmessage.state != LXMessage.REJECTED: lxmessage.state = LXMessage.FAILED + if lxmessage.failed_callback != None and callable(lxmessage.failed_callback): + lxmessage.failed_callback(lxmessage) +``` + +A few non-obvious properties: + +- `REJECTED` is preserved when present (the receiver explicitly rejected — don't overwrite the reason). +- The message **is removed from `pending_outbound` synchronously**; the `failed_callback` fires on the same thread as `process_outbound`. Callbacks must not block. +- There is **no automatic re-queue or method change** on FAIL. A failed DIRECT message does not get re-tried as PROPAGATED. Apps that want that fallback have to implement it themselves on top of the `failed_callback`. + +--- + +## What does NOT happen + +These are common assumptions that don't match upstream behavior. Listed here so reimplementers don't trust their intuition: + +- **No automatic DIRECT→PROPAGATED fallback** — see PROPAGATED branch above. The user (or app) chose `desired_method` at message construction time; LXMF never overrides it on failure. +- **No exponential backoff** — `DELIVERY_RETRY_WAIT = 10s` is constant across attempts 1..5. +- **No persistence of `pending_outbound` to disk by default** — pending outbound messages live in process memory. A LXMRouter restart drops them. (Sideband persists messages at the *app* level, not via LXMRouter.) +- **`MESSAGE_EXPIRY` is not a per-message send timeout.** It governs the propagation-node *store* (how long the node retains a message for offline pickup); it does not bound how long a single sender will keep retrying. The retry loop bounds itself via `MAX_DELIVERY_ATTEMPTS`, which at ~10s per attempt is ~50 seconds, not 30 days. +- **`SENT` is not `DELIVERED`.** PROPAGATED reaches `SENT` after the propagation node accepts the message; the recipient may pick it up minutes, hours, or days later. There is no end-to-end delivery proof for PROPAGATED messages until the recipient comes online and emits it (see [`send-propagated-lxmf.md`](send-propagated-lxmf.md) §6). +- **Path-request preamble is OPPORTUNISTIC-only at submit time.** `handle_outbound` only fires the pre-emptive `path?` when `lxmessage.method == OPPORTUNISTIC` (`LXMF/LXMRouter.py:1675`). DIRECT and PROPAGATED rely on `process_outbound`'s no-link branch to discover the path on the first tick. + +--- + +## Source map + +| Concern | File | Function / line | +|---|---|---| +| Class constants | `LXMF/LXMRouter.py` | 30-83 | +| Job interval table | `LXMF/LXMRouter.py` | 852-859 | +| `jobs` dispatcher | `LXMF/LXMRouter.py` | 860-887 | +| `jobloop` daemon | `LXMF/LXMRouter.py` | 889-899 | +| Pre-emptive path request on submit | `LXMF/LXMRouter.py` | 1675-1679 | +| `handle_outbound` thread kick | `LXMF/LXMRouter.py` | 1691 | +| `process_outbound` entry + lock | `LXMF/LXMRouter.py` | 2513-2515 | +| Terminal-state branches (DELIVERED / SENT-PROPAGATED / CANCELLED / REJECTED) | `LXMF/LXMRouter.py` | 2517-2558 | +| OPPORTUNISTIC retry branch | `LXMF/LXMRouter.py` | 2566-2592 | +| DIRECT retry branch | `LXMF/LXMRouter.py` | 2596-2673 | +| PROPAGATED retry branch | `LXMF/LXMRouter.py` | 2677-2730 | +| `fail_message` | `LXMF/LXMRouter.py` | 2395-2402 | +| Message states | `LXMF/LXMessage.py` | 13-22 | +| Delivery methods | `LXMF/LXMessage.py` | 29-33 | diff --git a/todo.md b/todo.md index 177c60f..2ba26a9 100644 --- a/todo.md +++ b/todo.md @@ -448,6 +448,8 @@ strand when GitHub / PyPI stop being authoritative: frontmatter (per `agent.md` §7). Done — `RNS 1.2.0 / LXMF 0.9.6` is now in the document header. +- [x] **`flows/lxmf-outbound-retry.md`** — outbound retry loop and per-message state machine (`MAX_DELIVERY_ATTEMPTS`, `DELIVERY_RETRY_WAIT`, `PATH_REQUEST_WAIT`, `MAX_PATHLESS_TRIES`, the OPPORTUNISTIC / DIRECT / PROPAGATED retry decision trees, `fail_message`). Source-cited against LXMF 0.9.7. Fills the gap between the per-method send-* flows (each describes one attempt) and the actual delivery semantics (5 attempts, ~50s budget, no automatic method fallback, `SENT` ≠ `DELIVERED` for PROPAGATED). No verifier needed — direct upstream source citations per `agent.md` §1. + - [x] **`tools/verify_stamps.py`** runtime-locks §5.7. Done. Verifies workblock determinism (confirms exactly 768 KiB at 3000 rounds), PoW search-and-validate at target_cost=4 (fast),