flows: add lxmf-outbound-retry — process_outbound retry loop + state machine
Documents the outbound retry layer that wraps the existing per-method send-* flows. Pinned to LXMF 0.9.7 / RNS 1.2.4 with literal-quoted upstream source for every claim: - 4-second tick cadence (PROCESSING_INTERVAL × JOB_OUTBOUND_INTERVAL) - All seven retry constants (MAX_DELIVERY_ATTEMPTS, DELIVERY_RETRY_WAIT, PATH_REQUEST_WAIT, MAX_PATHLESS_TRIES, MESSAGE_EXPIRY, LINK_MAX_INACTIVITY, P_LINK_MAX_INACTIVITY) at LXMRouter.py:30-38 - Eight-state machine (GENERATING/OUTBOUND/SENDING/SENT/DELIVERED/ REJECTED/CANCELLED/FAILED) at LXMessage.py:13-22 - The four terminal-state branches at top of process_outbound (lines 2517-2558) and the three per-method retry branches (OPPORTUNISTIC 2566-2592, DIRECT 2596-2673, PROPAGATED 2677-2730) - fail_message semantics at LXMRouter.py:2395-2402 Includes a "what does NOT happen" section calling out common misconceptions: no automatic DIRECT→PROPAGATED fallback, no exponential backoff, no in-router persistence of pending_outbound, MESSAGE_EXPIRY governs the propagation-node store not per-sender retries, SENT is the terminal success state for PROPAGATED (not DELIVERED). No verifier needed per agent.md §1 — all claims are direct upstream source citations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
784343a33f
commit
5574d3bed3
3 changed files with 211 additions and 0 deletions
|
|
@ -20,6 +20,7 @@ The two views are complementary: SPEC.md tells you what each piece looks like; t
|
|||
| [`forward-announce.md`](forward-announce.md) (transport-node rebroadcast logic, announce_cap, queue) | ✅ |
|
||||
| [`send-propagated-lxmf.md`](send-propagated-lxmf.md) (PROPAGATED method, via a propagation node) | ✅ |
|
||||
| [`receive-propagated-lxmf.md`](receive-propagated-lxmf.md) (recipient pulling messages via `/get`) | ✅ |
|
||||
| [`lxmf-outbound-retry.md`](lxmf-outbound-retry.md) (process_outbound retry loop, per-message state machine, fail_message) | ✅ |
|
||||
|
||||
## Conventions
|
||||
|
||||
|
|
|
|||
208
flows/lxmf-outbound-retry.md
Normal file
208
flows/lxmf-outbound-retry.md
Normal file
|
|
@ -0,0 +1,208 @@
|
|||
# Flow: LXMF outbound retry loop and per-message state machine
|
||||
|
||||
What `LXMRouter.process_outbound` actually does on each tick — the layer that wraps [`send-opportunistic-lxmf.md`](send-opportunistic-lxmf.md), [`send-link-lxmf.md`](send-link-lxmf.md), and [`send-propagated-lxmf.md`](send-propagated-lxmf.md) and decides when each happy-path operation runs, retries, gives up, or falls through.
|
||||
|
||||
The three send-* flows describe what happens for *one* attempt of each method. This doc describes how attempts are scheduled, how the per-message state advances, and when a message moves from retry-eligible to terminally `FAILED`. It is the missing piece for any client that wants delivery semantics matching upstream Sideband.
|
||||
|
||||
Pinned against **RNS 1.2.4 / LXMF 0.9.7**. Line numbers below are from those versions.
|
||||
|
||||
---
|
||||
|
||||
## Cadence: how often process_outbound runs
|
||||
|
||||
`LXMRouter.jobloop` (`LXMF/LXMRouter.py:889-899`) is a daemon thread that wakes every `PROCESSING_INTERVAL` seconds and calls `LXMRouter.jobs`, which dispatches to `process_outbound` whenever its tick counter is divisible by `JOB_OUTBOUND_INTERVAL`:
|
||||
|
||||
| Constant | Value | File:line |
|
||||
|---|---|---|
|
||||
| `PROCESSING_INTERVAL` | `4` (seconds) | `LXMF/LXMRouter.py:31` |
|
||||
| `JOB_OUTBOUND_INTERVAL` | `1` | `LXMF/LXMRouter.py:852` |
|
||||
|
||||
So the **effective outbound tick is every 4 seconds.** Any per-message timer (path-request defer, retry backoff, link-establish timeout) is sampled at this granularity — a 10-second backoff isn't actually 10 seconds, it's "first tick at or after `now + 10s`."
|
||||
|
||||
`handle_outbound` also kicks `process_outbound` directly on a fresh thread when a new message is queued (`LXMF/LXMRouter.py:1691`), so the first attempt doesn't wait for the next jobloop tick.
|
||||
|
||||
---
|
||||
|
||||
## Constants that drive retry behavior
|
||||
|
||||
All on `LXMRouter`, all module-cited (`LXMF/LXMRouter.py:30-38`):
|
||||
|
||||
| Constant | Value | Meaning |
|
||||
|---|---|---|
|
||||
| `MAX_DELIVERY_ATTEMPTS` | `5` | Per-message attempt cap. Crossing this triggers `fail_message`. |
|
||||
| `DELIVERY_RETRY_WAIT` | `10` (seconds) | Wait between attempts when path is known but the prior attempt didn't yield delivery proof. |
|
||||
| `PATH_REQUEST_WAIT` | `7` (seconds) | Wait after issuing a `path?` request before the next attempt. |
|
||||
| `MAX_PATHLESS_TRIES` | `1` | OPPORTUNISTIC only — number of attempts before forcing a path request. |
|
||||
| `MESSAGE_EXPIRY` | `30*24*60*60` (30 days) | Used by propagation-node store cleanup, not the per-message retry path. |
|
||||
| `LINK_MAX_INACTIVITY` | `10*60` | Direct-link idle teardown threshold (`clean_links`). |
|
||||
| `P_LINK_MAX_INACTIVITY` | `3*60` | Propagation-link idle teardown threshold. |
|
||||
|
||||
A full single-message retry budget for DIRECT or PROPAGATED is therefore **5 attempts × 10 seconds ≈ 50 seconds of wall-clock** before `fail_message` runs, plus whatever each attempt itself spends inside the link-establishment / proof-wait window.
|
||||
|
||||
---
|
||||
|
||||
## Per-message state machine
|
||||
|
||||
States from `LXMF/LXMessage.py:13-22`:
|
||||
|
||||
| State | Value | When |
|
||||
|---|---|---|
|
||||
| `GENERATING` | `0x00` | Stamp generation in progress (deferred-stamp messages only) |
|
||||
| `OUTBOUND` | `0x01` | Queued in `pending_outbound`; not currently transmitting |
|
||||
| `SENDING` | `0x02` | A send is in flight on the wire (packet sent / Resource transferring) |
|
||||
| `SENT` | `0x04` | Wire send completed, but no end-to-end PROOF yet — also the **terminal** state for PROPAGATED (delivery to the recipient is the propagation node's job) |
|
||||
| `DELIVERED` | `0x08` | End-to-end PROOF received from the final recipient — only reachable for OPPORTUNISTIC and DIRECT |
|
||||
| `REJECTED` | `0xFD` | Receiver explicitly rejected (e.g. stamp validation failed on a propagation node) |
|
||||
| `CANCELLED` | `0xFE` | Sender called `LXMessage.cancel` while still queued |
|
||||
| `FAILED` | `0xFF` | `MAX_DELIVERY_ATTEMPTS` exhausted, or unrecoverable error |
|
||||
|
||||
The valid-method enum is `LXMessage.OPPORTUNISTIC = 0x01`, `DIRECT = 0x02`, `PROPAGATED = 0x03`, `PAPER = 0x05` (`LXMF/LXMessage.py:29-32`).
|
||||
|
||||
---
|
||||
|
||||
## Per-tick decision tree
|
||||
|
||||
`process_outbound` (`LXMF/LXMRouter.py:2513`) holds `outbound_processing_lock` across the whole tick (line 2514-2515) and walks `pending_outbound` once. For each message, the top-of-loop branches on terminal state first:
|
||||
|
||||
| Branch | File:line | Effect |
|
||||
|---|---|---|
|
||||
| `state == DELIVERED` | 2517-2542 | Remove from queue. If method was DIRECT, perform backchannel-identify on the link so the recipient can reply over the same link. |
|
||||
| `method == PROPAGATED and state == SENT` | 2544-2546 | Remove from queue (PROPAGATED's terminal success state is SENT, not DELIVERED — see state table). |
|
||||
| `state == CANCELLED` | 2548-2552 | Remove and fire `failed_callback`. |
|
||||
| `state == REJECTED` | 2554-2558 | Remove and fire `failed_callback`. |
|
||||
| Else (`OUTBOUND` or `SENDING`) | 2560+ | Per-method retry/send branch — see below. |
|
||||
|
||||
The non-terminal branch in turn switches on `lxmessage.method`:
|
||||
|
||||
### OPPORTUNISTIC branch (`LXMF/LXMRouter.py:2566-2592`)
|
||||
|
||||
```python
|
||||
if lxmessage.method == LXMessage.OPPORTUNISTIC:
|
||||
if lxmessage.delivery_attempts <= LXMRouter.MAX_DELIVERY_ATTEMPTS:
|
||||
if lxmessage.delivery_attempts >= LXMRouter.MAX_PATHLESS_TRIES \
|
||||
and not RNS.Transport.has_path(lxmessage.get_destination().hash):
|
||||
# Force a path request, defer PATH_REQUEST_WAIT seconds
|
||||
...
|
||||
lxmessage.next_delivery_attempt = time.time() + LXMRouter.PATH_REQUEST_WAIT
|
||||
elif lxmessage.delivery_attempts == LXMRouter.MAX_PATHLESS_TRIES + 1 \
|
||||
and RNS.Transport.has_path(...):
|
||||
# Path is known but prior attempt failed — drop_path + re-discover
|
||||
RNS.Reticulum.get_instance().drop_path(...)
|
||||
...
|
||||
lxmessage.next_delivery_attempt = time.time() + LXMRouter.PATH_REQUEST_WAIT
|
||||
else:
|
||||
if not hasattr(lxmessage, "next_delivery_attempt") \
|
||||
or time.time() > lxmessage.next_delivery_attempt:
|
||||
lxmessage.delivery_attempts += 1
|
||||
lxmessage.next_delivery_attempt = time.time() + LXMRouter.DELIVERY_RETRY_WAIT
|
||||
lxmessage.send()
|
||||
else:
|
||||
self.fail_message(lxmessage)
|
||||
```
|
||||
|
||||
Key behaviors:
|
||||
- **First attempt is "pathless-tolerant":** if `delivery_attempts < MAX_PATHLESS_TRIES (=1)` and there's no path, the message still tries a send (relying on `handle_outbound`'s pre-emptive `path?` at `LXMF/LXMRouter.py:1675-1679`).
|
||||
- **After the pathless tries are exhausted,** an explicit `path?` is fired and the message defers `PATH_REQUEST_WAIT (=7s)`.
|
||||
- **The `MAX_PATHLESS_TRIES + 1` case** is the "I have a stale path that didn't deliver" recovery: `Reticulum.drop_path` evicts the bad path table entry, then a fresh `path?` is requested.
|
||||
- **The `else` branch is the actual retransmit:** increment attempts, schedule `+ DELIVERY_RETRY_WAIT (=10s)`, fire `lxmessage.send()`.
|
||||
- **`fail_message` runs only after `delivery_attempts > MAX_DELIVERY_ATTEMPTS`** — i.e. attempts 1..5 are tried, attempt 6 trips `fail_message`.
|
||||
|
||||
### DIRECT branch (`LXMF/LXMRouter.py:2596-2673`)
|
||||
|
||||
Two sub-paths, decided by whether a usable link already exists in `direct_links` or `backchannel_links`:
|
||||
|
||||
**Existing link, `status == ACTIVE` (line 2616-2627):**
|
||||
- If `state != SENDING`, set the link as the delivery destination and call `lxmessage.send()`.
|
||||
- If `state == SENDING`, just log progress — the prior send is still pending its proof.
|
||||
|
||||
**Existing link, `status == CLOSED` (line 2628-2647):**
|
||||
- If the link was previously activated (`activated_at != None`), the link died unexpectedly — issue a fresh `path?` and schedule `PATH_REQUEST_WAIT`.
|
||||
- Else (link was never activated — LRPROOF never arrived on the prior attempt), retry the path request once via `path_request_retried`, then schedule `PATH_REQUEST_WAIT`.
|
||||
- Either way, **drop the dead link from `direct_links` / `backchannel_links`** and schedule the next attempt at `+ DELIVERY_RETRY_WAIT`.
|
||||
|
||||
**No link exists (line 2651-2670):**
|
||||
```python
|
||||
if not hasattr(lxmessage, "next_delivery_attempt") \
|
||||
or time.time() > lxmessage.next_delivery_attempt:
|
||||
lxmessage.delivery_attempts += 1
|
||||
lxmessage.next_delivery_attempt = time.time() + LXMRouter.DELIVERY_RETRY_WAIT
|
||||
|
||||
if lxmessage.delivery_attempts < LXMRouter.MAX_DELIVERY_ATTEMPTS:
|
||||
if RNS.Transport.has_path(lxmessage.get_destination().hash):
|
||||
delivery_link = RNS.Link(lxmessage.get_destination())
|
||||
delivery_link.set_link_established_callback(self.process_outbound)
|
||||
self.direct_links[delivery_destination_hash] = delivery_link
|
||||
else:
|
||||
RNS.Transport.request_path(lxmessage.get_destination().hash)
|
||||
lxmessage.next_delivery_attempt = time.time() + LXMRouter.PATH_REQUEST_WAIT
|
||||
```
|
||||
|
||||
The `set_link_established_callback(self.process_outbound)` re-entry is what lets the next tick after a successful LRPROOF immediately enter the "existing ACTIVE link" branch and fire `send()` — see [`send-link-lxmf.md`](send-link-lxmf.md) §2 for why this works.
|
||||
|
||||
`fail_message` runs at line 2671-2673 once `delivery_attempts > MAX_DELIVERY_ATTEMPTS`.
|
||||
|
||||
### PROPAGATED branch (`LXMF/LXMRouter.py:2677-2730`)
|
||||
|
||||
Structurally mirrors DIRECT but against `outbound_propagation_link` / `outbound_propagation_node` instead of per-recipient direct links. Two early failures:
|
||||
|
||||
- `outbound_propagation_node == None` → immediate `fail_message` (line 2680-2682). LXMF will not attempt PROPAGATED without an explicitly configured node — there is **no automatic fallback** from DIRECT/OPPORTUNISTIC to PROPAGATED. Sideband configures one via `LXMRouter.set_outbound_propagation_node` at startup; a clean-room client must do the same before the user picks PROPAGATED.
|
||||
- All `MAX_DELIVERY_ATTEMPTS` exhausted → `fail_message` (line 2728-2730).
|
||||
|
||||
Otherwise the link-state branching is identical to DIRECT: ACTIVE → send / CLOSED → drop and retry / no-link → establish-or-path-request.
|
||||
|
||||
---
|
||||
|
||||
## The terminal transition: `fail_message`
|
||||
|
||||
`LXMF/LXMRouter.py:2395-2402`:
|
||||
|
||||
```python
|
||||
def fail_message(self, lxmessage):
|
||||
RNS.log(str(lxmessage)+" failed to send", RNS.LOG_DEBUG)
|
||||
|
||||
lxmessage.progress = 0.0
|
||||
if lxmessage in self.pending_outbound: self.pending_outbound.remove(lxmessage)
|
||||
if lxmessage.state != LXMessage.REJECTED: lxmessage.state = LXMessage.FAILED
|
||||
if lxmessage.failed_callback != None and callable(lxmessage.failed_callback):
|
||||
lxmessage.failed_callback(lxmessage)
|
||||
```
|
||||
|
||||
A few non-obvious properties:
|
||||
|
||||
- `REJECTED` is preserved when present (the receiver explicitly rejected — don't overwrite the reason).
|
||||
- The message **is removed from `pending_outbound` synchronously**; the `failed_callback` fires on the same thread as `process_outbound`. Callbacks must not block.
|
||||
- There is **no automatic re-queue or method change** on FAIL. A failed DIRECT message does not get re-tried as PROPAGATED. Apps that want that fallback have to implement it themselves on top of the `failed_callback`.
|
||||
|
||||
---
|
||||
|
||||
## What does NOT happen
|
||||
|
||||
These are common assumptions that don't match upstream behavior. Listed here so reimplementers don't trust their intuition:
|
||||
|
||||
- **No automatic DIRECT→PROPAGATED fallback** — see PROPAGATED branch above. The user (or app) chose `desired_method` at message construction time; LXMF never overrides it on failure.
|
||||
- **No exponential backoff** — `DELIVERY_RETRY_WAIT = 10s` is constant across attempts 1..5.
|
||||
- **No persistence of `pending_outbound` to disk by default** — pending outbound messages live in process memory. A LXMRouter restart drops them. (Sideband persists messages at the *app* level, not via LXMRouter.)
|
||||
- **`MESSAGE_EXPIRY` is not a per-message send timeout.** It governs the propagation-node *store* (how long the node retains a message for offline pickup); it does not bound how long a single sender will keep retrying. The retry loop bounds itself via `MAX_DELIVERY_ATTEMPTS`, which at ~10s per attempt is ~50 seconds, not 30 days.
|
||||
- **`SENT` is not `DELIVERED`.** PROPAGATED reaches `SENT` after the propagation node accepts the message; the recipient may pick it up minutes, hours, or days later. There is no end-to-end delivery proof for PROPAGATED messages until the recipient comes online and emits it (see [`send-propagated-lxmf.md`](send-propagated-lxmf.md) §6).
|
||||
- **Path-request preamble is OPPORTUNISTIC-only at submit time.** `handle_outbound` only fires the pre-emptive `path?` when `lxmessage.method == OPPORTUNISTIC` (`LXMF/LXMRouter.py:1675`). DIRECT and PROPAGATED rely on `process_outbound`'s no-link branch to discover the path on the first tick.
|
||||
|
||||
---
|
||||
|
||||
## Source map
|
||||
|
||||
| Concern | File | Function / line |
|
||||
|---|---|---|
|
||||
| Class constants | `LXMF/LXMRouter.py` | 30-83 |
|
||||
| Job interval table | `LXMF/LXMRouter.py` | 852-859 |
|
||||
| `jobs` dispatcher | `LXMF/LXMRouter.py` | 860-887 |
|
||||
| `jobloop` daemon | `LXMF/LXMRouter.py` | 889-899 |
|
||||
| Pre-emptive path request on submit | `LXMF/LXMRouter.py` | 1675-1679 |
|
||||
| `handle_outbound` thread kick | `LXMF/LXMRouter.py` | 1691 |
|
||||
| `process_outbound` entry + lock | `LXMF/LXMRouter.py` | 2513-2515 |
|
||||
| Terminal-state branches (DELIVERED / SENT-PROPAGATED / CANCELLED / REJECTED) | `LXMF/LXMRouter.py` | 2517-2558 |
|
||||
| OPPORTUNISTIC retry branch | `LXMF/LXMRouter.py` | 2566-2592 |
|
||||
| DIRECT retry branch | `LXMF/LXMRouter.py` | 2596-2673 |
|
||||
| PROPAGATED retry branch | `LXMF/LXMRouter.py` | 2677-2730 |
|
||||
| `fail_message` | `LXMF/LXMRouter.py` | 2395-2402 |
|
||||
| Message states | `LXMF/LXMessage.py` | 13-22 |
|
||||
| Delivery methods | `LXMF/LXMessage.py` | 29-33 |
|
||||
2
todo.md
2
todo.md
|
|
@ -448,6 +448,8 @@ strand when GitHub / PyPI stop being authoritative:
|
|||
frontmatter (per `agent.md` §7). Done — `RNS 1.2.0 / LXMF
|
||||
0.9.6` is now in the document header.
|
||||
|
||||
- [x] **`flows/lxmf-outbound-retry.md`** — outbound retry loop and per-message state machine (`MAX_DELIVERY_ATTEMPTS`, `DELIVERY_RETRY_WAIT`, `PATH_REQUEST_WAIT`, `MAX_PATHLESS_TRIES`, the OPPORTUNISTIC / DIRECT / PROPAGATED retry decision trees, `fail_message`). Source-cited against LXMF 0.9.7. Fills the gap between the per-method send-* flows (each describes one attempt) and the actual delivery semantics (5 attempts, ~50s budget, no automatic method fallback, `SENT` ≠ `DELIVERED` for PROPAGATED). No verifier needed — direct upstream source citations per `agent.md` §1.
|
||||
|
||||
- [x] **`tools/verify_stamps.py`** runtime-locks §5.7. Done.
|
||||
Verifies workblock determinism (confirms exactly 768 KiB at
|
||||
3000 rounds), PoW search-and-validate at target_cost=4 (fast),
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue