Rob 5574d3bed3 flows: add lxmf-outbound-retry — process_outbound retry loop + state machine

Documents the outbound retry layer that wraps the existing per-method
send-* flows. Pinned to LXMF 0.9.7 / RNS 1.2.4 with literal-quoted
upstream source for every claim:

- 4-second tick cadence (PROCESSING_INTERVAL × JOB_OUTBOUND_INTERVAL)
- All seven retry constants (MAX_DELIVERY_ATTEMPTS, DELIVERY_RETRY_WAIT,
  PATH_REQUEST_WAIT, MAX_PATHLESS_TRIES, MESSAGE_EXPIRY,
  LINK_MAX_INACTIVITY, P_LINK_MAX_INACTIVITY) at LXMRouter.py:30-38
- Eight-state machine (GENERATING/OUTBOUND/SENDING/SENT/DELIVERED/
  REJECTED/CANCELLED/FAILED) at LXMessage.py:13-22
- The four terminal-state branches at top of process_outbound (lines
  2517-2558) and the three per-method retry branches (OPPORTUNISTIC
  2566-2592, DIRECT 2596-2673, PROPAGATED 2677-2730)
- fail_message semantics at LXMRouter.py:2395-2402

Includes a "what does NOT happen" section calling out common
misconceptions: no automatic DIRECT→PROPAGATED fallback, no
exponential backoff, no in-router persistence of pending_outbound,
MESSAGE_EXPIRY governs the propagation-node store not per-sender
retries, SENT is the terminal success state for PROPAGATED (not
DELIVERED).

No verifier needed per agent.md §1 — all claims are direct upstream
source citations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-09 10:09:02 -04:00

13 KiB

Raw Permalink Blame History

Flow: LXMF outbound retry loop and per-message state machine

What LXMRouter.process_outbound actually does on each tick — the layer that wraps send-opportunistic-lxmf.md, send-link-lxmf.md, and send-propagated-lxmf.md and decides when each happy-path operation runs, retries, gives up, or falls through.

The three send-* flows describe what happens for one attempt of each method. This doc describes how attempts are scheduled, how the per-message state advances, and when a message moves from retry-eligible to terminally FAILED. It is the missing piece for any client that wants delivery semantics matching upstream Sideband.

Pinned against RNS 1.2.4 / LXMF 0.9.7. Line numbers below are from those versions.

Cadence: how often process_outbound runs

LXMRouter.jobloop (LXMF/LXMRouter.py:889-899) is a daemon thread that wakes every PROCESSING_INTERVAL seconds and calls LXMRouter.jobs, which dispatches to process_outbound whenever its tick counter is divisible by JOB_OUTBOUND_INTERVAL:

Constant	Value	File:line
`PROCESSING_INTERVAL`	`4` (seconds)	`LXMF/LXMRouter.py:31`
`JOB_OUTBOUND_INTERVAL`	`1`	`LXMF/LXMRouter.py:852`

So the effective outbound tick is every 4 seconds. Any per-message timer (path-request defer, retry backoff, link-establish timeout) is sampled at this granularity — a 10-second backoff isn't actually 10 seconds, it's "first tick at or after now + 10s."

handle_outbound also kicks process_outbound directly on a fresh thread when a new message is queued (LXMF/LXMRouter.py:1691), so the first attempt doesn't wait for the next jobloop tick.

Constants that drive retry behavior

All on LXMRouter, all module-cited (LXMF/LXMRouter.py:30-38):

Constant	Value	Meaning
`MAX_DELIVERY_ATTEMPTS`	`5`	Per-message attempt cap. Crossing this triggers `fail_message`.
`DELIVERY_RETRY_WAIT`	`10` (seconds)	Wait between attempts when path is known but the prior attempt didn't yield delivery proof.
`PATH_REQUEST_WAIT`	`7` (seconds)	Wait after issuing a `path?` request before the next attempt.
`MAX_PATHLESS_TRIES`	`1`	OPPORTUNISTIC only — number of attempts before forcing a path request.
`MESSAGE_EXPIRY`	`302460*60` (30 days)	Used by propagation-node store cleanup, not the per-message retry path.
`LINK_MAX_INACTIVITY`	`10*60`	Direct-link idle teardown threshold (`clean_links`).
`P_LINK_MAX_INACTIVITY`	`3*60`	Propagation-link idle teardown threshold.

A full single-message retry budget for DIRECT or PROPAGATED is therefore 5 attempts × 10 seconds ≈ 50 seconds of wall-clock before fail_message runs, plus whatever each attempt itself spends inside the link-establishment / proof-wait window.

Per-message state machine

States from LXMF/LXMessage.py:13-22:

State	Value	When
`GENERATING`	`0x00`	Stamp generation in progress (deferred-stamp messages only)
`OUTBOUND`	`0x01`	Queued in `pending_outbound`; not currently transmitting
`SENDING`	`0x02`	A send is in flight on the wire (packet sent / Resource transferring)
`SENT`	`0x04`	Wire send completed, but no end-to-end PROOF yet — also the terminal state for PROPAGATED (delivery to the recipient is the propagation node's job)
`DELIVERED`	`0x08`	End-to-end PROOF received from the final recipient — only reachable for OPPORTUNISTIC and DIRECT
`REJECTED`	`0xFD`	Receiver explicitly rejected (e.g. stamp validation failed on a propagation node)
`CANCELLED`	`0xFE`	Sender called `LXMessage.cancel` while still queued
`FAILED`	`0xFF`	`MAX_DELIVERY_ATTEMPTS` exhausted, or unrecoverable error

The valid-method enum is LXMessage.OPPORTUNISTIC = 0x01, DIRECT = 0x02, PROPAGATED = 0x03, PAPER = 0x05 (LXMF/LXMessage.py:29-32).

Per-tick decision tree

process_outbound (LXMF/LXMRouter.py:2513) holds outbound_processing_lock across the whole tick (line 2514-2515) and walks pending_outbound once. For each message, the top-of-loop branches on terminal state first:

Branch	File:line	Effect
`state == DELIVERED`	2517-2542	Remove from queue. If method was DIRECT, perform backchannel-identify on the link so the recipient can reply over the same link.
`method == PROPAGATED and state == SENT`	2544-2546	Remove from queue (PROPAGATED's terminal success state is SENT, not DELIVERED — see state table).
`state == CANCELLED`	2548-2552	Remove and fire `failed_callback`.
`state == REJECTED`	2554-2558	Remove and fire `failed_callback`.
Else (`OUTBOUND` or `SENDING`)	2560+	Per-method retry/send branch — see below.

The non-terminal branch in turn switches on lxmessage.method:

OPPORTUNISTIC branch (`LXMF/LXMRouter.py:2566-2592`)

if lxmessage.method == LXMessage.OPPORTUNISTIC:
    if lxmessage.delivery_attempts <= LXMRouter.MAX_DELIVERY_ATTEMPTS:
        if lxmessage.delivery_attempts >= LXMRouter.MAX_PATHLESS_TRIES \
                and not RNS.Transport.has_path(lxmessage.get_destination().hash):
            # Force a path request, defer PATH_REQUEST_WAIT seconds
            ...
            lxmessage.next_delivery_attempt = time.time() + LXMRouter.PATH_REQUEST_WAIT
        elif lxmessage.delivery_attempts == LXMRouter.MAX_PATHLESS_TRIES + 1 \
                and RNS.Transport.has_path(...):
            # Path is known but prior attempt failed — drop_path + re-discover
            RNS.Reticulum.get_instance().drop_path(...)
            ...
            lxmessage.next_delivery_attempt = time.time() + LXMRouter.PATH_REQUEST_WAIT
        else:
            if not hasattr(lxmessage, "next_delivery_attempt") \
                    or time.time() > lxmessage.next_delivery_attempt:
                lxmessage.delivery_attempts += 1
                lxmessage.next_delivery_attempt = time.time() + LXMRouter.DELIVERY_RETRY_WAIT
                lxmessage.send()
    else:
        self.fail_message(lxmessage)

Key behaviors:

First attempt is "pathless-tolerant": if delivery_attempts < MAX_PATHLESS_TRIES (=1) and there's no path, the message still tries a send (relying on handle_outbound's pre-emptive path? at LXMF/LXMRouter.py:1675-1679).
After the pathless tries are exhausted, an explicit path? is fired and the message defers PATH_REQUEST_WAIT (=7s).
The MAX_PATHLESS_TRIES + 1 case is the "I have a stale path that didn't deliver" recovery: Reticulum.drop_path evicts the bad path table entry, then a fresh path? is requested.
The else branch is the actual retransmit: increment attempts, schedule + DELIVERY_RETRY_WAIT (=10s), fire lxmessage.send().
fail_message runs only after delivery_attempts > MAX_DELIVERY_ATTEMPTS — i.e. attempts 1..5 are tried, attempt 6 trips fail_message.

DIRECT branch (`LXMF/LXMRouter.py:2596-2673`)

Two sub-paths, decided by whether a usable link already exists in direct_links or backchannel_links:

Existing link, status == ACTIVE (line 2616-2627):

If state != SENDING, set the link as the delivery destination and call lxmessage.send().
If state == SENDING, just log progress — the prior send is still pending its proof.

Existing link, status == CLOSED (line 2628-2647):

If the link was previously activated (activated_at != None), the link died unexpectedly — issue a fresh path? and schedule PATH_REQUEST_WAIT.
Else (link was never activated — LRPROOF never arrived on the prior attempt), retry the path request once via path_request_retried, then schedule PATH_REQUEST_WAIT.
Either way, drop the dead link from direct_links / backchannel_links and schedule the next attempt at + DELIVERY_RETRY_WAIT.

No link exists (line 2651-2670):

if not hasattr(lxmessage, "next_delivery_attempt") \
        or time.time() > lxmessage.next_delivery_attempt:
    lxmessage.delivery_attempts += 1
    lxmessage.next_delivery_attempt = time.time() + LXMRouter.DELIVERY_RETRY_WAIT

    if lxmessage.delivery_attempts < LXMRouter.MAX_DELIVERY_ATTEMPTS:
        if RNS.Transport.has_path(lxmessage.get_destination().hash):
            delivery_link = RNS.Link(lxmessage.get_destination())
            delivery_link.set_link_established_callback(self.process_outbound)
            self.direct_links[delivery_destination_hash] = delivery_link
        else:
            RNS.Transport.request_path(lxmessage.get_destination().hash)
            lxmessage.next_delivery_attempt = time.time() + LXMRouter.PATH_REQUEST_WAIT

The set_link_established_callback(self.process_outbound) re-entry is what lets the next tick after a successful LRPROOF immediately enter the "existing ACTIVE link" branch and fire send() — see send-link-lxmf.md §2 for why this works.

fail_message runs at line 2671-2673 once delivery_attempts > MAX_DELIVERY_ATTEMPTS.

PROPAGATED branch (`LXMF/LXMRouter.py:2677-2730`)

Structurally mirrors DIRECT but against outbound_propagation_link / outbound_propagation_node instead of per-recipient direct links. Two early failures:

outbound_propagation_node == None → immediate fail_message (line 2680-2682). LXMF will not attempt PROPAGATED without an explicitly configured node — there is no automatic fallback from DIRECT/OPPORTUNISTIC to PROPAGATED. Sideband configures one via LXMRouter.set_outbound_propagation_node at startup; a clean-room client must do the same before the user picks PROPAGATED.
All MAX_DELIVERY_ATTEMPTS exhausted → fail_message (line 2728-2730).

Otherwise the link-state branching is identical to DIRECT: ACTIVE → send / CLOSED → drop and retry / no-link → establish-or-path-request.

The terminal transition: `fail_message`

LXMF/LXMRouter.py:2395-2402:

def fail_message(self, lxmessage):
    RNS.log(str(lxmessage)+" failed to send", RNS.LOG_DEBUG)

    lxmessage.progress = 0.0
    if lxmessage in self.pending_outbound: self.pending_outbound.remove(lxmessage)
    if lxmessage.state != LXMessage.REJECTED: lxmessage.state = LXMessage.FAILED
    if lxmessage.failed_callback != None and callable(lxmessage.failed_callback):
        lxmessage.failed_callback(lxmessage)

A few non-obvious properties:

REJECTED is preserved when present (the receiver explicitly rejected — don't overwrite the reason).
The message is removed from pending_outbound synchronously; the failed_callback fires on the same thread as process_outbound. Callbacks must not block.
There is no automatic re-queue or method change on FAIL. A failed DIRECT message does not get re-tried as PROPAGATED. Apps that want that fallback have to implement it themselves on top of the failed_callback.

What does NOT happen

These are common assumptions that don't match upstream behavior. Listed here so reimplementers don't trust their intuition:

No automatic DIRECT→PROPAGATED fallback — see PROPAGATED branch above. The user (or app) chose desired_method at message construction time; LXMF never overrides it on failure.
No exponential backoff — DELIVERY_RETRY_WAIT = 10s is constant across attempts 1..5.
No persistence of pending_outbound to disk by default — pending outbound messages live in process memory. A LXMRouter restart drops them. (Sideband persists messages at the app level, not via LXMRouter.)
MESSAGE_EXPIRY is not a per-message send timeout. It governs the propagation-node store (how long the node retains a message for offline pickup); it does not bound how long a single sender will keep retrying. The retry loop bounds itself via MAX_DELIVERY_ATTEMPTS, which at ~10s per attempt is ~50 seconds, not 30 days.
SENT is not DELIVERED. PROPAGATED reaches SENT after the propagation node accepts the message; the recipient may pick it up minutes, hours, or days later. There is no end-to-end delivery proof for PROPAGATED messages until the recipient comes online and emits it (see send-propagated-lxmf.md §6).
Path-request preamble is OPPORTUNISTIC-only at submit time. handle_outbound only fires the pre-emptive path? when lxmessage.method == OPPORTUNISTIC (LXMF/LXMRouter.py:1675). DIRECT and PROPAGATED rely on process_outbound's no-link branch to discover the path on the first tick.

Source map

Concern	File	Function / line
Class constants	`LXMF/LXMRouter.py`	30-83
Job interval table	`LXMF/LXMRouter.py`	852-859
`jobs` dispatcher	`LXMF/LXMRouter.py`	860-887
`jobloop` daemon	`LXMF/LXMRouter.py`	889-899
Pre-emptive path request on submit	`LXMF/LXMRouter.py`	1675-1679
`handle_outbound` thread kick	`LXMF/LXMRouter.py`	1691
`process_outbound` entry + lock	`LXMF/LXMRouter.py`	2513-2515
Terminal-state branches (DELIVERED / SENT-PROPAGATED / CANCELLED / REJECTED)	`LXMF/LXMRouter.py`	2517-2558
OPPORTUNISTIC retry branch	`LXMF/LXMRouter.py`	2566-2592
DIRECT retry branch	`LXMF/LXMRouter.py`	2596-2673
PROPAGATED retry branch	`LXMF/LXMRouter.py`	2677-2730
`fail_message`	`LXMF/LXMRouter.py`	2395-2402
Message states	`LXMF/LXMessage.py`	13-22
Delivery methods	`LXMF/LXMessage.py`	29-33

13 KiB Raw Permalink Blame History Unescape Escape