18 KiB
Exercise 305 post-mortem: BLE file transfer corruption under load
Date: 2026-05-20
Exercise 305 was created to move beyond BLE ping-pong messages and test a real bilateral file transfer over a negotiated microReticulum Link. Both ESP32-S3 T-Beam SUPREME boards run the same firmware, form a BLE transport connection, establish a Reticulum Link, and then send the selected text file at the same time.
The test exposed a failure that looked at first like a Reticulum cryptographic problem:
Decryption failed on link {...}. The contained exception was: Token token HMAC was invalid
RX FILE END ... received=0/... chunks=0/... status=VERIFY_FAIL
The important observation was that the failure followed BLE role/path, not a particular physical board. The server-side receiver could receive the bilateral US Constitution transfer, while the client-side receiver of server-originated chunks failed. In some runs a receiver accepted a small number of packets and then entered a long run of HMAC failures. That pattern suggested that the negotiated Link could be valid initially, and that later traffic pressure might be corrupting bytes, selecting the wrong Link context, overflowing a queue, or exhausting a constrained runtime resource.
This post-mortem records what was instrumented, what the live T-Beam serial sessions showed, and what was fixed.
Test setup
The active test was:
source /home/jlpoole/rnsenv/bin/activate
cd /usr/local/src/microreticulum/microReticulumTbeam
pio run -d exercises/305_microReticulum_ble_file_transfer -e tbeam_constitution_pi_zero_profile
The same firmware image was uploaded to both boards:
pio run -d exercises/305_microReticulum_ble_file_transfer -e tbeam_constitution_pi_zero_profile -t upload --upload-port /dev/ttytBOB
pio run -d exercises/305_microReticulum_ble_file_transfer -e tbeam_constitution_pi_zero_profile -t upload --upload-port /dev/ttytDAN
The serial devices used for the live debug session were:
/dev/ttytBOB
/dev/ttytDAN
The Pi-Zero-comparison profile sends:
US_Constitution.txt: 44225 bytes
Application chunk size: 300 bytes
Application chunk interval: 100 ms
Repeat rest: 10000 ms after each completed transmission
Those are application-level file chunks. microReticulum then wraps and encrypts them into Link packets, and the BLE adapter fragments those encrypted packets into BLE characteristic writes/notifications.
Initial hypotheses
The requested debug work was designed to distinguish several possible failure classes:
- Wrong Link/Fernet/HMAC key or wrong Link/session object.
- Corrupted, truncated, or shifted encrypted packet bytes before decrypt.
- BLE fragmentation or reassembly failure.
- Queue overflow or silent overwrite between BLE callback and Reticulum processing.
- Borrowed buffer lifetime or buffer reuse after a BLE callback returns.
- Heap, stack, or fragmentation pressure.
- Role-specific TX/RX sequencing failure.
The central idea was to fingerprint the same packet at multiple layers without dumping full payloads. If the sender's encrypted token CRC matched the receiver's token CRC but HMAC failed, the likely cause would be wrong key/session/Link context. If the CRC differed, the likely cause would be mechanical corruption below Reticulum.
Instrumentation added
Exercise 305 now has a lightweight debug layer in:
src/DebugStats.h
src/DebugStats.cpp
The BLE adapter instrumentation is in:
src/TBeamSupremeBleInterface.h
src/TBeamSupremeBleInterface.cpp
The exercise application emits Link and application-send debug lines from:
src/main.cpp
The local microReticulum fork also has Link encrypt/decrypt instrumentation in:
/usr/local/src/microreticulum/microReticulum/src/Link.cpp
The debug output uses stable single-line prefixes so a log analyzer can parse them:
RNSLINK Link establishment, close, announce TX/RX, peer hashes, Link ids.
RNSTX Application plaintext sends and encrypted Reticulum packets handed to BLE.
RNSRX Reassembled BLE packets immediately before Reticulum receives them.
RNSDEC Link encrypt/decrypt attempts and classified decrypt failures.
RNSBLE BLE connect, identity handshake, fragment TX/RX, and packet assembly.
RNSQUEUE BLE RX queue depth, pushes, pops, drops, allocation failures, high-water marks.
RNSMEM Heap, largest block, PSRAM, and current task stack high-water mark.
RNSERR Classified adapter errors such as reassembly gap, bad header, queue overflow.
All records are one-line key=value records. Packet records include length, CRC32, first 4 bytes, and last 4 bytes, but not full payloads. Normal packet fingerprint logging is rate-limited: first 25 packets, every 25th packet after that, and all packets briefly after a failure. It can be made fully verbose with RNS_DEBUG_VERBOSE=1.
Link/key context fingerprinting
The Reticulum Link layer now prints non-secret fingerprints around encryption and decryption:
RNSDEC event=encrypt ...
RNSDEC event=attempt ...
RNSDEC event=HMAC_INVALID ...
The fields include:
link_id
token_len
token_crc32
sign_key_crc32
enc_key_crc32
link_obj
The key CRCs are not keys. They are non-secret fingerprints that let the logs answer questions such as:
Did this packet use the same Link object?
Did this node try to decrypt with the same negotiated key material?
Did the ciphertext bytes change between sender and receiver?
This is the crucial distinction. Reticulum can only report that an encrypted token failed authentication. It cannot, by itself, know whether the bytes were damaged in the transport or whether the wrong key/session was selected. The added fingerprints separate those cases.
BLE fragmentation instrumentation
The BLE adapter was changed to use an explicit debug-friendly fragment header:
byte 0 fragment type
byte 1 fragment header version
bytes 2-3 fragment index
bytes 4-5 total fragments
bytes 6-9 message id
bytes 10-13 full message length
The fragment header size is 14 bytes.
Each transmitted and received fragment can now be logged with:
RNSBLE event=fragment_tx ...
RNSBLE event=fragment_rx ...
Important fields include:
frag_index
frag_total
msg_id
msg_off
msg_total
frag_len
payload_len
frag_crc32
The receiver now validates:
fragment header version
fragment type
fragment count
message id
exact expected message length
payload length versus remaining message bytes
stale/mismatched message id
mid-message restart
assembled packet length
When reassembly succeeds, it prints:
RNSBLE event=packet_assembled assembled_len=... assembled_crc32=...
When it fails, it prints RNSERR with a specific class such as:
rx_fragment_too_short
rx_fragment_header_bad
rx_reassembly_gap
rx_reassembly_restart
rx_reassembly_stale_fragment
rx_reassembly_len_mismatch
Queue and buffer ownership changes
The BLE callback boundary was audited so incoming packets are copied into owned storage before being handed upward. The adapter avoids passing a borrowed pointer or view into a BLE callback buffer that may be reused after the callback returns.
The RX queue now tracks:
depth
max_depth
pushes
pops
drops
overwrites
allocation failures
high-water mark
It emits RNSQUEUE once per second and immediately on errors. Silent overwrite is treated as unacceptable during this test; a queue overflow increments a counter and emits RNSERR.
In the successful run after the fix, queue pressure was not the cause. Representative live stats showed no drops, no overwrites, and no allocation failures:
RNSQUEUE ... pushes=153 pops=153 drops=0 overwrites=0 alloc_failures=0 highwater=1
RNSQUEUE ... pushes=302 pops=302 drops=0 overwrites=0 alloc_failures=0 highwater=3
Memory and stack instrumentation
RNSMEM prints once per second and immediately after important failures:
free_heap
min_heap
largest_block
free_psram
min_psram
task
stack_hw_words
stack_hw_bytes
The "thin stack margin" finding was about per-task stack, not total heap. On FreeRTOS, each task has its own stack. A task can run out of stack while the chip still has plenty of free heap.
The risky task was the ESP32 BLE/Bluetooth callback task, commonly shown as BTC_TASK. Earlier debug output showed this task with a very small stack high-water margin, while overall heap remained healthy. That means the board was not simply out of RAM. Instead, the code running inside a BLE callback had too much local stack pressure for that task.
One debug helper originally used a local formatting buffer:
char buffer[512]
That was acceptable on a large stack, but risky inside BTC_TASK. During the live session, that local buffer contributed to a stack canary panic. The fix was to move the formatting buffer to static storage and guard serial writes with a FreeRTOS mutex:
static char debug_log_buffer[512]
This reduced callback stack pressure while preserving parseable logs.
Configurability note:
The Arduino loop task stack is relatively easy to tune.
The BLE/Bluetooth host callback task stack is controlled deeper in the ESP32/Arduino/Bluetooth stack.
So this was not an immediate hardware RAM ceiling. It was a task-stack budget issue caused by where the code was running. The practical fix was to keep callback-local stack use small and move heavier formatting/storage out of that stack.
The decisive finding: MTU was not payload size
The root transport bug was an MTU accounting error.
The code previously treated BLE MTU 185 as if 185 bytes could be written as the characteristic value. That is not correct. BLE ATT notifications/writes have protocol overhead. The usable characteristic value payload is:
ATT value size = ATT MTU - 3
For this exercise:
BLE_ATT_MTU = 185
BLE_VALUE_SIZE = 185 - 3 = 182
FRAG_HEADER = 14
BLE payload = 182 - 14 = 168
Before the fix, the sender attempted to send 185-byte characteristic values. The receiver saw only 182 bytes. That means the first fragment of a multi-fragment Reticulum token lost exactly 3 bytes.
The live debug logs made this visible. A received fragment showed:
RNSBLE event=fragment_rx ... msg_total=184 frag_len=182 payload_len=168
RNSERR class=rx_reassembly_len_mismatch ... assembled_181 expected_184
For a larger encrypted Reticulum packet, the same accounting error showed up as token shrinkage. A packet that should have been 384 bytes arrived as 378 bytes after two fragments lost 3 bytes each.
That observation moved the failure out of Reticulum and into BLE fragmentation:
Reticulum token bytes were being truncated before decrypt.
The HMAC failures were a correct symptom of damaged ciphertext.
Once the BLE value size was corrected to MTU - 3, the receiver's token CRCs matched the sender's token CRCs before decrypt, and the HMAC failures stopped for the bilateral Constitution transfer.
Why HMAC failures were expected
Reticulum Link packets are authenticated. If even one byte of the encrypted token is missing, shifted, or changed, the Link decrypt step must reject it.
That is what HMAC is for. The HMAC failure did not mean the cryptographic layer was broken. It meant the cryptographic layer detected that the bytes it received were not the bytes that were originally encrypted.
The new instrumentation makes the distinction concrete:
Same sender token_crc32 and receiver token_crc32, but HMAC_INVALID:
suspect wrong key, wrong Link object, or wrong session context.
Different sender token_crc32 and receiver token_crc32:
suspect BLE fragmentation, reassembly, queue, or buffer corruption.
In this case the BLE fragment logs and reassembly length mismatch proved that the bytes were truncated before Reticulum saw them.
Result after the fix
After correcting the BLE value size and moving the debug log buffer off the BLE task stack, both boards completed the Pi-Zero-profile US Constitution transfer in both directions.
Representative successful receiver output:
RX FILE END: from=Node-A0935A43CA48 file=US_Constitution.txt received=44225/44225 chunks=148/148 crc=4C7F2C7A status=OK
RX FILE END: from=Node-DC935A43CA48 file=US_Constitution.txt received=44225/44225 chunks=148/148 crc=4C7F2C7A status=OK
The live packet fingerprints also showed matching encrypted token CRCs at the sender and receiver for packet after packet. For example, encrypted token CRCs such as:
C3D5244B
15472EDA
998AFC8C
39B9D34D
appeared on both sides before decrypt, followed by valid RX FILE DATA lines instead of HMAC failures.
That result is important because it validates the whole stack under bilateral pressure:
same firmware image on both boards
deterministic BLE client/server tie-breaker
BLE transport
microReticulum Link establishment
encrypted Link packet send/receive
application file protocol
full file CRC verification
repeat transfer loop
Limitations discovered
BLE MTU is not the usable application payload
The configured ATT MTU is not the number of application bytes that can be placed in a characteristic value. The BLE ATT layer consumes 3 bytes. For MTU 185, the characteristic value size is 182 bytes.
This matters any time the firmware does its own fragmentation. The fragmenter must budget against the characteristic value size, not the MTU.
Fragment headers reduce available payload further
Exercise 305 now uses a 14-byte fragment header. With MTU 185:
185 ATT MTU
- 3 ATT overhead
- 14 Exercise 305 fragment header
= 168 encrypted Reticulum packet bytes per BLE fragment
The application file chunk size is separate from this. A 300-byte file chunk becomes a larger encrypted Reticulum token, and that token is then split into 168-byte BLE payload fragments.
Stack pressure is role/task dependent
The BLE callback path can run on a smaller Bluetooth task stack than ordinary application code. Heavy local variables, formatting buffers, or deep call paths inside that callback can trigger stack canary failures even when global heap looks fine.
The practical rule for this firmware is:
Keep BLE callback code small.
Copy incoming bytes into owned storage at the callback boundary.
Do not allocate large local buffers on the BLE callback stack.
Move verbose formatting out of callback-local stack where possible.
Logging itself can change the system
The debug logging was necessary, but the first version added stack pressure. That is an important lesson: instrumentation on a microcontroller is not free. It can expose a bug, hide a bug, or create a new one if it uses too much stack, heap, serial bandwidth, or callback time.
The current instrumentation tries to stay lightweight by:
printing fingerprints instead of payloads
rate-limiting normal packet logs
printing every error
using a static log buffer
tracking queue counters instead of dumping queue contents
Serial output can interleave
Exercise 305 debug output uses a mutexed logger, but some microReticulum fork output still uses direct Serial.printf calls. Under simultaneous BLE, Reticulum, and application output, a few serial lines can interleave or appear visually truncated.
That is a logging cleanliness issue, not evidence of transport corruption by itself. For a fully machine-parseable long-duration capture, the remaining direct Serial.printf calls in the fork should eventually route through the same debug logger or another shared serial lock.
Why live access mattered
The fastest part of this investigation was being able to upload firmware to both T-Beams and monitor both serial consoles immediately:
pio device monitor -p /dev/ttytBOB -b 115200
pio device monitor -p /dev/ttytDAN -b 115200
That made the debug loop short:
patch instrumentation
build
upload both boards
watch both roles at once
compare sender and receiver CRCs
adjust instrumentation
repeat
Without live dual-console access, the key clue would have been much harder to isolate. The logs had to be compared across both boards at the exact layer boundary where Reticulum handed encrypted bytes to BLE and where BLE handed reassembled bytes back to Reticulum.
Final diagnosis
The primary failure was not a bad Reticulum Link negotiation and not a board-specific hardware fault.
The main failure was:
BLE fragmentation used the ATT MTU as the characteristic value size.
The actual characteristic value size was ATT MTU minus 3.
This truncated BLE fragments.
Truncated encrypted Reticulum tokens correctly failed HMAC validation.
A secondary instrumentation problem was:
The first debug logger used too much BLE callback task stack.
Moving the log buffer to static storage removed that stack pressure.
After both fixes, the bilateral Exercise 305 US Constitution transfer completed successfully in both directions with full byte count, full chunk count, and matching file checksum.
Practical takeaways
For future ESP32 BLE Reticulum work:
Budget BLE fragments against ATT MTU - 3, not ATT MTU.
Keep callback stack usage small.
Deep-copy BLE callback bytes before passing them upward.
Treat HMAC failures as a symptom, not a root cause.
Use packet CRC fingerprints at every layer boundary.
Track queues explicitly; never silently overwrite during transport tests.
Use full-file verification so a successful run is unambiguous.
Exercise 305 is now a useful stress test because it proves more than "a packet got through." It proves that both devices can run the same firmware, negotiate roles, establish a microReticulum Link over BLE, transmit encrypted packets simultaneously, reassemble those packets correctly, and verify a 44 KB file in both directions.