ble-reticulum/migration/Migration_Testing_Journal.md

256 lines
15 KiB
Markdown
Raw Permalink Normal View History

# Migration Testing Journal
May 17, 2026 3:30 AM - I had successfully testing the CPP classes on the 2 Rapsberry Pi Zero 2Ws using the Constitution. I then queried Codex and ask: how could I be certain I was ussing the compiled CPP library and not the older Python code. It made changes which I did not examine which would regurgitate what it was using. I pushed to Forgejo and then pulled to Zeros. That's when things went south. Suddenly the Constitution was not being cross transmitted in 36 seconds and the 60 seconds timeout trigger resulting in an incomplete transfer on both side. Then, I think, more modifications were made and deployed and then the BLE could not even advertise and/or register. Codex suggestion was to reboot the Pis. This is where I disagreed. Here is an opportunit to take a rock-solid hardware state and learn why it was disturbed, if it was disturbed. This occured about 6 hours ago on Saturday evening, May 15th. I've awakened and mulled this over and realized this may be an opportunity to identify a more low level bug. The latest runs were tmp/run13 where a connection could not even be established.
So, I'll do run14 using the exact same command clumps to see if the condition remains, or if the passage of time has changed something. Hopefully, run14 will have the same errors.
I'll also source control my Pi command clumps.
May 17, 2026 3:39 AM Using existing ssh shells, I reran the test and duplicated the error condition. Saved in run14. zerodev1 has announcements and no indication of error. zerodev2 has errors:
```bash
[03:39:27.352] Requesting path to <926e6d3b35b7d5940be7edeb47c41b78>
[2026-05-17 03:39:28] [Critical] BLEInterface[BLE Interface] driver critical: Scanner callback failure detected (0 callbacks for 19 consecutive scans). Bluetooth stack requires reboot. - Exception: BleakScanner callbacks not invoked
```
Codex advises:
```
Run14 supports “low-level BLE/scanner path,” not C++ fragmentation:
zerodev2 is forced Python.
It advertises successfully.
It never establishes a path/link.
It repeatedly reports BleakScanner callbacks not invoked.
zerodev1 shows no scanner callback errors in the shown log.
So the immediate code point to instrument is LinuxBluetoothDriver._perform_scan(), around BleakScanner(... detection_callback=...), scanner.start(), scanner.stop(), and callback_count[0] == 0.
```
2026-05-17 04:59:38 -07:00
So I'll try a run with no change on zerodev1, but force "cpp" for zerdev2, I'll keep the logs in run14.
May 17, 2026 4:45 AM Did a lot of forensics for run 14, finally concluded a reboot of both was last resort. Also, created symlinks under ~/.reticulum/interfaces to this project's files. Previous runs may not have been using the "shim" given static file under ~/.reticulum/interfaces. For example:
```bash
So this discrepancy may invalidate all prior testing? Should I try using soft links under ~/.reticulum/interfaces pointing to the git controlled project files?
jlpoole@zerodev1:/usr/local/src/ble-reticulum $ ls src/ble_reticulum/ -la
total 312
drwxrwxr-x 3 jlpoole jlpoole 4096 May 16 20:28 .
drwxrwxr-x 3 jlpoole jlpoole 4096 May 15 11:25 ..
-rw-rw-r-- 1 jlpoole jlpoole 9891 May 16 08:11 BLEAgent.py
-rw-rw-r-- 1 jlpoole jlpoole 20471 May 15 11:25 BLEFragmentation.py
-rw-rw-r-- 1 jlpoole jlpoole 2433 May 16 19:36 BLEFragmentationBackend.py
-rw-rw-r-- 1 jlpoole jlpoole 27232 May 16 08:11 BLEGATTServer.py
-rw-rw-r-- 1 jlpoole jlpoole 120620 May 16 20:28 BLEInterface.py
-rw-rw-r-- 1 jlpoole jlpoole 0 May 15 11:25 __init__.py
drwxrwxr-x 2 jlpoole jlpoole 4096 May 16 19:37 __pycache__
-rw-rw-r-- 1 jlpoole jlpoole 7340 May 15 11:25 bluetooth_driver.py
-rw-rw-r-- 1 jlpoole jlpoole 107923 May 16 08:11 linux_bluetooth_driver.py
jlpoole@zerodev1:/usr/local/src/ble-reticulum $ ls ~/.reticulum/interfaces/ -la
total 308
drwxrwxr-x 3 jlpoole jlpoole 4096 May 15 16:35 .
drwxrwxr-x 4 jlpoole jlpoole 4096 May 16 07:43 ..
-rw-rw-r-- 1 jlpoole jlpoole 9751 May 15 12:19 BLEAgent.py
-rw-rw-r-- 1 jlpoole jlpoole 20471 May 15 12:19 BLEFragmentation.py
-rw-rw-r-- 1 jlpoole jlpoole 27092 May 15 12:19 BLEGATTServer.py
-rw-rw-r-- 1 jlpoole jlpoole 119128 May 15 12:19 BLEInterface.py
-rw-rw-r-- 1 jlpoole jlpoole 0 May 15 12:19 __init__.py
drwxrwxr-x 2 jlpoole jlpoole 4096 May 15 16:35 __pycache__
-rw-rw-r-- 1 jlpoole jlpoole 7340 May 15 12:19 bluetooth_driver.py
-rw-rw-r-- 1 jlpoole jlpoole 107783 May 15 12:19 linux_bluetooth_driver.py
jlpoole@zerodev1:/usr/local/src/ble-reticulum
```
I tried CPP versions and I extended the timeout from 60 seconds to 90, the previous 60 was cutting it too close, though at the very early tests, the Constitution could be transferred in 36 seconds, so note should be made that something was introduced that slowed the transfer down.
```bash
(rnsenv) jlpoole@jp /usr/local/src/ble-reticulum/scripts $ date; ./analyze_reticulum_file_transfer_20260516_1130.pl ../tmp/run15/20260517_0439_zerodev1_Constitution_CPP.txt ../tmp/run15/20260517_0439_zerodev2_Constitution_CPP.txt
Sun May 17 04:52:33 PDT 2026
Reticulum BLE file transfer analysis
Generated: 2026-05-17 04:52:33 PDT
Input files:
../tmp/run15/20260517_0439_zerodev1_Constitution_CPP.txt
../tmp/run15/20260517_0439_zerodev2_Constitution_CPP.txt
Log provenance summary:
20260517_0439_zerodev1_Constitution_CPP.txt receiver=zerodev1 date='Sun May 17 04:44:06 PDT 2026' command_lines=21 post_marker_lines=591
20260517_0439_zerodev2_Constitution_CPP.txt receiver=zerodev2 date='Sun May 17 04:44:08 AM PDT 2026' command_lines=20 post_marker_lines=587
Chrony clock notes from logs:
20260517_0439_zerodev1_Constitution_CPP.txt
System time : 0.000602999 seconds fast of NTP time
System time : 0.000000797 seconds slow of NTP time
20260517_0439_zerodev2_Constitution_CPP.txt
System time : 0.000199690 seconds slow of NTP time
System time : 0.000000200 seconds fast of NTP time
Declared outbound sends observed in logs:
sender=zerodev1 file=US_Constitution.txt chunks= 140 bytes= 44225 chunk_data_bytes=316
sender=zerodev2 file=US_Constitution.txt chunks= 148 bytes= 44225 chunk_data_bytes=n/a
Direction: zerodev1->zerodev2
file : US_Constitution.txt
chunks received : 140 of 140
completeness : 100.00%
missing chunks : none
duplicate chunks : none
payload bytes RX : 44225
first chunk RX : 04:44:33.283
last chunk RX : 04:45:09.259
receiver span : 35.976 s
sender span : 14.299 s
payload rate RX span : 1229.3 B/s 9834.3 bit/s
payload rate TX span : 3092.8 B/s 24742.6 bit/s
one-way latency min/median/mean/p95/max/stddev: 246.163 / 11275.384 / 11238.875 / 21147.058 / 21922.931 / 6402.281 ms
receiver inter-chunk gap min/median/mean/p95/max/stddev: 191.000 / 244.000 / 258.820 / 297.300 / 341.000 / 36.970 ms
sender inter-chunk gap min/median/mean/p95/max/stddev: 101.924 / 102.945 / 102.872 / 104.299 / 109.563 / 0.917 ms
Direction: zerodev2->zerodev1
file : US_Constitution.txt
chunks received : 148 of 148
completeness : 100.00%
missing chunks : none
duplicate chunks : none
payload bytes RX : 44225
first chunk RX : 04:44:33.284
last chunk RX : 04:45:07.703
receiver span : 34.419 s
sender span : 15.470 s
payload rate RX span : 1284.9 B/s 10279.2 bit/s
payload rate TX span : 2858.7 B/s 22869.8 bit/s
one-way latency min/median/mean/p95/max/stddev: 254.424 / 9806.025 / 9750.358 / 18371.803 / 19203.224 / 5533.436 ms
receiver inter-chunk gap min/median/mean/p95/max/stddev: 145.000 / 243.000 / 234.143 / 246.700 / 292.000 / 24.080 ms
sender inter-chunk gap min/median/mean/p95/max/stddev: 103.738 / 104.216 / 105.239 / 108.147 / 122.931 / 2.539 ms
Hello/handshake RX records:
zerodev2 -> zerodev1 recv=04:44:32.951 latency= 118.171 ms message='hello'
zerodev1 -> zerodev2 recv=04:44:32.963 latency= 113.954 ms message='hello back'
Caution: one-way latency assumes sender and receiver clocks are synchronized.
Your chronyc tracking output helps bound this error, but it is not a substitute for ACK/round-trip timing.
(rnsenv) jlpoole@jp /usr/local/src/ble-reticulum/scripts $
```
2026-05-17 13:22:25 -07:00
May 17, 2026 1:21 PM Identities CPP works and tested in field with bilateral Constitution test. Report: migration/reports/identities_20260517_report.md
2026-05-18 14:44:25 -07:00
May 17, 2026 2:22 PM Codex & I went round and found, even with a reboot of zerodev1 documented in run16 (now preserved) and Codex concluded:
The end of run16 is very useful.
On `zerodev2`, the harness definitely used the C++ backend:
```text
BLEInterface[BLE Interface] fragmentation backend: cpp
(fragmenter=ble_protocol_core_cpp.BLEFragmenter, reassembler=ble_protocol_core_cpp.BLEReassembler)
```
It also successfully advertised:
```text
Advertisement registered
```
Then `zerodev1`s standalone Bleak probe saw that advertisement:
```text
B8:27:EB:54:BF:0E zerodev2 zerodev2 ['37145b00-442d-4a94-917f-8f42c5da28e3']
callbacks: 1
```
So `zerodev1` scanner callbacks are not fundamentally broken. They work when there is a known advertiser.
The confusing part is on `zerodev2`: while advertising successfully, its own scanner loop reports:
```text
Scanner callback failure detected (0 callbacks for 3...13 consecutive scans)
```
That may be a false-positive in quiet RF conditions or while the other Pi is not advertising. The current detector treats “zero callbacks” as a Bluetooth stack failure, but your evidence now shows zero callbacks can happen even while the node is otherwise healthy enough to register advertisements, and `btmgmt find` is not reliable on this platform.
My updated interpretation:
- `btmgmt find` should be removed from the health criteria for these Pis.
- “0 Bleak callbacks” should not automatically mean “Bluetooth stack requires reboot” unless there is a known advertiser expected during that scan window.
- The scanner callback critical log in `linux_bluetooth_driver.py` is likely too aggressive for this environment.
- The C++ backend is confirmed active in this run and is not implicated by these scanner warnings.
For the next code improvement, I would change the scanner diagnostic severity/message. Something like: after repeated zero-callback scans, log a warning that no BLE advertisements were observed, but do not call it “requires reboot” unless there is additional evidence such as scanner start failure, adapter not powered, HCI timeout, or known peer advertisement missed repeatedly.
That is separate from the protocol-core migration and should probably be handled as a BLE platform robustness fix.
Codex's carry-over summary for a new chat, Phase 2:
Yes, start a new chat for Phase 2. This one has a lot of forensic context mixed with migration work. Seed the new chat with this:
```text
Project: ble-reticulum migration to C++, Phase 2
Repo: /usr/local/src/ble-reticulum
Branch: c++migration
Current date context: May 17, 2026
Reticulum env on workstation: /home/jlpoole/rnsenv/bin/activate
On Pi Zero 2Ws I use system python3, not a venv.
Important current state:
- C++ protocol-core module exists under migration/protocol_core:
- migration/protocol_core/ble_protocol_core.cpp
- migration/protocol_core/setup.py
- pybind11 module name: ble_protocol_core_cpp
- Implemented/bound:
- BLEFragmenter
- BLEReassembler
- HDLCFramer
- compute_identity_hash(peer_identity)
- get_fragmenter_key(peer_identity, peer_address)
- Tests exist under migration/tests:
- test_fragmentation_cpp_equivalence.py
- test_fragmentation_backend_shim.py
- test_identity_helpers_cpp_equivalence.py
- In rnsenv, latest verification:
- identity helper tests: 20 passed, 1 skipped; real BLEInterface path used, source fallback skipped.
- fragmentation/backend tests: 36 passed.
- BLEFragmentationBackend.py shim exists and supports:
- BLE_RETICULUM_FRAGMENTATION_BACKEND=auto|cpp|python
- BLE_RETICULUM_FRAGMENTATION_BACKEND_REPORT=1
- BLEInterface.py was modified only to import fragmentation classes through the shim and report selected backend. Do not port BLEInterface wholesale.
- External Reticulum interface files on the Pis should be symlinked to repo files in ~/.reticulum/interfaces, especially:
- BLEInterface.py
- BLEFragmentation.py
- BLEFragmentationBackend.py
- linux_bluetooth_driver.py
- Live field test:
- bilateral U.S. Constitution transfer succeeded after symlinks and C++ backend.
- logs confirmed C++ backend line:
BLEInterface[BLE Interface] fragmentation backend: cpp
- Important forensic conclusion:
- `sudo btmgmt find` returns `0x0b Rejected` even after clean reboot on Pi Zero 2W, so it is not a valid health baseline on this platform.
- Standalone Bleak scanner probe can see zerodev2 advertisement from zerodev1.
- Scanner “0 callbacks requires reboot” diagnostic in linux_bluetooth_driver.py appears too aggressive in quiet RF/known-advertiser-absent conditions.
- This is platform robustness work, not evidence against the C++ protocol core.
Phase 2 intent:
- Continue conservative C++ extraction.
- Do not change live BLE behavior unless explicitly requested.
- Prefer tests first.
- Treat BLEInterface.py as Python integration/glue/platform unless extracting very small pure helper logic.
- Avoid porting _handle_identity_handshake unless explicitly asked.
```
Also mention what you want Phase 2 to mean. Good candidates are:
```text
For Phase 2, I want to extract the next small protocol-only unit, with tests first, and avoid BLEInterface behavioral changes.
```
or:
```text
For Phase 2, I want to address the scanner false-positive diagnostic separately as a platform robustness fix.
```
2026-05-19 09:44:03 -07:00
May 18, 2026 2:36 PM - saved a copy of the SQLite database under sql/archive. Memorialized prompt for Phase 2A from ChatGPT in Codex_prompt_State2A_20250518_1433.md.
Trying to reconcile Codex's report of warnings vs. what I get, ChatGPT had me run several tests and I provided the results and it concluded:
Codex report showed “2 warnings,” but on jp the warnings were not reproducible.
Rerun with -W always -rw showed:
fragmentation backend shim: 9 passed, no warnings
Gate 2C/2D/2E regression: 39 passed, 1 skipped, no warnings
2026-05-20 17:33:14 -07:00
The single skip is expected and is not a warning.
Tuesday, May 19, 2026 10:18 I'm looking for work I did with microreticulum and see that I may have conflated my own exercises under microreticulumTbeam thinking I had proven microreticulum can successfully use Reticulum over Lora. But I see my exercises were pure Lora, and not Reticulum and then proceeded to explore the hardware of the T-Beam Supreme. I did find in microreticulum/chad that I had built:
```bash
(rnsenv) jlpoole@jp /usr/local/src/microreticulum/chad $ date; find examples -type f -name "program" |xargs ls -lat
Tue May 19 10:22:31 PDT 2026
-rwxr-xr-x 1 jlpoole jlpoole 8950024 Jan 22 14:26 examples/udp_announce/.pio/build/native/program
-rwxr-xr-x 1 jlpoole jlpoole 8984072 Jan 22 12:46 examples/link_native/.pio/build/native/program
(rnsenv) jlpoole@jp /usr/local/src/microreticulum/chad $
```