fix(ble): Add connection race condition prevention and improve error handling

Implements comprehensive connection state tracking to prevent "Operation
already in progress" errors and connection retry storms.

BLE Interface changes:
- Record connection attempts before calling driver.connect()
- Add 5-second rate limiting between attempts to same peer
- Skip connections already in progress via _connecting_peers check
- Downgrade expected race conditions to DEBUG level
- Auto-blacklist MAC addresses on connection failures
- Add diagnostic logging for concurrent connection tracking

BLE Driver changes:
- Add _connecting_peers set to track in-progress connections
- Prevent concurrent connection attempts to same address
- Attach cleanup callbacks to connection Futures
- Add defense-in-depth cleanup in finally blocks
- Detailed logging for connection state debugging

Documentation updates:
- Add deployment workflow documentation to README.md
- Update .github/workflows/README.md with CD workflow details
- Document containerized runner SSH configuration
- Update reference documentation (CLAUDE.md, BLE_PROTOCOL, etc.)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
torlando-tech 2025-11-07 22:32:00 -05:00
commit 12ff03d2fa
7 changed files with 444 additions and 5 deletions

View file

@ -40,6 +40,18 @@ class DriverState(Enum):
class BLEDriverInterface(ABC):
"""
Abstract interface for a platform-specific BLE driver.
Driver implementations should maintain connection state tracking
to prevent race conditions from concurrent connection attempts:
self._connecting_peers: set = set() # addresses with pending connections
self._connecting_lock: threading.Lock = threading.Lock()
The connect() method should check this set before initiating a connection,
and always clean up the set in a finally block to ensure proper state
management even on connection failures. This prevents "Operation already
in progress" errors when discovery callbacks trigger multiple simultaneous
connection attempts to the same peer.
"""
# --- Callbacks ---
@ -256,6 +268,11 @@ This tier tests your actual `BleakDriver` implementation against real hardware.
* **Scanning Test:** Run a script that starts the driver and prints discovered devices. Verify that it finds your other test device.
* **Connection Test:** Write a script to connect to the test device. Verify that the `on_device_connected` callback fires and that `driver.connected_peers` is updated.
* **Data I/O Test:** After connecting, use `driver.send()` to send a simple "hello world" byte string. On the other device, verify that the bytes are received correctly. Test this in both directions.
* **Connection Race Condition Test:** Simulate rapid discovery callbacks for the same peer (e.g., by triggering `on_device_discovered` multiple times in quick succession). Verify that:
- Only one connection attempt is made (check `driver._connecting_peers` contains only one entry)
- No "Operation already in progress" errors appear in logs
- The `_connecting_peers` set is properly cleaned up after connection (success or failure)
- Subsequent connection attempts are properly rate-limited (5-second minimum interval)
### Tier 3: End-to-End Testing (Full Stack)