From 7d0392b703bd2d2065efd484f735016ef3a9da84 Mon Sep 17 00:00:00 2001 From: Michal Humpula Date: Sun, 1 Mar 2026 07:39:32 +0100 Subject: [PATCH] cleanup docs --- README.md | 127 ++++++++++++------------ doc/API_DESIGN.md | 231 ------------------------------------------- doc/ARCHITECTURE.md | 167 ------------------------------- doc/TESTING.md | 233 ++++++-------------------------------------- 4 files changed, 93 insertions(+), 665 deletions(-) delete mode 100644 doc/API_DESIGN.md delete mode 100644 doc/ARCHITECTURE.md diff --git a/README.md b/README.md index d9503c4..9109578 100644 --- a/README.md +++ b/README.md @@ -8,31 +8,25 @@ Route-Switcher monitors connectivity to specified IP addresses via multiple netw ## Architecture -### Core Components +Route-Switcher consists of three main components: -1. **Async Pingers** (`src/pinger.rs`) - - Dual-interface ICMP monitoring - - Explicit interface binding (equivalent to `ping -I `) - - Configurable ping targets and intervals - - Async/await implementation with tokio +1. **Async Pingers** (`src/pinger.rs`) - ICMP monitoring with explicit interface binding +2. **Route Manager** (`src/routing.rs`) - Netlink-based route manipulation +3. **State Machine** (`src/main.rs`) - Failover logic with anti-flapping protection -2. **Route Manager** (`src/routing.rs`) - - Netlink-based route manipulation - - No external dependencies on `ip` command - - Route addition and deletion - - Metric-based route prioritization +### State Machine +``` +Boot → Primary: After 10 seconds of sampling +Primary → Fallback: After 3 consecutive failures AND secondary is healthy +Fallback → Primary: After 60 seconds of stable primary connectivity +``` -3. **State Machine** (`src/main.rs`) - - Failover logic with anti-flapping protection - - Three consecutive failures trigger failover - - One minute of stable connectivity triggers failback - - Prevents switching when both interfaces fail +### Route Management Strategy +- **Primary route**: metric 10 (default priority) +- **Secondary route**: metric 20 (lower priority) +- **Failover route**: metric 5 (highest priority, added only during failover) -4. **Configuration** - - Interface definitions (primary/secondary) - - Gateway configurations - - Ping targets and timing - - Route metrics +The system maintains both base routes continuously and adds/removes the failover route as needed. ## Key Features @@ -105,62 +99,74 @@ RUST_LOG=debug sudo cargo run RUST_LOG=info sudo cargo run ``` -## Testing Environment - -### Podman-Compose Setup -The project includes a complete testing environment using podman-compose: +## Testing +### Quick Test ```bash # Start test environment podman-compose up -d +# Run automated failover test +./scripts/test-failover.sh + # View logs podman-compose logs -f route-switcher -# Stop test environment +# Stop environment podman-compose down ``` -### End-to-End Testing +### Manual Testing ```bash -# Simulate primary interface failure -podman-compose exec primary ip link set eth0 down +# Test primary connectivity +podman-compose exec route-switcher ping -c 3 -I eth0 192.168.202.100 -# Observe failover in logs -podman-compose logs -f route-switcher +# Test secondary connectivity +podman-compose exec route-switcher ping -c 3 -I eth1 192.168.202.100 -# Restore primary interface -podman-compose exec primary ip link set eth0 up +# Simulate primary router failure +podman-compose exec primary-router ip link set eth0 down -# Observe failback after 1 minute +# Check routing table +podman-compose exec route-switcher ip route show ``` -## Implementation Details +## API (Optional) -### State Machine -``` -[Boot] -> [Primary] (after initial connectivity check) -[Primary] -> [Fallback] (after 3 consecutive failures) -[Fallback] -> [Primary] (after 60 seconds of stability) +The route-switcher includes an optional HTTP REST API for monitoring and control. + +### Configuration +```bash +# Enable API +API_ENABLED=true +API_USERNAME=admin +API_PASSWORD_HASH= +API_PORT=8080 ``` -### Route Management -- Primary route: `ip r add default via dev metric 10` -- Secondary route: `ip r add default via dev metric 20` -- Routes are managed via netlink, not external commands +### Endpoints +- **GET /api/state** - Returns current state and ping statistics +- **POST /api/state** - Manually set state (primary/secondary) -### Failover Logic -1. **Detection**: 3 consecutive ping failures on primary interface -2. **Verification**: Secondary interface must be responsive -3. **Switch**: Update routing table to use secondary gateway -4. **Monitor**: Continue monitoring both interfaces -5. **Recovery**: After 60 seconds of stable primary connectivity, switch back - -### Error Handling -- Graceful degradation on interface failures -- Comprehensive logging for debugging -- Signal handling for clean shutdown -- Recovery from temporary network issues +### Example Response +```json +{ + "state": "Primary", + "primary_stats": { + "success_rate": 95.5, + "failures": 2, + "total_pings": 44, + "last_ping": "Ok" + }, + "secondary_stats": { + "success_rate": 98.2, + "failures": 1, + "total_pings": 56, + "last_ping": "Ok" + }, + "last_failover": "2024-02-15T10:30:00Z" +} +``` ## Dependencies @@ -169,13 +175,8 @@ podman-compose exec primary ip link set eth0 up - `netlink-sys` - Netlink kernel communication - `anyhow` - Error handling - `log` + `env_logger` - Logging -- `crossbeam-channel` - Inter-thread communication -- `signal-hook` - Signal handling - -## Development Phases - -- [ ] End-to-end automated tests +- `clap` - Command line parsing ## License -GPLv3 \ No newline at end of file +GPLv \ No newline at end of file diff --git a/doc/API_DESIGN.md b/doc/API_DESIGN.md deleted file mode 100644 index 76f703a..0000000 --- a/doc/API_DESIGN.md +++ /dev/null @@ -1,231 +0,0 @@ -# Route-Switcher API Design - -## Overview - -HTTP REST API with Basic Authentication for Home Assistant integration, exposing state machine state and ping statistics. - -## Design Principles - -- **Minimal surface area**: Only expose necessary information -- **Simple authentication**: HTTP Basic Auth (no JWT complexity) -- **State-focused**: Centered on state machine state and ping history -- **Home Assistant friendly**: Structured for HA REST integration -- **Opt-in**: API disabled by default - -## API Endpoints - -### GET /api/state - -Returns current state machine state with ping statistics. - -**Response:** -```json -{ - "state": "Primary", - "primary_stats": { - "success_rate": 95.5, - "failures": 2, - "total_pings": 44, - "last_ping": "Ok" - }, - "secondary_stats": { - "success_rate": 98.2, - "failures": 1, - "total_pings": 56, - "last_ping": "Ok" - }, - "last_failover": "2024-02-15T10:30:00Z" -} -``` - -**Fields:** -- `state`: Current state machine state (Boot/Primary/Fallback) -- `primary_stats`: Ping statistics for primary interface -- `secondary_stats`: Ping statistics for secondary interface -- `last_failover`: ISO 8601 timestamp of last failover (null if never) - -### POST /api/state - -Manually set state machine state. - -**Request:** -```json -{ - "state": "fallback" -} -``` - -**Response:** -```json -{ - "state": "Fallback", - "previous_state": "Primary", - "primary_stats": { ... }, - "secondary_stats": { ... }, - "last_failover": "2024-02-15T10:30:00Z" -} -``` - -**Valid states:** `primary`, `fallback` - -## Authentication - -HTTP Basic Authentication with username/password configured via environment variables. - -**Security considerations:** -- Passwords stored as bcrypt hash -- HTTPS recommended for production -- Local network access only -- No token management (stateless) - -## Data Structures - -### PingStats - -Calculated from state machine ping history (60 entries per interface): - -```rust -struct PingStats { - success_rate: f64, // Percentage of successful pings - failures: usize, // Number of failed pings in history - total_pings: usize, // Total pings in history - last_ping: String, // "Ok" or "Failed" -} -``` - -### StateResponse - -```rust -struct StateResponse { - state: String, - primary_stats: PingStats, - secondary_stats: PingStats, - last_failover: Option, -} -``` - -## Home Assistant Integration - -### REST Sensor Configuration - -```yaml -sensor: - - platform: rest - name: Route Switcher State - resource: http://route-switcher.local:8080/api/state - username: !secret route_switcher_user - password: !secret route_switcher_pass - value_template: "{{ value_json.state }}" - json_attributes: - - primary_stats - - secondary_stats - - last_failover - - - platform: template - sensors: - route_switcher_primary_success_rate: - value_template: "{{ state_attr('sensor.route_switcher_state', 'primary_stats').success_rate | default(0) }}" - unit_of_measurement: "%" - route_switcher_secondary_success_rate: - value_template: "{{ state_attr('sensor.route_switcher_state', 'secondary_stats').success_rate | default(0) }}" - unit_of_measurement: "%" - route_switcher_primary_failures: - value_template: "{{ state_attr('sensor.route_switcher_state', 'primary_stats').failures | default(0) }}" - route_switcher_secondary_failures: - value_template: "{{ state_attr('sensor.route_switcher_state', 'secondary_stats').failures | default(0) }}" - -switch: - - platform: rest - name: Route Switcher Control - resource: http://route-switcher.local:8080/api/state - username: !secret route_switcher_user - password: !secret route_switcher_pass - body_on: '{"state": "fallback"}' - body_off: '{"state": "primary"}' - is_on_template: "{{ value_json.state == 'fallback' }}" -``` - -## Configuration - -### Environment Variables - -```bash -# API Configuration -API_ENABLED=true -API_BIND_ADDRESS=0.0.0.0 -API_PORT=8080 -API_USERNAME=admin -API_PASSWORD_HASH= - -# CORS Configuration -API_CORS_ORIGINS=http://homeassistant.local:8123 -``` - -### Password Hash Generation - -```bash -# Generate bcrypt hash -echo -n "your-password" | bcrypt -``` - -## Implementation Details - -### Dependencies - -```toml -axum = "0.7" -tokio = { version = "1.42", features = ["full"] } -tower = "0.4" -tower-http = { version = "0.5", features = ["cors", "auth"] } -serde = { version = "1.0", features = ["derive"] } -serde_json = "1.0" -chrono = { version = "0.4", features = ["serde"] } -bcrypt = "0.15" -base64 = "0.22" -``` - -### Architecture - -- **API Module**: `src/api.rs` - HTTP server and endpoints -- **State Sharing**: Thread-safe access to state machine and ping history -- **Authentication**: Basic Auth middleware with bcrypt validation -- **Error Handling**: Standardized JSON error responses -- **Integration**: Minimal changes to existing state machine - -### Thread Safety - -- `Arc>` for shared state access -- Non-blocking async operations -- Minimal locking duration - -## Error Handling - -Standardized error responses: - -```json -{ - "error": "Invalid state", - "message": "State must be 'primary' or 'fallback'" -} -``` - -HTTP Status Codes: -- 200: Success -- 400: Bad Request (invalid state) -- 401: Unauthorized (invalid credentials) -- 500: Internal Server Error - -## Security Considerations - -- Network access restrictions (local only recommended) -- HTTPS for credential protection -- Rate limiting considerations -- Audit logging for manual state changes -- No configuration exposure (state only) - -## Backward Compatibility - -- API disabled by default -- No changes to existing CLI functionality -- Service continues without API if disabled -- Graceful degradation on API errors diff --git a/doc/ARCHITECTURE.md b/doc/ARCHITECTURE.md deleted file mode 100644 index d18e123..0000000 --- a/doc/ARCHITECTURE.md +++ /dev/null @@ -1,167 +0,0 @@ -# Architecture Documentation - -## System Overview - -Route-Switcher is a network failover system that operates at the application layer to provide automatic network redundancy. The system monitors network connectivity through multiple interfaces and manages routing tables to ensure continuous connectivity. - -## Component Architecture - -``` -┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ -│ Main Thread │ │ Async Pingers │ │ Route Manager │ -│ │ │ │ │ │ -│ • State Machine │◄──►│ • Interface A │◄──►│ • Netlink API │ -│ • Decision Logic│ │ • Interface B │ │ • Route Add/Del │ -│ • Coordination │ │ • ICMP Monitoring│ │ • Metric Mgmt │ -└─────────────────┘ └──────────────────┘ └─────────────────┘ - │ │ │ - └───────────────────────┼───────────────────────┘ - │ - ┌──────────────────┐ - │ Linux Kernel │ - │ │ - │ • Routing Table │ - │ • Network Stack │ - │ • Netlink Socket │ - └──────────────────┘ -``` - -## Data Flow - -1. **Monitoring Phase** - - Async pingers send ICMP packets via both interfaces - - Results are collected and sent to main thread - - State machine evaluates connectivity patterns - -2. **Decision Phase** - - State machine determines if failover is needed - - Verifies secondary interface health - - Triggers route changes if conditions are met - -3. **Action Phase** - - Route manager updates kernel routing table - - Changes are applied via netlink interface - - System continues monitoring in new state - -## State Machine Design - -### States -- **Boot**: Initial state, gathering connectivity data -- **Primary**: Using primary interface for routing -- **Fallback**: Using secondary interface for routing - -### Transitions -``` -Boot → Primary: After 10 seconds of sampling (regardless of ping results) -Primary → Fallback: After 3 consecutive failures AND secondary is healthy -Fallback → Primary: After 60 seconds of stable primary connectivity -``` - -### Routing Behavior -- **Boot State**: Both routes are set up initially - primary (metric 10) and secondary (metric 20) -- **Primary State**: Primary route (metric 10) and secondary route (metric 20) present -- **Fallback State**: All three routes present - primary (metric 10), secondary (metric 20), and failover secondary (metric 5) -- **Exit**: Only the failover route (metric 5) is removed - -### Route Management Strategy -The system follows a "both routes always present, extra failover on-demand" approach: -1. **Initialization**: Set up primary route (metric 10) and secondary route (metric 20) -2. **Boot Phase**: Collect 10 seconds of ping samples to establish baseline connectivity -3. **Normal Operation**: Primary route serves traffic (metric 10), secondary available as backup (metric 20) -4. **Failover**: Add extra secondary route with highest priority (metric 5) for immediate failover -5. **Failback**: Remove extra failover route when primary recovers -6. **Cleanup**: Only remove the extra failover route on exit, preserving base routes - -### State Persistence -- Current state is maintained in memory -- State changes are logged for debugging -- No persistent storage required (state rebuilds on restart) - -## Interface Design - -### Pinger Interface -```rust -pub trait Pinger { - async fn ping(&self, target: Ipv4Addr, interface: &str) -> PingResult; - async fn start_monitoring(&self, targets: &[Ipv4Addr], interfaces: &[String]) -> Receiver; -} -``` - -### Route Manager Interface -```rust -pub trait RouteManager { - fn add_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>; - fn delete_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>; - fn get_current_routes(&self) -> Result>; -} -``` - -## Threading Model - -### Main Thread -- Runs the state machine -- Handles signals and graceful shutdown -- Coordinates between components - -### Async Pinger Tasks -- One task per interface -- Non-blocking ICMP operations -- Results sent via channels - -### Route Manager -- Synchronous operations (netlink is sync) -- Called from main thread -- Thread-safe operations - -## Error Handling Strategy - -### Categories -1. **Network Errors**: Temporary connectivity issues -2. **System Errors**: Permission problems, interface not found -3. **Configuration Errors**: Invalid IP addresses, missing interfaces - -### Recovery Mechanisms -- **Network Errors**: Retry with exponential backoff -- **System Errors**: Log and exit (requires admin intervention) -- **Configuration Errors**: Validate on startup, exit if invalid - -## Security Considerations - -### Privileges -- Requires root privileges for route manipulation -- Drops unnecessary privileges where possible -- Validates all user inputs - -### Network Security -- Only sends ICMP packets to configured targets -- No arbitrary packet crafting -- Interface binding prevents traffic leakage - -## Performance Characteristics - -### Resource Usage -- **Memory**: Minimal (~10MB) -- **CPU**: Low (periodic ICMP packets) -- **Network**: Very low (only ping traffic) - -### Scalability -- Single target machine design -- Supports multiple ping targets -- Limited to 2 interfaces (current design) - -## Testing Architecture - -### Unit Tests -- Individual component testing -- Mock network interfaces -- State machine logic verification - -### Integration Tests -- Component interaction testing -- Real network interface usage -- Netlink operation verification - -### End-to-End Tests -- Full system testing in containers -- Network failure simulation -- Failover timing verification diff --git a/doc/TESTING.md b/doc/TESTING.md index 48c21b3..4720d0e 100644 --- a/doc/TESTING.md +++ b/doc/TESTING.md @@ -1,112 +1,43 @@ # Testing Guide -## Overview +## Test Environment -This document describes the testing strategy and environment for the Route-Switcher project. - -## Testing Environment - -### Podman-Compose Setup - -The testing environment uses podman-compose to create a realistic network topology with routers and a single ICMP target: +The testing environment uses podman-compose to create a network topology with routers and an ICMP target: ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ -│ Route-Switcher │ │ Primary Router│ │ │ +│ Route-Switcher │ │ Primary Router│ │ ICMP Target │ │ │ │ │ │ │ -│ eth0 ────────────┼────►│ eth0 ──────────┼────►│ ICMP Target │ +│ eth0 ────────────┼────►│ eth0 ──────────┼────►│ 192.168.202.100│ │ eth1 ────────────┼────►│ eth1 ──────────┼────►│ │ -│ │ │ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ - │ │ │ - │ │ │ - ▼ ▼ ▼ -primary-net secondary-net target-net -192.168.1.0/24 192.168.2.0/24 10.0.0.0/24 ``` -### Container Architecture - +### Container Setup - **route-switcher**: Dual interfaces (eth0→primary-net, eth1→secondary-net) -- **primary-router**: Connects primary-net ↔ target-net (192.168.1.1 ↔ 10.0.0.1) -- **secondary-router**: Connects secondary-net ↔ target-net (192.168.2.1 ↔ 10.0.0.2) -- **icmp-target**: Single IP on target-net (10.0.0.100), reachable via either router - -### Quick Start - -```bash -# Start the testing environment -podman-compose up -d - -# Run automated failover test -./scripts/test-failover.sh - -# View logs -podman-compose logs -f route-switcher - -# Stop environment -podman-compose down -``` - -### Network Configuration - -**Route-Switcher:** -- eth0: 192.168.1.10 (primary network) -- eth1: 192.168.2.10 (secondary network) -- Default gateway: 192.168.1.1 (primary router) - -**Primary Router:** -- eth0: 192.168.1.1 (primary network) -- eth1: 10.0.0.1 (target network) -- Routes traffic between networks with NAT - -**Secondary Router:** -- eth0: 192.168.2.1 (secondary network) -- eth1: 10.0.0.2 (target network) -- Routes traffic between networks with NAT - -**ICMP Target:** -- Single IP: 10.0.0.100 -- Default route: 10.0.0.1 (primary router) -- Responds to ping from both routers +- **primary-router**: Connects primary-net ↔ target-net (192.168.200.11 ↔ 192.168.202.11) +- **secondary-router**: Connects secondary-net ↔ target-net (192.168.201.11 ↔ 192.168.202.12) +- **icmp-target**: Single IP on target-net (192.168.202.100) ## Test Scenarios ### 1. Basic Connectivity Test -**Objective**: Verify basic ping functionality on both interfaces - ```bash -# Start environment podman-compose up -d - -# Test primary connectivity -podman-compose exec route-switcher ping -c 3 -I eth0 10.0.0.100 - -# Test secondary connectivity -podman-compose exec route-switcher ping -c 3 -I eth1 10.0.0.100 - -# Check routing table -podman-compose exec route-switcher ip route show +podman-compose exec route-switcher ping -c 3 -I eth0 192.168.202.100 +podman-compose exec route-switcher ping -c 3 -I eth1 192.168.202.100 ``` ### 2. Failover Test -**Objective**: Verify automatic failover when primary router fails - ```bash -# Start monitoring logs +# Monitor logs podman-compose logs -f route-switcher & # Simulate primary router failure podman-compose exec primary-router ip link set eth0 down -# Verify failover occurs (should see in logs) -# Wait for state change to Fallback - -# Check routing table after failover -podman-compose exec route-switcher ip route show - -# Test connectivity via secondary router -podman-compose exec route-switcher ping -c 3 10.0.0.100 +# Verify failover occurs and connectivity works +podman-compose exec route-switcher ping -c 3 192.168.202.100 # Restore primary router podman-compose exec primary-router ip link set eth0 up @@ -115,119 +46,45 @@ podman-compose exec primary-router ip link set eth0 up ``` ### 3. Dual Failure Test -**Objective**: Verify system doesn't failover when both routers fail - ```bash -# Start monitoring logs -podman-compose logs -f route-switcher & - -# Fail both routers +# Fail both routers - system should NOT switch podman-compose exec primary-router ip link set eth0 down podman-compose exec secondary-router ip link set eth0 down # Verify no routing changes occur -# System should remain in current state - -# Restore routers -podman-compose exec primary-router ip link set eth0 up -podman-compose exec secondary-router ip link set eth0 up ``` -### 4. Router Target Interface Failure -**Objective**: Test upstream network failure simulation +## Automated Testing +Run the comprehensive test script: ```bash -# Fail primary router's connection to target network -podman-compose exec primary-router ip link set eth1 down - -# Should trigger failover to secondary router -# Verify connectivity still works via secondary path - -# Restore primary router's target connection -podman-compose exec primary-router ip link set eth1 up -``` - -### 5. Automated Failover Test -**Objective**: Run complete automated test sequence - -```bash -# Run the comprehensive test script ./scripts/test-failover.sh - -# This script will: -# 1. Start the environment -# 2. Verify initial connectivity -# 3. Simulate primary router failure -# 4. Monitor failover -# 5. Restore primary router -# 6. Verify failback after 60 seconds ``` +This script: +1. Starts the test environment +2. Verifies initial connectivity +3. Simulates primary router failure +4. Monitors failover +5. Restores primary router +6. Verifies failback + ## Unit Tests -### Running Tests ```bash # Run all tests cargo test -# Run specific test module +# Run specific module cargo test pinger - -# Run with coverage -cargo tarpaulin --out Html +cargo test routing +cargo test state_machine ``` -### Test Structure -``` -tests/ -├── unit/ -│ ├── pinger_tests.rs -│ ├── routing_tests.rs -│ └── state_machine_tests.rs -├── integration/ -│ ├── netlink_tests.rs -│ └── dual_interface_tests.rs -└── e2e/ - └── failover_tests.rs -``` +## Debug Commands -## Performance Testing - -### Load Testing ```bash -# Test with multiple ping targets -cargo run -- --ping-target 8.8.8.8 - -# Monitor resource usage -podman stats route-switcher - -# Test long-running stability -# Run for 24 hours and monitor for memory leaks -``` - -### Network Latency Testing -```bash -# Measure failover time -# Start script to time the state transition -start_time=$(date +%s%N) -# Trigger failure -# Wait for state change -end_time=$(date +%s%N) -failover_time=$((($end_time - $start_time) / 1000000)) -echo "Failover time: ${failover_time}ms" -``` - -## Debugging Tests - -### Common Issues -1. **Permission Denied**: Ensure containers run with privileged mode -2. **Interface Not Found**: Check network configuration in compose file -3. **Netlink Errors**: Verify kernel supports required operations -4. **Timing Issues**: Adjust test timeouts for your environment - -### Debug Commands -```bash -# Check container network interfaces +# Check container interfaces podman-compose exec route-switcher ip addr show # Check routing table @@ -235,36 +92,4 @@ podman-compose exec route-switcher ip route show # Monitor network traffic podman-compose exec route-switcher tcpdump -i any icmp - -# Check system logs -podman-compose exec route-switcher dmesg | tail -20 ``` - -## Test Data - -### Sample Ping Results -```rust -// Mock data for testing -let mock_ping_results = vec![ - PingResult::Ok, // Normal operation - PingResult::Failed, // Single failure - PingResult::Failed, // Consecutive failure - PingResult::Failed, // Trigger failover -]; -``` - -### Network Configuration -```bash -# Test network setup -ip addr add 192.168.1.10/24 dev eth0 -ip addr add 192.168.2.10/24 dev eth1 -ip route add default via 192.168.1.1 dev eth0 metric 10 -ip route add default via 192.168.2.1 dev eth1 metric 20 -``` - -## Test Coverage Goals - -- **Unit Tests**: 90%+ code coverage -- **Integration Tests**: All major component interactions -- **E2E Tests**: All user scenarios and edge cases -- **Performance Tests**: Resource usage and timing validation