# Architecture Documentation ## System Overview Route-Switcher is a network failover system that operates at the application layer to provide automatic network redundancy. The system monitors network connectivity through multiple interfaces and manages routing tables to ensure continuous connectivity. ## Component Architecture ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Main Thread │ │ Async Pingers │ │ Route Manager │ │ │ │ │ │ │ │ • State Machine │◄──►│ • Interface A │◄──►│ • Netlink API │ │ • Decision Logic│ │ • Interface B │ │ • Route Add/Del │ │ • Coordination │ │ • ICMP Monitoring│ │ • Metric Mgmt │ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ │ └───────────────────────┼───────────────────────┘ │ ┌──────────────────┐ │ Linux Kernel │ │ │ │ • Routing Table │ │ • Network Stack │ │ • Netlink Socket │ └──────────────────┘ ``` ## Data Flow 1. **Monitoring Phase** - Async pingers send ICMP packets via both interfaces - Results are collected and sent to main thread - State machine evaluates connectivity patterns 2. **Decision Phase** - State machine determines if failover is needed - Verifies secondary interface health - Triggers route changes if conditions are met 3. **Action Phase** - Route manager updates kernel routing table - Changes are applied via netlink interface - System continues monitoring in new state ## State Machine Design ### States - **Boot**: Initial state, gathering connectivity data - **Primary**: Using primary interface for routing - **Fallback**: Using secondary interface for routing ### Transitions ``` Boot → Primary: After 10 seconds of sampling (regardless of ping results) Primary → Fallback: After 3 consecutive failures AND secondary is healthy Fallback → Primary: After 60 seconds of stable primary connectivity ``` ### Routing Behavior - **Boot State**: Both routes are set up initially - primary (metric 10) and secondary (metric 20) - **Primary State**: Primary route (metric 10) and secondary route (metric 20) present - **Fallback State**: All three routes present - primary (metric 10), secondary (metric 20), and failover secondary (metric 5) - **Exit**: Only the failover route (metric 5) is removed ### Route Management Strategy The system follows a "both routes always present, extra failover on-demand" approach: 1. **Initialization**: Set up primary route (metric 10) and secondary route (metric 20) 2. **Boot Phase**: Collect 10 seconds of ping samples to establish baseline connectivity 3. **Normal Operation**: Primary route serves traffic (metric 10), secondary available as backup (metric 20) 4. **Failover**: Add extra secondary route with highest priority (metric 5) for immediate failover 5. **Failback**: Remove extra failover route when primary recovers 6. **Cleanup**: Only remove the extra failover route on exit, preserving base routes ### State Persistence - Current state is maintained in memory - State changes are logged for debugging - No persistent storage required (state rebuilds on restart) ## Interface Design ### Pinger Interface ```rust pub trait Pinger { async fn ping(&self, target: Ipv4Addr, interface: &str) -> PingResult; async fn start_monitoring(&self, targets: &[Ipv4Addr], interfaces: &[String]) -> Receiver; } ``` ### Route Manager Interface ```rust pub trait RouteManager { fn add_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>; fn delete_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>; fn get_current_routes(&self) -> Result>; } ``` ## Threading Model ### Main Thread - Runs the state machine - Handles signals and graceful shutdown - Coordinates between components ### Async Pinger Tasks - One task per interface - Non-blocking ICMP operations - Results sent via channels ### Route Manager - Synchronous operations (netlink is sync) - Called from main thread - Thread-safe operations ## Error Handling Strategy ### Categories 1. **Network Errors**: Temporary connectivity issues 2. **System Errors**: Permission problems, interface not found 3. **Configuration Errors**: Invalid IP addresses, missing interfaces ### Recovery Mechanisms - **Network Errors**: Retry with exponential backoff - **System Errors**: Log and exit (requires admin intervention) - **Configuration Errors**: Validate on startup, exit if invalid ## Security Considerations ### Privileges - Requires root privileges for route manipulation - Drops unnecessary privileges where possible - Validates all user inputs ### Network Security - Only sends ICMP packets to configured targets - No arbitrary packet crafting - Interface binding prevents traffic leakage ## Performance Characteristics ### Resource Usage - **Memory**: Minimal (~10MB) - **CPU**: Low (periodic ICMP packets) - **Network**: Very low (only ping traffic) ### Scalability - Single target machine design - Supports multiple ping targets - Limited to 2 interfaces (current design) ## Testing Architecture ### Unit Tests - Individual component testing - Mock network interfaces - State machine logic verification ### Integration Tests - Component interaction testing - Real network interface usage - Netlink operation verification ### End-to-End Tests - Full system testing in containers - Network failure simulation - Failover timing verification