Files
route-switcher/doc/ARCHITECTURE.md
Michal Humpula 5fbd72b370 working base
2026-02-15 17:36:08 +01:00

168 lines
6.2 KiB
Markdown

# Architecture Documentation
## System Overview
Route-Switcher is a network failover system that operates at the application layer to provide automatic network redundancy. The system monitors network connectivity through multiple interfaces and manages routing tables to ensure continuous connectivity.
## Component Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Main Thread │ │ Async Pingers │ │ Route Manager │
│ │ │ │ │ │
│ • State Machine │◄──►│ • Interface A │◄──►│ • Netlink API │
│ • Decision Logic│ │ • Interface B │ │ • Route Add/Del │
│ • Coordination │ │ • ICMP Monitoring│ │ • Metric Mgmt │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
┌──────────────────┐
│ Linux Kernel │
│ │
│ • Routing Table │
│ • Network Stack │
│ • Netlink Socket │
└──────────────────┘
```
## Data Flow
1. **Monitoring Phase**
- Async pingers send ICMP packets via both interfaces
- Results are collected and sent to main thread
- State machine evaluates connectivity patterns
2. **Decision Phase**
- State machine determines if failover is needed
- Verifies secondary interface health
- Triggers route changes if conditions are met
3. **Action Phase**
- Route manager updates kernel routing table
- Changes are applied via netlink interface
- System continues monitoring in new state
## State Machine Design
### States
- **Boot**: Initial state, gathering connectivity data
- **Primary**: Using primary interface for routing
- **Fallback**: Using secondary interface for routing
### Transitions
```
Boot → Primary: After 10 seconds of sampling (regardless of ping results)
Primary → Fallback: After 3 consecutive failures AND secondary is healthy
Fallback → Primary: After 60 seconds of stable primary connectivity
```
### Routing Behavior
- **Boot State**: Both routes are set up initially - primary (metric 10) and secondary (metric 20)
- **Primary State**: Primary route (metric 10) and secondary route (metric 20) present
- **Fallback State**: All three routes present - primary (metric 10), secondary (metric 20), and failover secondary (metric 5)
- **Exit**: Only the failover route (metric 5) is removed
### Route Management Strategy
The system follows a "both routes always present, extra failover on-demand" approach:
1. **Initialization**: Set up primary route (metric 10) and secondary route (metric 20)
2. **Boot Phase**: Collect 10 seconds of ping samples to establish baseline connectivity
3. **Normal Operation**: Primary route serves traffic (metric 10), secondary available as backup (metric 20)
4. **Failover**: Add extra secondary route with highest priority (metric 5) for immediate failover
5. **Failback**: Remove extra failover route when primary recovers
6. **Cleanup**: Only remove the extra failover route on exit, preserving base routes
### State Persistence
- Current state is maintained in memory
- State changes are logged for debugging
- No persistent storage required (state rebuilds on restart)
## Interface Design
### Pinger Interface
```rust
pub trait Pinger {
async fn ping(&self, target: Ipv4Addr, interface: &str) -> PingResult;
async fn start_monitoring(&self, targets: &[Ipv4Addr], interfaces: &[String]) -> Receiver<PingResult>;
}
```
### Route Manager Interface
```rust
pub trait RouteManager {
fn add_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
fn delete_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
fn get_current_routes(&self) -> Result<Vec<RouteInfo>>;
}
```
## Threading Model
### Main Thread
- Runs the state machine
- Handles signals and graceful shutdown
- Coordinates between components
### Async Pinger Tasks
- One task per interface
- Non-blocking ICMP operations
- Results sent via channels
### Route Manager
- Synchronous operations (netlink is sync)
- Called from main thread
- Thread-safe operations
## Error Handling Strategy
### Categories
1. **Network Errors**: Temporary connectivity issues
2. **System Errors**: Permission problems, interface not found
3. **Configuration Errors**: Invalid IP addresses, missing interfaces
### Recovery Mechanisms
- **Network Errors**: Retry with exponential backoff
- **System Errors**: Log and exit (requires admin intervention)
- **Configuration Errors**: Validate on startup, exit if invalid
## Security Considerations
### Privileges
- Requires root privileges for route manipulation
- Drops unnecessary privileges where possible
- Validates all user inputs
### Network Security
- Only sends ICMP packets to configured targets
- No arbitrary packet crafting
- Interface binding prevents traffic leakage
## Performance Characteristics
### Resource Usage
- **Memory**: Minimal (~10MB)
- **CPU**: Low (periodic ICMP packets)
- **Network**: Very low (only ping traffic)
### Scalability
- Single target machine design
- Supports multiple ping targets
- Limited to 2 interfaces (current design)
## Testing Architecture
### Unit Tests
- Individual component testing
- Mock network interfaces
- State machine logic verification
### Integration Tests
- Component interaction testing
- Real network interface usage
- Netlink operation verification
### End-to-End Tests
- Full system testing in containers
- Network failure simulation
- Failover timing verification