168 lines
6.2 KiB
Markdown
168 lines
6.2 KiB
Markdown
# Architecture Documentation
|
|
|
|
## System Overview
|
|
|
|
Route-Switcher is a network failover system that operates at the application layer to provide automatic network redundancy. The system monitors network connectivity through multiple interfaces and manages routing tables to ensure continuous connectivity.
|
|
|
|
## Component Architecture
|
|
|
|
```
|
|
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
|
|
│ Main Thread │ │ Async Pingers │ │ Route Manager │
|
|
│ │ │ │ │ │
|
|
│ • State Machine │◄──►│ • Interface A │◄──►│ • Netlink API │
|
|
│ • Decision Logic│ │ • Interface B │ │ • Route Add/Del │
|
|
│ • Coordination │ │ • ICMP Monitoring│ │ • Metric Mgmt │
|
|
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
|
│ │ │
|
|
└───────────────────────┼───────────────────────┘
|
|
│
|
|
┌──────────────────┐
|
|
│ Linux Kernel │
|
|
│ │
|
|
│ • Routing Table │
|
|
│ • Network Stack │
|
|
│ • Netlink Socket │
|
|
└──────────────────┘
|
|
```
|
|
|
|
## Data Flow
|
|
|
|
1. **Monitoring Phase**
|
|
- Async pingers send ICMP packets via both interfaces
|
|
- Results are collected and sent to main thread
|
|
- State machine evaluates connectivity patterns
|
|
|
|
2. **Decision Phase**
|
|
- State machine determines if failover is needed
|
|
- Verifies secondary interface health
|
|
- Triggers route changes if conditions are met
|
|
|
|
3. **Action Phase**
|
|
- Route manager updates kernel routing table
|
|
- Changes are applied via netlink interface
|
|
- System continues monitoring in new state
|
|
|
|
## State Machine Design
|
|
|
|
### States
|
|
- **Boot**: Initial state, gathering connectivity data
|
|
- **Primary**: Using primary interface for routing
|
|
- **Fallback**: Using secondary interface for routing
|
|
|
|
### Transitions
|
|
```
|
|
Boot → Primary: After 10 seconds of sampling (regardless of ping results)
|
|
Primary → Fallback: After 3 consecutive failures AND secondary is healthy
|
|
Fallback → Primary: After 60 seconds of stable primary connectivity
|
|
```
|
|
|
|
### Routing Behavior
|
|
- **Boot State**: Both routes are set up initially - primary (metric 10) and secondary (metric 20)
|
|
- **Primary State**: Primary route (metric 10) and secondary route (metric 20) present
|
|
- **Fallback State**: All three routes present - primary (metric 10), secondary (metric 20), and failover secondary (metric 5)
|
|
- **Exit**: Only the failover route (metric 5) is removed
|
|
|
|
### Route Management Strategy
|
|
The system follows a "both routes always present, extra failover on-demand" approach:
|
|
1. **Initialization**: Set up primary route (metric 10) and secondary route (metric 20)
|
|
2. **Boot Phase**: Collect 10 seconds of ping samples to establish baseline connectivity
|
|
3. **Normal Operation**: Primary route serves traffic (metric 10), secondary available as backup (metric 20)
|
|
4. **Failover**: Add extra secondary route with highest priority (metric 5) for immediate failover
|
|
5. **Failback**: Remove extra failover route when primary recovers
|
|
6. **Cleanup**: Only remove the extra failover route on exit, preserving base routes
|
|
|
|
### State Persistence
|
|
- Current state is maintained in memory
|
|
- State changes are logged for debugging
|
|
- No persistent storage required (state rebuilds on restart)
|
|
|
|
## Interface Design
|
|
|
|
### Pinger Interface
|
|
```rust
|
|
pub trait Pinger {
|
|
async fn ping(&self, target: Ipv4Addr, interface: &str) -> PingResult;
|
|
async fn start_monitoring(&self, targets: &[Ipv4Addr], interfaces: &[String]) -> Receiver<PingResult>;
|
|
}
|
|
```
|
|
|
|
### Route Manager Interface
|
|
```rust
|
|
pub trait RouteManager {
|
|
fn add_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
|
|
fn delete_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
|
|
fn get_current_routes(&self) -> Result<Vec<RouteInfo>>;
|
|
}
|
|
```
|
|
|
|
## Threading Model
|
|
|
|
### Main Thread
|
|
- Runs the state machine
|
|
- Handles signals and graceful shutdown
|
|
- Coordinates between components
|
|
|
|
### Async Pinger Tasks
|
|
- One task per interface
|
|
- Non-blocking ICMP operations
|
|
- Results sent via channels
|
|
|
|
### Route Manager
|
|
- Synchronous operations (netlink is sync)
|
|
- Called from main thread
|
|
- Thread-safe operations
|
|
|
|
## Error Handling Strategy
|
|
|
|
### Categories
|
|
1. **Network Errors**: Temporary connectivity issues
|
|
2. **System Errors**: Permission problems, interface not found
|
|
3. **Configuration Errors**: Invalid IP addresses, missing interfaces
|
|
|
|
### Recovery Mechanisms
|
|
- **Network Errors**: Retry with exponential backoff
|
|
- **System Errors**: Log and exit (requires admin intervention)
|
|
- **Configuration Errors**: Validate on startup, exit if invalid
|
|
|
|
## Security Considerations
|
|
|
|
### Privileges
|
|
- Requires root privileges for route manipulation
|
|
- Drops unnecessary privileges where possible
|
|
- Validates all user inputs
|
|
|
|
### Network Security
|
|
- Only sends ICMP packets to configured targets
|
|
- No arbitrary packet crafting
|
|
- Interface binding prevents traffic leakage
|
|
|
|
## Performance Characteristics
|
|
|
|
### Resource Usage
|
|
- **Memory**: Minimal (~10MB)
|
|
- **CPU**: Low (periodic ICMP packets)
|
|
- **Network**: Very low (only ping traffic)
|
|
|
|
### Scalability
|
|
- Single target machine design
|
|
- Supports multiple ping targets
|
|
- Limited to 2 interfaces (current design)
|
|
|
|
## Testing Architecture
|
|
|
|
### Unit Tests
|
|
- Individual component testing
|
|
- Mock network interfaces
|
|
- State machine logic verification
|
|
|
|
### Integration Tests
|
|
- Component interaction testing
|
|
- Real network interface usage
|
|
- Netlink operation verification
|
|
|
|
### End-to-End Tests
|
|
- Full system testing in containers
|
|
- Network failure simulation
|
|
- Failover timing verification
|