cleanup docs

This commit is contained in:
Michal Humpula
2026-03-01 07:39:32 +01:00
parent 0cff05d623
commit 187093156f
3 changed files with 60 additions and 438 deletions

View File

@@ -8,31 +8,25 @@ Route-Switcher monitors connectivity to specified IP addresses via multiple netw
## Architecture
### Core Components
Route-Switcher consists of three main components:
1. **Async Pingers** (`src/pinger.rs`)
- Dual-interface ICMP monitoring
- Explicit interface binding (equivalent to `ping -I <interface>`)
- Configurable ping targets and intervals
- Async/await implementation with tokio
1. **Async Pingers** (`src/pinger.rs`) - ICMP monitoring with explicit interface binding
2. **Route Manager** (`src/routing.rs`) - Netlink-based route manipulation
3. **State Machine** (`src/main.rs`) - Failover logic with anti-flapping protection
2. **Route Manager** (`src/routing.rs`)
- Netlink-based route manipulation
- No external dependencies on `ip` command
- Route addition and deletion
- Metric-based route prioritization
### State Machine
```
Boot → Primary: After 10 seconds of sampling
Primary → Fallback: After 3 consecutive failures AND secondary is healthy
Fallback → Primary: After 60 seconds of stable primary connectivity
```
3. **State Machine** (`src/main.rs`)
- Failover logic with anti-flapping protection
- Three consecutive failures trigger failover
- One minute of stable connectivity triggers failback
- Prevents switching when both interfaces fail
### Route Management Strategy
- **Primary route**: metric 10 (default priority)
- **Secondary route**: metric 20 (lower priority)
- **Failover route**: metric 5 (highest priority, added only during failover)
4. **Configuration**
- Interface definitions (primary/secondary)
- Gateway configurations
- Ping targets and timing
- Route metrics
The system maintains both base routes continuously and adds/removes the failover route as needed.
## Key Features
@@ -105,63 +99,38 @@ RUST_LOG=debug sudo cargo run
RUST_LOG=info sudo cargo run
```
## Testing Environment
### Podman-Compose Setup
The project includes a complete testing environment using podman-compose:
## Testing
### Quick Test
```bash
# Start test environment
podman-compose up -d
# Run automated failover test
./scripts/test-failover.sh
# View logs
podman-compose logs -f route-switcher
# Stop test environment
# Stop environment
podman-compose down
```
### End-to-End Testing
### Manual Testing
```bash
# Simulate primary interface failure
podman-compose exec primary ip link set eth0 down
# Test primary connectivity
podman-compose exec route-switcher ping -c 3 -I eth0 192.168.202.100
# Observe failover in logs
podman-compose logs -f route-switcher
# Test secondary connectivity
podman-compose exec route-switcher ping -c 3 -I eth1 192.168.202.100
# Restore primary interface
podman-compose exec primary ip link set eth0 up
# Simulate primary router failure
podman-compose exec primary-router ip link set eth0 down
# Observe failback after 1 minute
# Check routing table
podman-compose exec route-switcher ip route show
```
## Implementation Details
### State Machine
```
[Boot] -> [Primary] (after initial connectivity check)
[Primary] -> [Fallback] (after 3 consecutive failures)
[Fallback] -> [Primary] (after 60 seconds of stability)
```
### Route Management
- Primary route: `ip r add default via <primary-gw> dev <primary-iface> metric 10`
- Secondary route: `ip r add default via <secondary-gw> dev <secondary-iface> metric 20`
- Routes are managed via netlink, not external commands
### Failover Logic
1. **Detection**: 3 consecutive ping failures on primary interface
2. **Verification**: Secondary interface must be responsive
3. **Switch**: Update routing table to use secondary gateway
4. **Monitor**: Continue monitoring both interfaces
5. **Recovery**: After 60 seconds of stable primary connectivity, switch back
### Error Handling
- Graceful degradation on interface failures
- Comprehensive logging for debugging
- Signal handling for clean shutdown
- Recovery from temporary network issues
## Dependencies
- `tokio` - Async runtime
@@ -169,12 +138,7 @@ podman-compose exec primary ip link set eth0 up
- `netlink-sys` - Netlink kernel communication
- `anyhow` - Error handling
- `log` + `env_logger` - Logging
- `crossbeam-channel` - Inter-thread communication
- `signal-hook` - Signal handling
## Development Phases
- [ ] End-to-end automated tests
- `clap` - Command line parsing
## License

View File

@@ -1,167 +0,0 @@
# Architecture Documentation
## System Overview
Route-Switcher is a network failover system that operates at the application layer to provide automatic network redundancy. The system monitors network connectivity through multiple interfaces and manages routing tables to ensure continuous connectivity.
## Component Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Main Thread │ │ Async Pingers │ │ Route Manager │
│ │ │ │ │ │
│ • State Machine │◄──►│ • Interface A │◄──►│ • Netlink API │
│ • Decision Logic│ │ • Interface B │ │ • Route Add/Del │
│ • Coordination │ │ • ICMP Monitoring│ │ • Metric Mgmt │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
┌──────────────────┐
│ Linux Kernel │
│ │
│ • Routing Table │
│ • Network Stack │
│ • Netlink Socket │
└──────────────────┘
```
## Data Flow
1. **Monitoring Phase**
- Async pingers send ICMP packets via both interfaces
- Results are collected and sent to main thread
- State machine evaluates connectivity patterns
2. **Decision Phase**
- State machine determines if failover is needed
- Verifies secondary interface health
- Triggers route changes if conditions are met
3. **Action Phase**
- Route manager updates kernel routing table
- Changes are applied via netlink interface
- System continues monitoring in new state
## State Machine Design
### States
- **Boot**: Initial state, gathering connectivity data
- **Primary**: Using primary interface for routing
- **Fallback**: Using secondary interface for routing
### Transitions
```
Boot → Primary: After 10 seconds of sampling (regardless of ping results)
Primary → Fallback: After 3 consecutive failures AND secondary is healthy
Fallback → Primary: After 60 seconds of stable primary connectivity
```
### Routing Behavior
- **Boot State**: Both routes are set up initially - primary (metric 10) and secondary (metric 20)
- **Primary State**: Primary route (metric 10) and secondary route (metric 20) present
- **Fallback State**: All three routes present - primary (metric 10), secondary (metric 20), and failover secondary (metric 5)
- **Exit**: Only the failover route (metric 5) is removed
### Route Management Strategy
The system follows a "both routes always present, extra failover on-demand" approach:
1. **Initialization**: Set up primary route (metric 10) and secondary route (metric 20)
2. **Boot Phase**: Collect 10 seconds of ping samples to establish baseline connectivity
3. **Normal Operation**: Primary route serves traffic (metric 10), secondary available as backup (metric 20)
4. **Failover**: Add extra secondary route with highest priority (metric 5) for immediate failover
5. **Failback**: Remove extra failover route when primary recovers
6. **Cleanup**: Only remove the extra failover route on exit, preserving base routes
### State Persistence
- Current state is maintained in memory
- State changes are logged for debugging
- No persistent storage required (state rebuilds on restart)
## Interface Design
### Pinger Interface
```rust
pub trait Pinger {
async fn ping(&self, target: Ipv4Addr, interface: &str) -> PingResult;
async fn start_monitoring(&self, targets: &[Ipv4Addr], interfaces: &[String]) -> Receiver<PingResult>;
}
```
### Route Manager Interface
```rust
pub trait RouteManager {
fn add_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
fn delete_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
fn get_current_routes(&self) -> Result<Vec<RouteInfo>>;
}
```
## Threading Model
### Main Thread
- Runs the state machine
- Handles signals and graceful shutdown
- Coordinates between components
### Async Pinger Tasks
- One task per interface
- Non-blocking ICMP operations
- Results sent via channels
### Route Manager
- Synchronous operations (netlink is sync)
- Called from main thread
- Thread-safe operations
## Error Handling Strategy
### Categories
1. **Network Errors**: Temporary connectivity issues
2. **System Errors**: Permission problems, interface not found
3. **Configuration Errors**: Invalid IP addresses, missing interfaces
### Recovery Mechanisms
- **Network Errors**: Retry with exponential backoff
- **System Errors**: Log and exit (requires admin intervention)
- **Configuration Errors**: Validate on startup, exit if invalid
## Security Considerations
### Privileges
- Requires root privileges for route manipulation
- Drops unnecessary privileges where possible
- Validates all user inputs
### Network Security
- Only sends ICMP packets to configured targets
- No arbitrary packet crafting
- Interface binding prevents traffic leakage
## Performance Characteristics
### Resource Usage
- **Memory**: Minimal (~10MB)
- **CPU**: Low (periodic ICMP packets)
- **Network**: Very low (only ping traffic)
### Scalability
- Single target machine design
- Supports multiple ping targets
- Limited to 2 interfaces (current design)
## Testing Architecture
### Unit Tests
- Individual component testing
- Mock network interfaces
- State machine logic verification
### Integration Tests
- Component interaction testing
- Real network interface usage
- Netlink operation verification
### End-to-End Tests
- Full system testing in containers
- Network failure simulation
- Failover timing verification

View File

@@ -1,112 +1,43 @@
# Testing Guide
## Overview
## Test Environment
This document describes the testing strategy and environment for the Route-Switcher project.
## Testing Environment
### Podman-Compose Setup
The testing environment uses podman-compose to create a realistic network topology with routers and a single ICMP target:
The testing environment uses podman-compose to create a network topology with routers and an ICMP target:
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Route-Switcher │ │ Primary Router│ │
│ Route-Switcher │ │ Primary Router│ │ ICMP Target
│ │ │ │ │ │
│ eth0 ────────────┼────►│ eth0 ──────────┼────►│ ICMP Target
│ eth0 ────────────┼────►│ eth0 ──────────┼────►│ 192.168.202.100
│ eth1 ────────────┼────►│ eth1 ──────────┼────►│ │
│ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ │ │
▼ ▼ ▼
primary-net secondary-net target-net
192.168.1.0/24 192.168.2.0/24 10.0.0.0/24
```
### Container Architecture
### Container Setup
- **route-switcher**: Dual interfaces (eth0→primary-net, eth1→secondary-net)
- **primary-router**: Connects primary-net ↔ target-net (192.168.1.1 ↔ 10.0.0.1)
- **secondary-router**: Connects secondary-net ↔ target-net (192.168.2.1 ↔ 10.0.0.2)
- **icmp-target**: Single IP on target-net (10.0.0.100), reachable via either router
### Quick Start
```bash
# Start the testing environment
podman-compose up -d
# Run automated failover test
./scripts/test-failover.sh
# View logs
podman-compose logs -f route-switcher
# Stop environment
podman-compose down
```
### Network Configuration
**Route-Switcher:**
- eth0: 192.168.1.10 (primary network)
- eth1: 192.168.2.10 (secondary network)
- Default gateway: 192.168.1.1 (primary router)
**Primary Router:**
- eth0: 192.168.1.1 (primary network)
- eth1: 10.0.0.1 (target network)
- Routes traffic between networks with NAT
**Secondary Router:**
- eth0: 192.168.2.1 (secondary network)
- eth1: 10.0.0.2 (target network)
- Routes traffic between networks with NAT
**ICMP Target:**
- Single IP: 10.0.0.100
- Default route: 10.0.0.1 (primary router)
- Responds to ping from both routers
- **primary-router**: Connects primary-net ↔ target-net (192.168.200.11 ↔ 192.168.202.11)
- **secondary-router**: Connects secondary-net ↔ target-net (192.168.201.11 ↔ 192.168.202.12)
- **icmp-target**: Single IP on target-net (192.168.202.100)
## Test Scenarios
### 1. Basic Connectivity Test
**Objective**: Verify basic ping functionality on both interfaces
```bash
# Start environment
podman-compose up -d
# Test primary connectivity
podman-compose exec route-switcher ping -c 3 -I eth0 10.0.0.100
# Test secondary connectivity
podman-compose exec route-switcher ping -c 3 -I eth1 10.0.0.100
# Check routing table
podman-compose exec route-switcher ip route show
podman-compose exec route-switcher ping -c 3 -I eth0 192.168.202.100
podman-compose exec route-switcher ping -c 3 -I eth1 192.168.202.100
```
### 2. Failover Test
**Objective**: Verify automatic failover when primary router fails
```bash
# Start monitoring logs
# Monitor logs
podman-compose logs -f route-switcher &
# Simulate primary router failure
podman-compose exec primary-router ip link set eth0 down
# Verify failover occurs (should see in logs)
# Wait for state change to Fallback
# Check routing table after failover
podman-compose exec route-switcher ip route show
# Test connectivity via secondary router
podman-compose exec route-switcher ping -c 3 10.0.0.100
# Verify failover occurs and connectivity works
podman-compose exec route-switcher ping -c 3 192.168.202.100
# Restore primary router
podman-compose exec primary-router ip link set eth0 up
@@ -115,119 +46,45 @@ podman-compose exec primary-router ip link set eth0 up
```
### 3. Dual Failure Test
**Objective**: Verify system doesn't failover when both routers fail
```bash
# Start monitoring logs
podman-compose logs -f route-switcher &
# Fail both routers
# Fail both routers - system should NOT switch
podman-compose exec primary-router ip link set eth0 down
podman-compose exec secondary-router ip link set eth0 down
# Verify no routing changes occur
# System should remain in current state
# Restore routers
podman-compose exec primary-router ip link set eth0 up
podman-compose exec secondary-router ip link set eth0 up
```
### 4. Router Target Interface Failure
**Objective**: Test upstream network failure simulation
## Automated Testing
Run the comprehensive test script:
```bash
# Fail primary router's connection to target network
podman-compose exec primary-router ip link set eth1 down
# Should trigger failover to secondary router
# Verify connectivity still works via secondary path
# Restore primary router's target connection
podman-compose exec primary-router ip link set eth1 up
```
### 5. Automated Failover Test
**Objective**: Run complete automated test sequence
```bash
# Run the comprehensive test script
./scripts/test-failover.sh
# This script will:
# 1. Start the environment
# 2. Verify initial connectivity
# 3. Simulate primary router failure
# 4. Monitor failover
# 5. Restore primary router
# 6. Verify failback after 60 seconds
```
This script:
1. Starts the test environment
2. Verifies initial connectivity
3. Simulates primary router failure
4. Monitors failover
5. Restores primary router
6. Verifies failback
## Unit Tests
### Running Tests
```bash
# Run all tests
cargo test
# Run specific test module
# Run specific module
cargo test pinger
# Run with coverage
cargo tarpaulin --out Html
cargo test routing
cargo test state_machine
```
### Test Structure
```
tests/
├── unit/
│ ├── pinger_tests.rs
│ ├── routing_tests.rs
│ └── state_machine_tests.rs
├── integration/
│ ├── netlink_tests.rs
│ └── dual_interface_tests.rs
└── e2e/
└── failover_tests.rs
```
## Debug Commands
## Performance Testing
### Load Testing
```bash
# Test with multiple ping targets
cargo run -- --ping-target 8.8.8.8
# Monitor resource usage
podman stats route-switcher
# Test long-running stability
# Run for 24 hours and monitor for memory leaks
```
### Network Latency Testing
```bash
# Measure failover time
# Start script to time the state transition
start_time=$(date +%s%N)
# Trigger failure
# Wait for state change
end_time=$(date +%s%N)
failover_time=$((($end_time - $start_time) / 1000000))
echo "Failover time: ${failover_time}ms"
```
## Debugging Tests
### Common Issues
1. **Permission Denied**: Ensure containers run with privileged mode
2. **Interface Not Found**: Check network configuration in compose file
3. **Netlink Errors**: Verify kernel supports required operations
4. **Timing Issues**: Adjust test timeouts for your environment
### Debug Commands
```bash
# Check container network interfaces
# Check container interfaces
podman-compose exec route-switcher ip addr show
# Check routing table
@@ -235,36 +92,4 @@ podman-compose exec route-switcher ip route show
# Monitor network traffic
podman-compose exec route-switcher tcpdump -i any icmp
# Check system logs
podman-compose exec route-switcher dmesg | tail -20
```
## Test Data
### Sample Ping Results
```rust
// Mock data for testing
let mock_ping_results = vec![
PingResult::Ok, // Normal operation
PingResult::Failed, // Single failure
PingResult::Failed, // Consecutive failure
PingResult::Failed, // Trigger failover
];
```
### Network Configuration
```bash
# Test network setup
ip addr add 192.168.1.10/24 dev eth0
ip addr add 192.168.2.10/24 dev eth1
ip route add default via 192.168.1.1 dev eth0 metric 10
ip route add default via 192.168.2.1 dev eth1 metric 20
```
## Test Coverage Goals
- **Unit Tests**: 90%+ code coverage
- **Integration Tests**: All major component interactions
- **E2E Tests**: All user scenarios and edge cases
- **Performance Tests**: Resource usage and timing validation