cleanup docs
This commit is contained in:
@@ -1,167 +0,0 @@
|
||||
# Architecture Documentation
|
||||
|
||||
## System Overview
|
||||
|
||||
Route-Switcher is a network failover system that operates at the application layer to provide automatic network redundancy. The system monitors network connectivity through multiple interfaces and manages routing tables to ensure continuous connectivity.
|
||||
|
||||
## Component Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
|
||||
│ Main Thread │ │ Async Pingers │ │ Route Manager │
|
||||
│ │ │ │ │ │
|
||||
│ • State Machine │◄──►│ • Interface A │◄──►│ • Netlink API │
|
||||
│ • Decision Logic│ │ • Interface B │ │ • Route Add/Del │
|
||||
│ • Coordination │ │ • ICMP Monitoring│ │ • Metric Mgmt │
|
||||
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
||||
│ │ │
|
||||
└───────────────────────┼───────────────────────┘
|
||||
│
|
||||
┌──────────────────┐
|
||||
│ Linux Kernel │
|
||||
│ │
|
||||
│ • Routing Table │
|
||||
│ • Network Stack │
|
||||
│ • Netlink Socket │
|
||||
└──────────────────┘
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
1. **Monitoring Phase**
|
||||
- Async pingers send ICMP packets via both interfaces
|
||||
- Results are collected and sent to main thread
|
||||
- State machine evaluates connectivity patterns
|
||||
|
||||
2. **Decision Phase**
|
||||
- State machine determines if failover is needed
|
||||
- Verifies secondary interface health
|
||||
- Triggers route changes if conditions are met
|
||||
|
||||
3. **Action Phase**
|
||||
- Route manager updates kernel routing table
|
||||
- Changes are applied via netlink interface
|
||||
- System continues monitoring in new state
|
||||
|
||||
## State Machine Design
|
||||
|
||||
### States
|
||||
- **Boot**: Initial state, gathering connectivity data
|
||||
- **Primary**: Using primary interface for routing
|
||||
- **Fallback**: Using secondary interface for routing
|
||||
|
||||
### Transitions
|
||||
```
|
||||
Boot → Primary: After 10 seconds of sampling (regardless of ping results)
|
||||
Primary → Fallback: After 3 consecutive failures AND secondary is healthy
|
||||
Fallback → Primary: After 60 seconds of stable primary connectivity
|
||||
```
|
||||
|
||||
### Routing Behavior
|
||||
- **Boot State**: Both routes are set up initially - primary (metric 10) and secondary (metric 20)
|
||||
- **Primary State**: Primary route (metric 10) and secondary route (metric 20) present
|
||||
- **Fallback State**: All three routes present - primary (metric 10), secondary (metric 20), and failover secondary (metric 5)
|
||||
- **Exit**: Only the failover route (metric 5) is removed
|
||||
|
||||
### Route Management Strategy
|
||||
The system follows a "both routes always present, extra failover on-demand" approach:
|
||||
1. **Initialization**: Set up primary route (metric 10) and secondary route (metric 20)
|
||||
2. **Boot Phase**: Collect 10 seconds of ping samples to establish baseline connectivity
|
||||
3. **Normal Operation**: Primary route serves traffic (metric 10), secondary available as backup (metric 20)
|
||||
4. **Failover**: Add extra secondary route with highest priority (metric 5) for immediate failover
|
||||
5. **Failback**: Remove extra failover route when primary recovers
|
||||
6. **Cleanup**: Only remove the extra failover route on exit, preserving base routes
|
||||
|
||||
### State Persistence
|
||||
- Current state is maintained in memory
|
||||
- State changes are logged for debugging
|
||||
- No persistent storage required (state rebuilds on restart)
|
||||
|
||||
## Interface Design
|
||||
|
||||
### Pinger Interface
|
||||
```rust
|
||||
pub trait Pinger {
|
||||
async fn ping(&self, target: Ipv4Addr, interface: &str) -> PingResult;
|
||||
async fn start_monitoring(&self, targets: &[Ipv4Addr], interfaces: &[String]) -> Receiver<PingResult>;
|
||||
}
|
||||
```
|
||||
|
||||
### Route Manager Interface
|
||||
```rust
|
||||
pub trait RouteManager {
|
||||
fn add_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
|
||||
fn delete_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
|
||||
fn get_current_routes(&self) -> Result<Vec<RouteInfo>>;
|
||||
}
|
||||
```
|
||||
|
||||
## Threading Model
|
||||
|
||||
### Main Thread
|
||||
- Runs the state machine
|
||||
- Handles signals and graceful shutdown
|
||||
- Coordinates between components
|
||||
|
||||
### Async Pinger Tasks
|
||||
- One task per interface
|
||||
- Non-blocking ICMP operations
|
||||
- Results sent via channels
|
||||
|
||||
### Route Manager
|
||||
- Synchronous operations (netlink is sync)
|
||||
- Called from main thread
|
||||
- Thread-safe operations
|
||||
|
||||
## Error Handling Strategy
|
||||
|
||||
### Categories
|
||||
1. **Network Errors**: Temporary connectivity issues
|
||||
2. **System Errors**: Permission problems, interface not found
|
||||
3. **Configuration Errors**: Invalid IP addresses, missing interfaces
|
||||
|
||||
### Recovery Mechanisms
|
||||
- **Network Errors**: Retry with exponential backoff
|
||||
- **System Errors**: Log and exit (requires admin intervention)
|
||||
- **Configuration Errors**: Validate on startup, exit if invalid
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Privileges
|
||||
- Requires root privileges for route manipulation
|
||||
- Drops unnecessary privileges where possible
|
||||
- Validates all user inputs
|
||||
|
||||
### Network Security
|
||||
- Only sends ICMP packets to configured targets
|
||||
- No arbitrary packet crafting
|
||||
- Interface binding prevents traffic leakage
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Resource Usage
|
||||
- **Memory**: Minimal (~10MB)
|
||||
- **CPU**: Low (periodic ICMP packets)
|
||||
- **Network**: Very low (only ping traffic)
|
||||
|
||||
### Scalability
|
||||
- Single target machine design
|
||||
- Supports multiple ping targets
|
||||
- Limited to 2 interfaces (current design)
|
||||
|
||||
## Testing Architecture
|
||||
|
||||
### Unit Tests
|
||||
- Individual component testing
|
||||
- Mock network interfaces
|
||||
- State machine logic verification
|
||||
|
||||
### Integration Tests
|
||||
- Component interaction testing
|
||||
- Real network interface usage
|
||||
- Netlink operation verification
|
||||
|
||||
### End-to-End Tests
|
||||
- Full system testing in containers
|
||||
- Network failure simulation
|
||||
- Failover timing verification
|
||||
233
doc/TESTING.md
233
doc/TESTING.md
@@ -1,112 +1,43 @@
|
||||
# Testing Guide
|
||||
|
||||
## Overview
|
||||
## Test Environment
|
||||
|
||||
This document describes the testing strategy and environment for the Route-Switcher project.
|
||||
|
||||
## Testing Environment
|
||||
|
||||
### Podman-Compose Setup
|
||||
|
||||
The testing environment uses podman-compose to create a realistic network topology with routers and a single ICMP target:
|
||||
The testing environment uses podman-compose to create a network topology with routers and an ICMP target:
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Route-Switcher │ │ Primary Router│ │ │
|
||||
│ Route-Switcher │ │ Primary Router│ │ ICMP Target │
|
||||
│ │ │ │ │ │
|
||||
│ eth0 ────────────┼────►│ eth0 ──────────┼────►│ ICMP Target │
|
||||
│ eth0 ────────────┼────►│ eth0 ──────────┼────►│ 192.168.202.100│
|
||||
│ eth1 ────────────┼────►│ eth1 ──────────┼────►│ │
|
||||
│ │ │ │ │ │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
│ │ │
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
primary-net secondary-net target-net
|
||||
192.168.1.0/24 192.168.2.0/24 10.0.0.0/24
|
||||
```
|
||||
|
||||
### Container Architecture
|
||||
|
||||
### Container Setup
|
||||
- **route-switcher**: Dual interfaces (eth0→primary-net, eth1→secondary-net)
|
||||
- **primary-router**: Connects primary-net ↔ target-net (192.168.1.1 ↔ 10.0.0.1)
|
||||
- **secondary-router**: Connects secondary-net ↔ target-net (192.168.2.1 ↔ 10.0.0.2)
|
||||
- **icmp-target**: Single IP on target-net (10.0.0.100), reachable via either router
|
||||
|
||||
### Quick Start
|
||||
|
||||
```bash
|
||||
# Start the testing environment
|
||||
podman-compose up -d
|
||||
|
||||
# Run automated failover test
|
||||
./scripts/test-failover.sh
|
||||
|
||||
# View logs
|
||||
podman-compose logs -f route-switcher
|
||||
|
||||
# Stop environment
|
||||
podman-compose down
|
||||
```
|
||||
|
||||
### Network Configuration
|
||||
|
||||
**Route-Switcher:**
|
||||
- eth0: 192.168.1.10 (primary network)
|
||||
- eth1: 192.168.2.10 (secondary network)
|
||||
- Default gateway: 192.168.1.1 (primary router)
|
||||
|
||||
**Primary Router:**
|
||||
- eth0: 192.168.1.1 (primary network)
|
||||
- eth1: 10.0.0.1 (target network)
|
||||
- Routes traffic between networks with NAT
|
||||
|
||||
**Secondary Router:**
|
||||
- eth0: 192.168.2.1 (secondary network)
|
||||
- eth1: 10.0.0.2 (target network)
|
||||
- Routes traffic between networks with NAT
|
||||
|
||||
**ICMP Target:**
|
||||
- Single IP: 10.0.0.100
|
||||
- Default route: 10.0.0.1 (primary router)
|
||||
- Responds to ping from both routers
|
||||
- **primary-router**: Connects primary-net ↔ target-net (192.168.200.11 ↔ 192.168.202.11)
|
||||
- **secondary-router**: Connects secondary-net ↔ target-net (192.168.201.11 ↔ 192.168.202.12)
|
||||
- **icmp-target**: Single IP on target-net (192.168.202.100)
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### 1. Basic Connectivity Test
|
||||
**Objective**: Verify basic ping functionality on both interfaces
|
||||
|
||||
```bash
|
||||
# Start environment
|
||||
podman-compose up -d
|
||||
|
||||
# Test primary connectivity
|
||||
podman-compose exec route-switcher ping -c 3 -I eth0 10.0.0.100
|
||||
|
||||
# Test secondary connectivity
|
||||
podman-compose exec route-switcher ping -c 3 -I eth1 10.0.0.100
|
||||
|
||||
# Check routing table
|
||||
podman-compose exec route-switcher ip route show
|
||||
podman-compose exec route-switcher ping -c 3 -I eth0 192.168.202.100
|
||||
podman-compose exec route-switcher ping -c 3 -I eth1 192.168.202.100
|
||||
```
|
||||
|
||||
### 2. Failover Test
|
||||
**Objective**: Verify automatic failover when primary router fails
|
||||
|
||||
```bash
|
||||
# Start monitoring logs
|
||||
# Monitor logs
|
||||
podman-compose logs -f route-switcher &
|
||||
|
||||
# Simulate primary router failure
|
||||
podman-compose exec primary-router ip link set eth0 down
|
||||
|
||||
# Verify failover occurs (should see in logs)
|
||||
# Wait for state change to Fallback
|
||||
|
||||
# Check routing table after failover
|
||||
podman-compose exec route-switcher ip route show
|
||||
|
||||
# Test connectivity via secondary router
|
||||
podman-compose exec route-switcher ping -c 3 10.0.0.100
|
||||
# Verify failover occurs and connectivity works
|
||||
podman-compose exec route-switcher ping -c 3 192.168.202.100
|
||||
|
||||
# Restore primary router
|
||||
podman-compose exec primary-router ip link set eth0 up
|
||||
@@ -115,119 +46,45 @@ podman-compose exec primary-router ip link set eth0 up
|
||||
```
|
||||
|
||||
### 3. Dual Failure Test
|
||||
**Objective**: Verify system doesn't failover when both routers fail
|
||||
|
||||
```bash
|
||||
# Start monitoring logs
|
||||
podman-compose logs -f route-switcher &
|
||||
|
||||
# Fail both routers
|
||||
# Fail both routers - system should NOT switch
|
||||
podman-compose exec primary-router ip link set eth0 down
|
||||
podman-compose exec secondary-router ip link set eth0 down
|
||||
|
||||
# Verify no routing changes occur
|
||||
# System should remain in current state
|
||||
|
||||
# Restore routers
|
||||
podman-compose exec primary-router ip link set eth0 up
|
||||
podman-compose exec secondary-router ip link set eth0 up
|
||||
```
|
||||
|
||||
### 4. Router Target Interface Failure
|
||||
**Objective**: Test upstream network failure simulation
|
||||
## Automated Testing
|
||||
|
||||
Run the comprehensive test script:
|
||||
```bash
|
||||
# Fail primary router's connection to target network
|
||||
podman-compose exec primary-router ip link set eth1 down
|
||||
|
||||
# Should trigger failover to secondary router
|
||||
# Verify connectivity still works via secondary path
|
||||
|
||||
# Restore primary router's target connection
|
||||
podman-compose exec primary-router ip link set eth1 up
|
||||
```
|
||||
|
||||
### 5. Automated Failover Test
|
||||
**Objective**: Run complete automated test sequence
|
||||
|
||||
```bash
|
||||
# Run the comprehensive test script
|
||||
./scripts/test-failover.sh
|
||||
|
||||
# This script will:
|
||||
# 1. Start the environment
|
||||
# 2. Verify initial connectivity
|
||||
# 3. Simulate primary router failure
|
||||
# 4. Monitor failover
|
||||
# 5. Restore primary router
|
||||
# 6. Verify failback after 60 seconds
|
||||
```
|
||||
|
||||
This script:
|
||||
1. Starts the test environment
|
||||
2. Verifies initial connectivity
|
||||
3. Simulates primary router failure
|
||||
4. Monitors failover
|
||||
5. Restores primary router
|
||||
6. Verifies failback
|
||||
|
||||
## Unit Tests
|
||||
|
||||
### Running Tests
|
||||
```bash
|
||||
# Run all tests
|
||||
cargo test
|
||||
|
||||
# Run specific test module
|
||||
# Run specific module
|
||||
cargo test pinger
|
||||
|
||||
# Run with coverage
|
||||
cargo tarpaulin --out Html
|
||||
cargo test routing
|
||||
cargo test state_machine
|
||||
```
|
||||
|
||||
### Test Structure
|
||||
```
|
||||
tests/
|
||||
├── unit/
|
||||
│ ├── pinger_tests.rs
|
||||
│ ├── routing_tests.rs
|
||||
│ └── state_machine_tests.rs
|
||||
├── integration/
|
||||
│ ├── netlink_tests.rs
|
||||
│ └── dual_interface_tests.rs
|
||||
└── e2e/
|
||||
└── failover_tests.rs
|
||||
```
|
||||
## Debug Commands
|
||||
|
||||
## Performance Testing
|
||||
|
||||
### Load Testing
|
||||
```bash
|
||||
# Test with multiple ping targets
|
||||
cargo run -- --ping-target 8.8.8.8
|
||||
|
||||
# Monitor resource usage
|
||||
podman stats route-switcher
|
||||
|
||||
# Test long-running stability
|
||||
# Run for 24 hours and monitor for memory leaks
|
||||
```
|
||||
|
||||
### Network Latency Testing
|
||||
```bash
|
||||
# Measure failover time
|
||||
# Start script to time the state transition
|
||||
start_time=$(date +%s%N)
|
||||
# Trigger failure
|
||||
# Wait for state change
|
||||
end_time=$(date +%s%N)
|
||||
failover_time=$((($end_time - $start_time) / 1000000))
|
||||
echo "Failover time: ${failover_time}ms"
|
||||
```
|
||||
|
||||
## Debugging Tests
|
||||
|
||||
### Common Issues
|
||||
1. **Permission Denied**: Ensure containers run with privileged mode
|
||||
2. **Interface Not Found**: Check network configuration in compose file
|
||||
3. **Netlink Errors**: Verify kernel supports required operations
|
||||
4. **Timing Issues**: Adjust test timeouts for your environment
|
||||
|
||||
### Debug Commands
|
||||
```bash
|
||||
# Check container network interfaces
|
||||
# Check container interfaces
|
||||
podman-compose exec route-switcher ip addr show
|
||||
|
||||
# Check routing table
|
||||
@@ -235,36 +92,4 @@ podman-compose exec route-switcher ip route show
|
||||
|
||||
# Monitor network traffic
|
||||
podman-compose exec route-switcher tcpdump -i any icmp
|
||||
|
||||
# Check system logs
|
||||
podman-compose exec route-switcher dmesg | tail -20
|
||||
```
|
||||
|
||||
## Test Data
|
||||
|
||||
### Sample Ping Results
|
||||
```rust
|
||||
// Mock data for testing
|
||||
let mock_ping_results = vec![
|
||||
PingResult::Ok, // Normal operation
|
||||
PingResult::Failed, // Single failure
|
||||
PingResult::Failed, // Consecutive failure
|
||||
PingResult::Failed, // Trigger failover
|
||||
];
|
||||
```
|
||||
|
||||
### Network Configuration
|
||||
```bash
|
||||
# Test network setup
|
||||
ip addr add 192.168.1.10/24 dev eth0
|
||||
ip addr add 192.168.2.10/24 dev eth1
|
||||
ip route add default via 192.168.1.1 dev eth0 metric 10
|
||||
ip route add default via 192.168.2.1 dev eth1 metric 20
|
||||
```
|
||||
|
||||
## Test Coverage Goals
|
||||
|
||||
- **Unit Tests**: 90%+ code coverage
|
||||
- **Integration Tests**: All major component interactions
|
||||
- **E2E Tests**: All user scenarios and edge cases
|
||||
- **Performance Tests**: Resource usage and timing validation
|
||||
|
||||
Reference in New Issue
Block a user