Files
route-switcher/doc/ARCHITECTURE.md
Michal Humpula 5fbd72b370 working base
2026-02-15 17:36:08 +01:00

6.2 KiB

Architecture Documentation

System Overview

Route-Switcher is a network failover system that operates at the application layer to provide automatic network redundancy. The system monitors network connectivity through multiple interfaces and manages routing tables to ensure continuous connectivity.

Component Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Main Thread   │    │  Async Pingers   │    │  Route Manager  │
│                 │    │                  │    │                 │
│ • State Machine │◄──►│ • Interface A    │◄──►│ • Netlink API   │
│ • Decision Logic│    │ • Interface B    │    │ • Route Add/Del │
│ • Coordination  │    │ • ICMP Monitoring│    │ • Metric Mgmt   │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌──────────────────┐
                    │   Linux Kernel   │
                    │                  │
                    │ • Routing Table  │
                    │ • Network Stack  │
                    │ • Netlink Socket │
                    └──────────────────┘

Data Flow

  1. Monitoring Phase

    • Async pingers send ICMP packets via both interfaces
    • Results are collected and sent to main thread
    • State machine evaluates connectivity patterns
  2. Decision Phase

    • State machine determines if failover is needed
    • Verifies secondary interface health
    • Triggers route changes if conditions are met
  3. Action Phase

    • Route manager updates kernel routing table
    • Changes are applied via netlink interface
    • System continues monitoring in new state

State Machine Design

States

  • Boot: Initial state, gathering connectivity data
  • Primary: Using primary interface for routing
  • Fallback: Using secondary interface for routing

Transitions

Boot → Primary: After 10 seconds of sampling (regardless of ping results)
Primary → Fallback: After 3 consecutive failures AND secondary is healthy
Fallback → Primary: After 60 seconds of stable primary connectivity

Routing Behavior

  • Boot State: Both routes are set up initially - primary (metric 10) and secondary (metric 20)
  • Primary State: Primary route (metric 10) and secondary route (metric 20) present
  • Fallback State: All three routes present - primary (metric 10), secondary (metric 20), and failover secondary (metric 5)
  • Exit: Only the failover route (metric 5) is removed

Route Management Strategy

The system follows a "both routes always present, extra failover on-demand" approach:

  1. Initialization: Set up primary route (metric 10) and secondary route (metric 20)
  2. Boot Phase: Collect 10 seconds of ping samples to establish baseline connectivity
  3. Normal Operation: Primary route serves traffic (metric 10), secondary available as backup (metric 20)
  4. Failover: Add extra secondary route with highest priority (metric 5) for immediate failover
  5. Failback: Remove extra failover route when primary recovers
  6. Cleanup: Only remove the extra failover route on exit, preserving base routes

State Persistence

  • Current state is maintained in memory
  • State changes are logged for debugging
  • No persistent storage required (state rebuilds on restart)

Interface Design

Pinger Interface

pub trait Pinger {
    async fn ping(&self, target: Ipv4Addr, interface: &str) -> PingResult;
    async fn start_monitoring(&self, targets: &[Ipv4Addr], interfaces: &[String]) -> Receiver<PingResult>;
}

Route Manager Interface

pub trait RouteManager {
    fn add_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
    fn delete_default_route(&self, gateway: Ipv4Addr, interface: &str, metric: u32) -> Result<()>;
    fn get_current_routes(&self) -> Result<Vec<RouteInfo>>;
}

Threading Model

Main Thread

  • Runs the state machine
  • Handles signals and graceful shutdown
  • Coordinates between components

Async Pinger Tasks

  • One task per interface
  • Non-blocking ICMP operations
  • Results sent via channels

Route Manager

  • Synchronous operations (netlink is sync)
  • Called from main thread
  • Thread-safe operations

Error Handling Strategy

Categories

  1. Network Errors: Temporary connectivity issues
  2. System Errors: Permission problems, interface not found
  3. Configuration Errors: Invalid IP addresses, missing interfaces

Recovery Mechanisms

  • Network Errors: Retry with exponential backoff
  • System Errors: Log and exit (requires admin intervention)
  • Configuration Errors: Validate on startup, exit if invalid

Security Considerations

Privileges

  • Requires root privileges for route manipulation
  • Drops unnecessary privileges where possible
  • Validates all user inputs

Network Security

  • Only sends ICMP packets to configured targets
  • No arbitrary packet crafting
  • Interface binding prevents traffic leakage

Performance Characteristics

Resource Usage

  • Memory: Minimal (~10MB)
  • CPU: Low (periodic ICMP packets)
  • Network: Very low (only ping traffic)

Scalability

  • Single target machine design
  • Supports multiple ping targets
  • Limited to 2 interfaces (current design)

Testing Architecture

Unit Tests

  • Individual component testing
  • Mock network interfaces
  • State machine logic verification

Integration Tests

  • Component interaction testing
  • Real network interface usage
  • Netlink operation verification

End-to-End Tests

  • Full system testing in containers
  • Network failure simulation
  • Failover timing verification