How to Ensure a More Reliable ISP Fiber Connection with Failover

Hello folks,

Recently, I have been into the homelab stuff. I wanted to host a little bit more at home. Also, because I am running a Tezos node, I wanted my connection to be as reliable as possible. The need for that became a little more important when my ISP experienced an outage for a couple of days.

My QoS journey

QoS stands for Quality of service, and this is what Wikipedia is saying about:


Quality of service (QoS) is the description or measurement of the overall performance of a service, such as a telephony or computer network, or a cloud computing service, particularly the performance seen by the users of the network

sources: https://en.wikipedia.org/wiki/Quality_of_service

As I was saying, I had some ISP (fiber) outages, and I wanted to make my home internet more reliable. It started by changing my ISP. Then, I looked into the idea of "replacing your ISP box with a router." There are plenty of solutions out there, but in the end, it is quite expensive, and you never know exactly in advance if it will work.

The vast majority of what you'll find involves mimicking what the ISP router does so that your router is "authenticated" as legitimate on the network and can decode the fiber signal. I wanted to do that, but I also wanted to reduce the complexity of my setup.

So, here's what I decided in the end: let's just stick with the ISP router and try to build a resilient connection system around it.

The bad idea

Initially, I thought subscribing to two ISP fiber connections was the solution. That way, I could easily have failover from one to another in case of an outage. However, this is not doable (at least in France and where I live). Most of the time, you'll have only one connection port allocated to your home, and you won't be able to have two fiber lines coming in.

The first alternative

This is where I started digging into the 4G/5G router option. There are a couple of 4G/5G routers where you simply insert a SIM card, and it streams the connection to your devices. The good thing about this setup is that you can even connect your ISP router to it. Depending on the 4G/5G router, you can have failover to 4G/5G as a backup in case your primary connection (fiber) fails.

This is the path I decided to go with.

What you'll need:

  1. An ISP connection and router
  2. A 4G or 5G router
  3. A mobile 4G or 5G SIM card and plan

I choose the TP Link Deco 5G

It has one WAN/LAN 2.5 Gbps port where you can plug your ISP router. The two other ports are gigabit ports if you want to connect devices.

This was working pretty well, and I started testing the failover option. However, it was not what I was expecting. I wanted an efficient and fast failover—I don't want to wait 30 seconds for my 4G/5G router to realize I've lost the signal from my ISP.

I started looking into different solutions. There are routers that can perform load balancing and may be a little more efficient, but they would add additional layers of complexity to my infrastructure and increase the cost. This is when I started thinking about VRRP.

The Virtual Router Redundancy Protocol (VRRP) is a computer networking protocol that provides for automatic assignment of available Internet Protocol (IP) routers to participating hosts. This increases the availability and reliability of routing paths via automatic default gateway selections on an IP subnetwork.

source: https://en.wikipedia.org/wiki/Virtual_Router_Redundancy_Protocol

The second alternative

It looks like the first option, but instead of relying on the 4G/5G router to use the ISP router as the primary connection, you decouple your connectivity differently.

I have a server, which is a NUC, and that NUC has two Ethernet connections. You don’t necessarily need a NUC; you can use a laptop or computer with a dongle or external/internal network card to achieve the same. Once your host device has two entry points for Ethernet, your operating system will have two interfaces with connectivity.

MSI Cubi NUC 1M-024BEU

Now, the network wiring looks a little more like this:

  1. Primary Connection: The ISP router is connected to one Ethernet interface on the host device.
  2. Backup Connection: The 4G/5G router is connected to the second Ethernet interface on the host device.
  3. The host device (NUC, laptop, or PC with dual Ethernet) manages the failover and routing between the two connections.

Basically, the host will have two connection entry points, e.g., eth1 and eth2. One will have priority over the other, and you can define these priorities using the network manager or by configuring route metrics.

Metrics (networking) - Wikipedia

From that point, I needed a way to prioritize my primary connection, determine when to fall back to my backup connection (5G), and decide how to switch back once the primary connection is restored.

To do that there are a couple of options:

  • Internet Connection Bonding: This is a technique where you can aggregate multiple connections. It means the computer will stick to a single connection entry point (the bonded one), but behind the scenes, it binds to both connections. In the end, this acts as a kind of failover because, whether the primary or backup is up, you'll have connectivity.

I don't remember exactly why I ditched that option, but it wasn’t working as expected.

Bonding protocol - Wikipedia
  • MPTCP (Multipath TCP): This is essentially a protocol layer that allows you to route your internet traffic through multiple sources. It can help achieve a failover option, but I found it a bit heavy and complex to implement. It would have been my last resort if I hadn’t found another solution to achieve what I wanted to do.
Multipath TCP — Wikipédia
  • Keepalived: Keepalived is routing software written in C. The main goal of this project is to provide simple and robust facilities for load balancing and high availability to Linux systems and Linux-based infrastructures. The load balancing framework relies on the well-known and widely used Linux Virtual Server (IPVS) kernel module, which provides Layer 4 load balancing.

This was definitely the option I looked into. You basically install it, configure your interfaces, and it will monitor the connection, falling back to your backup depending on the results of the monitoring tests. To test connectivity, you'll have to provide scripts, and it will run these scripts for you, using the results to make decisions.

This is where I was a little skeptical, as it didn’t have built-in capabilities for the checks—it only had the logic to balance traffic from one connection to another. From there, since I had to provide the scripts, I wondered why not code the entire orchestration myself.

Keepalived for Linux
Keepalived provides robust High-Availability and Load Balancing features for Linux critical infrastructures

From the second alternative to a homemade solution

Network and wiring-wise, the second alternative was the right choice. I now only needed the software part that would:

  1. Check the connectivity regularly
  2. Switch to the backup connectivity quickly
  3. Recover to the primary connection when it's back up

I started my R&D on my desktop computer, which is running the Solus distribution, and it has a network manager (nmcli).


My R&D environment was as follows:

  • ISP fiber on eth
  • Phone hotspot through USB

So, I had two connections plugged into my computer, and I started hacking away.

The whole point of the application is as follows: we set the following environment variables:

PRIMARY_CONNECTION=eth0
PRIMARY_CHECK_INTERVAL_IN_SECONDS=5
BACKUP_CONNECTION=eth1
BACKUP_CHECK_INTERVAL_IN_SECONDS=30

The environment variables set are for the connection names (primary and backup) and the delay between checks for each connection.

The program will do the following:

  1. Pick a well-known website randomly from a list.
  2. Ping that website using curl, specifying the interface being used.
  3. Based on the result (whether the primary is down, secondary is down, or both are down), take action.

Actions:

  • If the primary is up and the backup is up, stick to the primary.
  • If the primary is down and the backup is up, switch to the backup.
  • If the primary is back up and the backup is active, switch back to the primary.
  • If both primary and backup are down, do nothing.

For this, I started writing a little script, which is a Nest.js application written in TypeScript and run using bun.

NestJS - A progressive Node.js framework
NestJS is a framework for building efficient, scalable Node.js web applications. It uses modern JavaScript, is built with TypeScript and combines elements of OOP (Object Oriented Programming), FP (Functional Programming), and FRP (Functional Reactive Programming).
Bun — A fast all-in-one JavaScript runtime
Bundle, install, and run JavaScript & TypeScript — all in Bun. Bun is a new JavaScript runtime with a native bundler, transpiler, task runner, and npm client built-in.

You’ll find the implementation details and all the requirements for the program in the GitHub README.md, as the repository is open source. More information can be found here:

GitHub - iiAku/virtual-failover
Contribute to iiAku/virtual-failover development by creating an account on GitHub.

This is what it looks like when it's running:

[16:50:00.438] WARN (1470255): Primary connection seems to be down, checking again
[16:50:00.438] INFO (1470255): Checking connectivity against
[16:50:00.438] INFO (1470255): Checking connectivity against
[16:50:01.453] INFO (1470255): Current check interval is 5 seconds
[16:50:01.453] INFO (1470255): Primary connection is down ❌
[16:50:01.453] INFO (1470255): Backup connection is up ✅
[16:50:01.453] INFO (1470255): Connection state is PRIMARY
[16:50:02.515] INFO (1470255): Setting route priority for connection eth0
[16:50:03.427] INFO (1470255): Setting route priority for connection eth1
[16:50:03.738] INFO (1470255): Connection (eth0) took 1223ms to restart.
[16:50:03.738] INFO (1470255): Connection (eth0) ipv4.route-metric=300 ✅
[16:50:03.738] INFO (1470255): Connection (eth0) ipv6.route-metric=300 ✅
[16:50:03.738] INFO (1470255): Connection (eth0) connection.autoconnect-priority=200 ✅
[16:50:03.751] INFO (1470255): Connection (eth1) took 323ms to restart.
[16:50:03.751] INFO (1470255): Connection (eth1) ipv4.route-metric=100 ✅
[16:50:03.751] INFO (1470255): Connection (eth1) ipv6.route-metric=100 ✅
[16:50:03.751] INFO (1470255): Connection (eth1) connection.autoconnect-priority=400 ✅
[16:50:03.751] INFO (1470255): Changing from PRIMARY to BACKUP connection is active.
[16:50:03.751] INFO (1470255): Primary connection is down ❌ - Activating backup 🔄
[16:50:04.423] INFO (1470255): Checking connectivity against
[16:50:04.424] INFO (1470255): Checking connectivity against
[16:50:05.440] INFO (1470255): Current check interval is 30 seconds
[16:50:05.440] INFO (1470255): Primary connection is up ✅
[16:50:05.440] INFO (1470255): Backup connection is down ❌
[16:50:05.440] INFO (1470255): Connection state is BACKUP
[16:50:06.448] INFO (1470255): Setting route priority for connection eth0
[16:50:07.334] INFO (1470255): Setting route priority for connection eth1
[16:50:07.622] INFO (1470255): Connection (eth1) took 288ms to restart.
[16:50:07.622] INFO (1470255): Connection (eth1) ipv4.route-metric=300 ✅
[16:50:07.622] INFO (1470255): Connection (eth1) ipv6.route-metric=300 ✅
[16:50:07.622] INFO (1470255): Connection (eth1) connection.autoconnect-priority=200 ✅
[16:50:07.637] INFO (1470255): Connection (eth0) took 1188ms to restart.
[16:50:07.637] INFO (1470255): Connection (eth0) ipv4.route-metric=200 ✅
[16:50:07.637] INFO (1470255): Connection (eth0) ipv6.route-metric=200 ✅
[16:50:07.637] INFO (1470255): Connection (eth0) connection.autoconnect-priority=300 ✅
[16:50:07.637] INFO (1470255): Changing from BACKUP to PRIMARY connection is active.
[16:50:07.637] INFO (1470255): Primary connection is back up ✅ - Switching back to primary.

That way, I had achieved a proper failover software system that switches between two connections on Linux. 😄

I hope you enjoyed this blog post.