How to run the BFD protocol on kubernetes for fast BGP convergence
2021-08-18 - Metallb does not support it yet
Tags: k3s kubernetes
Introduction
I am currently playing with metallb for a baremetal setup of mine that sits behind a router/firewall that I cannot reconfigure with kubernetes. I am therefore planning to have it do a static mapping of public ips to virtual ips configured with metallb, a kubernetes service made just for this kind of situation.
Metallb has two ways of advertising its virtual ips to the world. The first one is the layer2 mode and is unsatisfactory to me because metallb does not speak vrrp, therefore the nodes advertise their virtual ips with their own mac addresses. Because of that, failing over when a node fails (even if you drain it gracefully) takes a long time and there is no way to speed it up.
That leaves me with the bgp way of doing this, which works fine as long as there is no abrupt failure of the node the router/firewall is currently routing to. When an abrupt failure happens you get to wait the bgp session timeout before the router/firewall converges. Draining a node works because the bgp session gets properly closed, only abrupt failures are a problem in this mode.
This problem is well known and usually solved with bfd, but according to https://github.com/metallb/metallb/issues/396 it is neither supported nor planned.
Bird to the rescue
There are not many well known software BFD implementations. There are several github projects but I wanted something robust and well known, and looked to bird for that. It is an amazing and very robust piece of software that served me well for years and I trust it. It supports BFD, so lets use it!
One easy way to solve the problem would be to install it directly on the nodes, problem solved! But that would be too easy and at this point I wanted to try running BFD as a daemonset and see how it goes from there.
Making an image
Being packaged in Alpine Linux, I wrote the following script to build an image. As I am learning buildah
that’s what I tried to use here :
#!/usr/bin/env bash
set -eu
ALPINE_LATEST=$(curl --silent https://dl-cdn.alpinelinux.org/alpine/latest-stable/releases/x86_64/ |
perl -lane '$latest = $1 if $_ =~ /^<a href="(alpine-minirootfs-\d+\.\d+\.\d+-x86_64\.tar\.gz)">/; END {print $latest}')
if [ ! -e "./${ALPINE_LATEST}" ]; then
echo "Fetching ${ALPINE_LATEST}..."
curl --silent https://dl-cdn.alpinelinux.org/alpine/latest-stable/releases/x86_64/${ALPINE_LATEST} \
--output ./${ALPINE_LATEST}
fi
ctr=$(buildah from scratch)
buildah add $ctr ${ALPINE_LATEST} /
buildah run $ctr /bin/sh -c 'apk add --no-cache bird'
buildah add $ctr entry-point.sh /
buildah config \
--author 'Julien Dessaux' \
--cmd '[ "/usr/sbin/bird", "-d", "-u", "bird", "-g", "bird", "-s", "/run/bird.ctl", "-R", "-c", "/etc/bird.conf" ]' \
--entrypoint '[ "/entry-point.sh" ]' \
--port '3784/udp' \
$ctr
buildah commit $ctr adyxax/bfd
buildah rm $ctr
I wrote the following entry-point script to generate the configuration. It needs to be dynamic because we need to add a router id that we will only know from the node running the pod:
#!/bin/sh
set -eu
printf 'router id %s;\n' ${BIRD_HOST} > /etc/bird.conf
cat /etc/bird-template.conf >> /etc/bird.conf
exec $@
Running the image
The image is publicly available, you can find it in the following manifest. Just remember to check the tags on https://quay.io/repository/adyxax/bfd?tab=tags in case I updated the image for a new Alpine or Bird release.
For now I chose to run bfd from its own namespace :
apiVersion: v1
kind: Namespace
metadata:
name: bfd
---
apiVersion: v1
kind: ConfigMap
metadata:
namespace: bfd
name: config
data:
bird.conf: |
protocol device {
}
protocol direct {
disabled; # Disable by default
ipv4;
ipv6;
}
protocol kernel {
ipv4 {
export all;
};
}
protocol kernel {
ipv6 { export all; };
}
protocol static {
ipv4;
}
protocol bfd firewalls {
neighbor 10.2.21.1;
neighbor 10.2.21.2;
}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
namespace: bfd
name: bird
labels:
app: bird
spec:
selector:
matchLabels:
app: bfd
template:
metadata:
labels:
app: bfd
spec:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
containers:
- name: bfd
image: quay.io/adyxax/bfd:2021081806
ports:
- containerPort: 3784
hostPort: 3784
protocol: UDP
name: bfd
volumeMounts:
- name: config-volume
mountPath: /etc/bird-template.conf
subPath: bird.conf
env:
- name: BIRD_HOST
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
securityContext:
capabilities:
add: ["NET_BIND_SERVICE", "NET_RAW", "NET_ADMIN", "NET_BROADCAST"]
volumes:
- name: "config-volume"
configMap:
name: config
hostNetwork: true
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
I took the list of capabilities from bird’s source code, and got the inspiration to fetch the host IP address from metallb’s daemonset manifest. Given all this, it works perfectly!
Diagnosing
You can exec in a container and run the bird client from there :
kubectl -n bfd exec -ti bird-55sl7 -- birdc
BIRD 2.0.8 ready.
bird> show bfd sessions
firewalls:
IP address Interface State Since Interval Timeout
10.2.21.1 eth0 Up 16:51:23.162 1.000 0.000
10.2.21.2 eth0 Down 16:51:23.162 1.000 0.000
Ideas for improvements
A good first improvement would be to handle BFD authentication.
Another more challenging improvement would be to run this in metallb’s namespace and use metallb’s configmap to get the peers’ IP addresses, and respect the node selector expressions to limit which bird process does what on each node.