Topics

Is there a complete solution to the split brain issue caused by network loss between the MASTER/BACKUP nodes?

pingao.yu@...
 

Hi,
We are exploring Keepalived as one potential tool to be used to manage floating virtual IP addresses for an Active/Standby high-availability pair. Now, if the network connection between the pair is lost for some reason,  the keepalived process on both nodes will take on the MASTER role by assuming the other end is dead, as a result, we run into this split brain issue where both hosts will own the VIPs.

So, I wonder whether there is a complete solution to this issue.

Thanks
Pingao

vot4anto@...
 

I think that one possible solution is on the design of the solution to avoid to have just a pair of node but at least three of them.

Quentin Armitage
 

On Mon, 2019-07-15 at 07:38 -0700, pingao.yu@... wrote:
Hi,
We are exploring Keepalived as one potential tool to be used to manage floating virtual IP addresses for an Active/Standby high-availability pair. Now, if the network connection between the pair is lost for some reason,  the keepalived process on both nodes will take on the MASTER role by assuming the other end is dead, as a result, we run into this split brain issue where both hosts will own the VIPs.

So, I wonder whether there is a complete solution to this issue.

Thanks
Pingao

There is no complete solution to the split brain problem, although for any specific network/system configuration it may be possible to run some checks via track scripts to determine the situation. However, it will always be difficult to determine whether the other system is simply not reachable or is down. Also, if a system is totally unreachable, does it matter if it becomes master, since presumably no other system is able to see it? If there is a loss of network connectivity between two VRRP instances, then presumably there are also bigger network issues involved as well.

VRRP should be seen as an opportunity to improve network resilience, but it is not a perfect solution on its own. Indeed, there never can be a perfect solution.

Quentin Armitage


Bernd Naumann
 

On 15.07.19 16:38, pingao.yu@... wrote:
Hi,
We are exploring Keepalived as one potential tool to be used to manage floating virtual IP addresses for an Active/Standby high-availability pair. Now, if the network connection between the pair is lost for some reason,  the keepalived process on both nodes will take on the MASTER role by assuming the other end is dead, as a result, we run into this split brain issue where both hosts will own the VIPs.
So, I wonder whether there is a complete solution to this issue.
Hey,

Depending on your setup, but if i.e. you have VRRP between two physical machines, you can plug in a dedicated cable between booth so the chances are really low that *this* cables breaks or one of the interfaces.

Further more, you can setup two vrrp instances, one for your dedicated link and an other for the interface where you want to assign the vIP. Then you group these vrrp instances together. How to do it is well documented.

Also in each vrrp instance you can add track interface so the vrrp instance does not only do the announcement but also detects if /an other/ interface goes down. Can be extended with track scripts and so on.

If you run vrrp between to VMs then you can attach more interfaces to you VM but how you treat the underlying network is an exercise left, how to make it resilient. My experience is that if you break the underlay network, and your overlay dies with it, that you have indeed an other issue to solve first ;)


Hope this helps in a some way. My proposal is to study the man-page aka documentation, get an idea what keepalived is able to track and perform and then starting with a minimal subset of if.

But IMHO option many defaults are not good because to slow. Like garp delay: If you wait for like 5 sec before the master start to send garp after a take over, you loose a lot of time and chaces are high that network flows break.
I immediately send the garp (delay=0) and refresh it from time to time (every 60 s). If you need sub-second failure detection you can make use of BFD which is implemented for keepalved, and also well documented. (Having also a look at the BFD RFC does not do any harm).

Best,
Bernd

Quentin Armitage
 

On Thu, 2019-10-17 at 12:01 +0200, Bernd Naumann wrote:
But IMHO option many defaults are not good because to slow. Like garp 
delay: If you wait for like 5 sec before the master start to send garp 
after a take over, you loose a lot of time and chaces are high that 
network flows break.
I immediately send the garp (delay=0) and refresh it from time to time 
(every 60 s). If you need sub-second failure detection you can make use 
of BFD which is implemented for keepalved, and also well documented. 
(Having also a look at the BFD RFC does not do any harm).

Best,
Bernd

Hi,

garp_master_delay is the delay before the SECOND set of GARP messages is sent. The first set of GARP messages is always sent exactly when the VRRP instance transitions to master (if you run keepalived with the -D command line option, it will log when it is sending GARP messages). Setting garp_master_delay to 0 means that the second set of GARP messages is not sent, and what you are seeing is the first set of GARP messages.

VRRP version 3 supports sub-second failure detection, since the interval between adverts is specified in centi-seconds (1/100ths of a second), and so if advert_int is 0.01 and the VRRP instances are using priorities 250 or higher, transition from backup to master will occur within 0.03024 seconds of a failure of the old master. If that is not quick enough, then indeed BFD can provide faster transition times.

Bernd, if there are other parameters that you think should be changed to make keepalived quicker, although it would not be right to change the defaults now, we could add a "fast" option to set a number of parameters to make keepalived quicker. If you provide a list of options, along with suggested values, that you think should be set with a "fast" option, we can look at implementing that.

If you are using VMACs, you may find that with the latest code garp_refresh is no longer needed. There was a problem with IGMP/MLD messages being sent from the VMAC interface of backup instances, causing switches to update their MAC address cache to point to the backup instance for the VRRP MAC address (00:00:5e:00:0x:xx) until the master sends its next advert. Commits I pushed on Monday, and updated today, resolve this issue at last.

I hope that helps,
Quentin

Bernd Naumann
 

On 17.10.19 13:39, Quentin Armitage wrote:

garp_master_delay is the delay before the SECOND set of GARP messages is sent. The first set of GARP messages is always sent
exactly when the VRRP instance transitions to master (if you run keepalived with the -D command line option, it will log
when it is sending GARP messages). Setting garp_master_delay to 0 means that the second set of GARP messages is not sent,
and what you are seeing is the first set of GARP messages.
Ups I may confused this. Thanks for pointing it out!

VRRP version 3 supports sub-second failure detection, since the interval between adverts is specified in centi-seconds
(1/100ths of a second), and so if advert_int is 0.01 and the VRRP instances are using priorities 250 or higher, transition
from backup to master will occur within 0.03024 seconds of a failure of the old master. If that is not quick enough, then
indeed BFD can provide faster transition times.
I re-read the doc, and maybe its time to switch to vrrp version 3 :) (Default is still 2)

```
# VRRP Advert interval in seconds (e.g. 0.92) (use default)
advert_int 1
```
This is not needed to be changed in general I think but a user maybe wants smaller intervals.

Bernd, if there are other parameters that you think should be changed to make keepalived quicker, although it would not be
right to change the defaults now, we could add a "fast" option to set a number of parameters to make keepalived quicker. If
you provide a list of options, along with suggested values, that you think should be set with a "fast" option, we can look
at implementing that.
If you are using VMACs, you may find that with the latest code garp_refresh is no longer needed. There was a problem with
IGMP/MLD messages being sent from the VMAC interface of backup instances, causing switches to update their MAC address cache
to point to the backup instance for the VRRP MAC address (00:00:5e:00:0x:xx) until the master sends its next advert. Commits
I pushed on Monday, and updated today, resolve this issue at last.
I think we can safely ignore my statement about the slowness when I'm miss interpret the option ;) Maybe I should reread the whole doc and its default.

I think I have read in the RFC some while ago that the "original" vrrp could not handle sub-second timers, right?


Thanks for clear up my incomplete knowledge.


PS: Sadly I had not yet to ability to test and use VMAC but indeed it sounds so much more sane to use these special mac addresses for vIPs, rather then the hardware address.

Quentin Armitage
 

On Thu, 2019-10-17 at 14:19 +0200, Bernd Naumann wrote:
I think I have read in the RFC some while ago that the "original" vrrp 
could not handle sub-second timers, right?

Yes indeed, VRRP version 2 only had advert intervals as a whole number of seconds. VRRP version 3 added the support for centi-second intervals.

PS: Sadly I had not yet to ability to test and use VMAC but indeed it 
sounds so much more sane to use these special mac addresses for vIPs, 
rather then the hardware address.

Just add "use_vmac" in the vrrp_instance block, and keepalived does all the rest for you!

Quentin