During a recent project, I built a Juniper SRX cluster where a Reth connected via a LAG to a switch, which in turn is connected it to the Internet. In case of a failure, they Reth should failover to the second node where the 2nd part of a 4 cable LAG was configured. This LAG was connected to a 2nd LAG on the same switch. Just like the documentation says it should.
Sounds easy right?..
Well.. it did, BUT.. during extensive Systems Acceptance Tests, we found out that on regular occasions, the second node that was NOT primary for the Reth, suffered from ip-monitoring reachability problems when testing the internet connectivity via ip-monitoring.
The way it is supposed to work, is that the secondary node should on a regular bases verify connectivity to the monitored IP address (usually the default gateway/router) via a secondary IP address.
And for some reason.. that started intermittently failing for inexplicable reasons. Not very nice, as it mean that the secondary node declared the Reth on that node unfit for action. In other words: the Primary Node Reth could NOT FAILOVER.
On this RG, both IP monitoring as well as Interface monitoring was configured, which apparently are both Dataplane functionalities.
I can not find anything in the documentation that states that ip-monitoring and interface monitoring on a Reth that consists of a LAG is not supported. But because of a remark on this Juniper website:
where they state :
” do not recommend configuring chassis cluster IP monitoring on Redundancy Group 0 (RG0) for SRX Series devices.”
I became suspicious.
So.. i turned OFF interface monitoring and left ip-monitoring. And la voilà, ça marche!
The downside of this is of course that in case of an interface failure, your failover time is now a lot longer as the ip-monitoring will have to timeout first, whereas interface monitoring will failover virtually instantaneously. But the customer had no problem with that so this was accepted as a workable solution.
For all the other Reths that did not deploy ip-monitoring, interface monitoring was left in place and worked admirably.
I hope this helps you, the ip-monitoring failures on the 2nd node was intermittent but would NOT go away by any restart i could find. 🙁
And let’s face it: who wants to restart ANYTHING on a running production cluster?!