Page 1 of 1

Dropping Connections

Posted: 2016/05/27 19:43:26
by rotorboy
We're having a strange issue in one of our networks used for website hosting and email services. We have a number of Xenserver hosts each with any number of CentOS VMs. Within the VMs there maybe 1 or more public IP addresses using using eth0:1 to eth0:xx
Up until recently these were all working normally. A few days ago we started seeing SSH connections dropping randomly. At first it was intermittent and then on some VMs it become frequent to the point that we couldn't do anything. This also seemed to be affecting the apache connections as the websites would suddenly go offline. The only way to restore service was to run a small script that goes through and does an ifup eth0:xx for each virtual interface as well as the base eth0 interface.

Having seen this before I thought there was an IP conflict however extensive searching has not turned up any duplicate IPs. Next we searched for duplicate MAC addresses. We didn't find any of these either but to be safe we did go ahead and change some of the MAC addresses manually to be sure. Move forward a few days and it seems like whatever is causing this problem is getting worse. It's affecting about 20% of the CentOS5 VMs on the network. If we run the ifup on the virtual interfaces every few seconds the sites and SSH connections stay up most of the time.

I've been checking logs and running arpwatch but so far I'm not finding any clues to point at what the cause is. We've cold booted some of the VMs, the machines, the main switch, an upstream switch. Loads and network traffic have been within normal. No strange logins or other activity that we have found so far. Has anyone else seen something like this?

Re: Dropping Connections

Posted: 2016/05/28 17:13:22
by aks
Hmm, a bit of a mystery.

So let's be clear, the interface is going into the "down" state "all by itself" - otherwise you'd have to ifdown and then ifup the interface?

If that's the case, Google link flap and link flap prevention. Also check CPU utilization.

Re: Dropping Connections

Posted: 2016/05/28 19:26:32
by TrevorH
Mostly it sounds like duplicate ip addresses. If you're hosting other people then maybe they are just adding an extra ip that they shouldn't be using?

Re: Dropping Connections

Posted: 2016/05/28 20:08:20
by rotorboy
I'll do some googling on link flap etc. I was using different terms so maybe that'll get me what I need.

There's nothing in the logs to indicate the interface has been brought down or disconnected. The connection simply stops responding to some or all IPs. It seems exactly like what you'd get if an IP is being brought up in duplicate across a different system or possibly even if another system fires up with a duplicate MAC address, except I'm not finding any evidence that anyone else is trying to bring up my IPs and I've tried changing MAC addresses to no avail.

Over the last 18hrs or so arpwatch has reported the correct MAC and IP for most of the IPs in the block. So far it reported only one case where an IP went between 2 MAC addresses. I thought I found something there until I realized that it was the MAC for eth1 on the same machine. I'm not sure why the server had the IP on eth0 and then switched it to eth1 for a little while but I don't think that was anything more than an unrelated blip. The IP didn't actually go down during that time.

I'm in the data centre now and will try switching switches to see if maybe the switch is messed up. If the switch has a memory problem or something else screwed up I'm thinking whenever we run the ifup eth0:xx the server advertises the MAC and the route gets re-established at least temporarily. It might just be grasping at straws.

Thanks for the help so far.