Hosted VPS server very slow... but only at specific times of day

Posted: 2012/03/02 07:29:35
by Miles

I began using a hosted VPS with Centos 5 about a month ago, and until 3-4 days ago, it's been working great.

However, about 3-4 days ago, I began seeing a seriously strange issue: the server begins to slow way down for nearly all connections only at specific times of day.

These times are (on server time, which is MST): 8 AM, 1 PM, 6 PM, and 10 PM. Each time, it lasts for almost exactly one hour, sometimes bleeding into the next hour by about 10 minutes (that is: the 1 PM 'episode' lasts until about 2:10, and then clears up).

I'll try to provide as much information as I can, as well as the troubleshooting steps I've gone through on my own:

My server runs an application that is based on perl and postgresql; this application is accessed as a web site by users. php and mysql are installed, but php is only used for phppgadmin, and mysql isn't used at all. For the first three weeks on this server, it's worked fine (though with occasional timeout errors from postgresql here and there that were generally 'fixed' by refreshing the page).

I noticed some maintenance scripts that came with the script I'm running (I didn't write it, it's a software package), which were meant to be run with cron jobs. Simple scripts that delete old entries in the database routinely and send emails to certain users about certain events. I tested the scripts, and they ran fine, so I went ahead and set up some cron jobs for them. I did notice one of the cron jobs complaining about not having a particular perl module (ParanoidAgent), so I installed it with Cpan. Everything seemed fine. I just happened to set up the 'run once an hour' cron job just before 1 PM.

The "issue" in particular is that the server seems to bog down at the specific times of day mentioned above, and began on the 27th at 1 PM. In specific, ssh, perl, and postgresql seem to all be affected, but simply loading a plain HTML document opens just fine with no issues. commands issued by ssh take sometimes upward to a minute to see a response, and attempts to connect by ssh during one of these hours sometimes results in timeouts, and always takes a long time. perl and postgres offer timeout errors to anyone who tries to use them. In particular, these errors:

[code]DBD::Pg::st execute failed: ERROR: canceling statement due to statement timeout

Timeout waiting for output from CGI script


I've scoured every log I can find and can only find references similar to those above: time out errors with perl and postgresql that start at the beginning of certain hours, and end at the turn of the next hour. No clues as to what could be causing it in the first place.

As you could probably imagine, my first guess was that it was the cron job. Maybe something went wrong? I disabled the cron jobs, but I didn't see any relief from the server. My next instinct was to try to restart the services (both crond and postgresql); the command took a long time, but it did restart the services... and nothing changed. My next instinct was to reboot the VM with the 'reboot' command. Even after booting back up, the horrible lag continued (and in fact, it took a long time to reboot). Going through logs later, I realized that in actuality, the problems started mere minutes before the cron job was meant to take place: the cron job was scheduled for 3 minutes into every hour, and the problem began at exactly 1 PM.

So, I finally broke down and contacted my host's customer support to ask if maybe something happened on their end. My VM didn't show any unusual activity, but that doesn't mean it wasn't something on the host machine or even in their network somewhere. They said that there was an 'abusive user' on the node that they had 'removed,' and that was that. Taking their word for it, I went on with my day (as by the time they got back to me, the lag had stopped). I decided to turn the cron jobs back on, and they worked just fine.

When it happened again later in the same day, I sent them another email, and in the meanwhile used top and iftop to see if anything was going wrong there. Nothing seemed to be taking up any unusual amount of available resource, and nothing unusual was running. I sent them another email, and the response was again "there was an abusive user, we removed them." Despite the host saying that it was an external influence, and my monitoring software saying that nothing out of the ordinary was taking up resources, I went ahead and used CpanPlus to uninstall ParanoidAgent, since that was something I remembered installing just before the problems started. I noticed here that the cron jobs were executing their jobs without any hassle even during these hour-long episodes.

The next day, I counted three or four more instances, and was beginning to notice the pattern I mentioned above. I again contacted my host, and they responded rather simply with "There were three abusive users; we removed them." Just to be safe, I disabled cron jobs again (simply by removing their lines out of the crontab, if you were wondering).

However, I had noticed that the problems were starting at EXACTLY the times I mentioned above. I could use 'date' to watch the seconds count by, and as soon as the clock hit 1PM, the problems began again. To test if it was something in my VM reacting to that time period specifically or not, I changed the time zone to something else... to find that it didn't make any difference on when the problems occurred. I sent another email with the information I had gathered, telling them about the pattern I had found, and how the server issues begin at precisely the beginning of an hour according to the system clock. I haven't received a response yet.

Are there any suggestions at all as to what could be causing the issue? I'm torn between "The host is screwing something up" and "I screwed something up," and I keep looking for ways to fix it on my end. I've undone pretty much everything I've done at all over the past 5-6 days, and I can't think of anything else that could possibly be causing the problem. I've looked at httpd error_log and access_log, the postgresql log only shows me the same errors that they do ('canceled request due to timeout' type errors). I've looked at the 'secure' log, the 'cron' log... nothing seems to complain about anything in particular except for timeouts during these specific times.

Posted: 2012/03/04 02:38:49
by pschaff
Welcome to the CentOS fora. Please see the recommended reading for new users linked in my signature.

Impossible to say anything with certainty with the abundant but largely anecdotal data you have supplied. I'd suspect the hoster and/or still more abusive clients, but that is just a guess.