Severe perofrmance problem requiring reboots

General support questions including new installations
Post Reply
fugtruck
Posts: 39
Joined: 2006/08/18 18:24:08

Severe perofrmance problem requiring reboots

Post by fugtruck » 2011/04/29 16:26:47

I am not really sure exactly how to describe the problem I am having so I will try to provide some of the symptoms.

I am having a reoccurring (and seemingly random) problem where a server of mine becomes virtually unresponsive. I can SSH into the server and some commands execute properly and some just hang. For example, the 'uptime' command executes and returns data, but 'w' just hangs. Some details about the server: CentOS 5.6 x64 with kernel 2.6.18-238.9.1.el5 installed. It has 4 CPU cores with 6GB RAM. It is a virtual machine running on VMWare ESXi on a VMFS datastore. The partitions are a combination of ext3 and ext4. The server has been performing fine for the past year and a half and only suddenly started having problems within the past week or two.

The uptime command shows a load average of greater than 300 (normal for this server is between 5 and 10), however the vSphere client shows CPU activity drop to almost nothing when the problem occurs. When it does occur, I have to power off the server and power it back on, as the restart or shutdown command just hangs. Once the server boots back up, it goes back to performing just fine.

I found the following errors show up in the syslog when the problem occurs (see below). Any suggestions on what to do about this?

Apr 28 11:55:29 servername kernel: INFO: task pdflush:339 blocked for more than 120 seconds.
Apr 28 11:55:29 servername kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 28 11:55:29 servername kernel: pdflush D ffff8101bbc4f000 0 339 71 340 338 (L-TLB)
Apr 28 11:55:29 servername kernel: ffff8101bf369b70 0000000000000046 ffff81003bf1ed98 ffff8101bf369be8
Apr 28 11:55:29 servername kernel: 0000000000000001 000000000000000a ffff8101bf9a80c0 ffff81000a502080
Apr 28 11:55:29 servername kernel: 00007b9830595b62 00000000000568d1 ffff8101bf9a82a8 000000038004817a
Apr 28 11:55:29 servername kernel: Call Trace:
Apr 28 11:55:29 servername kernel: [] write_cache_pages+0x2ac/0x332
Apr 28 11:55:29 servername kernel: [] :ext4:__mpage_da_writepage+0x0/0x162
Apr 28 11:55:29 servername kernel: [] :jbd2:start_this_handle+0x2e9/0x3b3
Apr 28 11:55:29 servername kernel: [] autoremove_wake_function+0x0/0x2e
Apr 28 11:55:29 servername kernel: [] alternate_node_alloc+0x70/0x8c
Apr 28 11:55:29 servername kernel: [] :jbd2:jbd2_journal_start+0xa1/0xd8
Apr 28 11:55:29 servername kernel: [] :ext4:ext4_da_writepages+0x296/0x4fc
Apr 28 11:55:29 servername kernel: [] do_writepages+0x20/0x2f
Apr 28 11:55:29 servername kernel: [] __writeback_single_inode+0x19e/0x318
Apr 28 11:55:29 servername kernel: [] delayacct_end+0x5d/0x86
Apr 28 11:55:29 servername kernel: [] dequeue_task+0x18/0x37
Apr 28 11:55:29 servername kernel: [] sync_sb_inodes+0x1b5/0x26f
Apr 28 11:55:29 servername kernel: [] keventd_create_kthread+0x0/0xc4
Apr 28 11:55:29 servername kernel: [] writeback_inodes+0x82/0xd8
Apr 28 11:55:29 servername kernel: [] wb_kupdate+0xd4/0x14e
Apr 28 11:55:29 servername kernel: [] pdflush+0x0/0x1fb
Apr 28 11:55:29 servername kernel: [] pdflush+0x151/0x1fb
Apr 28 11:55:29 servername kernel: [] wb_kupdate+0x0/0x14e
Apr 28 11:55:29 servername kernel: [] kthread+0xfe/0x132
Apr 28 11:55:29 servername kernel: [] child_rip+0xa/0x11
Apr 28 11:55:29 servername kernel: [] keventd_create_kthread+0x0/0xc4
Apr 28 11:55:29 servername kernel: [] kthread+0x0/0x132
Apr 28 11:55:29 servername kernel: [] child_rip+0x0/0x11
Apr 28 11:55:29 servername kernel:
Apr 28 11:55:29 servername kernel: INFO: task kswapd0:340 blocked for more than 120 seconds.
Apr 28 11:55:29 servername kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 28 11:55:29 servername kernel: kswapd0 D ffff8101bbc4f000 0 340 71 341 339 (L-TLB)
Apr 28 11:55:29 servername kernel: ffff8101bf36dc10 0000000000000046 0000000000000001 ffffffff800c8994
Apr 28 11:55:29 servername kernel: ffff8101063420c0 000000000000000a ffff8101bf9a8820 ffff8100136b2820
Apr 28 11:55:29 servername kernel: 00007b95f5267fab 000000000009b63d ffff8101bf9a8a08 0000000200000001
Apr 28 11:55:29 servername kernel: Call Trace:
Apr 28 11:55:29 servername kernel: [] __remove_from_page_cache+0x1f/0x6c
Apr 28 11:55:29 servername kernel: [] __pagevec_free+0x21/0x2e
Apr 28 11:55:29 servername kernel: [] release_pages+0x14d/0x15a
Apr 28 11:55:29 servername kernel: [] :jbd2:start_this_handle+0x2e9/0x3b3
Apr 28 11:55:29 servername kernel: [] autoremove_wake_function+0x0/0x2e
Apr 28 11:55:29 servername kernel: [] alternate_node_alloc+0x70/0x8c
Apr 28 11:55:29 servername kernel: [] :jbd2:jbd2_journal_start+0xa1/0xd8
Apr 28 11:55:29 servername kernel: [] :ext4:ext4_release_dquot+0x42/0x7f
Apr 28 11:55:29 servername kernel: [] dqput+0x1be/0x200
Apr 28 11:55:29 servername kernel: [] dquot_drop+0x30/0x5e
Apr 28 11:55:29 servername kernel: [] clear_inode+0xb4/0x123
Apr 28 11:55:29 servername kernel: [] dispose_list+0x41/0xe0
Apr 28 11:55:29 servername kernel: [] shrink_icache_memory+0x1b7/0x1e6
Apr 28 11:55:29 servername kernel: [] shrink_slab+0xdc/0x153
Apr 28 11:55:29 servername kernel: [] kswapd+0x35d/0x495
Apr 28 11:55:29 servername kernel: [] autoremove_wake_function+0x0/0x2e
Apr 28 11:55:29 servername kernel: [] keventd_create_kthread+0x0/0xc4
Apr 28 11:55:29 servername kernel: [] kswapd+0x0/0x495
Apr 28 11:55:29 servername kernel: [] keventd_create_kthread+0x0/0xc4
Apr 28 11:55:29 servername kernel: [] kthread+0xfe/0x132
Apr 28 11:55:29 servername kernel: [] child_rip+0xa/0x11
Apr 28 11:55:29 servername kernel: [] keventd_create_kthread+0x0/0xc4
Apr 28 11:55:29 servername kernel: [] kthread+0x0/0x132
Apr 28 11:55:29 servername kernel: [] child_rip+0x0/0x11
Apr 28 11:55:29 servername kernel:
Apr 28 11:55:29 servername kernel: INFO: task jbd2/sdg1-8:2610 blocked for more than 120 seconds.
Apr 28 11:55:29 servername kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 28 11:55:29 servername kernel: jbd2/sdg1-8 D ffff8101bffa29c0 0 2610 71 2611 2600 (L-TLB)
Apr 28 11:55:29 servername kernel: ffff8101b8623d60 0000000000000046 0000000000000282 ffffffff8002232e
Apr 28 11:55:29 servername kernel: ffff8101b8623cf0 000000000000000a ffff8101bf31b7a0 ffff810137ed2820
Apr 28 11:55:29 servername kernel: 00007b95b09f8cf9 0000000000000be6 ffff8101bf31b988 0000000300000001
Apr 28 11:55:29 servername kernel: Call Trace:
Apr 28 11:55:29 servername kernel: [] __up_read+0x19/0x7f
Apr 28 11:55:29 servername kernel: [] :jbd2:jbd2_journal_commit_transaction+0x191/0x1068
Apr 28 11:55:29 servername kernel: [] autoremove_wake_function+0x0/0x2e
Apr 28 11:55:29 servername kernel: [] lock_timer_base+0x1b/0x3c
Apr 28 11:55:29 servername kernel: [] try_to_del_timer_sync+0x7f/0x88
Apr 28 11:55:29 servername kernel: [] :jbd2:kjournald2+0x9a/0x1ec
Apr 28 11:55:29 servername kernel: [] autoremove_wake_function+0x0/0x2e
Apr 28 11:55:29 servername kernel: [] keventd_create_kthread+0x0/0xc4
Apr 28 11:55:29 servername kernel: [] :jbd2:kjournald2+0x0/0x1ec
Apr 28 11:55:29 servername kernel: [] keventd_create_kthread+0x0/0xc4
Apr 28 11:55:29 servername kernel: [] kthread+0xfe/0x132
Apr 28 11:55:29 servername kernel: [] child_rip+0xa/0x11
Apr 28 11:55:29 servername kernel: [] keventd_create_kthread+0x0/0xc4
Apr 28 11:55:29 servername kernel: [] kthread+0x0/0x132
Apr 28 11:55:29 servername kernel: [] child_rip+0x0/0x11
Apr 28 11:55:29 servername kernel:
Apr 28 11:55:29 servername kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 28 11:55:29 servername kernel: ffff8101891cbd78 0000000000000082 ffff810168a210f8 ffffffff8000d044
Apr 28 11:55:29 servername kernel: ffff8101bf595780 000000000000000a ffff8101b48b3820 ffff810011f72860
Apr 28 11:55:29 servername kernel: 00007b95e2033232 000000000003271a ffff8101b48b3a08 00000001891cbea8
Apr 28 11:55:29 servername kernel: Call Trace:
Apr 28 11:55:29 servername kernel: [] do_lookup+0x65/0x1e6
Apr 28 11:55:29 servername kernel: [] __link_path_walk+0xf90/0xfb9
Apr 28 11:55:29 servername kernel: [] :jbd2:start_this_handle+0x2e9/0x3b3
Apr 28 11:55:29 servername kernel: [] autoremove_wake_function+0x0/0x2e
Apr 28 11:55:29 servername kernel: [] do_gettimeofday+0x40/0x90
Apr 28 11:55:29 servername kernel: [] :jbd2:jbd2_journal_start+0xa1/0xd8
Apr 28 11:55:29 servername kernel: [] :ext4:ext4_setattr+0x1b5/0x339
Apr 28 11:55:29 servername kernel: [] notify_change+0x145/0x2f3
Apr 28 11:55:29 servername kernel: [] do_truncate+0x67/0x82
Apr 28 11:55:29 servername kernel: [] audit_syscall_entry+0x1a4/0x1cf
Apr 28 11:55:29 servername kernel: [] sys_ftruncate+0xe4/0x101
Apr 28 11:55:29 servername kernel: [] tracesys+0xd5/0xe0
Apr 28 11:55:29 servername kernel:
.
.
.

gerald_clark
Posts: 10642
Joined: 2005/08/05 15:19:54
Location: Northern Illinois, USA

Severe perofrmance problem requiring reboots

Post by gerald_clark » 2011/04/29 16:37:29

Dying or sleeping drive?
By any chance you running WD Green drives in a RAID?

fugtruck
Posts: 39
Joined: 2006/08/18 18:24:08

Re: Severe perofrmance problem requiring reboots

Post by fugtruck » 2011/04/29 16:46:26

No, and no. I am running on EqualLogic storage arrays. Other servers on the same physical storage with the same or similar configuration as this one are not having any noticeable problems.

gerald_clark
Posts: 10642
Joined: 2005/08/05 15:19:54
Location: Northern Illinois, USA

Re: Severe perofrmance problem requiring reboots

Post by gerald_clark » 2011/04/29 16:51:40

I see the kernel is current.
There was a glibc update yesterday, or the day before.
Has it been applied?

fugtruck
Posts: 39
Joined: 2006/08/18 18:24:08

Re: Severe perofrmance problem requiring reboots

Post by fugtruck » 2011/04/29 17:01:19

I have glibc version 2.5-58.el5_6.3 installed, which is the latest being offered to me by yum

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Severe perofrmance problem requiring reboots

Post by TrevorH » 2011/04/29 18:55:49

I'd compare the date of the start of the problem with the applying of patches listed in /var/log/yum.log and see if the kernel was updated around the time. Maybe you might want to try an older rather than a newer one. The stack trace you posted does mention ext4 in it so that might be a possible candidate for the problem but I'd also look at whatever driver it is that's in use for the disks themselves.

Post Reply