hang in ext4_journal_start_sb making an lvm snapshot

Bug #620242 reported by Martin Pool
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

[maverick]
I'm trying to use lvm snapshots under schroot to create a reproducible build environment. The first few uses of them worked, but after that I find more and more processes are getting locked up like this:

[168841.490770] INFO: task flush-252:13:27159 blocked for more than 120 seconds.
[168841.490777] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[168841.490783] flush-252:13 D 000000010100cf86 0 27159 2 0x00000000
[168841.490793] ffff8801f6643ac0 0000000000000046 ffff880100000000 0000000000015900
[168841.490802] ffff8801f6643fd8 0000000000015900 ffff8801f6643fd8 ffff88008269c4a0
[168841.490810] 0000000000015900 0000000000015900 ffff8801f6643fd8 0000000000015900
[168841.490818] Call Trace:
[168841.490831] [<ffffffff81201aad>] ext4_journal_start_sb+0x7d/0x130
[168841.490841] [<ffffffff8107e490>] ? autoremove_wake_function+0x0/0x40
[168841.490850] [<ffffffff811ef6ff>] ext4_da_writepages+0x25f/0x640
[168841.490859] [<ffffffff810530e6>] ? load_balance_fair+0x1a6/0x2a0
[168841.490869] [<ffffffff81109ea1>] do_writepages+0x21/0x40
[168841.490877] [<ffffffff81173db6>] writeback_single_inode+0xe6/0x3f0
[168841.490884] [<ffffffff81174525>] writeback_sb_inodes+0x195/0x280
[168841.490892] [<ffffffff81174f7c>] wb_writeback+0x12c/0x2b0
[168841.490900] [<ffffffff8106ef4c>] ? lock_timer_base+0x3c/0x70
[168841.490908] [<ffffffff8106fa42>] ? del_timer_sync+0x22/0x30
[168841.490915] [<ffffffff811751a9>] wb_do_writeback+0xa9/0x190
[168841.490922] [<ffffffff8106f060>] ? process_timeout+0x0/0x10
[168841.490930] [<ffffffff811752e7>] bdi_writeback_task+0x57/0x160
[168841.490936] [<ffffffff8107e357>] ? bit_waitqueue+0x17/0xd0
[168841.490945] [<ffffffff81119406>] bdi_start_fn+0x86/0x100
[168841.490952] [<ffffffff81119380>] ? bdi_start_fn+0x0/0x100
[168841.490957] [<ffffffff8107df36>] kthread+0x96/0xa0
[168841.490965] [<ffffffff8100aee4>] kernel_thread_helper+0x4/0x10
[168841.490971] [<ffffffff8107dea0>] ? kthread+0x0/0xa0
[168841.490977] [<ffffffff8100aee0>] ? kernel_thread_helper+0x0/0x10
[168841.490984] INFO: task lvcreate:27338 blocked for more than 120 seconds.

Tags: kj-triage
Revision history for this message
Martin Pool (mbp) wrote :
Revision history for this message
Toby McLaughlin (toby-jarpy) wrote :

Also on observed on one of my machines.

  - Linux RAID10 (f2)
  - SATA disks
  - Ext4
  - KVM
  - Multi gigabyte files (qcow2 images)
  - Multiple frozen, unkillable 'lvcreate --snapshot' commands

[1360167.174975] INFO: task flush-251:4:479 blocked for more than 120 seconds.
[1360167.187216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1360167.211359] flush-251:4 D ffff88031d2e33d8 0 479 2 0x00000000
[1360167.211363] ffff88031f7c9bb0 0000000000000046 0000000000015bc0 0000000000015bc0
[1360167.211366] ffff880322589ab0 ffff88031f7c9fd8 0000000000015bc0 ffff8803225896f0
[1360167.211369] 0000000000015bc0 ffff88031f7c9fd8 0000000000015bc0 ffff880322589ab0
[1360167.211371] Call Trace:
[1360167.211377] [<ffffffff811f632d>] ext4_force_commit+0x8d/0xe0
[1360167.211381] [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
[1360167.211385] [<ffffffff811de405>] ext4_write_inode+0x75/0x110
[1360167.211388] [<ffffffff81166a64>] writeback_single_inode+0x2b4/0x3d0
[1360167.211390] [<ffffffff8116750e>] writeback_inodes_wb+0x40e/0x5e0
[1360167.211393] [<ffffffff810586b0>] ? finish_task_switch+0x50/0xe0
[1360167.211395] [<ffffffff811677ea>] wb_writeback+0x10a/0x1d0
[1360167.211398] [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
[1360167.211402] [<ffffffff81558d9d>] ? schedule_timeout+0x19d/0x300
[1360167.211404] [<ffffffff81167abd>] wb_do_writeback+0x12d/0x1a0
[1360167.211406] [<ffffffff81167b83>] bdi_writeback_task+0x53/0xe0
[1360167.211409] [<ffffffff8110f546>] bdi_start_fn+0x86/0x100
[1360167.211411] [<ffffffff8110f4c0>] ? bdi_start_fn+0x0/0x100
[1360167.211413] [<ffffffff81084d16>] kthread+0x96/0xa0
[1360167.211416] [<ffffffff810141ea>] child_rip+0xa/0x20
[1360167.211418] [<ffffffff81084c80>] ? kthread+0x0/0xa0
[1360167.211419] [<ffffffff810141e0>] ? child_rip+0x0/0x20

Revision history for this message
nutznboltz (nutznboltz-deactivatedaccount) wrote :

"bdi_sync_writeback()" is the key.

Search for "writeback" in
https://edge.launchpad.net/ubuntu/+source/linux/2.6.32-25.43

Also look for "writeback" in

Bug #543617
Bug #585092
Bug #601361
Bug #620242
Bug #628047

Try the proposed kernel version 2.6.32-25.43, see
https://wiki.ubuntu.com/Testing/EnableProposed

Revision history for this message
Steve Conklin (sconklin) wrote :

We need some testing of the proposed kernel to see whether this has been resolved - please see previous comment for information about how to get the proposed kernel.

Thanks

tags: added: kj-triage
Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 620242] Re: hang in ext4_journal_start_sb making an lvm snapshot

I have 2.6.35-22.32 installed, I've made a snapshot, and I'll report
if the problem recurs.

--
Martin

Revision history for this message
Martin Pool (mbp) wrote :

I'm confused, do you specifically want me to test 2.6.32-25.43, or is 2.6.35-22.32 useful enough?

At any rate after about 24 hours it's good so far.

Revision history for this message
Arie Skliarouk (skliarie) wrote :

Also observed on one of my servers:
 - linux-image-2.6.35-25-server (package version 2.6.35-25.44) (latests from updates of 10.10)
 - software RAID5 on 5 SATA disks
 - LVM, ext4

"ps awx" shows process "[flush-251:0]" in D state. This (and related process) got locked up when there was much of disk contention going on.

Revision history for this message
Arie Skliarouk (skliarie) wrote :

Sorry, my dmesg is different:

[ 2394.850285] Call Trace:
[ 2394.850293] [<ffffffff8159e18d>] schedule_timeout+0x22d/0x310
[ 2394.850299] [<ffffffff8104ee62>] ? enqueue_entity+0x132/0x1b0
[ 2394.850303] [<ffffffff81058884>] ? check_preempt_wakeup+0x1b4/0x290
[ 2394.850306] [<ffffffff8159d306>] wait_for_common+0xd6/0x180
[ 2394.850309] [<ffffffff81056d40>] ? default_wake_function+0x0/0x20
[ 2394.850312] [<ffffffff8159d48d>] wait_for_completion+0x1d/0x20
[ 2394.850317] [<ffffffff811763a9>] sync_inodes_sb+0x89/0xc0
[ 2394.850320] [<ffffffff8117ada0>] ? sync_one_sb+0x0/0x30
[ 2394.850323] [<ffffffff8117ad88>] __sync_filesystem+0x88/0xa0
[ 2394.850325] [<ffffffff8117adc0>] sync_one_sb+0x20/0x30
[ 2394.850329] [<ffffffff811559db>] iterate_supers+0x8b/0xd0
[ 2394.850332] [<ffffffff8117ae15>] sys_sync+0x45/0x70
[ 2394.850337] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b

The "fastpath" makes me think it relates to software RAID somehow.

Revision history for this message
Arie Skliarouk (skliarie) wrote :

And again I am sorry - it looks like it IS related to the start_sb. Because when I look at the stack track of the stuck flush-251:0 process (cat /proc/1428/stack) I see the following output:
[<ffffffff8117dae5>] sync_buffer+0x45/0x50
[<ffffffff8117da96>] __wait_on_buffer+0x26/0x30
[<ffffffff8117ea99>] sync_dirty_buffer+0x69/0xe0
[<ffffffff8122d24d>] jbd2_journal_update_superblock+0xad/0x170
[<ffffffff8122a757>] jbd2_cleanup_journal_tail+0x107/0x1e0
[<ffffffff8122af16>] jbd2_log_do_checkpoint+0x186/0x260
[<ffffffff8122b09e>] __jbd2_log_wait_for_space+0xae/0x1d0
[<ffffffff81226867>] start_this_handle+0x107/0x4b0
[<ffffffff81226de5>] jbd2_journal_start+0xb5/0x100
[<ffffffff81202a68>] ext4_journal_start_sb+0xf8/0x130
[<ffffffff811f06ff>] ext4_da_writepages+0x25f/0x640
[<ffffffff8110aab1>] do_writepages+0x21/0x40
[<ffffffff811759d6>] writeback_single_inode+0xe6/0x3f0
[<ffffffff81176135>] writeback_sb_inodes+0x195/0x280
[<ffffffff81176b8c>] wb_writeback+0x12c/0x2b0
[<ffffffff81176db9>] wb_do_writeback+0xa9/0x190
[<ffffffff81176ef3>] bdi_writeback_task+0x53/0x160
[<ffffffff8111a196>] bdi_start_fn+0x86/0x100
[<ffffffff8107ec56>] kthread+0x96/0xa0
[<ffffffff8100aee4>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

Revision history for this message
Arie Skliarouk (skliarie) wrote :

Ok, found the real description of the bug over here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/624877

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 620242

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.