Ubuntu
linux package

hang in ext4_journal_start_sb making an lvm snapshot

Bug #620242 reported by Martin Pool on 2010-08-19

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Expired	Undecided	Unassigned

Bug Description

[maverick]
I'm trying to use lvm snapshots under schroot to create a reproducible build environment. The first few uses of them worked, but after that I find more and more processes are getting locked up like this:

[168841.490770] INFO: task flush-252:13:27159 blocked for more than 120 seconds.
[168841.490777] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[168841.490783] flush-252:13 D 000000010100cf86 0 27159 2 0x00000000
[168841.490793] ffff8801f6643ac0 0000000000000046 ffff880100000000 0000000000015900
[168841.490802] ffff8801f6643fd8 0000000000015900 ffff8801f6643fd8 ffff88008269c4a0
[168841.490810] 0000000000015900 0000000000015900 ffff8801f6643fd8 0000000000015900
[168841.490818] Call Trace:
[168841.490831] [<ffffffff81201aad>] ext4_journal_start_sb+0x7d/0x130
[168841.490841] [<ffffffff8107e490>] ? autoremove_wake_function+0x0/0x40
[168841.490850] [<ffffffff811ef6ff>] ext4_da_writepages+0x25f/0x640
[168841.490859] [<ffffffff810530e6>] ? load_balance_fair+0x1a6/0x2a0
[168841.490869] [<ffffffff81109ea1>] do_writepages+0x21/0x40
[168841.490877] [<ffffffff81173db6>] writeback_single_inode+0xe6/0x3f0
[168841.490884] [<ffffffff81174525>] writeback_sb_inodes+0x195/0x280
[168841.490892] [<ffffffff81174f7c>] wb_writeback+0x12c/0x2b0
[168841.490900] [<ffffffff8106ef4c>] ? lock_timer_base+0x3c/0x70
[168841.490908] [<ffffffff8106fa42>] ? del_timer_sync+0x22/0x30
[168841.490915] [<ffffffff811751a9>] wb_do_writeback+0xa9/0x190
[168841.490922] [<ffffffff8106f060>] ? process_timeout+0x0/0x10
[168841.490930] [<ffffffff811752e7>] bdi_writeback_task+0x57/0x160
[168841.490936] [<ffffffff8107e357>] ? bit_waitqueue+0x17/0xd0
[168841.490945] [<ffffffff81119406>] bdi_start_fn+0x86/0x100
[168841.490952] [<ffffffff81119380>] ? bdi_start_fn+0x0/0x100
[168841.490957] [<ffffffff8107df36>] kthread+0x96/0xa0
[168841.490965] [<ffffffff8100aee4>] kernel_thread_helper+0x4/0x10
[168841.490971] [<ffffffff8107dea0>] ? kthread+0x0/0xa0
[168841.490977] [<ffffffff8100aee0>] ? kernel_thread_helper+0x0/0x10
[168841.490984] INFO: task lvcreate:27338 blocked for more than 120 seconds.

Tags:

Revision history for this message

Martin Pool (mbp) wrote on 2010-08-19:

attachment Edit (141.2 KiB, application/binary)

Revision history for this message

Toby McLaughlin (toby-jarpy) wrote on 2010-08-30:

dmesg output (and mount, mdstat etc.) Edit (97.8 KiB, text/plain)

Also on observed on one of my machines.

  - Linux RAID10 (f2)
  - SATA disks
  - Ext4
  - KVM
  - Multi gigabyte files (qcow2 images)
  - Multiple frozen, unkillable 'lvcreate --snapshot' commands

[1360167.174975] INFO: task flush-251:4:479 blocked for more than 120 seconds.
[1360167.187216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1360167.211359] flush-251:4 D ffff88031d2e33d8 0 479 2 0x00000000
[1360167.211363] ffff88031f7c9bb0 0000000000000046 0000000000015bc0 0000000000015bc0
[1360167.211366] ffff880322589ab0 ffff88031f7c9fd8 0000000000015bc0 ffff8803225896f0
[1360167.211369] 0000000000015bc0 ffff88031f7c9fd8 0000000000015bc0 ffff880322589ab0
[1360167.211371] Call Trace:
[1360167.211377] [<ffffffff811f632d>] ext4_force_commit+0x8d/0xe0
[1360167.211381] [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
[1360167.211385] [<ffffffff811de405>] ext4_write_inode+0x75/0x110
[1360167.211388] [<ffffffff81166a64>] writeback_single_inode+0x2b4/0x3d0
[1360167.211390] [<ffffffff8116750e>] writeback_inodes_wb+0x40e/0x5e0
[1360167.211393] [<ffffffff810586b0>] ? finish_task_switch+0x50/0xe0
[1360167.211395] [<ffffffff811677ea>] wb_writeback+0x10a/0x1d0
[1360167.211398] [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
[1360167.211402] [<ffffffff81558d9d>] ? schedule_timeout+0x19d/0x300
[1360167.211404] [<ffffffff81167abd>] wb_do_writeback+0x12d/0x1a0
[1360167.211406] [<ffffffff81167b83>] bdi_writeback_task+0x53/0xe0
[1360167.211409] [<ffffffff8110f546>] bdi_start_fn+0x86/0x100
[1360167.211411] [<ffffffff8110f4c0>] ? bdi_start_fn+0x0/0x100
[1360167.211413] [<ffffffff81084d16>] kthread+0x96/0xa0
[1360167.211416] [<ffffffff810141ea>] child_rip+0xa/0x20
[1360167.211418] [<ffffffff81084c80>] ? kthread+0x0/0xa0
[1360167.211419] [<ffffffff810141e0>] ? child_rip+0x0/0x20

Revision history for this message

nutznboltz (nutznboltz-deactivatedaccount) wrote on 2010-09-03:

"bdi_sync_writeback()" is the key.

Search for "writeback" in
https://edge.launchpad.net/ubuntu/+source/linux/2.6.32-25.43

Also look for "writeback" in

Bug #543617
Bug #585092
Bug #601361
Bug #620242
Bug #628047

Try the proposed kernel version 2.6.32-25.43, see
https://wiki.ubuntu.com/Testing/EnableProposed

Revision history for this message

Steve Conklin (sconklin) wrote on 2010-09-15:

We need some testing of the proposed kernel to see whether this has been resolved - please see previous comment for information about how to get the proposed kernel.

Thanks

Jeremy Foshee (jeremyfoshee) on 2010-09-16

tags:

added: kj-triage

Revision history for this message

Martin Pool (mbp) wrote on 2010-09-20: Re: [Bug 620242] Re: hang in ext4_journal_start_sb making an lvm snapshot

I have 2.6.35-22.32 installed, I've made a snapshot, and I'll report
if the problem recurs.

--
Martin

Revision history for this message

Martin Pool (mbp) wrote on 2010-09-21:

I'm confused, do you specifically want me to test 2.6.32-25.43, or is 2.6.35-22.32 useful enough?

At any rate after about 24 hours it's good so far.

Revision history for this message

Arie Skliarouk (skliarie) wrote on 2011-02-23:

Also observed on one of my servers:
- linux-image-2.6.35-25-server (package version 2.6.35-25.44) (latests from updates of 10.10)
- software RAID5 on 5 SATA disks
- LVM, ext4

"ps awx" shows process "[flush-251:0]" in D state. This (and related process) got locked up when there was much of disk contention going on.

Revision history for this message

Arie Skliarouk (skliarie) wrote on 2011-02-23:

Sorry, my dmesg is different:

[ 2394.850285] Call Trace:
[ 2394.850293] [<ffffffff8159e18d>] schedule_timeout+0x22d/0x310
[ 2394.850299] [<ffffffff8104ee62>] ? enqueue_entity+0x132/0x1b0
[ 2394.850303] [<ffffffff81058884>] ? check_preempt_wakeup+0x1b4/0x290
[ 2394.850306] [<ffffffff8159d306>] wait_for_common+0xd6/0x180
[ 2394.850309] [<ffffffff81056d40>] ? default_wake_function+0x0/0x20
[ 2394.850312] [<ffffffff8159d48d>] wait_for_completion+0x1d/0x20
[ 2394.850317] [<ffffffff811763a9>] sync_inodes_sb+0x89/0xc0
[ 2394.850320] [<ffffffff8117ada0>] ? sync_one_sb+0x0/0x30
[ 2394.850323] [<ffffffff8117ad88>] __sync_filesystem+0x88/0xa0
[ 2394.850325] [<ffffffff8117adc0>] sync_one_sb+0x20/0x30
[ 2394.850329] [<ffffffff811559db>] iterate_supers+0x8b/0xd0
[ 2394.850332] [<ffffffff8117ae15>] sys_sync+0x45/0x70
[ 2394.850337] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b

The "fastpath" makes me think it relates to software RAID somehow.

Revision history for this message

Arie Skliarouk (skliarie) wrote on 2011-02-23:

And again I am sorry - it looks like it IS related to the start_sb. Because when I look at the stack track of the stuck flush-251:0 process (cat /proc/1428/stack) I see the following output:
[<ffffffff8117dae5>] sync_buffer+0x45/0x50
[<ffffffff8117da96>] __wait_on_buffer+0x26/0x30
[<ffffffff8117ea99>] sync_dirty_buffer+0x69/0xe0
[<ffffffff8122d24d>] jbd2_journal_update_superblock+0xad/0x170
[<ffffffff8122a757>] jbd2_cleanup_journal_tail+0x107/0x1e0
[<ffffffff8122af16>] jbd2_log_do_checkpoint+0x186/0x260
[<ffffffff8122b09e>] __jbd2_log_wait_for_space+0xae/0x1d0
[<ffffffff81226867>] start_this_handle+0x107/0x4b0
[<ffffffff81226de5>] jbd2_journal_start+0xb5/0x100
[<ffffffff81202a68>] ext4_journal_start_sb+0xf8/0x130
[<ffffffff811f06ff>] ext4_da_writepages+0x25f/0x640
[<ffffffff8110aab1>] do_writepages+0x21/0x40
[<ffffffff811759d6>] writeback_single_inode+0xe6/0x3f0
[<ffffffff81176135>] writeback_sb_inodes+0x195/0x280
[<ffffffff81176b8c>] wb_writeback+0x12c/0x2b0
[<ffffffff81176db9>] wb_do_writeback+0xa9/0x190
[<ffffffff81176ef3>] bdi_writeback_task+0x53/0x160
[<ffffffff8111a196>] bdi_start_fn+0x86/0x100
[<ffffffff8107ec56>] kthread+0x96/0xa0
[<ffffffff8100aee4>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

Revision history for this message

Arie Skliarouk (skliarie) wrote on 2011-02-23:

#10

Ok, found the real description of the bug over here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/624877

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-09-17: Missing required logs.

#11

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 620242

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-11-17:

#12

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1312454

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

hang in ext4_journal_start_sb making an lvm snapshot

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package