fsck not repairing corruption on boot

Bug #209416 reported by sam tygier
22
Affects Status Importance Assigned to Milestone
e2fsprogs (Ubuntu)
Invalid
Undecided
Unassigned
sysvinit (Ubuntu)
Fix Released
High
Martin Pitt

Bug Description

Binary package hint: e2fsprogs

I suffered some disk corruption see Bug #209346

On rebooting fsck stopped at 15% and the computer rebooted. this loop of fscking and rebooting continued for a while.

i did the nosplash thing and got the following

* Checking root file system...
1254
fsck 1.40.8 (13-Mar-2008)
/dev/sda9 contains a file system with error, check forced.
Checking drive /dev/sda9: 0% (stage 1/5, 1/79)
Checking drive /dev/sda9: 1% (stage 1/5, 1/79)
...
Checking drive /dev/sda9: 11% (stage 1/5, 1/79)
/dev/sda9: Inodes that were not part of a corrupted orphan linked list found. fsck died with exit status 4

when i booted gutsy from another partition i was able to repair the problem with fsck

sam@oberon:~$ sudo fsck -y /dev/sda9
fsck 1.40.2 (12-Jul-2007)
e2fsck 1.40.2 (12-Jul-2007)
/dev/sda9 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix? yes

Inode 294516 was part of the orphaned inode list. FIXED.
Inode 1153183 was part of the orphaned inode list. FIXED.

Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 97482: 232785 232786 232787 232788 232789
Multiply-claimed block(s) in inode 97821: 223622 223623 232785 232786 232787 232788 232789 232790 232790
Multiply-claimed block(s) in inode 98127: 538587 538588 538589 538590 538593 538593
Multiply-claimed block(s) in inode 98135: 538576 538577 538578 538579 538580 538581 538582 538583 538583 538587 538588 538589
Multiply-claimed block(s) in inode 131581: 538576
Multiply-claimed block(s) in inode 131584: 538577 538578 538579 538580 538581 538582 538590
Multiply-claimed block(s) in inode 1169003: 223622 223623
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 7 inodes containing multiply-claimed blocks.)

File /var/lib/mlocate/mlocate.db (inode #97482, mod time Sat Mar 29 12:45:10 2008)
  has 5 multiply-claimed block(s), shared with 1 file(s):
        /var/log/syslog.0 (inode #97821, mod time Sat Mar 29 12:38:09 2008)
Clone multiply-claimed blocks? yes

File /var/log/syslog.0 (inode #97821, mod time Sat Mar 29 12:38:09 2008)
  has 9 multiply-claimed block(s), shared with 2 file(s):
        /var/lib/mlocate/mlocate.db (inode #97482, mod time Sat Mar 29 12:45:10 2008)
        /home/sam/.mozilla/firefox/3dcgopyv.default/Cache/05ADE8E7d01 (inode #1169003, mod time Sun Mar 30 10:56:19 2008)
Clone multiply-claimed blocks? yes

File /var/log/messages (inode #98127, mod time Sun Mar 30 10:54:00 2008)
  has 6 multiply-claimed block(s), shared with 2 file(s):
        /var/cache/apt/archives/linux-ubuntu-modules-2.6.24-12-generic_2.6.24-12.17_amd64.deb (inode #131584, mod time Tue Mar 11 13:04:01 2008)
        /var/log/kern.log (inode #98135, mod time Sun Mar 30 10:54:13 2008)
Clone multiply-claimed blocks? yes

File /var/log/kern.log (inode #98135, mod time Sun Mar 30 10:54:13 2008)
  has 12 multiply-claimed block(s), shared with 3 file(s):
        /var/log/messages (inode #98127, mod time Sun Mar 30 10:54:00 2008)
        /var/cache/apt/archives/linux-ubuntu-modules-2.6.24-12-generic_2.6.24-12.17_amd64.deb (inode #131584, mod time Tue Mar 11 13:04:01 2008)
        /var/cache/apt/archives/linux-image-2.6.24-12-generic_2.6.24-12.22_amd64.deb (inode #131581, mod time Thu Mar 13 01:04:22 2008)
Clone multiply-claimed blocks? yes

File /var/cache/apt/archives/linux-image-2.6.24-12-generic_2.6.24-12.22_amd64.deb (inode #131581, mod time Thu Mar 13 01:04:22 2008)
  has 1 multiply-claimed block(s), shared with 1 file(s):
        /var/log/kern.log (inode #98135, mod time Sun Mar 30 10:54:13 2008)
Multiply-claimed blocks already reassigned or cloned.

File /var/cache/apt/archives/linux-ubuntu-modules-2.6.24-12-generic_2.6.24-12.17_amd64.deb (inode #131584, mod time Tue Mar 11 13:04:01 2008)
  has 7 multiply-claimed block(s), shared with 2 file(s):
        /var/log/messages (inode #98127, mod time Sun Mar 30 10:54:00 2008)
        /var/log/kern.log (inode #98135, mod time Sun Mar 30 10:54:13 2008)
Multiply-claimed blocks already reassigned or cloned.

File /home/sam/.mozilla/firefox/3dcgopyv.default/Cache/05ADE8E7d01 (inode #1169003, mod time Sun Mar 30 10:56:19 2008)
  has 2 multiply-claimed block(s), shared with 1 file(s):
        /var/log/syslog.0 (inode #97821, mod time Sat Mar 29 12:38:09 2008)
Multiply-claimed blocks already reassigned or cloned.

Pass 2: Checking directory structure
Entry 'dpkg.status.1.gz' in /var/backups (102619) has deleted/unused inode 97829. Clear? yes

Entry '%gconf.xml' in /home/sam/.gconf/apps/file-roller/listing (1203731) has deleted/unused inode 1201104. Clear? yes

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 97482 ref count is 1, should be 2. Fix? yes

Unattached inode 97686
Connect to /lost+found? yes

Inode 97686 ref count is 2, should be 1. Fix? yes

Unattached inode 1204250
Connect to /lost+found? yes

Inode 1204250 ref count is 2, should be 1. Fix? yes

Pass 5: Checking group summary information
Block bitmap differences: -(204832--204834) -(396515--396519) -516406 -(516585--516590) -(516599--516602) -520920 +(538553--538554) -713526 -(1086379--1086426) -(2356655--2356658)
Fix? yes

Free blocks count wrong for group #1 (21, counted=0).
Fix? yes

Free blocks count wrong for group #6 (1, counted=4).
Fix? yes

Free blocks count wrong for group #12 (383, counted=388).
Fix? yes

Free blocks count wrong for group #15 (11864, counted=11876).
Fix? yes

Free blocks count wrong for group #16 (6344, counted=6363).
Fix? yes

Free blocks count wrong for group #21 (6, counted=7).
Fix? yes

Free blocks count wrong for group #33 (0, counted=48).
Fix? yes

Free blocks count wrong for group #71 (98, counted=102).
Fix? yes

Free blocks count wrong (172633, counted=172704).
Fix? yes

Inode bitmap differences: -99319 -292135 -294516 -1153183 -1201104 +1204250
Fix? yes

Free inodes count wrong for group #6 (9591, counted=9592).
Fix? yes

Free inodes count wrong for group #18 (13732, counted=13734).
Fix? yes

Free inodes count wrong for group #71 (8505, counted=8506).
Fix? yes

Free inodes count wrong (1058740, counted=1058744).
Fix? yes

/dev/sda9: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda9: 222952/1281696 files (2.8% non-contiguous), 2387647/2560351 blocks
sam@oberon:~$ sudo fsck -y /dev/sda9
fsck 1.40.2 (12-Jul-2007)
e2fsck 1.40.2 (12-Jul-2007)
/dev/sda9: clean, 222952/1281696 files, 2387647/2560351 blocks

i was then able to mount sda9.

it seems to me that hardy's fsck should have been able to fix this, and not get stuck in a reboot loop.

Related branches

Revision history for this message
Mary Gardiner (puzzlement) wrote :

The "getting stuck in a reboot loop" part is possibly a duplicate of bug 204097.

Revision history for this message
Theodore Ts'o (tytso) wrote :

This looks like an upstart bug. There are certain bugs which e2fsck will not fix automatically, but where it wants a human to look at the filesystem and decide what the right thing is before just going ahead and blindly fixing the problem. (For example, the filesystem corruption might result in an access control list file for sudo or some firewall to go missing, and that might leave the system unprotected.) In those cases, e2fsck will print a message:

 fprintf(stderr, _("\n\n%s: UNEXPECTED INCONSISTENCY; "
  "RUN fsck MANUALLY.\n\t(i.e., without -a or -p options)\n"),
        ctx->device_name);

And then exit with an error code of 4 (filesystem errors left uncorrected) OR'ed into the exit code. See the fsck man page for more details about the error code reporting convention.

This error:

/dev/sda9: Inodes that were not part of a corrupted orphan linked list found. fsck died with exit status 4

is one that will result in e2fsck calling preenhalt(). But it appears that upstart isn't making the message be visible (maybe because it's only paying attention to stdout and not stderr?). And if then reboots, the same thing will happen again, and again....

Changed in e2fsprogs:
status: New → Invalid
Revision history for this message
sam tygier (samtygier) wrote :

the fsck man page says
" -y For some filesystem-specific checkers, the -y option will cause
              the fs-specific fsck to always attempt to fix any detected
              filesystem corruption automatically. Sometimes an expert may be
              able to do better driving the fsck manually. Note that not all
              filesystem-specific checkers implement this option. In particu‐
              lar fsck.minix(8) and fsck.cramfs(8) does not support the -y
              option as of this writing."

that makes the benifit of manual checking seem pretty small. maybe the -y option could be add by default.

Could a 'these files may have been damaged' list be given to the user at next log in?

Revision history for this message
Theodore Ts'o (tytso) wrote :

There are two prolems with using -y. First of all, the idea of giving the user "these files may have been dmanaged" at the next login doesn't work if these are access control files for controlling which hosts are allowed to talk to a server, or other forms of security critical files if the machine is running on unattended server configuration. As another example, suppose the filesystem contains a database or some other critical application data which is now corrupt. It may be better to not let the system come back up, since there are many business applications where serving wrong data is far, far worse than serving no data at all. (Think financial applications....)

Secondly, e2fsck -y won't always do the best job if the goal is to recover as much files as possible.

I'm willing to consider adding a paremeter to e2fsck.conf file to enable "reckless mode", which in preen mode blindly tries to fix everything according to hueristics, with no care as to whether a system administrator with human judgement could do a better job. I am concerned about this, because Ubuntu users seem to be more likely to have disk corruption issues more frequently than I've seen from other distro's. Maybe it's because some segment of Ubuntu users are not as careful about the sort of hardware they choose and are using cheaper hardware (as the old joke goes, "whatever falls off the boat from Taiwan, as long as its cheapest"); or maybe because people are encouraged to file Launchpad bugs over hardware issues; or maybe its because of a difference in the maintainance strategy of the distro kernel. So the problem with reckless mode is that they might lose files without even noticing that something bad had happened (i.e., they click away or delete the message of filesystem problems because they don't understand it). Of course these "less-clueful users" are also much less likely to be doing regular backups as well.....

In any case, regardless of whether it is a good idea or not to provide a "reckless mode" for e2fsck, upstart **MUST** display output which is printed by the fsck drivers on standard output and upstart **MUST** respect the fsck driver's wishes if it exits saying that a system administrator should stop and look at the filesystem. At least for a server configuration (and I thought Hardy was going to be tagetted at servers), this is a MUST.

Revision history for this message
Martin Pitt (pitti) wrote :

I agree that we should really fix this for Hardy. I'm currently not sure where the problem is (in upstart, or the usplash integration, etc.).

Incidentally, is it possible to safely fake this situation in order to test this? Well, I guess temporarily replacing e2fsck with an "exit 4" shell script should do. :-)

FWIW, I think it would be a bad idea to use something like a "reckless" fsck repair mode by default. Forcefully mounting and booting a broken fs might do more damage than the boot is worth, and it invites users to ignore the error.

Changed in upstart:
importance: Undecided → High
milestone: none → ubuntu-8.04
Revision history for this message
Martin Pitt (pitti) wrote :

I'll do some reproducing and checking where exactly the bug lies.

Changed in upstart:
assignee: nobody → pitti
Revision history for this message
Theodore Ts'o (tytso) wrote :

Sam,

In your original report, you said:

>I did the nosplash thing and got the following
>
>* Checking root file system...
>1254
>fsck 1.40.8 (13-Mar-2008)
>/dev/sda9 contains a file system with error, check forced.
>Checking drive /dev/sda9: 0% (stage 1/5, 1/79)
>Checking drive /dev/sda9: 1% (stage 1/5, 1/79)
>...
>Checking drive /dev/sda9: 11% (stage 1/5, 1/79)
>/dev/sda9: Inodes that were not part of a corrupted orphan linked list found. fsck died with exit status 4

Was this *really* all you saw? There should have been a message saying that you needed to run e2fsck without the -p (preen) option, and it should offered to drop you into single user mode, after demanding a root password.

Did you see any evidence of this, in spash or nospash mode?

And when you say "reboot loop", did you have to hit return or otherwise decline to enter single-user mode?

It should have **not** been necessary to boot a rescue floppy to recover from this. You don't have to on Ubuntu Gutsy in non-splash mode, nor on any other major distribution as far as I know. I leapt to the assumption that this was an upstart problem, since upstart was new to Hardy, but maybe it's something else.....

Revision history for this message
sam tygier (samtygier) wrote :

i snipped out the some lines from the middle with incrementing percentages (and there were normal verbose booting lines above). i did not not see any message telling me to run fsck.

the reboot loops were with no interaction by me. it would just go through BIOS, then grub, then show the splash, then show fsck process, then it would go back to BIOS. i let it do this 3 or 4 times.

is there a way to deliberately cause this sort of disk corruption? i had a google for way to do it, but is seems that most people want to fix corruption ;-)

Revision history for this message
Martin Pitt (pitti) wrote : Re: [Bug 209416] Re: fsck not repairing corruption on boot

sam tygier [2008-04-02 1:14 -0000]:
> is there a way to deliberately cause this sort of disk corruption? i had
> a google for way to do it, but is seems that most people want to fix
> corruption ;-)

In my experience, this triggers a fatal error, which is correctable
well:

  sudo tune2fs -s 0 /dev/...; sudo tune2fs -s 1 /dev/...

I. e. flip the 'sparse superblock' bit back and forth.

Revision history for this message
Theodore Ts'o (tytso) wrote :

Actually, that won't replicate the problem you're trying to replicate. In order to do that, you need to induce an error which causes e2fsck to exit with a preenhalt statement when you run it with the -p option.

So for example:

# debugfs -w -R "write /etc/motd test-file" /tmp/foo.img
debugfs 1.40.8 (13-Mar-2008)
Allocated inode: 12
# debugfs -w -R "unlink test-file" /tmp/foo.img
debugfs 1.40.8 (13-Mar-2008)
# debugfs -w -R "set_super_value state 2" /tmp/foo.img
debugfs 1.40.8 (13-Mar-2008)
# e2fsck -p /tmp/foo.img
/tmp/foo.img contains a file system with errors, check forced.
/tmp/foo.img: Unattached inode 12

/tmp/foo.img: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
        (i.e., without -a or -p options)
# echo $?
4

Revision history for this message
sam tygier (samtygier) wrote :

i tried theodore's debugfs commands, at boot the fsck happens during the splash screen and then when it fails i am taken to a prompt (though it is not very clear as there is no # or $) see attachment.

is there any reason that "Inodes that were not part of a corrupted orphan linked list found" errors might be treated differently?

Revision history for this message
Colin Watson (cjwatson) wrote :

Ted: "since upstart was new to Hardy" - for the record, upstart was introduced in Edgy, i.e. Ubuntu 6.10.

Revision history for this message
Martin Pitt (pitti) wrote :

With Ted's example commands I can replicate the reboot loop in splash mode. I'll have a look at this now.

Setting to sysvinit since I suspect the bug is in checkroot.sh (for not displaying the text message properly) and the usplash fsck integration (for not killing usplash on failure).

Changed in upstart:
status: New → In Progress
Revision history for this message
Martin Pitt (pitti) wrote :

When I corrupt the disk and boot without 'splash' (either immediately or after a few reboot loops), I get attached screenshot. This looks fine to me, I get the "RUN fsck MANUALLY" notice and a root shell.

However, Sam's screenshot is different: given the check percentage lines on his system, the usplash process is still running (since the usplash integration scripts kick in), but not displayed any more. This suggests that you booted with splash, but usplash crashed somewhere in between?

Revision history for this message
Martin Pitt (pitti) wrote :

I fixed the fsck integration to quit usplash on fsck errors > 1. Now I end up with the same screen as Sam. The "RUN fsck MANUALLY" message is not printed. I do get a sulogin shell, but it does not show any prompt. I suspect that the terminal attributes get wrecked. Looking into that now.

Revision history for this message
Martin Pitt (pitti) wrote :

I got it now, I think. I uploaded a new sysfsutils which is now sitting in the UNAPPROVED queue, and will be accepted after the Hardy RC release.

FYI, I attach the debdiff.

Revision history for this message
Martin Pitt (pitti) wrote :

I tested this with the following cases:

 - clean fs, routine check, cancel
 - clean fs, routine check, no cancel
 - two clean fses with routine check, and mixed cancelling
 - one ext3 and one reiserfsck
 - corrupt ext3 root fs (I get to the console and get the fsck message, as well as usable sulogin)
 - corrupt ext3 non-root fs (dito about sulogin)

Changed in sysvinit:
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package sysvinit - 2.86.ds1-14.1ubuntu45

---------------
sysvinit (2.86.ds1-14.1ubuntu45) hardy; urgency=low

  * Fix handling of fatal fsck errors in the usplash integration. (LP: #209416)
    - usplash-fsck-functions.sh: When fsck exits with an error > 1, this
      signals a non-correctable failure, which will trigger sulogin. Quit
      usplash in this case and restore stdin/out/err, so that the following
      sulogin is actually usable.
    - check{root,fs}.sh: Redirect fsck's stdout/err to /dev/console in usplash
      mode, so that the user will see the "RUN fsck MANUALLY" warning.

 -- Martin Pitt <email address hidden> Wed, 16 Apr 2008 17:34:20 +0200

Changed in sysvinit:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.