Buggy BIOS hard disk workaround missing; causes: "Geom Error"

Bug #555500 reported by TJ
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
grub
Unknown
Unknown
grub2 (Ubuntu)
Fix Released
High
Unassigned
Lucid
Fix Released
High
Unassigned

Bug Description

Binary package hint: grub2

Many people are reporting failure of GRUB2 to boot. Usually this is Karmic and more lately Lucid. In the forums there is a thread with a workaround being used - to install Lilo:

http://ubuntuforums.org/showthread.php?t=1374209

I had Jaunty running fine on an Acer Travelmate C100 and decided to test Lucid. I booted using PXE over the network from a Xubuntu Live i386 CD image, ran the installer, and rebooted.

As soon as BIOS hands over to GRUB2 the screen shows:

GRUB
Geom Error

and that's it - nothing else.

GRUB1 had worked fine with the exact same partition layout on the disk:

1 ntfs 13GB Windows
2 extended
5 ext3 26GB Linux
6 swap ~1GB

In March 2009 I was diagnosing a problem with a USB key failing to boot in a similar way. The USB key used the syslinux project boot loader and so I wrote a diagnostic master boot record (MBR) that reports succinctly what the BIOS tells the boot code about which device it is booting from. It also allows to hold down the Shift or Ctrl keys to change its behaviour. The MBR code is only 435 bytes long.

I installed mbr-diag.bin into the MBR of the C100. It reveals that the BIOS is passing some very weird values to the boot code regardless of what the BIOS's Startup Configuration, Boot Order settings are.

Explanation of usage and output codes of mbr-diag.bin:

If a shift key is held down at boot, CHS addressing mode is forced
If Ctrl key is held down, drive number 0x80 is forced

L | C LBA or CHS addressing mode
D drive number BIOS-reported drive number
C cylinders Geometry of drive according to BIOS
H heads
S sectors
P partition active partition number (first partition flagged active). '?' if no active partition
O offset absolute sector offset of active partition . '????????' if no active partition
M magic magic bytes of active partition boot sector (sector <offset> as read by BIOS).
                                '????' if no active partition. Value is reset to 0xDEAD before the sector is read
                                to avoid inheriting the MBR magic on error
E error error code returned by BIOS 'read sector' interrupt (0x02 or 0x42, int 0x13).
                               '??' if no active partition.

It shows:

C D5F C000 H01 S01 P1 O0000003F MDEAD E01

So that means, CHS addressing mode, drive 95, 0 cylinders, 1 head, 1 sector, active partition #1, offset to partition#1 63 sectors, magic bytes not read since BIOS reported error 1.

I then tried holding the Ctrl key down to force hard disk 0x80 to be used:

L D80 C3FE HFF S3F P1 O0000003F M0000 E0

I'd have expected to see something close to this, which is an example of a 'good' set of BIOS boot parameters:

L D80 C3D9 HFF S3F P1 O00000020 MAA55 E00

However, I'd moved the Windows partition to the end of the disk to avoid any problems with the BIOS not being able to address beyond cylinder 1024. The new layout is:

      1 30 0x83 ext4 (250MB ext4 /boot)
    31 124 0x82 swap (750MB swap)
  125 3208 0x83 ext4 (26GB Linux /)
3209 4864 0x07 ntfs (13GB Windows)

So P1 points to the Linux /boot partition which doesn't have a volume boot sector and *does* contain 0x0000 in the magic bytes slots.

To try and confirm that forcing drive 0x80 was causing BIOS to read the correct device I changed the active partition to #4 (Windows) that does have a volume boot sector with the magic bytes 0x55AA. When the PC was rebooted it showed (without Ctrl pressed):

C D5F C000 H01 S01 P4 O03126288 MDEAD E01

Well, progress! partition #4 has been seen as the active one but reads still fail as the magic bytes and error show.

I tried again, this time pressing Ctrl key:

L D80 C3FE HFF S3F P4 O03126288 MAA55 E00

Success! The magic bytes show the BIOS was able to read the volume boot sector from partition #4, and the initial "L" shows it was in LBA mode so was able to address beyond the 1024 cylinder limit.

My next step will be to create a patch for the GRUB2 boot sector similar to the one I contributed to the syslinux project that allows the use of the Ctrl key pressed at boot to force disk 0x80 and LBA mode.

TJ (tj)
Changed in grub2 (Ubuntu):
status: New → Confirmed
status: Confirmed → In Progress
importance: Undecided → High
assignee: nobody → TJ (intuitivenipple)
Revision history for this message
TJ (tj) wrote :

This a problem in grub-setup.

grub2 ships a 'default' boot sector /boot/grub/boot.img created from boot/i386/pc/boot.S

grub-setup is supposed to modify the code, over-writing a couple of instructions with non-operations (nops =0x90) if it knows it is installing onto the first hard disk of the target:

  /* If DEST_DRIVE is a hard disk, enable the workaround, which is
     for buggy BIOSes which don't pass boot drive correctly. Instead,
     they pass 0x00 or 0x01 even when booted from 0x80. */
  if (dest_dev->disk->id & 0x80)
    /* Replace the jmp (2 bytes) with double nop's. */
    *boot_drive_check = 0x9090;

The result of this is that the two bytes at offset 0x66 (decimal 102) in the sector written to the hard disk should be nops to replace the jmp instruction at 0x66:

00000065 FA cli
00000066 EB07 jmp short 0x6f
00000068 F6C280 test dl,0x80
0000006B 7502 jnz 0x6f
0000006D B280 mov dl,0x80

I manually wrote the nops to the boot sector and grub2 started correctly. I'll now figure out why grub-setup is not doing this over-write itself.

As a temporary workaround for this issue you can fix this by:

1. Boot from a LiveCD image from CD or network (via PXE).
2. Open a terminal (there are two ways)
 a. press Ctrl+Alt+F1 *twice* to get to virtual console #1
 b. Applications > Accessories > Terminal
3. Create a file containing the nops:
 echo -e -n "\0220\0220" >/tmp/nop.bin
4. Write the nops into the boot sector (replace /dev/sda if necessary with the boot device name on *your* system):
 sudo dd if=/tmp/nop.bin of=/dev/sda bs=2 count=1 seek=102
5. Restart and test.

Revision history for this message
TJ (tj) wrote :

My reading of the source-code path for gub-setup.c indicates that dest_dev->disk->id is not set by the time the drive number test is done.

utils/i386/pc/grub-setup.c::setup()
   dest_dev = grub_device_open (dest);
   kern/device.c::grub_device_open(const char *dest name)
    disk = grub_disk_open(name);
    kern/disk.c::grub_disk_open(const char *name)
      disk = (grub_disk_t) grub_zalloc(sizeof(*disk));
      disk->name = grub_strdup (name);
      ...
      disk->dev = dev;
      ...
      return disk;
   dev->disk = disk;
   return dev;
 ...
 /* If DEST_DRIVE is a hard disk, enable the workaround, which is
    for buggy BIOSes which don't pass boot drive correctly. Instead,
    they pass 0x00 or 0x01 even when booted from 0x80. */
 if (dest_dev->disk->id & 0x80)
 /* Replace the jmp (2 bytes) with double nop's. */
 *boot_drive_check = 0x9090;

To test this I added a small patch to report the value of disk->id (attached) and built the binary. When run on the target system it reveals:

./grub-setup: info: the size of hd0 is 78140160
./grub-setup: info: setting the root device to 'hd0,1'.
./grub-setup: info: disk->id = 0.

and the resulting boot sector contained the jmp instruction.

Revision history for this message
TJ (tj) wrote :
TJ (tj)
summary: - Acer Travelmate C100 fails to boot: "Geom Error"
+ Buggy BIOS hard disk workaround missing; causes: "Geom Error"
description: updated
Revision history for this message
TJ (tj) wrote :

*** IMPORTANT ***

I've noticed that the workaround in comment #1 give and incorrect set of options for using dd to write to the disk. Please DO NOT use that command as it will try to truncate the output.

It should read:

4. Write the nops into the boot sector (replace /dev/sda if necessary with the boot device name on *your* system):

 sudo dd if=/tmp/nop.bin of=/dev/sda conv=notrunc bs=2 count=1 seek=102

Revision history for this message
TJ (tj) wrote :

After adding several debugging fprintf statements the reason for this problem seems to become clear. grub-setup calls grub_disk_open() which in turn calls an open() function via an array of pointers. Here's the source-code peppered with my debug print statements from kern/desk.c::grub_disk_open():

  grub_dprintf("disk", "grub_disk_dev_list %s\n", raw);
  for (dev = grub_disk_dev_list; dev; dev = dev->next)
    {
      grub_dprintf("disk", "dev->open (%p)\n", (void *) dev->open);
      if ((dev->open) (raw, disk) == GRUB_ERR_NONE) {
       grub_dprintf("disk", "%s\n", "GRUB_ERR_NONE");
 break;
      }
      else if (grub_errno == GRUB_ERR_UNKNOWN_DEVICE) {
 grub_errno = GRUB_ERR_NONE;
       grub_dprintf("disk", "%s\n", "GRUB_ERR_UNKNOWN_DEVICE");
      }
      else
 goto fail;
    }

Running grub-setup with extra debug info output to stdout:

sudo ./grub-setup -vvv '(hd0)' >grub-setup.log 2>grub-setup.stderr.log

shows:

grub-setup: opening destination 'hd0'
/home/all/SourceCode/grub/grub2-1.98/kern/disk.c:245: Opening `hd0'...
/home/all/SourceCode/grub/grub2-1.98/kern/disk.c:268: grub_disk_dev_list hd0
/home/all/SourceCode/grub/grub2-1.98/kern/disk.c:271: dev->open (0x80754f7)
/home/all/SourceCode/grub/grub2-1.98/kern/disk.c:278: GRUB_ERR_UNKNOWN_DEVICE
/home/all/SourceCode/grub/grub2-1.98/kern/disk.c:271: dev->open (0x80740b0)
/home/all/SourceCode/grub/grub2-1.98/kern/disk.c:278: GRUB_ERR_UNKNOWN_DEVICE
/home/all/SourceCode/grub/grub2-1.98/kern/disk.c:271: dev->open (0x804b6d6)
grub_util_biosdisk_open(hd0)
find_grub_drive(hd0)
i=0 hd0
find_grub_drive(hd0) returned 0
setting disk->id=0
/home/all/SourceCode/grub/grub2-1.98/kern/disk.c:273: GRUB_ERR_NONE
grub-setup: testing: disk->id = 0

So find_grub_drive() is the function returning 0.

That function simply iterates through an array called 'map' that contains the contents of /boot/grub/device.map. The first entry in 'map' (map[0]) represents the first line in the device.map file.

From that it is now possible to see why the id is set to 0 and therefore why the buggy BIOS work-around never gets applied to the boot sector.

Revision history for this message
TJ (tj) wrote :

This patch may not be the perfect solution - that may need upstream's input - but it follows the same process used in biosdisk routines: where the GRUB device name begins "hd" set bit 7 (& 0x80) to indicate a hard disk.

In this case we don't have access to the same data but we can still apply the patch when the 'dest' begins "hd". It is slightly ropier that I've allowed it to also compare on drive->id == 0 since that, we now know, only relates to which line of 'device.map' the device is on. I'm relying on 'device.map' usually having the BIOS's first hard disk as the first listed.

grub2 (1.98-1ubuntu4) lucid; urgency=low

  * Ensure buggy BIOS workaround is applied by grub-setup (LP: #555500).

 -- TJ <email address hidden> Thu, 08 Apr 2010 04:30:00 +0100

Revision history for this message
TJ (tj) wrote :
Revision history for this message
Colin Watson (cjwatson) wrote :

Thanks for your investigation and your proposed patch. I have a slightly different plan here involving getting util/hostdisk.c to check for "hd", which I think would be neater. Work in progress ...

Revision history for this message
Colin Watson (cjwatson) wrote :

Could you try this patch:

  http://bazaar.launchpad.net/~ubuntu-core-dev/ubuntu/lucid/grub2/lucid/annotate/head%3A/debian/patches/975_hostdisk_hd.diff

Based on inspection of the resulting boot sector, it seems to be doing the right thing for me.

Revision history for this message
TJ (tj) wrote : Re: [Bug 555500] Re: Buggy BIOS hard disk workaround missing; causes: "Geom Error"

On Thu, 2010-04-08 at 17:06 +0000, Colin Watson wrote:
> Could you try this patch:
>
> http://bazaar.launchpad.net/~ubuntu-core-
> dev/ubuntu/lucid/grub2/lucid/annotate/head%3A/debian/patches/975_hostdisk_hd.diff
>
> Based on inspection of the resulting boot sector, it seems to be doing
> the right thing for me.

Confirmed, it works. Thanks for figuring out the 'proper' way to do it.
It helps when someone is familiar with the source :)

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package grub2 - 1.98-1ubuntu4

---------------
grub2 (1.98-1ubuntu4) lucid; urgency=low

  * Only use the first word of GRUB_DISTRIBUTOR for --class, to avoid
    problems if somebody puts spaces in GRUB_DISTRIBUTOR (LP: #557606).
  * Probe all devices in 'grub-probe --target=drive' if
    /boot/grub/device.map is missing (LP: #549980).
  * Adjust hostdisk id for hard disks, allowing grub-setup to use its
    standard workaround for broken BIOSes (thanks to TJ for detailed
    investigation; LP: #555500).
 -- Colin Watson <email address hidden> Fri, 09 Apr 2010 09:46:44 +0100

Changed in grub2 (Ubuntu Lucid):
status: In Progress → Fix Released
TJ (tj)
Changed in grub2 (Ubuntu):
assignee: TJ (tj) → nobody
Changed in grub2 (Ubuntu Lucid):
assignee: TJ (tj) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.