Replace a failed drive in a software RAID

View kenel log to detect a possible failing hard drive

root@ubuntu:~# dmesg
[ 886.492585] sdb: Current: sense key: Recovered Error
[ 886.497903] Additional sense: Recovered data with retries
[ 886.504060] Info fld=0xdf82e1
[ 919.421181] sdb: Current: sense key: Recovered Error
[ 919.426474] Additional sense: Recovered data without ECC - recommend rewrite
[ 919.434375] Info fld=0xd66a9a
[ 1728.424643] sdb: Current: sense key: Recovered Error
[ 1728.429945] Additional sense: Recovered data without ECC - data auto-real
located
[ 1728.438197] Info fld=0xccc0fe
[ 1731.086946] sdb: Current: sense key: Recovered Error
[ 1731.092252] Additional sense: Recovered data without ECC - data auto-real
located
[ 1731.100514] Info fld=0xccb675

Perform SMART test on drive

Install SMART tools

root@ubuntu:~# aptitude install smartmontools

Run SMART tests

root@ubuntu:~# smartctl --test=long /dev/sdb
root@ubuntu:~# smartctl -a /dev/sdb
smartctl version 5.34 [x86_64-unknown-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
 
Device: FUJITSU MAV2073RCSUN72G Version: 0301
Serial number: 000535S00AUB
Device type: disk
Transport protocol: SAS
Local Time is: Sat Jan 29 14:22:13 2011 CST
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK
 
Current Drive Temperature: 27 C
Drive Trip Temperature: 65 C
Manufactured in week 35 of year 2005
Current start stop count: 43 times
Recommended maximum start stop count: 10000 times
Elements in grown defect list: 355
 
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 530114 1342 1342 0 78930.620 0
write: 0 2 0 0 0 38013.435 0
 
Non-medium error count: 44
 
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK
ASC ASQ]
Description number (hours)
# 1 Background long Failed in segment --> 9 42754 13399317 [0x3
0x11 0x1]
# 2 Background long Failed in segment --> 9 42635 13399317 [0x3
0x11 0x1]
# 3 Background short Completed - 42635 - [- -
-]
# 4 Background long Failed in segment --> 9 42634 13398730 [0x3
0x11 0x1]
 
Long (extended) Self Test duration: 2233 seconds [37.2 minutes]
root@ubuntu:~# fdisk -l
 
Disk /dev/sda: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          12       96358+  fd  Linux raid autodetect
/dev/sda2              13        8924    71585640   fd  Linux raid autodetect
 
Disk /dev/sdb: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          12       96358+  fd  Linux raid autodetect
/dev/sdb2              13        8924    71585640   fd  Linux raid autodetect
 
Disk /dev/sdc: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1        8924    71681998+  83  Linux
 
Disk /dev/sdd: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1        8924    71681998+  83  Linux
 
Disk /dev/md0: 98 MB, 98566144 bytes
2 heads, 4 sectors/track, 24064 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
 
Disk /dev/md0 doesn't contain a valid partition table
 
Disk /dev/md1: 73.3 GB, 73303588864 bytes
2 heads, 4 sectors/track, 17896384 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
 
Disk /dev/md1 doesn't contain a valid partition table
 
Disk /dev/md2: 73.4 GB, 73402286080 bytes
2 heads, 4 sectors/track, 17920480 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
 
Disk /dev/md2 doesn't contain a valid partition table
root@ubuntu:~# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
 
md1 : active raid1 sda2[0] sdb2[1]
      71585536 blocks [2/2] [UU]
 
md0 : active raid1 sda1[0] sdb1[1]
      96256 blocks [2/2] [UU]
 
unused devices: <none>
root@ubuntu:~# mdadm --query --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Wed Feb  8 17:29:05 2006
     Raid Level : raid1
     Array Size : 96256 (94.02 MiB 98.57 MB)
    Device Size : 96256 (94.02 MiB 98.57 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent
 
    Update Time : Mon Jan 31 06:26:13 2011
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
 
           UUID : 96c88b09:82b06262:679309e4:bbe2fe4f
         Events : 0.20160
 
    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
root@ubuntu:~# mdadm --query --detail /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Wed Feb  8 17:29:25 2006
     Raid Level : raid1
     Array Size : 71585536 (68.27 GiB 73.30 GB)
    Device Size : 71585536 (68.27 GiB 73.30 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent
 
    Update Time : Mon Jan 31 17:42:26 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
 
           UUID : 6154cd5a:edf5f628:28d7a268:ad434b95
         Events : 0.59383068
 
    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2

Remove the Failed Drive

root@ubuntu:~# mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
root@ubuntu:~# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
 
md1 : active raid1 sda2[0] sdb2[1]
      71585536 blocks [2/2] [UU]
 
md0 : active raid1 sda1[0] sdb1[2](F)
      96256 blocks [2/1] [U_]
 
unused devices: <none>
root@ubuntu:~# mdadm --manage /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1
root@ubuntu:~# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
 
md1 : active raid1 sda2[0] sdb2[1]
      71585536 blocks [2/2] [UU]
 
md0 : active raid1 sda1[0]
      96256 blocks [2/1] [U_]
 
unused devices: <none>
root@ubuntu:~# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
root@ubuntu:~# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
 
md1 : active raid1 sda2[0] sdb2[2](F)
      71585536 blocks [2/1] [U_]
 
md0 : active raid1 sda1[0]
      96256 blocks [2/1] [U_]
 
unused devices: <none>
root@ubuntu:~# mdadm --manage /dev/md1 --remove /dev/sdb2
mdadm: hot removed /dev/sdb2
root@ubuntu:~# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
 
md1 : active raid1 sda2[0]
      71585536 blocks [2/1] [U_]
 
md0 : active raid1 sda1[0]
      96256 blocks [2/1] [U_]
 
unused devices: <none>

Replace Drive

Power down the server and replace the failed physical drive.

Add new Drive to RAID

Verify current partition information

root@ubuntu:~# sfdisk -d /dev/sda
# partition table of /dev/sda
unit: sectors
 
/dev/sdb1 : start=       63, size=   192779, Id=fd, bootable
/dev/sdb2 : start=   192780, size=143364059, Id=fd
/dev/sdb3 : start=        0, size=        0, Id= 0
/dev/sdb4 : start=        0, size=        0, Id= 0

Copy the partition information over

root@ubuntu:~# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ...
OK
 
Disk /dev/sdb: 8924 cylinders, 255 heads, 63 sectors/track
 
sfdisk: ERROR: sector 0 does not have an msdos signature
 /dev/sdb: unrecognized partition table type
Old situation:
No partitions found
New situation:
Units = sectors of 512 bytes, counting from 0
 
   Device Boot    Start       End   #sectors  Id  System
/dev/sdb1   *        63    192779     192717  fd  Linux raid autodetect
/dev/sdb2        192780 143364059  143171280  fd  Linux raid autodetect
/dev/sdb3             0         -          0   0  Empty
/dev/sdb4             0         -          0   0  Empty
Successfully wrote the new partition table
 
Re-reading the partition table ...
 
If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)

Verify partition information

root@ubuntu:~# fdisk -l /dev/sda /dev/sdb
 
Disk /dev/sda: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          12       96358+  fd  Linux raid autodetect
/dev/sda2              13        8924    71585640   fd  Linux raid autodetect
 
Disk /dev/sdb: 73.4 GB, 73407865856 bytes
255 heads, 63 sectors/track, 8924 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          12       96358+  fd  Linux raid autodetect
/dev/sdb2              13        8924    71585640   fd  Linux raid autodetect

Add new drive partitions to software RAID

root@ubuntu:~# mdadm --manage /dev/md0 --add /dev/sdb1
mdadm: hot added /dev/sdb1
root@ubuntu:~# mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: hot added /dev/sdb2
root@ubuntu:~# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
 
md1 : active raid1 sdb2[2] sda2[0]
      71585536 blocks [2/1] [U_]
      [>....................]  recovery =  0.1% (97408/71585536) finish=73.3min speed=16234K/sec
 
md0 : active raid1 sdb1[1] sda1[0]
      96256 blocks [2/2] [UU]
 
unused devices: <none>

Verify that the RAID build process eventually finishes successfully

root@ubuntu:~# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sdc1[0] sdd1[1]
      71681920 blocks [2/2] [UU]
 
md1 : active raid1 sdb2[1] sda2[0]
      71585536 blocks [2/2] [UU]
 
md0 : active raid1 sdb1[1] sda1[0]
      96256 blocks [2/2] [UU]
 
unused devices: <none>

Make Disks Bootable with Grub

If the drive you replaced contains the boot partition, you need to make it bootable by Grub once again.

/dev/sda

root@ubuntu:~# grub
Probing devices to guess BIOS drives. This may take a long time.
 
       [ Minimal BASH-like line editing is supported.   For
         the   first   word,  TAB  lists  possible  command
         completions.  Anywhere else TAB lists the possible
         completions of a device/filename. ]
grub> device (hd0) /dev/sda
grub> root (hd0,0)
grub> setup (hd0)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd0)"...  16 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd0) (hd0)1+16 p (hd0,0)/grub/stage2 /grub/menu.lst"... succeeded
Done.
grub> quit

/dev/sdb

root@ubuntu:~# grub
Probing devices to guess BIOS drives. This may take a long time.
 
       [ Minimal BASH-like line editing is supported.   For
         the   first   word,  TAB  lists  possible  command
         completions.  Anywhere else TAB lists the possible
         completions of a device/filename. ]
grub> device (hd1) /dev/sdb
grub> root (hd1,0)
grub> setup (hd1)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd1)"...  16 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd1) (hd1)1+16 p (hd1,0)/grub/stage2 /grub/menu.lst"... succeeded
Done.
grub> quit
root@ubuntu:~#

References

  • http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array

2 thoughts on “Replace a failed drive in a software RAID”

  1. Very good and complete howto. Thanks for including the output of the commands (most howtos have the commands only). Thanks also for including the grub/make-bootable steps; I think I forgot to do this went I setup raid1 originally and so I got lucky that sdb (not sda) failed. Very helpful!

  2. First step under “Add New Drive.” If you replaced your original boot disk, you must enter BIOS to change your boot disk to the RAID counterpart (usually the sdb disk) in order to complete this process.

Leave a Reply

Your email address will not be published. Required fields are marked *

*