Checking Hard Drive health on GNU/Linux (Part I)

Some days ago I found an I/O problem with two of my USB hard drives, I use these disks to store my personal information so I was very worried about this error but I thought that it was very strange that two disks had failed at the same time, but everybody knows that weird things happen sometimes.

So I decided to know what’s going on and take a look what happens when I plug one of my USB external disks:

 [ 6481.036062] usb 2-1: new high-speed USB device number 7 using ehci_hcd
 [ 6481.169575] usb 2-1: New USB device found, idVendor=1058, idProduct=1010
 [ 6481.169584] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
 [ 6481.169591] usb 2-1: Product: External HDD
 [ 6481.169597] usb 2-1: Manufacturer: Western Digital
 [ 6481.169602] usb 2-1: SerialNumber: 57442D575832304136395738343933
 [ 6481.170377] scsi11 : usb-storage 2-1:1.0
 [ 6482.177044] scsi 11:0:0:0: Direct-Access     WD       5000BEV External 1.75 PQ: 0 ANSI: 4
 [ 6482.179348] sd 11:0:0:0: Attached scsi generic sg2 type 0
 [ 6482.181668] sd 11:0:0:0: [sdc] 976773168 512-byte logical blocks: (500 GB/465 GiB)
 [ 6482.182191] sd 11:0:0:0: [sdc] Write Protect is off
 [ 6482.182194] sd 11:0:0:0: [sdc] Mode Sense: 23 00 00 00
 [ 6482.182666] sd 11:0:0:0: [sdc] No Caching mode page present
 [ 6482.182669] sd 11:0:0:0: [sdc] Assuming drive cache: write through
 [ 6482.184429] sd 11:0:0:0: [sdc] No Caching mode page present
 [ 6482.184434] sd 11:0:0:0: [sdc] Assuming drive cache: write through
 [ 6482.216041]  sdc: sdc1
 [ 6482.222538] sd 11:0:0:0: [sdc] No Caching mode page present
 [ 6482.222542] sd 11:0:0:0: [sdc] Assuming drive cache: write through
 [ 6482.222545] sd 11:0:0:0: [sdc] Attached SCSI disk
$ lsusb
...
Bus 002 Device 007: ID 1058:1010 Western Digital Technologies, Inc. Elements External HDD
...
# mount -o rw,noexec,relatime,nosuid,nodev,noauto,user,uid=1000,gid=1000 /dev/sdc1 /mnt/seagate/
$ mount|grep sdc<br>/dev/sdc1 on /mnt/seagate type fuseblk (rw,nosuid,nodev,noexec,relatime,user_id=0,group_id=0,default_permissions,allow_other,blksize=4096,user)
$ cd /mnt/seagate
bash: cd: /mnt/seagate: Input/output error

The next step would be to check my disks using smartctl from smarttool package, that uses S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology).

From its man page:

smartctl controls the Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.) system built into most ATA/SATA and SCSI/SAS hard drives and solid-state drives.
The purpose of S.M.A.R.T. is to monitor the reliability of the hard drive and predict drive failures, and to carry out different types of drive self-tests.

 # aptitude install smartmontools

Now we can use smartctl tool to know what’s wrong with the drives, let start scanning our devices:

# smartctl --scan
 /dev/sda -d scsi # /dev/sda, SCSI device
 /dev/sdb -d scsi # /dev/sdb, SCSI device
 /dev/sdc -d sat # /dev/sdc [SAT], ATA device

I have to say that /dev/sda and /dev/sdb corresponding to my SSD’s internal disks, so it’s needed to get information about the /dev/sdc device:

 # smartctl -i /dev/sdc
 smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.2.0-4-amd64] (local build)
 Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
 === START OF INFORMATION SECTION ===
 Model Family:     Western Digital Scorpio Blue Serial ATA
 Device Model:     WDC WD5000BEVT-22ZAT0
 Serial Number:    WD-WX20A69W8493
 LU WWN Device Id: 5 0014ee 2587682d9
 Firmware Version: 01.01A01
 User Capacity:    500,107,862,016 bytes [500 GB]
 Sector Size:      512 bytes logical/physical
 Rotation Rate:    5400 rpm
 Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   ATA8-ACS (minor revision not indicated)
 SATA Version is:  SATA 2.6, 3.0 Gb/s
 Local Time is:    Sat Mar 14 16:45:11 2015 CET
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled

I was very lucky because being an old drive has S.M.A.R.T. capability, in contrast my other external disk which present the same I/O problem it hasn’t this capability:

 === START OF INFORMATION SECTION ===
 Vendor:               SAMSUNG
 Product:              HM500LI
 User Capacity:        500,107,862,016 bytes [500 GB]
 Logical block size:   512 bytes
 Serial number:        152D20329000
 Device type:          disk
 Local Time is:        Sat Mar 14 15:21:35 2015 CET
 SMART support is:     Unavailable - device lacks SMART capability.

If we want to make some tests to check how healthy is our device we must execute this command to turn on offline data collection and enable S.M.A.R.T. support:

# smartctl -s on -o on -S on /dev/sdc
 ...
 === START OF ENABLE/DISABLE COMMANDS SECTION ===
 SMART Enabled.
 SMART Attribute Autosave Enabled.
 SMART Automatic Offline Testing Enabled every four hours.

Our first check:

 # smartctl -H /dev/sdc
 ...
 === START OF READ SMART DATA SECTION ===
 SMART Status command failed: scsi error medium or hardware error (serious)
 SMART overall-health self-assessment test result: PASSED
 Warning: This result is based on an Attribute check.

If the previous command doesn’t return PASSED you’re in troubles and you should immediately backup all your data.

Now we must make sure that our drive supports self-tests the following command also give estimated time to each test:

# smartctl -c /dev/sda
 ...
 === START OF READ SMART DATA SECTION ===
 General SMART Values:
 Offline data collection status:  (0x00) Offline data collection activity
                                         was never started.
                                         Auto Offline Data Collection: Disabled.
 Self-test execution status:      (   0) The previous self-test routine completed
                                         without error or no self-test has ever
                                         been run.
 Total time to complete Offline
 data collection:                (    0) seconds.
 Offline data collection
 capabilities:                    (0x7b) SMART execute Offline immediate.
                                         Auto Offline data collection on/off support.
                                         Suspend Offline collection upon new
                                         command.
                                         Offline surface scan supported.
                                         Self-test supported.
                                         Conveyance Self-test supported.
                                         Selective Self-test supported.
 SMART capabilities:            (0x0003) Saves SMART data before entering
                                         power-saving mode.
                                         Supports SMART auto save timer.
 Error logging capability:        (0x01) Error logging supported.
                                         General Purpose Logging supported.
 Short self-test routine
 recommended polling time:        (   1) minutes.
 Extended self-test routine
 recommended polling time:        (  48) minutes.
 Conveyance self-test routine
 recommended polling time:        (   2) minutes.
 SCT capabilities:              (0x0021) SCT Status supported.
                                         SCT Data Table supported.

So we can run short test:

 # smartctl -t short /dev/sdc
 ...
 === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
 Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
 Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
 Testing has begun.
 Please wait 2 minutes for test to complete.
 Test will complete after Sat Mar 14 17:30:03 2015
Use smartctl -X to abort test.

Unfortunately you will have to execute “smartctl -l selftest” command to check results of your test because it’s not possible check the progress of the test, so you can use this (or wait the estimated time around 2 minutes):

 $ while :; do sudo smartctl -l selftest /dev/sdc; sleep 5; done;
=== START OF READ SMART DATA SECTION ===
 SMART Self-test log structure revision number 1
 Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
 # 1  Short offline       Completed without error       00%       289         

So it seems clear that, at least for one of my external disks, that it isn’t a hardware problem, the point now is why I get I/O errors when I plug external disks in my USB ports.

Just to be sure I plug both external USB disks in other laptop and I’m able to view its content without any problems.

I’ll have to continue investigating this problem, because I’m afraid that my personal laptop could have a hardware failure in its USB subsystem (I tested the drive with all of USB ports).

I want to thank the author of this blog entry because it has been very useful.

That’s all folks!!!

“The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”
Terry Pratchett (RIP)

3 thoughts on “Checking Hard Drive health on GNU/Linux (Part I)

Leave a Reply to coffee Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s