What SMART Stats Tell Us About Hard Drives

esaym · 9 years ago

Ten years ago when I was I trying to learn how to "program", I wrote this bash script (to be added into /etc/cron.daily) that dumps a few smart stats that are normally 0 or slow changing, diffs it with the copy from the previous run, and if anything is different (and cron in configured right) it will email you the diff. Every linux machine I touch gets this file dropped onto it. I've replaced many hard drives because of it

    #!/bin/bash
    
    
    smartctl -a /dev/sda > /root/smartStates
    grep Reallocated_Sector_Ct /root/smartStates > /root/stats
    grep Current_Pending_Sector /root/smartStates >> /root/stats
    grep Offline_Uncorrectable /root/smartStates >> /root/stats
    grep UDMA_CRC_Error_Count /root/smartStates >> /root/stats
    
    touch /root/statsOld
    cmp /root/stats /root/statsOld
    result=$?
    
    if [[ $result -ne "1" && $result -ne "0" ]]
      then
    	echo "Something went wrong"
    	exit -1
    fi
    
    if [[ $result -eq "1" ]]
      then
    	echo "Files are different\n"
    	cat /root/stats
    fi
    
    mv /root/stats /root/statsOld
    rm /root/smartStates

sn · 9 years ago

I'm not sure why you wouldn't use smartd, which has sane defaults, can immediately alert based on arbitrary smart properties changing, and also handles scheduling smart tests. An arbitrary command can be run instead of sending email.

For example, reallocated sectors are not alerted on by default, so we added '-R 5!' to our smartd config. The full config we have is:

DEVICESCAN -a -s (L/../../6/01) -l selftest -l error -m <email> -M daily -M test -R 5!

djsumdog · 9 years ago

My familiarity with SMART was on the POST screen. I never ran any daemons; I didn't know they were a thing. Like so many other nerds that grew up in the 90s, we've all experienced data loss. At the best it was just some porn and the worst were those rare VHS tapes you ripped of random high school crap.

I feel like there were enough tools for users to simply monitor SMART stats, or awareness of how it works. Even in this article, it seems like a lot of analysis to see if reported flags are significant.

koolba · 9 years ago

(Warning pedantic script review)

Add a "set -e" to catch errors. Say if the disk can't be read or file can't be written.

Why reuse the same temp file? Make a new one with mktemp and auto clean it via an exit trap. As it's written this isn't concurrently safe.

Exiting -1 on error? Don't use negatives.

Wrap it all in a main() function and use locals instead of global vars.

jimmaswell · 9 years ago

It's just a small bash script, and one that's apparently worked well for 10 years. Rewriting it to J2EE standards would just be a waste of time; the best outcome is that it still works the same, and the other outcome is that you introduced a new bug refactoring it.

caf · 9 years ago

You don't even need the 'smartStats' temporary file at all, since you can do that with just one grep:

  smartctl -a /dev/sda |
    grep '\(Reallocated_Sector_Ct\|Current_Pending_Sector\|Offline_Uncorrectable\|UDMA_CRC_Error_Count\)' > /root/stats

ptman · 9 years ago

Also https://www.shellcheck.net/

Deleted Comment

alphapapa · 9 years ago

This is basically what smartmontools does:

    /var/log/syslog:Oct  6 08:14:10 hostname smartd[573]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 62
    /var/log/syslog:Oct  6 09:44:10 hostname smartd[573]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 61
    /var/log/syslog:Oct  6 10:14:10 hostname smartd[573]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 60
    /var/log/syslog:Oct  6 18:44:10 hostname smartd[573]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 59

lazyjones · 9 years ago

I respectfully disagree. Judging from the output you posted, it just spams logfiles with irrelevant info, training the user to ignore it, so they will never notice when important values change in an interesting way.

ekianjo · 9 years ago

Did you keep any data on how often you got to replace your drives because of such signs ?

mynameislegion · 9 years ago

Now you can just `apt install smart-notifier`

jtl999 · 9 years ago

Looks useful

redtexture · 9 years ago

How about: a comment on (1) what this bash script does, (2) and why, and (3) the reason for dumping a disk as a result.

boxfire · 9 years ago

1) How about read the script. 2) How about read the subject 3) How about understanding it.

mjb · 9 years ago

SMART is a fantastic exercise in sensitivity and specificity. As backblaze is showing with this data, SMART stats have poor sensitivity, but what's much worse for those who run big fleets of drives is their poor specificity. Lots of healthy drives are reported unhealthy by SMART. If I'm running a gold-plated database server, that doesn't matter. A couple of extra planned drive replacements is a small price to pay for avoiding unplanned failures. If I'm running a huge drive cluster, it's much, much more expensive.

Take Backblaze's 0.01% for a group of four failures. That's replacing an extra 100 drives per million, at random, and only getting the benefit of correctly predicting failures 10.4% of the time.

This is great data to have.

Retric · 9 years ago

Thresholds are often useful with these kinds of stats. Aka a drive moving one sector might mean nothing, but moving 30 in a week could be great predictor. Further they only have 70k drives across a range of product lines so what predicts drive X failing very well might say little about drive Y.

PS: Rememebr all RAM gets bit flit errors over time. Which is one of the reasons rebooting is often so useful, but also means one off errors are often meaningless.

sn · 9 years ago

ECC ram will either correct the error or can raise a non maskable interrupt if it can't be corrected.

joosters · 9 years ago

It's unfair to blame SMART stats for this. On their own, SMART does not declare a drive 'healthy' or 'unhealthy', it's up to the user to decide. Each stat on its own seems very specific, e.g. Number of reallocated sectors. I'm not sure how you think this is 'poor specificity', the stats are measuring one thing exactly, you can't get more specific than that!

fnj · 9 years ago

Actually, it does declare. Direct quote from smartctl output: "SMART overall-health self-assessment test result: PASSED"

devonkim · 9 years ago

SMART stats are only one source of data though that can be predictors of drive failures. Hard drives can experience latency spikes and erratic performance compared to peers before SMART records problems, for example. This is part of why predictive monitoring can be so difficult - the data you didn't know you needed is probably among the ones you didn't ingest into your metrics, and you can't dump literally everything in /proc every other second to your metrics system without sacrificing some CPU or network bandwidth either.

Deleted Comment

andy4blaze · 9 years ago

Nice analysis.

latitude · 9 years ago

Might be a good time to plug my little baby - https://diskovery.io

If you want to have a quick, but in-depth look at your drives, it'll give you lots of data, including the SMART table interpreted in a vendor-specific way. It also understands some RAID setups, and more support for this is upcoming. Windows only, at the moment.

To explain a bit of a context - SMART data comprises a set of attributes and each attribute has a value, a threshold and a raw value. Values are opaque 8-bit somethings that are only meant to be compared to thresholds. When they fall under then, then it may indicate a problem. They aren't really interesting. What's interesting is the "raw" values, but as the name implies, they are vendor-specific and require decoding. Some vendors publish the specs, but most don't. Specs that are published are often incomplete or plain wrong. So there's a LOT of reverse engineering and guesswork involved, which makes writing a SMART tool both frustrating and interesting at the same time. But if you need just the "dying / healthy" indicator, it's a very easy thing to extract from a drive.

djsumdog · 9 years ago

Has anyone ported your work to Linux or MacOS? I guess not since iIt looks like it would be very OS specific. It looks like an incredible tool.

latitude · 9 years ago

It is indeed pretty OS specific.

Not the SMART part, but how you talk to the drives and controllers and how storage is generally sliced into partitions, volumes, etc. Windows has a fairly comprehensive version of Software RAID, but in true Microsoft fashion they do things ass-backwards in more than one place. For example, striped volumes (RAID 0) will use only a part of a partition for each stripe, but to learn that you'd have to talk to Virtual Disk Service rather than regular Disk/Volume management API. This is, basically, as unportable as it gets.

daveguy · 9 years ago

Wow. They provide all of the raw log data from the drives[0]. Looks like an interesting source of data for a Kaggle competition.

[0]https://www.backblaze.com/b2/hard-drive-test-data.html

tedunangst · 9 years ago

Isn't the reverse stat more interesting? What percentage of drives reporting an error fail within X weeks? I don't want to know how many failed drives had errors, I want to know how many errored drives fail. (A more accurate title might be "What failed drives tell us about SMART stats".)

arielhn · 9 years ago

Small nitpicking here, but the moment it popup a modal dialog asking me to enter my email for some kind of subscription I simply close the tab. I do this since three months ago for any unknown website I visit.

Such nuisance for what might be a good read.

teh_klev · 9 years ago

Shame...you could've just dismissed the popup and not missed out on an interesting article, it's same energy expended but with a nett gain instead of your loss. A small price to pay for BackBlaze willing to share interesting stuff like this and hardly the most egregious examples of this type of thing. Also these types of complaints have been done to death here over the years and are really, really tedious. Please complain to BackBlaze instead of trying to take this thread off-topic.

arielhn · 9 years ago

First of all, I was on a public transport when I click on that link, my 'consuming' experience already not optimal from the point of view of readership. Many technical people, like I do, are busy people with short tolerance on things that detract from what I'm supposed to read or comprehend. Unless I can just read right there right away, I'm just going to skip to the next tab.

Secondly, I noticed that this is a trend right now; where you get to a page and after a few seconds a dialog just thrown into your face with little disregard to you (the reader) is trying to concentrate to read the content. To me that is rude, you don't go to a bookstore while reading the table of content a salesman grab that book from you and tell you "would you like me to take your email address so that we can notify you when we have new books available?" without wondering what kind of establishment that allow this kind of behavior.

Third, I got the link from HN it was easier for me to go back to this tab, login, hit reply than registering a disqus account and then enter a comment there.

With that said, I dont want to blog about this on Medium or whatever, I dont need clicks by moaning about every little things, this is my way of protesting on what I perceive is happening right now and that's why I start with "small nitpicking".

jlarocco · 9 years ago

On the contrary, those popups are a really stupid and annoying trend, and I think it's good people voice their dissatisfaction with them.

Without people leaving the sites and complaining on places like HN, web designers will have no feedback that it's such a stupid idea.

CamperBob2 · 9 years ago

Shame...you could've just dismissed the popup and not missed out on an interesting article

But I don't know that until I've already given up the goods, do I? This approach to life, the universe, and the Internet simply doesn't scale.

djsumdog · 9 years ago

Yea, I already run uBlock Origin. I don't mind mailing list popups. The authors should make sure someone gets at least 50% through the page before showing them. I have a feeling that will get a high click through rate ... err...sign up rate.

I hate ads. I block all of them. But I will help your crowd-funding or Pateron or buy some swag to help you promote your thing.

d3lxa · 9 years ago

As a data scientist, I would be curious to see the application of machine learning to this problem. I'd start with naive Bayes, logistic regression and SVM.

@blackblaze I'm pretty sure you can automatize a large portion of your investigation that way.

StillBored · 9 years ago

First these counters vary in meaning and support by vendor/model, and it would be nice if someone were to come along and mandate further standardized ones. Instead you have to tune everything for each drive model. In this regard SCSI is a little better (more on that later).

Second, timeouts and uncorrectable errors are generally being reported to the controller as part of normal operation. So having SMART tracking them is just a bonus. Either of those two conditions is usually sufficient to kick a drive out of a functional RAID array because those are data loss events. Most drives have layers and layers of ECC, so in order to get an uncorrectable error a lot of bits need to be flipped in the target sector. For that to happen it likely indicates there is something mechanical going on which is likely to affect adjacent tracks/sectors. Of course if you never scrub your drives its possible bitrot accumulates on a perfectly functional device until sectors aren't recoverable.

In my previous life I found it much more interesting to track the rate of soft error counts during scrub operations. Particularly, in larger arrays because sometimes a drive would start getting slower (which is frequently caused by read retries in the drive itself or problems tracking the embedded servo/etc) and the correctable error counts would start to steadily rise followed by actual timeouts/uncorrectable errors. Of course these days, it seems most drives won't show the correctable error counts because it would freak people out. Instead you have to infer it from seek errors and relocated sector counts. Although, it might now be considered a SAS/SATA differentiator. SCSI has standardized log pages with more detailed information. (random google hit http://www.seagate.com/staticfiles/support/disc/manuals/scsi... page 238) Note the errors are categorized as corrected without delay, with substantial delay, and corrected on a retry. By comparison the SMART data isn't particularly "smart".