Backblazes storage infrastructure, which it refers to as "Storage Pods". The company has almost 40,000 drives in its data center, and a study over the past several years shows that SMART software stats often don't relate to the health of those drives. Credit: Backblaze
Hard drive firmware that IT administrators use to monitor hard drive health is highly inconsistent from drive to drive and manufacturer to manufacturer, according to figures collected from nearly 40,000 spindles.
The data, released today from cloud service provider Backblaze, also indicated which five of the 70 metrics that SMART stats cover are likely to predict a hard drive failure.
SMART, or Self-Monitoring, Analysis, and Reporting Technology, is nearly ubiquitous firmware that vendors embed as tools to alert IT admins to impending problems.
Due to a lack of industrywide SMART software and hardware standards, SMART data cannot be exchanged between vendor products. Vendors can also use SMART data to analyze issues across drive lines.
For several years, Backblaze has collected data on hard drive failures. It has released that data in company blogs, highlighting which manufacturer's drives failed more often than others.
Backblaze's most recent study delved into SMART alerts based on the 40,000 or so hard drives the company has in its data center.
It found that five SMART stats do predict drive failures, according to Backblaze CEO Gleb Budman.
SMART software reports drive issues as normalized values, or categories, which range from SMART stat 1 to 253 (not all numbers in between are included). For example, a value of "1" represents data read error rates, which are displayed as a decimal number. A value of 240 represents the amount of time that a drive spends positioning read/write heads.
Backblaze's analysis of nearly 40,000 drives showed five SMART metrics that correlate strongly with impending disk drive failure:
- SMART 5 - Reallocated_Sector_Count.
- SMART 187 - Reported_Uncorrectable_Errors.
- SMART 188 - Command_Timeout.
- SMART 197 - Current_Pending_Sector_Count.
- SMART 198 - Offline_Uncorrectable
Backblaze counts a drive as failed when it is removed from a storage array and replaced because it has totally stopped working or because it has shown evidence of failing soon.
A drive is considered to have stopped working when the drive appears physically dead (e.g. won't power up), it doesn't respond to console commands or the RAID system reports that the drive can't be read or written.
"To determine if a drive is going to fail soon we use SMART statistics as evidence to remove a drive before it fails catastrophically or impedes the operation of the Storage Pod volume," Budman said.
For example, SMART stat 187 reports the number of reads that could not be corrected using hardware error correction code (ECC). Drives with 0 uncorrectable errors hardly ever fail, Budman said, "but once SMART 187 goes above 0, we schedule the drive for replacement."
Sign up for Computerworld eNewsletters.