Hard disk drives are necessary for all data storage, but like all computing hardware, failure is not an if issue, it is a when issue.
When failure occurs, the data stored on these drives is lost. Recovery may be an option, but it can cost up to $7,500 per drive.
Sherweb Cloud Solutions found that “While 57% of IT managers have a backup solution in place, 75% of them were not able to restore all of their lost data. In fact, 23% of people with a backup solution in place weren’t able to recover any data at all." (Painchaud, 2018)
Keeping a constant backup of every drive could double costs and still not guarantee complete data safety.
Our business's data and our operations built on it are simply too important to leave to chance. We need to do more than just backup data for when a main drive stops working. We need a solution that can help us determine when a drive is likely to fail so that we can backup the data off of it and replace it before loss occurs.
Study factors significantly indicate impending hard disk drive failure and can be used to predict failure on the day it happens.
Every analysis starts by checking and cleaning data.
Before analysis, the data was carefully split up into 3 sets.
Keeping the sets of data completely separate allows for confidence that the solutions will work with data they have never encounted before.
The dataset started with nearly 11 million instances, yet only had 678 failures. Drive failure in the 4th quarter was a 1 in 16,210 chance, but 7.4 drives on average failed every single day. Extremely rare events like this require very special techniques in order to create solutions for.
A technique called Synthetic Minority Over-Sampling Technique (SMOTE) was used to balance the training dataset.
The analysis formally begins with calculating the relations of each attribute with each other.
Finally, these results were used to create 6 different models. These models leverage machine learning and artificial intelligence techniques to solve the problem of predicting drive failure.
The attributes most closely related to drive failure are:
Model | Sensitivity | Specificity | Precision | Error Rate | ROC AUC |
---|---|---|---|---|---|
Logistic Regression | 0.6397 | 0.9732 | 1.1478e-3 | 2.68% | 0.8729 |
Decision Tree | 0.4412 | 0.9690 | 0.8829e-3 | 3.10% | 0.6900 |
Random Forest | 0.3603 | 0.9903 | 2.2900e-3 | 0.98% | 0.7974 |
Class-Weighted Random Forest | 0.4044 | 0.9717 | 0.8858e-3 | 2.83% | 0.7998 |
Simple DNN | 0.6176 | 0.9185 | 0.4696e-3 | 8.15% | 0.7681 |
Complex DNN | 0.7132 | 0.9364 | 0.6946e-3 | 6.36% | 0.8248 |
The various metrics of success all show that these models relying on the information of the study factors all perform significantly better than chance.
Implement either the logistic regression model or the more complex DNN model into the daily drive diagnostics checks and backup procedure pipeline.
Until this solution is in place, special care should be taken to consider the drives whose SMART attributes are higher in the 3 main study factors:
After the model solution is in place, additional research is warranted past the limits of this study.
Backblaze. (2020). data_Q4_2019. San Mateo, CA; Backblaze.
Painchaud, A. (2018, October 31). 8 Reasons on How Data Loss Can Negatively Impact Your Business. https://www.sherweb.com/blog/security/statistics-on-data-loss/.