The Agony of Disk Failure
Monday, April 30th, 2007After many years of steadily rising hard drive reliability claims, a recent study by Carnegie-Mellon University and Google has found a much worse failure rate than that usually claimed by manufacturers. The study, cited in an article on PC World’s site, indicates failures occurred at least four times as often as industry figures indicate. It made use of the MTTF, or “mean time to failure,” specification often used by vendors to indicate the reliability of their drives. For obvious reasons, this figure isn’t based on long term studies of current drives; instead it’s based on historical models of similar disks.
The study didn’t state that more drives are failing catastrophically, but rather that more are reporting errors during read/write operations than would be expected based on manufacturer claims. A catastrophic failure is somewhat rare (and can be dramatic, especially if the drive motor seizes or a head snaps loose), whereas a data error basically means a disk was unable to read or write data at one or more specific locations. Such failures usually mean the disk internally marks that sector as unusable, and re-vectors a write attempt to a different location. This is fine, but if the disk was attempting to read data when the failure occurred, it means you’ve lost something that previously was written successfully to that location. This could well mean an entire file has become corrupted (it all depends on the exact nature of the error and whether Scandisk or another utility can recover all or part of the data), so the consequences of such a failure can also be fairly severe.
Drives can fail for a number of reasons. Heat and vibration are the biggest enemies, along with shock (mostly in laptop disks) and, rarely, exposure to very strong magnetic fields. Many people don’t vent their PC’s case adequately, and tend to jam tower machines into cramped quarters that prevent adequate air exchange. This can lead to disk (and CPU, and memory…) failure since the temperatures inside such a case can skyrocket to 150 degrees or more. Serious Gamers, who often use extreme hardware that generates higher than normal heat loads, often install multiple fans or even liquid cooling systems in order to minimize case temperatures and improve component lifetime.
Shock can also be caused by someone kicking or hitting a PC, and can cause localized drive failures. The heads in a disk ride on an extremely small cushion of air just above the spinning platter, and it’s possible for a strong shock to cause one or more heads to slap against the platters. If this happens, it’s bye-bye data (and maybe the entire disk).
The best defense against this type of data loss is a simple one: back up your data on a regular basis. Putting a copy on a separate drive or tape is the best way to ensure it’ll be there when you need it. You should also run Scandisk or another disk-checking tool on a regular basis; if it reports any errors, consider replacing the disk immediately. Once a drive starts throwing errors, it’s usually heading rapidly down the path toward a major failure.
A final tip: listen to your PC. If it starts making any sort of squealing or loud whirring noise, it’s telling you that a spinning component (power supply fan, CPU fan, or disk drive) is starting to fail. If this happens, have someone look at the system immediately.