The postmortem and what you need to know
We’ve known for some time that there are a finite number of writes that a solid state hard drive can endure. This is because the type of memory commonly used for SSD data storage can only be erased a certain number of times.
SSD drive manufacturers publish this specification as Life expectancy, Endurance, or Mean time between failures. More expensive Data Center type SSD’s have a longer endurance specification.
To mitigate drive failures and extend drive life, manufacturers set aside a certain number of blocks beyond the published capacity and use these as spare blocks. When a block of memory wears out, a spare block is used in its place.
To complicate matters, if you use SSD’s in a server application behind a RAID card, drive health information is hidden from monitoring by typical operating system tools. This is why we were caught by surprise with a recent failure – normally we would have been alerted by our monitoring tools. To examine drive health, the drives needed to be removed and plugged into a non-raid configuration where they could be directly interrogated by the Intel SSD tool.
We were surprised to learn we had written over 500TB to these datacenter-grade drives – at least 2x the expected “endurance” specification. The MTBF of 2 million hours was never reached – but the write capacity was exhausted. In the end, the drives lasted approximately two years in a write-intensive database application. This is far below what would be expected of spinning media.
Fortunately, no data was lost – the problem manifested itself as an extreme performance slowdown – write latency in the 4-5 second range. We were able to copy the data off to spinning media and get the server back up.
So what is to be learned from this? Next time, we’ll buy even more expensive datacenter drives with a higher write endurance. We use software RAID at the operating system level whenever possible as a best practice, but it was not possible in this application. With software mirroring or RAID, you’ll get some alerts in the event log that will be trapped by professional monitoring tools. If you’re stuck with hardware RAID, buy the highest endurance SSD’s available, or use spinning media.
Also note that in a workstation application, this is probably nothing to worry about – but you do want to keep an eye on drive health with a SMART check (just like you should be doing with spinning drives).