How to check the raid controller on Linux:
# lspci | grep -i raid 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
You should get to know /proc/mdstat, looking at it often. This will tell you the state of your arrays, and very importantly it will tell you whether any drives have failed, and whether any arrays are degraded. Check, and check regularly!
xosview is a venerable utility, and one of the author’s favourites. It is capable of displaying the state of raid arrays, but unfortunately currently the code is broken – it reads mdstat, and doesn’t understand the current output. It is currently (2016) being updated to read the status directly from /sys, and should hopefully soon be able to display raid status correctly. The author leaves xosview running permanently on his desktop to provide an overview of system performance.
mdadm –monitor –scan –mail firstname.lastname@example.org
This will fire up mdadm to keep an eye on your arrays. It will daemonize and run in the background, sending an email to the specified address if it detects any problems related to a disk failure. This is good for remote monitoring BUT. It won’t tell you if anything goes wrong with the monitoring! You cannot assume – even if you put this in your boot-up sequence as you should – that you will be notified about important events. It’s not unknown for the daemon to fail.
Don’t rely on this! Check regularly on a manual basis!
This tool tells you all sorts of information about your drives. When you read the “When things go wrogn” section, you will see that smartctl is a very important diagnostic tool, but it also provides a lot of proactive information to help you anticipate a drive failure.
There are various S.M.A.R.T. stats that can be looked at which will provide clues:
Attribute | Description |
SMART 5 | Reallocated Sectors Count |
SMART 187 | Reported Uncorrectable Errors |
SMART 188 | Command Timeout |
SMART 197 | Current Pending Sector Timeout |
SMART 198 | Uncorrectable Sector Count |
Backblaze.com (who run huge raid arrays) have a lot of interesting information on their site. They point out that maybe a quarter of their drives fail when all these statistics are 0, so a healthy SMART report does not necessarily mean a healthy drive, but almost none of their drives survive having errors on all five counts.
smartctl also reports on things like drive temperature, how long the drive has been powered on, how many times it has been started and shut down etc. It’s no surprise that drives that get too hot or are otherwise stressed beyond normal limits tend to fail early.
Smartmontools for RAID