Recovering a Failed LSI or Intel RAID Array
This article will show you how to recover a failed RAID 5 array on an Intel or LSI hardware RAID controller. The same or similar process can also be used with other RAID arrays such as RAID 1 or RAID 6.
It is important to understand first how drive configuration information is stored, and what a "foreign configuration" is, as you may see messages about this appear.
What is a Foreign Configuration?
Intel or LSI Hardware RAID controllers store RAID volume configuration information on both the drives and the controller. When the system is booted or a drive is inserted, the drives and configurations are examined and any configurations that don't match each other are flagged as a foreign configuration. The RAID controller will use its own configuration as the master record, to identify which configurations are invalid.
All physical drives configuration information contains a list of the drives involved in the associated logical drive.
What's the difference between Optimal, Degraded, Critical and Offline?
RAID controllers use these terms to show the health of the logical drive.
- Optimal: Everything should be normal; all drives are present
- Degraded: A failure has occurred; likely a drive is missing. However, the RAID array may be able to tolerate further failures. The bad drive should still be replaced as soon as possible.
- Critical: A failure has occurred and the RAID array health is critical. Any further failures will lead to the loss of data.
- Offline: The virtual drive has failed and the data will be inaccessible. Recovery can be attempted but is not guaranteed.
For example, RAID6 arrays can lose up to two drives. The loss of one drive will result in the logical drive being reported as degraded, not critical.
Scenario 1 - The system is booted but not all of the drives are plugged in, possibly due to a loose cable.
Messages such as the one above require prompt action:
- Switch off the server and check all the cables and connections, especially the seating of any multi-lan SFF-8087 cables. Reseat all drives and re-test.
- Do not accept the message above or load the configuration utility: Doing so will write a configuration message to the remaining drives detailing the drives which are missing.
Scenario 2 - A drive has failed, but now a message about a Foreign Configuration has appeared
Hard drives which encounter problems such as bad blocks or a controller fault may stop being able to be detected. The drive may then become "available" again, perhaps after the drive has reallocated the bad blocks. However, when the drive failed, the RAID card will have written a configuration update to itself and the remaining drives, notifying the failed drive. The configuration on the failed drive is now out of date and does not match the other drives, so it is flagged as having a foreign configuration.
- If the RAID array is degraded or critical, it is usually safer to attempt to rebuild onto a fresh drive, rather than trying to either re-use a problem drive (i.e. rebuilding onto it), or worse, trying to force the previously failed drive online. This is because any disk writes that have occurred to the RAID pack - even just the normal day to day running of an idle Windows system - will mean that the drive that failed and has now re-appeared contains out of date data. Attempting to put this drive into use without rebuilding will result in serious system corruption.
- If the RAID array is marked as failed, always try and import the foreign configuration, making sure that the drive that failed last is the drive that you try and re-use. Drives that failed earlier (as perhaps you could have in a RAID 5 or RAID 6 situation) will not only have an out of date or foreign configuration, but more importantly the data on the drive will likely be out of date if an operating system has been booted or running since the failure.
Scenario 3 covers importing a foreign configuration.
Scenario 3 - The drive was disconnected accidentally, my array is offline and I need to recover it.
Disconnected drives will either be marked as Foreign, or Unconfigured Bad. Unconfigured bad drives cannot be have the foreign configuration imported until you mark the drive as "Unconfigured good".
Foreign drives detected on the RAID card boot may be flagged as below.
In this instance, either Press F, or better still, press CTRL+G to enter the RAID BIOS and manage the foreign configuration process.
Some older RAID cards may allow you to Import a foreign configuration without Previewing it. However we recommend you always preview the configuration to be imported and check that all logical drives - including those that have no problem - are correctly shown.
If you have different drives with different foreign configurations then you may have multiple choices to try and import. Try "All Configurations" first. If this does not work, try each configuration in turn, making sure you preview it first. It is likely that only the configuration from the last drive that failed will be importable.
Tip: If you can still boot your operating system such as Windows - for example, you are managing a problem with your data logical drive, but your operating system logical drive is still functioning, you may find it easier to boot Windows and use the Intel RAID Web Console to manage the array recovery. Right hand click on the RAID controller from inside RAID Web Console and select "Scan Foreign Configuration" to start the import process.
Scenario 4 - I'm Running RAID 5 and my system crashed. I appear to have had two hard drives fail.
This can happen especially if you have a hot-spare drive. When the first drive fails, the RAID card will start rebuilding onto the spare. The unusual workload may cause another weak drive to then experience a failure.
Tip: Using RAID 6 instead of a RAID 5 plus hot-spare configuration prevents you from being in this situation. RAID 6 will still require a logical drive rebuild but RAID 6 can withstand up to two drive failures without going offline.
In this scenario you may not may not get messages about a foreign configuration. If you do not, or if the foreign configuration does not import, follow the steps below:
- Enter the RAID BIOS console, and check for the presence of any drives marked as unconfigured bad.
- Click on the drive, and then choose the option to "Mark as Unconfigured Good", then click on Go.
- The RAID card likely will NOT prompt to import the foreign configuration at this time. You will need to reboot and then go back into the RAID BIOS to follow the foreign configuration import process.
Note: In some situations when you have multiple drive problems you will not be able to go into the RAID BIOS when there is a foreign configuration present, even if it will not import. Do not clear the foreign configuration, as you should always attempt the least destructive operations first. In this situation, determine which are the problem drives by the orange LEDs on the front of the server. Turn off the server, and then remove one drive. Boot the server backup and attempt the import process, or mark the drive as Unconfigured good and then reboot, and attempt the import process again. Only the last drive to have failed will likely import, but it does mean that this last drive needs to be healthy enough to get the RAID array back to a degraded or non-critical state.
In All Situations
- Always ensure that the RAID array is rebuilt when necessary. You can check the progress from the RAID BIOS or from the Intel RAID Web Console in Windows.
- Never disable the audible alarm. Always silence the alarm, even if you need to do so multiple times.
- Always contact Stone support if a drive is marked as failed, or having bad blocks/media errors. Drives in this situation should be replaced.
- When dealing with RAID problems, check the Media Error count or Predictive failure count to see if you are managing multiple problems.
What Can I do to Minimise the Chances of Unexpected Failures?
- Always have the correct version of RAID Web Console installed if you are running Windows, or use a Remote RAID Console plus the Management component if running VMWare ESXi.
- Set up RAID Web Console for email alerting of problems. This may alert you to minor problems or growing issues before a drive fails completely, for example.
- Take advantage of the option for a scheduled consistency check available in RAID Web Console for Window. Right hand click on the controller and then select "Schedule Consistency Check".
- To ensure your system can manage drive problems more reliable, always use a recent RAID driver, RAID firmware and RAID Web Console software when commissioning your system. Contact Stone support for further assistance.
- Server and Workstation products with LSI or Intel Hardware RAID Controllers up to and including the 2.5Generation 6G SAS Adapters (excludes the third generation adapters).