| Author |
Message |
| < Public Announcements ~ s271 Hard Drive Failure |
|
Posted:
Tue Apr 24, 2007 6:31 pm
|
|
|
|
|
s271 has recently sustained a hard drive failure yesterday, we have been working around the clock to try and resolve the issue - however we have not been able to restore the failed drive, nor the RAID array it was part of.
To avoid any further downtime, we are restoring all accounts from our daily backups on a new working server - there will be no IP changes, so there is no need to update DNS.
We are still investigating the exact cause of the failure, however at this stage we can only conclude that one of the drives in the RAID array failed, and the data on the second drive in the array got corrupted. We will continue our investigation into this matter, and take any necessary measures to ensure this does not happen again.
We sincerely apologize for the inconvenience caused by this recent hardware failure, and we can assure you that we are working as quickly as possible to restore all accounts.
Kind Regards,
Aaron Weller
Crucial Paradigm |
|
|
|
|
|
 |
|
Posted:
Wed Apr 25, 2007 11:55 am
|
|
|
|
|
Status Update: s271 Hard Drive Failure
We have currently restored approximately 60% of the accounts, and estimate that the rest of the accounts will be restored within 10 hours.
Please accept our sincere apologies for the downtime, and you can be assured that we have been working as quickly as possible to restore the accounts, and will continue doing so until all accounts have been restored.
Kind Regards,
Aaron Weller
Crucial Paradigm |
|
|
|
|
|
 |
|
Posted:
Wed Apr 25, 2007 3:05 pm
|
|
|
|
|
Run down of the exact events of the failure, and future preventative steps being taken to prevent a similar failure from happening:
Within MINUTES of detecting the drive failure, we were already working on this issue. Initially one drive failed, however this brought the entire server down, we attempted to repair the RAID 1 array (configured to handle a drive failure, and continue operation when this happens) - however after about 9 hours it was not repairing correctly. We investigated the drive which was not faulty, and should have had an exact copy of the data as the one that failed, however it appeared for some reason the data partition on this drive had been corrupted. For the next 3 hours we attempted to recover the corrupted partition on the second RAID drive, however it appeared the data was lost. The above was done to attempt to bring the sites online as quickly as possible - and the reason why we use RAID on servers.
Please keep in mind we keep 7 sets of backups: RAID (instant backup), Daily Local Backup, Weekly Local Backup, Monthly Local Backup, Daily Remote Backup, Weekly Remote Backup, and Monthly Remote backup. By doing so we are able to ensure your data is near 100% safe, and we can retain your data even in the worst of situations. Depending on the seriousness of the failure, restoring the data can take some time. s271's dual hard drive failure is considered a very serious, and unusual failure and hence the time it has taken to restore the accounts. Files have to restored from a remote backup server (in a different datacenter), and must be copied and restored onto the new server, this is a lengthly process, and usually takes quite some time.
We will be taking preventative measures to further improve our backup strategies - the main one will be to implement hardware RAID over software RAID (in the past we had reviewed both options and found that software RAID would work best for us, however in light of the recent failure we will start using hardware RAID on all our hosting servers), we will also be including a third drive on all our servers to further improve our local backups, and speed up the process of recovery in the unlikely event of a failure of the RAID array.
Kind Regards,
Aaron Weller
Crucial Paradigm |
|
|
|
|
|
 |
|
Posted:
Thu Apr 26, 2007 2:53 pm
|
|
|
|
|
Update:
All accounts were restored approximately 15 hours ago, if you have any problems with any of your accounts please be sure to submit a support ticket.
If you wish to receive an SLA refund request, please submit a ticket to Billing with the period of outage, and your account details.
Thanks everyone for you patience with this matter.
Kind Regards,
Aaron Weller
Crucial Paradigm |
|
|
|
|
|
 |
|
|
All times are GMT + 10 Hours |
|