| Author |
Message |
| < Public Announcements ~ s134 Outage |
|
Posted:
Mon Mar 17, 2008 4:54 pm
|
|
|
|
|
There is currently an outage on s134, we are working to resolve it as quickly as possible.
The server initially crashed, and upon rebooting the server there was a kernel panic. Currently the server is undergoing a fsck (forced filesystem scan).
Please accept our sincere apologies for this inconvenience, and be assured we are working as quickly as possible to resolve this!
Kind Regards,
Aaron Weller
Crucial Paradigm |
|
|
|
|
|
 |
|
Posted:
Mon Mar 17, 2008 5:35 pm
|
|
|
|
|
| The fsck is currently 76% complete. |
|
|
|
|
|
 |
|
Posted:
Mon Mar 17, 2008 6:23 pm
|
|
|
|
|
| The fsck reported errors, and must be completed manually, we are proceeding with this now. |
|
|
|
|
|
 |
|
Posted:
Mon Mar 17, 2008 10:35 pm
|
|
|
|
|
| As there appears to be corruption in the boot partition, we are doing a verify on the RAID array. This make take a little while. |
|
|
|
|
|
 |
|
Posted:
Mon Mar 17, 2008 10:54 pm
|
|
|
|
|
This is now at 35%, once this has been completed we will need to do another manual fsck.
If you have any questions or concerns, please do not hesitate to contact us. |
|
|
|
|
|
 |
|
Posted:
Mon Mar 17, 2008 11:47 pm
|
|
|
|
|
| The verify run on the RAID array has completed with no errors, we will continue with another manual fsck on the server. |
|
|
|
|
|
 |
|
Posted:
Tue Mar 18, 2008 6:08 am
|
|
|
|
|
| Manual FSCK completed in the server successfully. |
|
|
|
|
|
 |
|
Posted:
Tue Mar 18, 2008 6:09 am
|
|
|
|
|
| Trying to reboot with a kernel now. |
|
|
|
|
|
 |
|
Posted:
Tue Mar 18, 2008 8:04 am
|
|
|
|
|
| We tried to reboot the server. But it went down at the login prompt. So we checked the server for memory errors. And we decided to replace the RAM on the server. |
|
|
|
|
|
 |
|
Posted:
Tue Mar 18, 2008 11:00 am
|
|
|
|
|
| RAM is currently being replaced, and server should be back online shortly. |
|
|
|
|
|
 |
|
Posted:
Tue Mar 18, 2008 4:19 pm
|
|
|
|
|
| The server has been back online for a few hours now, we will write up a complete report regarding this shortly! |
|
|
|
|
|
 |
|
Posted:
Thu Mar 20, 2008 10:55 pm
|
|
|
|
|
Dear Valued Customers,
s134 is now back up and running. Please accept our sincere apologies for this extended outage, and note that we will be providing all customers effected by this outage with a 100% refund for this month upon request. To receive a refund for this month, please submit a ticket to billing requesting an SLA refund. Please be sure to include your domain name, name, and that the outage is related to s134.
Run Down of Events:
Initially the server crashed, and upon rebooting required and fsck (file system scan). We continued to run the fsck, however it kept restarting and not completing. So we continued to try and get the fsck to complete as it looked as though there was some data corruption on the server. We completed this several times without luck, and proceeded to run a test (verify) on the RAID array to check if there were any corrupted portions on the array. All results pointed towards everything running fine.
All evidence pointed towards data corruption on the array. As a last attempt, we attempted another fsck to try and get the server online. After this was complete, we thought we would test the RAM on the server "just in case" - immediately our test showed faulty RAM.
At this stage we replaced the RAM, and the server was brought back online.
What we are doing in future:
In the aftermath of this outage, it was clear that we could have resolved this issue much quicker. Unfortunately RAM issues are sometimes hard to diagnose, and can appear to be another fault - a RAM test also takes some time to complete (4 hours). In future, if such symptoms do present themselves, then we will ensure we run a RAM test, or replace the RAM in the server immediately with new RAM to prevent any further downtime.
Notes:
We always use sever grade hardware in our servers, and this server was running every precaution possible, even running RAID 10 for the greatest speed and data protection. However as the issue at hand was not detected earlier on in the piece, it took a lot longer than it should have to resolve.
Please do not hesitate to contact us if you have any questions or concerns.
Kind Regards,
Aaron Weller
Crucial Paradigm |
|
|
|
|
|
 |
|