Crucial Paradigm Public Forums Forum Index
Author Message
<  Public Announcements  ~  s134 Outage
Aaron
PostPosted: Mon Mar 17, 2008 4:54 pm  Reply with quote



Joined: 05 Feb 2004
Posts: 474

There is currently an outage on s134, we are working to resolve it as quickly as possible.

The server initially crashed, and upon rebooting the server there was a kernel panic. Currently the server is undergoing a fsck (forced filesystem scan).

Please accept our sincere apologies for this inconvenience, and be assured we are working as quickly as possible to resolve this!

Kind Regards,
Aaron Weller
Crucial Paradigm
Back to top
Aaron
PostPosted: Mon Mar 17, 2008 5:35 pm  Reply with quote



Joined: 05 Feb 2004
Posts: 474

The fsck is currently 76% complete.
Back to top
Aaron
PostPosted: Mon Mar 17, 2008 6:23 pm  Reply with quote



Joined: 05 Feb 2004
Posts: 474

The fsck reported errors, and must be completed manually, we are proceeding with this now.
Back to top
Aaron
PostPosted: Mon Mar 17, 2008 10:35 pm  Reply with quote



Joined: 05 Feb 2004
Posts: 474

As there appears to be corruption in the boot partition, we are doing a verify on the RAID array. This make take a little while.
Back to top
Aaron
PostPosted: Mon Mar 17, 2008 10:54 pm  Reply with quote



Joined: 05 Feb 2004
Posts: 474

This is now at 35%, once this has been completed we will need to do another manual fsck.

If you have any questions or concerns, please do not hesitate to contact us.
Back to top
Aaron
PostPosted: Mon Mar 17, 2008 11:47 pm  Reply with quote



Joined: 05 Feb 2004
Posts: 474

The verify run on the RAID array has completed with no errors, we will continue with another manual fsck on the server.
Back to top
CPTech
PostPosted: Tue Mar 18, 2008 6:08 am  Reply with quote



Joined: 05 Jun 2007
Posts: 3

Manual FSCK completed in the server successfully.
Back to top
CPTech
PostPosted: Tue Mar 18, 2008 6:09 am  Reply with quote



Joined: 05 Jun 2007
Posts: 3

Trying to reboot with a kernel now.
Back to top
CPTech
PostPosted: Tue Mar 18, 2008 8:04 am  Reply with quote



Joined: 05 Jun 2007
Posts: 3

We tried to reboot the server. But it went down at the login prompt. So we checked the server for memory errors. And we decided to replace the RAM on the server.
Back to top
Aaron
PostPosted: Tue Mar 18, 2008 11:00 am  Reply with quote



Joined: 05 Feb 2004
Posts: 474

RAM is currently being replaced, and server should be back online shortly.
Back to top
Aaron
PostPosted: Tue Mar 18, 2008 4:19 pm  Reply with quote



Joined: 05 Feb 2004
Posts: 474

The server has been back online for a few hours now, we will write up a complete report regarding this shortly!
Back to top
Aaron
PostPosted: Thu Mar 20, 2008 10:55 pm  Reply with quote



Joined: 05 Feb 2004
Posts: 474

Dear Valued Customers,

s134 is now back up and running. Please accept our sincere apologies for this extended outage, and note that we will be providing all customers effected by this outage with a 100% refund for this month upon request. To receive a refund for this month, please submit a ticket to billing requesting an SLA refund. Please be sure to include your domain name, name, and that the outage is related to s134.


Run Down of Events:

Initially the server crashed, and upon rebooting required and fsck (file system scan). We continued to run the fsck, however it kept restarting and not completing. So we continued to try and get the fsck to complete as it looked as though there was some data corruption on the server. We completed this several times without luck, and proceeded to run a test (verify) on the RAID array to check if there were any corrupted portions on the array. All results pointed towards everything running fine.

All evidence pointed towards data corruption on the array. As a last attempt, we attempted another fsck to try and get the server online. After this was complete, we thought we would test the RAM on the server "just in case" - immediately our test showed faulty RAM.

At this stage we replaced the RAM, and the server was brought back online.


What we are doing in future:

In the aftermath of this outage, it was clear that we could have resolved this issue much quicker. Unfortunately RAM issues are sometimes hard to diagnose, and can appear to be another fault - a RAM test also takes some time to complete (4 hours). In future, if such symptoms do present themselves, then we will ensure we run a RAM test, or replace the RAM in the server immediately with new RAM to prevent any further downtime.


Notes:

We always use sever grade hardware in our servers, and this server was running every precaution possible, even running RAID 10 for the greatest speed and data protection. However as the issue at hand was not detected earlier on in the piece, it took a lot longer than it should have to resolve.


Please do not hesitate to contact us if you have any questions or concerns.

Kind Regards,
Aaron Weller
Crucial Paradigm
Back to top
Display posts from previous:   
All times are GMT + 10 Hours

Page 1 of 1
Crucial Paradigm Public Forums Forum Index  ~  Public Announcements

Post new topic   Reply to topic