An update on the recent stability issues
Posted (July 19th, 2006 at 2:55 pm PST) by danwyWe are still in the process of offloading data from the filer that caused the problems over the last couple of days (we started this shortly after it started failing). There is a LOT of data on that filer, and it is going to take a while to move it all off. The filer keeps crashing, so somebody is here 24 hours a day to restart it, swap out drives and continue offloading where it left off. Through this process, we’ve learned a lot of new tricks to help recognize these types of filer problems before they happen, so we shouldn’t experience problems of this magnitude in the future.
Unfortunately, this hasn’t been the only problem. The filer crash started a chain reaction that heavily affected other parts of our network, saturating their network interfaces (which is the cause of the slowness/outages some of you are still seeing). We have worked around some of these issues by splitting up switches to have less traffic being pushed through them. We’re also in the process of auditing our entire network to ensure that the NFS traffic won’t be affected like this again in the case of a hardware outage. There have been quite a few things that have caused the problems of the past few days and our admins are busy trying to get all of them ironed out. We have a lot of work still ahead of us to get things back to normal. We are making a lot of progress though, and once things are back to normal things should be a lot more stable than they previously have been. We thank you for your patience as we try to get things back to normal here.

