Emergency OS patch on file server
Per an open ticket with Sun we need to apply two patches to one of our file servers. This is to hopefully fix a degraded zpool which will not finish a parity rebuild. There are exactly 78 users in the frisky cluster on this file server. This will bring your email and web services offline for the time being. The patch should only require about 15 minutes of downtime. I apologize for doing this patch during peak hours, but we really need to get this data back up to full integrity.
Tech nerd details: The raid array is a raidz2 operating with one failed disk. It has been rebuilding off of a hot spare for a day or two and reset the rebuilding process itself after getting to 99%. We contacted Sun and after analyzing troubleshooting information believe a kernel + ZFS patch should resolve the problem. Fortunately this is a raidz2 so it can sustain a disk failure and still be fault tolerant.
Update: Well, one of the two patches was installed. It seems SunSolve is rejecting our service contract to download a specific patch in the dependency tree. We’ve updated our case with Sun. The system is back online and serving files!
.
May 9th, 2008 at 8:51 am
This is another case where DreamHost could use two metrics (also see previous status issue):
High Severity (those affected are 100% down)
Low Impact (small group of users affected)
May 9th, 2008 at 9:02 am
Thanks for the tech details. Always interesting to me what tech/troubleshooting goes on behind the scenes in DH land.
May 9th, 2008 at 9:17 am
Two metrics would be pointless. If they post it on here it is bound to be severe to some group of people otherwise it wouldnt be worth bothering. The severity is a good enough indication for people to make a common sense decision about how it will effect them.
May 9th, 2008 at 10:11 am
Okay, this resolution is a bit funny. So you started doing a bunch of patching on a server without having the patches downloaded, and then had to abort when halfway through you found out you couldn’t get the patches? Wonderful. It would have also been nice to have had some warning in advance for a site outage.
Also, this may have had limited impact on the customer base as a whole, but so do most Dreamhost issues - but for the 78 of us who were affected, I’d say the severity was high. “Low” severity is more appropriate for things like “we’re lagging a bit” or whatever.
May 9th, 2008 at 10:15 am
Craig, I understand there are plenty of ways to consider these things. Since this is a customer information blog, I think it would be more clear if Severity indicated how serious the problem is to my hosting. Low severity should not give me much cause for concern. Contrast with this alert, which is Low Severity, but my site will be gone from the world for a while (albeit _hopefully_ only 15 minutes).
I’d define Severity as:
Low Severity = Internal upgrade, maybe slowdown, no downtime
Med Severity = Possible intermittent/brief downtime for some services
High Severity = Definite downtime for some users
Then, the Impact metric tells customers whether this is an isolated incident or wide-spread. This is essentially informational; you must read the post to see if it pertains to you.
May 9th, 2008 at 10:20 am
I like that idea.
May 9th, 2008 at 10:47 am
Thanks, Eric. This ranking is similar to software bug severity, of which I have much experience. That is, if there are no High Severity bugs, I know my software is not “broken,” and there is no need to alert any users of a possible catastrophic failure. Med Severity bugs indicate something is wrong, which the users may need to know about. Low Severity means the issue does not degrade any features, which usually implies an enhancement, which users do not need to worry about.
May 9th, 2008 at 4:31 pm
Ah, gotta love Sun’s archaic patch download system. Almost as good as trying to get ahold of Cisco software. Good luck guys.
May 9th, 2008 at 6:08 pm
@Nyhm,
I hope that they do implement this. I think what they’re doing now is averaging the two; although this is faulty because you can hardly call something that affects all users very, very slightly medium severity; and if somebody’s site is gone, it’s more than low severity to them.
May 10th, 2008 at 12:42 am
wait wait wait…
you broke a sun server?
you touched it didn’t you, you dirty monkies
that said, I’m quite surprised to hear of a zpool breaking, I’ve never seen that, even on 10 year old sun hardware with poor reliablity…
May 10th, 2008 at 7:24 am
This is good info!
May 10th, 2008 at 9:52 am
Re: Patching, yes I mis-read one of the patch dependencies, which turned into a chain of them that I was missing. Fortunately network works in single user mode, so I could scp them in from another server! Then sunsolve decided it didn’t want to play with me anymore.
Re: setsuna-xero: You have 10 year old zpools?
It is strange. There are certainly kinks to work out, but overall it’s pretty slick!
I’m sorry for touching the Solaris server. It obviously knows better than I do about what should be done.
May 10th, 2008 at 10:55 am
I have heard of a few instances when the zpools have gone screwy - a couple of times it was bad enough to require recovery from backups.
I know some places think they can ignore backups if they use zpools - but thats assuming everything works and nothing breaks.
I hope there are some backups here instead of just depending on zpool to not go on a rampage and eat up a bunch of files ……
May 10th, 2008 at 5:42 pm
@Kelly
no the hardware was 10 years old
the zpool was only a few months old
but seriously the only time I’ve seen a Solaris Server go down was someone playing with it, otherwise they just keep chugging along for years and years, other than catastrophic hardware failures such as our SCSI box emitting some nice blue smoke of death…
I will agree that ZFS needs some more work, but its still a baby really in relation to the other volume managers/FSes
I also don’t get why many places don’t do proper backups, disk space is pretty cheap now, unless you’re looking a few dozen terabytes, but by that point your accountants should realize how much money will be lost WHEN you need those backups that don’t exist…
May 10th, 2008 at 6:48 pm
my email is still down
people are still not receiving emails from me using simple php mail
June 13th, 2008 at 12:26 pm
UNABLE TO CONNECT TO WEBSITE WITH DREAMWEAVER FOR UPDATES AND SITE ADMINISTRATION.
HAS USERNAME OR PASSWORD CHANGED?
IT APPEARS TO BE CONNECTING, AND THEN IT GETS TO THE END AND SAYS UNABLE TO CONNECT TO SERVER, AND FTP UNAVAILABLE.
PLEASE HELP.
THANK YOU.