Emergency OS patch on file server

Per an open ticket with Sun we need to apply two patches to one of our file servers. This is to hopefully fix a degraded zpool which will not finish a parity rebuild. There are exactly 78 users in the frisky cluster on this file server. This will bring your email and web services offline for the time being. The patch should only require about 15 minutes of downtime. I apologize for doing this patch during peak hours, but we really need to get this data back up to full integrity.

Tech nerd details: The raid array is a raidz2 operating with one failed disk. It has been rebuilding off of a hot spare for a day or two and reset the rebuilding process itself after getting to 99%. We contacted Sun and after analyzing troubleshooting information believe a kernel + ZFS patch should resolve the problem. Fortunately this is a raidz2 so it can sustain a disk failure and still be fault tolerant.

Update: Well, one of the two patches was installed. It seems SunSolve is rejecting our service contract to download a specific patch in the dependency tree. We’ve updated our case with Sun. The system is back online and serving files!


Severity: Low   Resolved: Yes
.

15 Responses to “Emergency OS patch on file server”

  1. 1
    Nyhm Says:

    This is another case where DreamHost could use two metrics (also see previous status issue):

    High Severity (those affected are 100% down)
    Low Impact (small group of users affected)

  2. 2
    hawk82 Says:

    Thanks for the tech details. Always interesting to me what tech/troubleshooting goes on behind the scenes in DH land.

  3. 3
    Craig Says:

    Two metrics would be pointless. If they post it on here it is bound to be severe to some group of people otherwise it wouldnt be worth bothering. The severity is a good enough indication for people to make a common sense decision about how it will effect them.

  4. 4
    fluffy Says:

    Okay, this resolution is a bit funny. So you started doing a bunch of patching on a server without having the patches downloaded, and then had to abort when halfway through you found out you couldn’t get the patches? Wonderful. It would have also been nice to have had some warning in advance for a site outage.

    Also, this may have had limited impact on the customer base as a whole, but so do most Dreamhost issues - but for the 78 of us who were affected, I’d say the severity was high. “Low” severity is more appropriate for things like “we’re lagging a bit” or whatever.

  5. 5
    Nyhm Says:

    Craig, I understand there are plenty of ways to consider these things. Since this is a customer information blog, I think it would be more clear if Severity indicated how serious the problem is to my hosting. Low severity should not give me much cause for concern. Contrast with this alert, which is Low Severity, but my site will be gone from the world for a while (albeit _hopefully_ only 15 minutes).

    I’d define Severity as:
    Low Severity = Internal upgrade, maybe slowdown, no downtime
    Med Severity = Possible intermittent/brief downtime for some services
    High Severity = Definite downtime for some users

    Then, the Impact metric tells customers whether this is an isolated incident or wide-spread. This is essentially informational; you must read the post to see if it pertains to you.

  6. 6
    Eric Says:

    I like that idea.

  7. 7
    Nyhm Says:

    Thanks, Eric. This ranking is similar to software bug severity, of which I have much experience. That is, if there are no High Severity bugs, I know my software is not “broken,” and there is no need to alert any users of a possible catastrophic failure. Med Severity bugs indicate something is wrong, which the users may need to know about. Low Severity means the issue does not degrade any features, which usually implies an enhancement, which users do not need to worry about.

  8. 8
    Kyle J Says:

    Ah, gotta love Sun’s archaic patch download system. Almost as good as trying to get ahold of Cisco software. Good luck guys.

  9. 9
    Eric Says:

    @Nyhm,

    I hope that they do implement this. I think what they’re doing now is averaging the two; although this is faulty because you can hardly call something that affects all users very, very slightly medium severity; and if somebody’s site is gone, it’s more than low severity to them.

  10. 10
    setsuna-xero Says:

    wait wait wait…
    you broke a sun server?
    you touched it didn’t you, you dirty monkies :P

    that said, I’m quite surprised to hear of a zpool breaking, I’ve never seen that, even on 10 year old sun hardware with poor reliablity…

  11. 11
    Sharon McCormick Says:

    This is good info!

  12. 12
    Kelly@Dreamhost Says:

    Re: Patching, yes I mis-read one of the patch dependencies, which turned into a chain of them that I was missing. Fortunately network works in single user mode, so I could scp them in from another server! Then sunsolve decided it didn’t want to play with me anymore.

    Re: setsuna-xero: You have 10 year old zpools? ;) It is strange. There are certainly kinks to work out, but overall it’s pretty slick!

    I’m sorry for touching the Solaris server. It obviously knows better than I do about what should be done. :(

  13. 13
    AJ Says:

    I have heard of a few instances when the zpools have gone screwy - a couple of times it was bad enough to require recovery from backups.

    I know some places think they can ignore backups if they use zpools - but thats assuming everything works and nothing breaks.

    I hope there are some backups here instead of just depending on zpool to not go on a rampage and eat up a bunch of files …… ;)

  14. 14
    setsuna-xero Says:

    @Kelly
    no the hardware was 10 years old
    the zpool was only a few months old

    but seriously the only time I’ve seen a Solaris Server go down was someone playing with it, otherwise they just keep chugging along for years and years, other than catastrophic hardware failures such as our SCSI box emitting some nice blue smoke of death…

    I will agree that ZFS needs some more work, but its still a baby really in relation to the other volume managers/FSes

    I also don’t get why many places don’t do proper backups, disk space is pretty cheap now, unless you’re looking a few dozen terabytes, but by that point your accountants should realize how much money will be lost WHEN you need those backups that don’t exist…

  15. 15
    ems Says:

    my email is still down

    people are still not receiving emails from me using simple php mail

Leave a Reply