Filer problems with blingy cluster.

We are currently having a problem with a filer which has crashed and is recovering at this time. While this is happening some customers in the blingy cluster will experience problems loading their websites/email. We apologize for the outage and service is expected to return to normal as soon as the filer recovers.

UPDATE 3:01:AM PDT

The filer has finished recovering and all services are back up and running. We are working with the filer vendor to find the source of the crash to prevent any further outages.

Update 24/03/08 10am: We’re working on the file server again to alleviate the load that’s causing problems with web, mail and mysql services. Sorry about that.

Update 27/03/08: We are doing emergency data moves to quell the stem of problems recently caused by your file server. During these moves, your data may be inaccessible. We are moving as we can off as fast as possible. Very sorry about the continued inconvenience!

Update 27/03/08 This series of moves has finished. We are going to keep an eye on things to see how much it helped and may have to do more moves tonight and tomorrow morning to get everything working smoothly again. This post will be updated with more information as soon as possible.

Update 29/03/08

We are continuing to move data off of the problematic file server but it’s a bit of a catch-22 because customers on that machine are continuing to add data at a very high rate. It filled up this morning for a while causing device full errors as well as mail problems and issues serving websites (when these fill up it causes problems across the board). To explain in more detail, when we move data it does not immediately disappear (there is a ’snapshot’ created of the old data that remains in case there was a problem with the move - that ensures that we do not lose customer data but until the admin team can check the move to make sure it went through properly we cannot delete the old data). We just did some of that and have some breathing room again and of course more moves are still in progress but we are asking customer on this cluster to help us by holding up on any non-essential uploads of data for the next couple of days. As soon as we have a significant portion of the data removed the problematic file server will begin to function properly (and additional moves will go much more quickly and smoothly) but right now we’re having trouble moving data more quickly than it’s being added by people. If everyone could please limit uploads to absolutely essential data until we reach the turning point where everything is working this will be resolved much more quickly (in other words if for example you are setting up a repository of large files you’ll actually be better off waiting a couple of days and getting the all clear from us on this issue because you’ll be able to access that data reliably instead of cramming it on there now and slowing the recovery process).

In the meantime we’ll be doing everything we can to safely and quickly move data off and get things back to normal.

Added information: Some of the people recently moved to the new file server are seeing errors because the data did not get set up completely (loading the site will work but just show an empty index). The admin team has been running an rsync that will fully restore all data and should hopefully finish by 9 PM PST - once that is finished all site and email data will be available for those users.

Update 30/03/08

We’re still racing to keep ahead of new data being added so any help we can get on that front is greatly appreciated (we’re still asking for customers to limit uploads as much as possible to speed up the recovery process). Some customers who are being moved are seeing blank directories still but those are due to moves in progress and the data will be fully restored when those complete.

Update April 1, 2008

We seem to be ahead of the curve right now, we are moving data off the primary volume and on to a secondary one faster than new data is being uploaded. The volume hasn’t filled up completely in a few days. We are working closely with the technical support team to see how we can speed up the process further. Thank you for your patience.

Update, April 1, 2008

I apologize for the late update but we’ve been going over our options (while moving data of course). While we’re not seeing any real relief in terms of data uploads we do have some very large moves that are almost complete. Once those finish we can start deleting the data (for example one is around a half TB or around 500 GB which will be 4-5% of the total but it’s going to take until around Friday to delete it all, so we’re dealing with a ton of data). Tomorrow’s update should be earlier in the day and hopefully we’ll have some progress to reports from the large moves being complete.

Update, April 3, 2008

The data moves to other file servers has been running constantly, but last night and this morning some complications happened with the moves, requiring admin attention. To clear up some space there had to be a short interruption in file serving, this is now finished, space is available and the moves are continuing. The admins are fixing up the last of the web servers which were having issues after file serving was restored. Our apologies again for the continued issues.

Update, April 4, 2008

Today has been a pretty good day of progress. We were able to complete even more moves and free up more data from the file server. Moves have been going quicker and stability is dramatically improving. Monitoring of the servers and email in the blingy cluster today have shown a significant decrease in problems. Issues do still exist but the problem is noticeably getting better. We are also pleased to note that we have more storage that will be coming early next week. We believe that this will go a long way in helping us fix this major problem.

Update, April 4, 2008

Things are continuing to improve today - when I got in I was pleased to see that we had held firm and even gained a percent (the effected file server was down to 95% which is as low as I have seen it in the last week and since I have been working it has dropped to 94%). Performance should improve as we gain ground (this will speed up moving data off as well). This progress, along with the added storage space we are expecting early next week should hopefully allow us to restore service for our customers to normal.

5:45 PM PST : The moves we started just a while ago seem to be causing server problems, we’re looking into it and should have it resolved shortly (they were run just like the ones that had completed so we have to determine why these specifically caused an issue). Update: this resolved itself before we could detect the cause but we’re monitoring the situation to ensure that it’s not a recurring issue (we have no indication that it will be).

Update, April 8, 2008

Please see our other posting for details on the work we did on the effected file server:

http://www.dreamhoststatus.com/2008/04/06/30-min-blingy-downtime-tonight/

We are also continuing to offload data and are making good progress (it’s never as fast as we would like it to be of course). There’s excellent detail here in case you missed it:

http://blog.dreamhost.com/2008/04/07/another-anatomy/

which chronicles the situation and fills you in pretty much up to today. We’re seeing the data dip to 90% so we’re hoping to have it down in the 80’s by the end of the week (every percent we gain helps and as performance improves we can speed up the rate of moving but we’re still looking to hit critical mass where you get the proper level of performance).

Update, April 9, 2008

As we had hoped progress is speeding up as we free up more space - while the file server is showing 95% usage, around 11% of that is data that has already been moved and is no longer in use. Due to a software issue we haven’t been able to remove it yet (the admin team is working on the best way to execute that), but once that is gone we should be around 85% usage which is another large step forward.

In terms of effect, I have already seen improvement in site function for many customers as well as greatly increased speed in moving chunks of data off as well as receiving reports that mail is functioning quite a bit better. That said this issue remains at the High severity rating and in unresolved status as we have not reached a normal level of service. I can’t stress enough how sorry I am that our customers have had to put up with this but I thank those of you who have stuck with us (check the newsletter for details on what we’re doing for Blingy customers) and look forward to providing you with the level of service we strive for at DreamHost.

Update, April 9th, 2008 21:59 PDT

Unfortunately, we need to unmount the volume again to kill these snapshots before they leave us with 0 bytes of free space. In 2 hours (midnight) I will be taking the problematic volume offline to delete the phantom snapshots. Total downtime will be between 10 and 30 minutes. Sorry for the short notice and additional outage!

Update, April 10, 2008

Well the snapshot mentioned yesterday is gone and we’re actually at 83% used today which is below where we were hoping to see marked improvement (85%). Of course we’re still moving data off (which increases the usage on the file server) so that won’t fully translate to customer usage improvement but it should be quite a bit better and keep improving until we stop moving data.

Update, April 14, 2008

Okay, we’re finally getting ready to mark this as resolved.. things have seemed pretty much okay for a while now. But, just to be sure, we’re dropping the severity to Medium for now and leaving it as unresolved.

Update, April 17, 2008

We’re still hearing some reports of site slowness - we were able to resolve an issue causing high loads today which should help but we’re not going to consider this resolved until everyone is receiving good service.


Severity: Medium   Resolved: No
.

1505 Responses to “Filer problems with blingy cluster.”

Pages: « 12 3 4 5 6 [7] 8 9 10 11 1231 » Show All

  1. 301
    JustJoined Says:

    I joined a few days ago and nothing works.

    No email (barely works — when it does the loads times are ridiculous, e.g. 3min for a 1kb email), website loads only 1/2 the time.

    Go to the support page and submit a ticket, get a bunch of auto-generated replies and no ETA or solution.

    On same support page (which itself was down several times) a notice saying they know their service is broken.

    I can’t even copy my emails off.

    Their chat function leads to a “domain cloacking error” page.

    Their forum link leads to a “Internal Server Error” page.

    How do I get out of this hell?

  2. 302
    Sub1 Says:

    Travis, first stage is to sign up with google apps here: http://www.google.com/a/help/intl/en/admins/editions.html
    After your domains been verified, you can then configure googlemail to be the email server.

    MK, All your email for your domain will go directly to google. When you activate it you can specify various email addresses/catchall etc. I would guess that your current email’s will stay on the dreamhost server, then all new ones will go directly to google.

  3. 303
    Just joined Says:

    We just joined about four weeks ago, and we’re thinking of taking Dreamhost up on their 97-day MB guarantee … Since we paid a year up front, we think a downtime rebate should be in order. Let us all ask for a free month or two, otherwise we’ll split. Voices of the masses.

  4. 304
    Adam Says:

    This is really depressing. My website runs a blog and pictures so my family can stay up to date on our daughter. My wife thinks the website sucks now because its so slow. It keeps going down, and I keep telling her they are working on it, but this is getting ridiculous. 1and1, as bad as they are, never had this many outages in the 2 years I used them. I’ve only been with dreamhost for 2.5 weeks.

  5. 305
    Dave Says:

    Blingy is still having issues?? I’m still having problems. Anyone else??

  6. 306
    Steve Says:

    Yes, I’m still having the problems with the server. Why not just move us to a different one?

  7. 307
    nron Says:

    yes email was up yesterday and is down again

    and who names a server blingy anyways

    Can anyone please request a callback from DH. I am abroad, and they wont call me. we need info!!

  8. 308
    TC Says:

    Still no personal response to any support requests. Still no email for one of my clients since Monday. Still no apparent urgency from DH on this devastating multi-day failure.

    I’ve got a few hosts in mind to move to, and have been using this site to check on uptimes. http://uptime.besthostratings.com/webhosts-uptime.php

    Good luck to everyone sticking with DH. My clients have all demanded that I move them to a new service.

    (the key word is SERVICE)

  9. 309
    bob cobb Says:

    I keep having to reboot my private server, anyone else having this problem? My usage spikes and it wont recover unless I reboot, and sometimes it still wont recover :mad:

  10. 310
    Sean Says:

    Here’s a news flash, how about fixing it at night and letting us get our email during the day, because it sseems your doing the exact opposite of that… New customer, been here less than a month and not liking the service at all so far.

  11. 311
    NkM Says:

    Ummm… did the file server totally die or something? Getting forbidden errors on websites… and if you SSH into your account, your home directory is GONE! I tried cd into and the device/address DOES NOT EXIST.

    Hey Dreamhost… I suggest moving some of your customers to other working clusters and quit overloading this one. I don’t know how many of your customers are on this cluster… but in time I’m sure it’ll sort itself out since many of us will be leaving.

  12. 312
    Loni Says:

    Isn’t the point of a status page to give status updates? I mean, if you aren’t going to e-mail us all and let us know what’s going on, you can at least update a blog page. Or is the point of this just to give us all a place to vent in the hopes of making us feel better? Unfortunately, I don’t feel any better. Well … maybe a little.

  13. 313
    Forbidden Says:

    I’ve a 403 Forbidden on all my web sites.
    I think that my web sites are more ofen down than up.
    Please fix it quickly.

  14. 314
    Kip Says:

    They couldn’t email us, since the mail server is down.

  15. 315
    seb Says:

    @bob You have problem with your private server too. Does private server works well or not? I want to bye one because of the overload on the Blingy cluser. Maybe it’s a bad idea?

  16. 316
    bob cobb Says:

    it has been decent, but when Im paying over $100 a month I think I deserve some better service. having to wait 12 hrs for an answer sucks

  17. 317
    Pat Says:

    I just moved over all my stuff to Verio. It is more expensive, but it is paid by my corp, so it’s a tax write off. I also closed my account with DH. I have a business to run and I have customers. I don’t have time to work around the technical problems at DH and I can’t afford to look like a clown because of DH.

  18. 318
    PF Says:

    I’m working with a group that has the PS as well. We haven’t published our site yet and have the whole domain password blocked, but still have stupid high IO% readings. We aren’t even close to our base CPU or RAM setting, and checking our processes shows that it’s minimal. The owner of the account is unavailable and I’m trying to help him out with technical issues and DH refuses to answer my questions by email when they had previously…

    Does anyone have an idea what exactly is wrong with the cluster? Is it a hardware Issue? I’ve been noticing a lag for a few weeks now, and like I say, there may be two people in total on the site at any given time, but no more.

  19. 319
    changed my host to hostgator Says:

    Hey thanks sam, I followed your advice and I got a account with Hostgator. This host looks good. Its fast and there is a live support also available.

  20. 320
    Kat Says:

    I can’t believe they haven’t even bothered to update the status page since Monday.
    My website’s up but slow, but webmail is totally out, and IMAP is only occasionally responding. This is outrageous.
    I’ve only been with DH for three weeks—week one we tried it out, week two we transitioned, week three is this nonsense.

    I wish I hadn’t canceled my LayeredTech account—it was an expensive dedicated machine with its own issues, but *nothing* like this.

  21. 321
    bob cobb Says:

    my support question got moved to the severe problem queue, I wonder if that means it will actually be fixed today

  22. 322
    rhbaby Says:

    @bob cobb: Yeah - mine got moved to the severe queue a day ago. Don’t hold your breath.

  23. 323
    yhs Says:

    THIS IS IT.
    If i dont get moved to a new server in 12 hours from now on, i will demand a refund :@
    damn it. this is getting silly.
    every half an our it says the server has been rebooted and I cannot access my account by any means :@
    keeps saying my files are not on the server, and i have more than 50 emails by now from people complaining :@
    FIX IT DAMN IT!

  24. 324
    I am free from this gutter now Says:

    Finally after 2 months of frustration I gave up. Blingy was a nightmare. Trying out at mediatemple and hostgator. Maybe I will I will stick to hostgator. The service team there was really helpful in shifting all my databases and files to their server. The good thing about them is that they offer 45 days money back guaranty with a minimum contract of just one month. At dreamhost I had to pay for the whole year in advance, and when i claimed my money back, they did not refund me the domain registration charges which they claimed to be “free for life”. Ok there were terms associated, but it was totally misleading I would say. Hostgator doesn’t look so tricky, have a live chat support, and the pages are loading much faster than when I was at dreamhost.

  25. 325
    Butters Says:

    What’s going on! Why is it taking so long to fix this problem, its been 2weeks ! My client is so pissed at me now! He wants money back and terminate the contract with me! LIFE is full of problems! I HATE LIFE! DREAM HOST LIVE up to your NAME or my dream will be shuttered by your dream.

  26. 326
    jidanni Says:

    DH: You might think you have fixed it, but your load averages are still high,
    and your disk is about to fill right back up again:
    $ ssh spyro.dreamhost.com w\;df .
    14:35:54 up 8:50, 3 users, load average: 23.27, 28.57, 28.22
    USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
    Filesystem 1K-blocks Used Available Use% Mounted on
    10.175.4.69:/fcvol1/blingy/yoshie
    13711880192 13522472400 189407792 99% /home/.yoshie

  27. 327
    Adam Says:

    *sigh*. This is pretty embarrassing.

  28. 328
    Al Says:

    Now this is really taking some time…

  29. 329
    Syklop Says:

    Byebye, guys !

    I’m away fron dreamhost. My “DH life” was only 15 days….
    I will wait 2 days for DNS propagation, and ask for my money back…

  30. 330
    Sam | I changed my hosting Says:

    Hey guys, can you please tell me how fast this site loads there ? I just changed over to Hostgators , and I want to see if it is fast just here, or everywhere. Please let me know.

  31. 331
    Simon Says:

    Is there a way to tell which server I’m on? I don’t know if I’m on blingy (by I assume I am guessing that it’s down :) )…

  32. 332
    businessgeeks Says:

    Hi SAM, I also changed over to hostgator, which was my original hosting provider before i tried this sad dreamhost hosting service. And love the way my sites perform in hostgator. here is a quick comparison. I have a static page in dreamhost which loads twice as long compared to my site hostgator (which runs joomla) go figure! i hope dreamhost fixes their issues for their other customers. Me? i love the way hostgator eats the competition! hehehe…

    Cheers!

  33. 333
    David Says:

    LOL. I just went to hostgator, and their site is down :-) They must be using Blingy too! (used the hostgator sponsored link on google)
    Going directly to their site works fine though.

  34. 334
    Arghhhh Says:

    @jidanni - my web is on strider and Ive been getting load times of 150:80:70 with only 5 users online!
    At one stage the loads were around the 200 mark! Thats just CRAP!

    Please move me off of the Blingy Cluster! GRRRRRRRRRRRRRRRRRRR

  35. 335
    Sandusky Chinese Restaurants Says:

    Glad to see that was all fixed up.

  36. 336
    bob cobb Says:

    How about an update??????? I am going nuts here

  37. 337
    Disappointed Says:

    > How about an update??????? I am going nuts here

    Me too. If this situation is not resolved within 24 hours, I will have no choice but to leave. I am appalled at the complete lack of communication from DH’s end. I haven’t had an e-mail response in over 2 days.

  38. 338
    bob cobb Says:

    im currently pricing dedicated servers at liquid web. fuck this

  39. 339
    Me Says:

    That nothing is working is bad enough. But please DH guys…..post an update. Our business hasn’t received email in 3 days. I need to know if I have to start calling people to use a different webbased email address. Just tell us where you guys stand right now with fixing this so we can make a choice about how to move on.

  40. 340
    jidanni Says:

    Hello everybody. I’m just here to tell you that I’ve been following
    the disk usage (df - report file system disk space usage) statistic,
    and hereby solely predict the s*it is about to hit the fan again
    within about an hour or two which is when the last few bytes will be
    used up again:

    $ df
    1K-blocks Used Available Use% Mounted on
    10.175.4.69:/fcvol1/blingy/yin
    13711880192 13625750644 86129548 100% /home/.yin
    10.175.4.69:/fcvol1/blingy/yoda
    13711880192 13625753340 86126852 100% /home/.yoda
    10.175.4.69:/fcvol1/blingy/yoi
    13711880192 13625762148 86118044 100% /home/.yoi
    10.175.4.69:/fcvol1/blingy/yolande
    13711880192 13625765552 86114640 100% /home/.yolande…

  41. 341
    jidanni Says:

    “Solely”? I meant “solemnly”. Anyways, last post it was 86114640. Now it
    it is 82286796. So you can compute how long before BOOM by my posts’
    timestamps…

  42. 342
    Me Says:

    to jidanni:
    so you are saying in 2.5 hours the disks are full again…..meaning?

  43. 343
    Disappointed Says:

    Meaning everything will go down again at that time.

    74693408 left.

  44. 344
    Me Says:

    oh well…..my email hasn’t worked in 3 days……I am just wondering if this “BOOM” will result in loss of email….can’t imagine the long term damage if that would happen…..

  45. 345
    Me Says:

    To: Jidanni
    I have to admit….kinda feels like New Year……counting down…..k
    keeps the post going please…..only info we get.

  46. 346
    Me Says:

    ok jidanni…one hour after your first post…where are we now?

  47. 347
    bob cobb Says:

    I dunno, but my site is back down after being up for an hour

  48. 348
    jidanni Says:

    It’s really easy to check when BOOM will hit, like I do here with ssh:
    $ ssh spyro.dreamhost.com df .
    Filesystem 1K-blocks Used Available Use% Mounted on
    10.175.4.69:/fcvol1/blingy/yoshie
    13711880192 13659968936 51911256 100% /home/.yoshie
    At my last post was 8… now it only has 5… Mail to ralph@dreamhost.com and
    blingystatus@dreamhost.com just got form responses. Well I tried to warn them…

  49. 349
    Arghhhh Says:

    well Im just suprised it hasnt gone BOOM already. There must be some steam coming out of those servers.

  50. 350
    Arghhhh Says:

    My request for the refund or to be moved from the Blingy cluster has fallen on deaf ears!
    I received the same reply that one of the other members received 2 days ago!!!! WTF?

Pages: « 12 3 4 5 6 [7] 8 9 10 11 1231 » Show All

Leave a Reply