Filer problems with blingy cluster. |
We are currently having a problem with a filer which has crashed and is recovering at this time. While this is happening some customers in the blingy cluster will experience problems loading their websites/email. We apologize for the outage and service is expected to return to normal as soon as the filer recovers.
UPDATE 3:01:AM PDT
The filer has finished recovering and all services are back up and running. We are working with the filer vendor to find the source of the crash to prevent any further outages.
Update 24/03/08 10am: We’re working on the file server again to alleviate the load that’s causing problems with web, mail and mysql services. Sorry about that.
Update 27/03/08: We are doing emergency data moves to quell the stem of problems recently caused by your file server. During these moves, your data may be inaccessible. We are moving as we can off as fast as possible. Very sorry about the continued inconvenience!
Update 27/03/08 This series of moves has finished. We are going to keep an eye on things to see how much it helped and may have to do more moves tonight and tomorrow morning to get everything working smoothly again. This post will be updated with more information as soon as possible.
Update 29/03/08
We are continuing to move data off of the problematic file server but it’s a bit of a catch-22 because customers on that machine are continuing to add data at a very high rate. It filled up this morning for a while causing device full errors as well as mail problems and issues serving websites (when these fill up it causes problems across the board). To explain in more detail, when we move data it does not immediately disappear (there is a ’snapshot’ created of the old data that remains in case there was a problem with the move - that ensures that we do not lose customer data but until the admin team can check the move to make sure it went through properly we cannot delete the old data). We just did some of that and have some breathing room again and of course more moves are still in progress but we are asking customer on this cluster to help us by holding up on any non-essential uploads of data for the next couple of days. As soon as we have a significant portion of the data removed the problematic file server will begin to function properly (and additional moves will go much more quickly and smoothly) but right now we’re having trouble moving data more quickly than it’s being added by people. If everyone could please limit uploads to absolutely essential data until we reach the turning point where everything is working this will be resolved much more quickly (in other words if for example you are setting up a repository of large files you’ll actually be better off waiting a couple of days and getting the all clear from us on this issue because you’ll be able to access that data reliably instead of cramming it on there now and slowing the recovery process).
In the meantime we’ll be doing everything we can to safely and quickly move data off and get things back to normal.
Added information: Some of the people recently moved to the new file server are seeing errors because the data did not get set up completely (loading the site will work but just show an empty index). The admin team has been running an rsync that will fully restore all data and should hopefully finish by 9 PM PST - once that is finished all site and email data will be available for those users.
Update 30/03/08
We’re still racing to keep ahead of new data being added so any help we can get on that front is greatly appreciated (we’re still asking for customers to limit uploads as much as possible to speed up the recovery process). Some customers who are being moved are seeing blank directories still but those are due to moves in progress and the data will be fully restored when those complete.
Update April 1, 2008
We seem to be ahead of the curve right now, we are moving data off the primary volume and on to a secondary one faster than new data is being uploaded. The volume hasn’t filled up completely in a few days. We are working closely with the technical support team to see how we can speed up the process further. Thank you for your patience.
Update, April 1, 2008
I apologize for the late update but we’ve been going over our options (while moving data of course). While we’re not seeing any real relief in terms of data uploads we do have some very large moves that are almost complete. Once those finish we can start deleting the data (for example one is around a half TB or around 500 GB which will be 4-5% of the total but it’s going to take until around Friday to delete it all, so we’re dealing with a ton of data). Tomorrow’s update should be earlier in the day and hopefully we’ll have some progress to reports from the large moves being complete.
Update, April 3, 2008
The data moves to other file servers has been running constantly, but last night and this morning some complications happened with the moves, requiring admin attention. To clear up some space there had to be a short interruption in file serving, this is now finished, space is available and the moves are continuing. The admins are fixing up the last of the web servers which were having issues after file serving was restored. Our apologies again for the continued issues.
Update, April 4, 2008
Today has been a pretty good day of progress. We were able to complete even more moves and free up more data from the file server. Moves have been going quicker and stability is dramatically improving. Monitoring of the servers and email in the blingy cluster today have shown a significant decrease in problems. Issues do still exist but the problem is noticeably getting better. We are also pleased to note that we have more storage that will be coming early next week. We believe that this will go a long way in helping us fix this major problem.
Update, April 4, 2008
Things are continuing to improve today - when I got in I was pleased to see that we had held firm and even gained a percent (the effected file server was down to 95% which is as low as I have seen it in the last week and since I have been working it has dropped to 94%). Performance should improve as we gain ground (this will speed up moving data off as well). This progress, along with the added storage space we are expecting early next week should hopefully allow us to restore service for our customers to normal.
5:45 PM PST : The moves we started just a while ago seem to be causing server problems, we’re looking into it and should have it resolved shortly (they were run just like the ones that had completed so we have to determine why these specifically caused an issue). Update: this resolved itself before we could detect the cause but we’re monitoring the situation to ensure that it’s not a recurring issue (we have no indication that it will be).
Update, April 8, 2008
Please see our other posting for details on the work we did on the effected file server:
http://www.dreamhoststatus.com/2008/04/06/30-min-blingy-downtime-tonight/
We are also continuing to offload data and are making good progress (it’s never as fast as we would like it to be of course). There’s excellent detail here in case you missed it:
http://blog.dreamhost.com/2008/04/07/another-anatomy/
which chronicles the situation and fills you in pretty much up to today. We’re seeing the data dip to 90% so we’re hoping to have it down in the 80’s by the end of the week (every percent we gain helps and as performance improves we can speed up the rate of moving but we’re still looking to hit critical mass where you get the proper level of performance).
Update, April 9, 2008
As we had hoped progress is speeding up as we free up more space - while the file server is showing 95% usage, around 11% of that is data that has already been moved and is no longer in use. Due to a software issue we haven’t been able to remove it yet (the admin team is working on the best way to execute that), but once that is gone we should be around 85% usage which is another large step forward.
In terms of effect, I have already seen improvement in site function for many customers as well as greatly increased speed in moving chunks of data off as well as receiving reports that mail is functioning quite a bit better. That said this issue remains at the High severity rating and in unresolved status as we have not reached a normal level of service. I can’t stress enough how sorry I am that our customers have had to put up with this but I thank those of you who have stuck with us (check the newsletter for details on what we’re doing for Blingy customers) and look forward to providing you with the level of service we strive for at DreamHost.
Update, April 9th, 2008 21:59 PDT
Unfortunately, we need to unmount the volume again to kill these snapshots before they leave us with 0 bytes of free space. In 2 hours (midnight) I will be taking the problematic volume offline to delete the phantom snapshots. Total downtime will be between 10 and 30 minutes. Sorry for the short notice and additional outage!
Update, April 10, 2008
Well the snapshot mentioned yesterday is gone and we’re actually at 83% used today which is below where we were hoping to see marked improvement (85%). Of course we’re still moving data off (which increases the usage on the file server) so that won’t fully translate to customer usage improvement but it should be quite a bit better and keep improving until we stop moving data.
Update, April 14, 2008
Okay, we’re finally getting ready to mark this as resolved.. things have seemed pretty much okay for a while now. But, just to be sure, we’re dropping the severity to Medium for now and leaving it as unresolved.
Update, April 17, 2008
We’re still hearing some reports of site slowness - we were able to resolve an issue causing high loads today which should help but we’re not going to consider this resolved until everyone is receiving good service.
| Severity: | Medium | Resolved: | No |
March 24th, 2008 at 11:35 am
Looks like everyone has a case of the Mondays. Hardware fails….it happens…you can always request from them that they move you to a different cluster…I did that with another one of my accounts and I haven’t had issues with them since. I have 3 different accounts and only one is affected by this outtage. When did this one start? Luckily I develop on a LAMP virtual machine and I don’t need the live server to develop on… But it does suck since I get commission off web leads…and web don’t werk rite now.
Have faith, they will get it fixed.
March 24th, 2008 at 11:39 am
It appends to me two hours ago .
can’t acces my website and ftp .
after turning around to find a problem in my computer, i wanted used the contact support because i realized an other friend on dreamhost don’t have problems.
And i saw this warning.
I don’t undertand this problem of plinging, i’m not an enginer but i’m surprised to see a 4 days problem can disturb other users laters ??
It coul’d be very appreciated to display an entimated time needed to solve this problem.
It would be usefull too to add how many time spended when updated au problem to say it’s OK .
I’m a new user, it’s my first time using host and if flame is natural it’s justice to thank you to display info about this disturb.
keep good work .
March 24th, 2008 at 11:47 am
Come on guys. It’s big boy time. Get things working. This is ridiculous.
March 24th, 2008 at 11:53 am
I was in midle of online mail sexy exchange. You bastards!
March 24th, 2008 at 11:54 am
I don’t believe it. You guys are not serious. This is fucked up stuff.
March 24th, 2008 at 11:56 am
The funniest thing is that the Dreamhost status site always seems to be working. I wonder if it is hosted by a different company?
March 24th, 2008 at 11:58 am
ok, mine is working….ooh…seems faster too! (maybe they upgraded)
March 24th, 2008 at 12:05 pm
Over on jurupa, we have us some huge server loads that have been going on all day. These loads are NOT reflected in the database. Is this a result of these problems or something else?
March 24th, 2008 at 12:13 pm
Ive been with dreamhost for a few years and my sites have only been down a few times in those years. This is the longest I have been unable to get to them though. I usually dont mind waiting but today I have a lot to do with my sites, I need them back up and running. Fix it in a hurry please.
March 24th, 2008 at 12:24 pm
This is getting ridiculous. I have important emails from the WA State Department of Commerce in my inbox that need handling, yet I can’t check my mail via Thunderbird or WebMail. I’m getting a little tired of this, DreamHost, and want some compensation. Your company looking bad is making mine look bad.
Thomas Albert Marsland II
CEO, MarsTech Computing, LLC
March 24th, 2008 at 12:34 pm
@ 106 - Yes. http://www.servepath.com/ seems to be the one.
March 24th, 2008 at 12:36 pm
Have been with DH for just one month and this is the third problem, and by far, the worst. Tech support has usually been responsive, but not this time. My sites are up, but NO email now for hours. Clients are not happy, so I’m not happy. This is EXACTLY the reason I changed hosts and I am not impressed. Lets get with the program fellas.
March 24th, 2008 at 1:13 pm
What servers are in the blingy cluster? I’m guessing COLBERT must be as my sites are SLOOOOOOOOOWWWWWWWWWWWW
March 24th, 2008 at 1:17 pm
my site has been slow all day and its been down for the past hour or two. This is Peak time!
March 24th, 2008 at 1:19 pm
This is not a ‘Dream’host, this is a NIGHTMAREhost!!!
Come on guys!!! i expect some kind of refund for this month, but even that will not
make up for all the time lost today!! I am on deadlines and i have not been able to read
1 email today!!
March 24th, 2008 at 1:24 pm
I’m leaving. My account is still within their 79 day guarantee.
Up and onwards I say!
March 24th, 2008 at 1:25 pm
I got email and website is up, but no mysql…which runs my shopping cart…which is my business…and we’re in peek season. Average about 10 orders a day this time of year….today NONE. I’m pretty aggravated.
March 24th, 2008 at 1:26 pm
Holy crap, it’s 5:30 PM EST and this is STILL listed as an unresolved issue? Are you f***ing kidding?
March 24th, 2008 at 1:29 pm
I am now reading my email faster than “300 baud modem”-like blingy server can emit it.
Idea: replace faulty hardware or software.
March 24th, 2008 at 1:32 pm
I found this number for DreamHost but nobody is answering: (714) 706-4182. I am not gonna stop calling until I get some information.
March 24th, 2008 at 1:35 pm
mine is back on (about time :@ )
but its cool now, mail still hell slow though.
site seems faster, i agree.
March 24th, 2008 at 1:42 pm
Still errors, stoll slow. Kind of upset at how long this has been going on!!!
March 24th, 2008 at 1:51 pm
still down here, my email has worked fine though, which I couldnt care less about
March 24th, 2008 at 1:57 pm
Guys, an ETA would be great! 2 clients dead in the water until this gets resolved.
Its hard to reassure my clients if you won’t reassure yours.
March 24th, 2008 at 2:06 pm
Guys, I understand everyone has issues with hardware, but this is a business, I can’t afford to not get my email all day long. A cluster is supposed to prevent this kind of down time, you might as well have it on a stand alone server.
March 24th, 2008 at 2:08 pm
Come on guys. How long does it take to fix this.
I can not send any email on the webmail function and half the time can’t get into my accounts.
This is unsat.
Keith
March 24th, 2008 at 2:09 pm
Neat. 3 minutes to get 2550 bytes. Less than 300 baud, fellas.
06:03 ~$ getmail
getmail version 4.7.8
Copyright (C) 1998-2007 Charles Cazabon. Licensed under the GNU GPL version 2.
SimpleIMAPSSLRetriever:jidanni@jidanni.org@mail.jidanni.org:993:
msg 1/1 (2550 bytes) delivered, deleted
1 messages (2550 bytes) retrieved, 0 skipped
You have new mail in /home/jidanni/Maildir/
06:06 ~$
March 24th, 2008 at 2:09 pm
Any chance of this being sorted soon? I am intermittently getting mail but I can’t move it or anything. I simply can’t continue like this and it’s really making me think about moving my services having been so happy with DreamHost for so long…
March 24th, 2008 at 2:27 pm
Status update please??
March 24th, 2008 at 2:55 pm
Tech Support - some information is better than NONE AT ALL! If you are still working on it, fine, but post it in a time stamped update. An ETA would be even better. You are getting more than a few disappointed customers and it looks like a lot of them (including me) have only been here a short while.
March 24th, 2008 at 3:03 pm
hi,
the problem probably is that they are having issues with their fileserver and that they don’t have a solution for it. if you read the status post carefully, then you notice that they have opened a case with the supplier.
They are probably doing their utmost to reduce the effects of the problem,
March 24th, 2008 at 3:13 pm
hi,
they are probably not responding because they are working on this. From what I can tell, they have decent fileservers.
Have a look at this http://www.bluearc.com/html/customers/internet-services.shtml
There seems to be some problem with some of the fileservers in this cluster, and it is ot easy to get this fixed. But I think they are doing their utmost.
March 24th, 2008 at 3:14 pm
Ok, defenders of DH: Keep in mind that those of us in the cluster have been dealing with these issues for more than a month and a half. I did ask to be migrated to another cluster and they refused. Now that I enter the 48 hours that I most need the site up, even the really slow site DH allows me, it’s all screwed.
So, with a group of people with whom I have no track record (new clients) _I_ look like ass.
March 24th, 2008 at 3:18 pm
Oh yes, “They’re working on it.” They’ve been working on it for weeks. They should have been working on isolating the machine and getting it out of production.
March 24th, 2008 at 3:20 pm
Ive said it before and Ill say it again, Id be gone from dreamhost if it wasnt for their panel. I cant find any other service with a panel that is as easy to use as theirs. I spend $80-$200/month for a private server for a big blog site, and all of the other companies out there use confusing panels like plesk
March 24th, 2008 at 3:27 pm
this is inexcusable, especially for such a long period of time. they’ve just lost all customers related to my accounts and any future recommendations.
March 24th, 2008 at 3:47 pm
DREAMHOST RULES!!!
you guys are haters
March 24th, 2008 at 3:51 pm
Does this involve harpo server?
March 24th, 2008 at 3:53 pm
Sites are working better, but Email is so slow as to be unusable.
March 24th, 2008 at 3:54 pm
So ????
March 24th, 2008 at 3:58 pm
I have a couple of my customers which unfortunately are on the blingy cluster…
looking back, the problems with blingy started at the end of dec’07.
now, back in the old telecom world, this would have been more than ample time to FIX, or remove/replace the unit. in this case, I would say that DH needs to bite the bullet and replace the hardware…
March 24th, 2008 at 3:59 pm
Funny thing Randy - I have the opposite: My E-mail is working fine, but my sites are down completely.
March 24th, 2008 at 4:01 pm
same here dan k, never had any issues with email today, and dont care about email
March 24th, 2008 at 4:04 pm
come on guys, give them a break, they’re prolly trying to fix it
March 24th, 2008 at 4:17 pm
Email finally back up (my sites and ftp were not affected) and seems to be ok . . . for the moment.
March 24th, 2008 at 4:41 pm
my sites are down with 404 errors and i can’t access FTP, i guess I am affected?
March 24th, 2008 at 4:42 pm
My sites are back up now.
March 24th, 2008 at 4:42 pm
Oh look. My site’s up. Must be a full moon in Cali tonight. I wonder how long this will last.
March 24th, 2008 at 5:06 pm
consider yourself lucky, my site is STILL down
March 24th, 2008 at 5:13 pm
how the hell do you find out if you’re in the blingy cluster? i’m on barry and my mysql is on duchess. sql and email work fine. html pages and most scripted apges work fine. but one of my scripts is incredibly slow user-side and completely times out admin-side. been like this all day. reported it and it got marked as resolved but of course it’s not. asked about it in a support ticket and no answer. as soon as i find time to switch hosts, i am. DH has sucked for at least a year. it’s ridiculous.