Filer problems with blingy cluster. |
We are currently having a problem with a filer which has crashed and is recovering at this time. While this is happening some customers in the blingy cluster will experience problems loading their websites/email. We apologize for the outage and service is expected to return to normal as soon as the filer recovers.
UPDATE 3:01:AM PDT
The filer has finished recovering and all services are back up and running. We are working with the filer vendor to find the source of the crash to prevent any further outages.
Update 24/03/08 10am: We’re working on the file server again to alleviate the load that’s causing problems with web, mail and mysql services. Sorry about that.
Update 27/03/08: We are doing emergency data moves to quell the stem of problems recently caused by your file server. During these moves, your data may be inaccessible. We are moving as we can off as fast as possible. Very sorry about the continued inconvenience!
Update 27/03/08 This series of moves has finished. We are going to keep an eye on things to see how much it helped and may have to do more moves tonight and tomorrow morning to get everything working smoothly again. This post will be updated with more information as soon as possible.
Update 29/03/08
We are continuing to move data off of the problematic file server but it’s a bit of a catch-22 because customers on that machine are continuing to add data at a very high rate. It filled up this morning for a while causing device full errors as well as mail problems and issues serving websites (when these fill up it causes problems across the board). To explain in more detail, when we move data it does not immediately disappear (there is a ’snapshot’ created of the old data that remains in case there was a problem with the move - that ensures that we do not lose customer data but until the admin team can check the move to make sure it went through properly we cannot delete the old data). We just did some of that and have some breathing room again and of course more moves are still in progress but we are asking customer on this cluster to help us by holding up on any non-essential uploads of data for the next couple of days. As soon as we have a significant portion of the data removed the problematic file server will begin to function properly (and additional moves will go much more quickly and smoothly) but right now we’re having trouble moving data more quickly than it’s being added by people. If everyone could please limit uploads to absolutely essential data until we reach the turning point where everything is working this will be resolved much more quickly (in other words if for example you are setting up a repository of large files you’ll actually be better off waiting a couple of days and getting the all clear from us on this issue because you’ll be able to access that data reliably instead of cramming it on there now and slowing the recovery process).
In the meantime we’ll be doing everything we can to safely and quickly move data off and get things back to normal.
Added information: Some of the people recently moved to the new file server are seeing errors because the data did not get set up completely (loading the site will work but just show an empty index). The admin team has been running an rsync that will fully restore all data and should hopefully finish by 9 PM PST - once that is finished all site and email data will be available for those users.
Update 30/03/08
We’re still racing to keep ahead of new data being added so any help we can get on that front is greatly appreciated (we’re still asking for customers to limit uploads as much as possible to speed up the recovery process). Some customers who are being moved are seeing blank directories still but those are due to moves in progress and the data will be fully restored when those complete.
Update April 1, 2008
We seem to be ahead of the curve right now, we are moving data off the primary volume and on to a secondary one faster than new data is being uploaded. The volume hasn’t filled up completely in a few days. We are working closely with the technical support team to see how we can speed up the process further. Thank you for your patience.
Update, April 1, 2008
I apologize for the late update but we’ve been going over our options (while moving data of course). While we’re not seeing any real relief in terms of data uploads we do have some very large moves that are almost complete. Once those finish we can start deleting the data (for example one is around a half TB or around 500 GB which will be 4-5% of the total but it’s going to take until around Friday to delete it all, so we’re dealing with a ton of data). Tomorrow’s update should be earlier in the day and hopefully we’ll have some progress to reports from the large moves being complete.
Update, April 3, 2008
The data moves to other file servers has been running constantly, but last night and this morning some complications happened with the moves, requiring admin attention. To clear up some space there had to be a short interruption in file serving, this is now finished, space is available and the moves are continuing. The admins are fixing up the last of the web servers which were having issues after file serving was restored. Our apologies again for the continued issues.
Update, April 4, 2008
Today has been a pretty good day of progress. We were able to complete even more moves and free up more data from the file server. Moves have been going quicker and stability is dramatically improving. Monitoring of the servers and email in the blingy cluster today have shown a significant decrease in problems. Issues do still exist but the problem is noticeably getting better. We are also pleased to note that we have more storage that will be coming early next week. We believe that this will go a long way in helping us fix this major problem.
Update, April 4, 2008
Things are continuing to improve today - when I got in I was pleased to see that we had held firm and even gained a percent (the effected file server was down to 95% which is as low as I have seen it in the last week and since I have been working it has dropped to 94%). Performance should improve as we gain ground (this will speed up moving data off as well). This progress, along with the added storage space we are expecting early next week should hopefully allow us to restore service for our customers to normal.
5:45 PM PST : The moves we started just a while ago seem to be causing server problems, we’re looking into it and should have it resolved shortly (they were run just like the ones that had completed so we have to determine why these specifically caused an issue). Update: this resolved itself before we could detect the cause but we’re monitoring the situation to ensure that it’s not a recurring issue (we have no indication that it will be).
Update, April 8, 2008
Please see our other posting for details on the work we did on the effected file server:
http://www.dreamhoststatus.com/2008/04/06/30-min-blingy-downtime-tonight/
We are also continuing to offload data and are making good progress (it’s never as fast as we would like it to be of course). There’s excellent detail here in case you missed it:
http://blog.dreamhost.com/2008/04/07/another-anatomy/
which chronicles the situation and fills you in pretty much up to today. We’re seeing the data dip to 90% so we’re hoping to have it down in the 80’s by the end of the week (every percent we gain helps and as performance improves we can speed up the rate of moving but we’re still looking to hit critical mass where you get the proper level of performance).
Update, April 9, 2008
As we had hoped progress is speeding up as we free up more space - while the file server is showing 95% usage, around 11% of that is data that has already been moved and is no longer in use. Due to a software issue we haven’t been able to remove it yet (the admin team is working on the best way to execute that), but once that is gone we should be around 85% usage which is another large step forward.
In terms of effect, I have already seen improvement in site function for many customers as well as greatly increased speed in moving chunks of data off as well as receiving reports that mail is functioning quite a bit better. That said this issue remains at the High severity rating and in unresolved status as we have not reached a normal level of service. I can’t stress enough how sorry I am that our customers have had to put up with this but I thank those of you who have stuck with us (check the newsletter for details on what we’re doing for Blingy customers) and look forward to providing you with the level of service we strive for at DreamHost.
Update, April 9th, 2008 21:59 PDT
Unfortunately, we need to unmount the volume again to kill these snapshots before they leave us with 0 bytes of free space. In 2 hours (midnight) I will be taking the problematic volume offline to delete the phantom snapshots. Total downtime will be between 10 and 30 minutes. Sorry for the short notice and additional outage!
Update, April 10, 2008
Well the snapshot mentioned yesterday is gone and we’re actually at 83% used today which is below where we were hoping to see marked improvement (85%). Of course we’re still moving data off (which increases the usage on the file server) so that won’t fully translate to customer usage improvement but it should be quite a bit better and keep improving until we stop moving data.
Update, April 14, 2008
Okay, we’re finally getting ready to mark this as resolved.. things have seemed pretty much okay for a while now. But, just to be sure, we’re dropping the severity to Medium for now and leaving it as unresolved.
Update, April 17, 2008
We’re still hearing some reports of site slowness - we were able to resolve an issue causing high loads today which should help but we’re not going to consider this resolved until everyone is receiving good service.
| Severity: | Medium | Resolved: | No |
March 25th, 2008 at 7:35 am
@ Jim Sullivan - Toadstool is a single webserver, which is in the Blingy cluster of servers. Since my cluster is in the LAX datacenter which has been having ongoing problems for quite a while too, I know all too well that these situations can be frustrating to say the least, but even so how are you expecting triple redundant systems and consequential loss coverage on discount hosting?
March 25th, 2008 at 7:41 am
Our charity is losing money every minute and DH isn’t even responding to support request!
March 25th, 2008 at 7:42 am
Also I am on snocap and all my websites are down !
March 25th, 2008 at 8:04 am
Come on guys! My sales force can’t get into their damn webmail. WE ARE LOSING MONEY BECAUSE YOU GUYS CAN’T FIX STUFF?!!!!
March 25th, 2008 at 8:07 am
You’re losing money because YOU are incompetent.
Any of you that rely on a single host, instead of investing in a failover setup, have no right to complain about downtime, since every single host has it.
If you can’t put that tiny bit of effort into your site, stop pretending it’s important.
March 25th, 2008 at 8:08 am
Coño de la Madre, No Joda….. I´´m in snocap too and my web sites are down. How long we have to wait==??
March 25th, 2008 at 8:12 am
All my sites still down !!!!
March 25th, 2008 at 8:12 am
THIS IS UNACCEPTABLE!!!
How can this be low priority when, so many ppl are not able to connect.
I cannot even connect to my site, as well as none of my images were showing up, when I was able to connect!
DREAMHOST, GET THIS FIXED IMMEDIATELY OR I CAN GUARANTEE THAT EVERYONE IS GOING TO LEAVE!!!!!!!!
March 25th, 2008 at 8:17 am
Does anyone know if this issue is effectively losing mail? I don’t want to login to my inbox after all these issues have been sorted out and find that I have lost all my email for the past 24 hours.
March 25th, 2008 at 8:35 am
Is this still a problem?!? I can’t get my email or load my site. Has this seriously been going on for FIVE DAYS?!?!
March 25th, 2008 at 8:35 am
In past experience (unfortunately I’ve had a lot of experience with this now) the mail simply came through when the service came back on.
Don’t bother asking Dreamhost about that, you’ll get a template e-mail response.
When this current account is done, I’m going. Totally regret the el-cheapo route for hosting and e-mail.
PLEASE GET THIS FIXED.
March 25th, 2008 at 8:36 am
Is this STILL NOT FIXED?!? I can’t get my email or load my site. Has this seriously been going on for FIVE DAYS?!?
March 25th, 2008 at 8:39 am
How come you all moron admins are sitting over there and selling your accounts for 10 fucking years? and we all fucked-up site admins are suffering for your stupid mistakes one after all! We had a major release today and only for you we’ve missed it!
March 25th, 2008 at 8:40 am
Here we go again. Everything was working around 5:00 pm pst yesterday, and all night. Got the long apologetic email from DH and now email is down again this morning. I can’t work. My main client is one of the largest networking equipment companies in the world - they are not amused.
March 25th, 2008 at 8:52 am
Keep up the hard work DH Staff. Don’t let the negativity get you down.
I’m hoping mail service will be returned soon. Thanks!
March 25th, 2008 at 9:02 am
Ironic, I didnt get an apology email. Because my email is down.
March 25th, 2008 at 9:04 am
If you’ve got a gmail account and your Dreamhost email is still belly-up, gmail’s “Get mail from other accounts” option will retrieve Dreamhost handled mail, albeit slowly. Use the setting “Reply from received account” to add the illusion of online-ness to your correspondence. You may have to do the setup a couple of times @ gmail to get past the timeouts, but once that’s done, then you can at least communicate. It’s not a fix, true, but it’s functional.
March 25th, 2008 at 9:09 am
Starts to be annoying..
I have several busineses depending on mail services at failing cluster.
No mail = losing money = not good indeed.
March 25th, 2008 at 9:14 am
I NEED MY EMAIL SERVICE UP!!!!
March 25th, 2008 at 9:17 am
To expand on rabidg’s comment, I succesfully set up forwarding to my ISP via the Mail panel, and it seems to be working.
So you don’t need gmail, per se, just another mail account WITH A COMPANY THAT ACTUALLY GIVES A DAMN ABOUT ITS CUSTOMERS. You can setup forwarding from NightmareHost until you, like myself, find a decent hosting company to switch to.
Fortunately I’ve still got 30 days to get my money back (tho switching is a PITA I didn’t really need right now).
And for those accusing us victims of poor judgement for expecting redundancy for shared hosting, I do indeed expect redundancy…because NMH is aggregating lots of sites, redundancy should be even more important.
Occasional slow page loads or slow mysql, sure…but week-long outages ? NOT ACCEPTABLE.
March 25th, 2008 at 9:19 am
FYI, a DreamHost moderator is now systematically DELETING certain posts here, including mine.
March 25th, 2008 at 9:26 am
Shawn (208) is right. Expecting Dreamhost to provide the service they promise is ridiculous! Their failure to repair the issue in a timely fashion is all my fault! What a jerk I must be!
OK… sarcasm off for a sec… having this problem debilitate several of my client’s day to day business is about as bad a scenario as I could envision. Now I’m living that personal hell.
Good times
Oh yeah… let’s get this thing back up pronto, OK?
March 25th, 2008 at 9:29 am
I feel for you TC, I just wish we could get an appropriate update about this. This problem has remained open too long.
March 25th, 2008 at 9:33 am
Luckily mine has been up long enough to get some things updated (message about hosting server experiencing technical difficulties), but I agree they should provide a little more information on the status….were close…we need a new HDD….were looking in the yellow pages for a PC repair tech….the message hasn’t been updated….I haven’t moved my mail from the other domain, so it still works, but i noticed webmail displaying beautiful red messages.
March 25th, 2008 at 9:33 am
I too am starting to become very disgruntled regarding the downtime. Access to my email is impertive and I am losing money and clients with the downtime. Like others that have posted, I have a PR to send out, which I can’t do until my sytems are functional once again. I can understand downtime, but five days it a little extreme - particularly when we are not receiving regular, frequent updates as to what’s being done to fix the problem along with estimated times when we can expect service to be restored.
March 25th, 2008 at 9:35 am
If you are “born” into a certain cluster are you stuck there forever? I started with Dreamhost at the first of March 2008, am on Blingy/Phil/Ness, and so far performance has been poor at best :/ I’m not doing anything critical, just testing the waters with a couple of wordpress blogs, but this is pathetic, so far I would not trust this service for production or commercial services. Maybe I shouldn’t have paid 3 years out
Thanks.
March 25th, 2008 at 9:42 am
DM offer a 90 day money back refund…
Do what I’m going… get all your shit off of DH and get a refund pronto before they go tits. I have no doubt that a *lot* of their ‘customers’ are jumping ship.
March 25th, 2008 at 9:44 am
I don’t know whether the outage is a normal occurrence with other hosts as well, but I do know that there’s no real excuse (and it’s just plain unwise) to keep your customers uninformed. We can’t be understanding about what’s happening if you don’t give us the information to understand what’s going on.
March 25th, 2008 at 9:46 am
Ok why the fck can’t I get to half the sites on my dedicated server now?
March 25th, 2008 at 10:02 am
my site was up for the last 12 hours, and now Im down again. Im not even on the blingy cluster
March 25th, 2008 at 10:05 am
DH, really, stop dicking around. This is some grade-A bullshit. I’ve been using DH to set up small businesses and now I look the fool, big time. All because I wanted to support a local outfit (I’m from OC). And for Christ’s sakes, hire a PR firm.
March 25th, 2008 at 10:06 am
There have been a lot of major network problems happening since Monday morning, 3/24. Not only with DreamHost, but with a number of other email providers and networks. NetFlix is one, for example, that was down for 12 hours yesterday. They don’t say why. My company has been having non-DreamHost-related email problems since Monday and some of our clients are experiencing the same. A friend of mine in New Hampshire told me that her company’s network was down yesterday as well, and they also had email issues. She said that she was aware of a few other friends/acquaintances whose networks were either down, or experiencing issues.
If you ask me, I think there is something going on in the bigger picture here, or there is a damn lot of coincidence! My other DreamHost sites are working great, actually seem faster than before Monday. You guys need to remember that networks such as theirs are increasingly complex and I’m sure they’re doing the best they can - it’s in their best interest to resolve issues promptly.
March 25th, 2008 at 10:06 am
I’d be a happier unhappy customer if DH would simply up their communication on this issue. As it is, they aren’t providing their customers with any comforting words to indicate that they are still working on the issue. Are they working on the issue? Who knows? Have any of you received a response from tech support? (not including the bulkmail message from last night)
I’m sure they’re losing customers that they could easily keep, simply because they refuse to communicate. No one likes being kept in the dark.
March 25th, 2008 at 10:08 am
They never do provide a message. They usually are pretty good when you put in a ticket, but, with big issues like this, they are less then pathetic. I would love to know who runs the company. That person owes the people who, you know, pay their friggin salaries an explanation and an apology.
March 25th, 2008 at 10:09 am
I recently received a response through my trouble ticket that said they are working on the issue. Nothing specific but they are working on it.
March 25th, 2008 at 12:20 pm
My clients will be moving to a new host very soon. I loved DH’s panel interface and the options that they have, but the outages have just been unacceptable.
March 25th, 2008 at 12:20 pm
What I appreciate most is the total lack of updates on the status. WTF is dreamhoststatus.com for if it’s not updated regularly? I saw where DH recommends we all link up with Twitter to catch their Twitter feed. No thanks, DH. It’s YOUR job to keep me updated via YOUR sites, not require us to all sign on to YET ANOTHER site to see if you might update that one.
March 25th, 2008 at 12:34 pm
damn it!
Again site is down and I can’t connect. Will it ever be solved?!
March 25th, 2008 at 12:42 pm
*cant even log onto web panel*
No idea if I am on Blinghy but I cant get onto my websites either.
March 25th, 2008 at 1:51 pm
STILL THE SAME TONIGHT.
I CAN T BELIEVE A PROFESSIONAL HOSTER CAN BE SO BAD…
DREAMHOST SUCKS
March 25th, 2008 at 1:59 pm
Does it seem ironic that the only way to have technical service is via email when you cannot get email? We are a PR agency and not having email is near sudden death.
March 25th, 2008 at 2:00 pm
Seems like they patch stuff up just before they cut out for the evening and then start ripping it up again the next morning. Nothing like experimenting with your production environment during the workday…
March 25th, 2008 at 2:03 pm
Awww… “they’re doing the best they can…” What a load. They could have started moving people off of this piece of crap weeks ago rather than in the midst of all of their server moves.
March 25th, 2008 at 2:14 pm
All my sites are down. I’m on Cheezit. Dreamhost’s control panel is dead too.
March 25th, 2008 at 2:14 pm
Oh, come on, at least get the control panel working so I can change the MX records! I can’t even get clients AWAY from you, now. Aaaaagggggghhhhhhh!!!!!!!! :”(
March 25th, 2008 at 2:15 pm
I cannot imagine why is so hard to repair or to replace that filer. If I were your boss, I take my “pal” with six bullets and fire your sysadmins. The remaining crew should work better and with more knowledge.
March 25th, 2008 at 2:22 pm
It may be a filer malfunction, but I just tried to backup MySQL so I could move the site and I had a “No space left on device” warning. I checked the filesystem mounts and every blingy mount is 100% full.
Yay!
March 25th, 2008 at 2:28 pm
@matt
yeah the filer has been having a problem of running out of space for the past month, it’s run out at least 3-4 times this month.
March 25th, 2008 at 2:31 pm
I have hosted with dh for 5 years, and though they go down, they are cheap and that´s what you get.
however, I have never been down for 5 days. and I have never been down for 5 days without information. that´s inexcusable.
Dreamhost, is there any way you can get us more information on what is happening here?! All we want is an estimate of the size of the problem and perhaps an organized answer to some basic questions
1. does file server down mean that our emails are in queue, or are being bounced?
2. does file server down mean that once you are up, we will have our databases exactly as we left them?
3. is there any way to request being moved to a different server than blingy?
it is embarassing to have to use gmail as a backup to mail from a paid server. but that´s what I´ve had to do. oh well…cheap is cheap.
March 25th, 2008 at 2:35 pm
I’ve already signed up on hostmonster. The whole reason I have a Dreamhost account is for one day: today. Boy did this suck. Anyway, drop kicking this crappy account is the first thing I’m doing tomorrow morning.