Friday 29th May 2020 was a disaster for us and would have been for anyone. It’s a story that should serve as a warning to everyone who values their data.

What happened

In summary:

  • Our host pulled the power on our busy Database (DB) server 
  • We didn’t have an up-to-date backup
  • We seriously needed a miracle (in fact, we got several)
  • We have permanently lost 10 days of statistical data

I woke up naturally at my normal time between 4 am and 5 am, kicked my laptop into action, hit <return> on a command to execute, and went to make a coffee. At that time, I didn’t know that our server with our primary database was no longer available.

We use MySQL as a primary database, and we had a slave to that server which operated as a backup device (performing daily backups) as well as a failover option (should the main database server fail). A couple of weeks ago our slave DB server died and the hardware was unrecoverable. We made the decision that we would work towards a better solution, and we have been working on it for the last couple of weeks. The day our systems failed was supposed to be the first day of use for part of the new backup system. Oh, the perfectly timed failure. Thank you universe for your sense of humour.

In the meantime, I have been backing up the database every day using a temporary solution. Some days earlier, our hosting provider informed us that our database server was going to be moved. So, with the new backup system ready for primetime, our plan was to switch off the sites, backup the DB, and allow the server to move. All being well, the server move would go without a hitch and no-one would really notice. But, if something should go wrong, we would have a backup we could restore that would have 100% of our data back again in just minutes.

I misread their email and thought it was a day later, so their switch-off of our database server was unexpected. I didn’t get a chance to do a backup prior to the move, nor to check things or even to shut down any of our services, and that included a busy database. Of course, the move resulted in a catastrophic failure and the machine cannot be recovered. The odds of this perfect storm of events are quite remote, but here it is, and it gets worse.

Our monitoring system didn’t alert us so when I woke I had no idea things were failing. My morning routine of Wake up, start a backup, make coffee, meditate, etc. happened as usual, but thanks to the database server no longer being there, the ad-hoc backup went ahead and overwrote our ONLY recent database copy. The latest complete data that we had was from 20th May 2020. This means 8 days of data would have been erased. Even then, it hadn’t been properly battle-tested yet so we had no idea how ‘complete’ it might be.

Our hosting provider has been truly brilliant. I have some issues with them relating to how things have gone (which I’ll address with them at a later date, once we have fully recovered and are more robust), but some of the folks on their team did stuff they didn’t have to and were deeply sympathetic to our cause. We could not have recovered so smoothly without them. Or worse, perhaps not at all.

With the machine dead they pulled the hard drives from it and put them in caddies, and connected the drives to a new temporary server. One of their team set about recovering data from the drives, which took a lot of time. Eventually, at around 3am on Saturday 30th May 2020, we had data files, but no idea whether they were okay or how badly corrupted they might be. We have been testing and configuring things ever since.

Now, we are alive again. There are a few minor victims of the failure, but considering the severity, I think we got lucky. Really lucky.

Mistakes made

  1. I misread the email from our hosting provider, so we were unprepared for them to move our server that particular day. We will take much more care from now on (grrr).
  2. We didn’t replace the slave DB fast enough. If the slave DB server been alive, we would have been down for minutes rather than days. Its purpose was to prevent this exact situation from happening.
  3. I should have developed a more robust backup, even if it was temporary.
  4. Our monitoring system didn’t work properly which meant we weren’t alerted to the server being ‘gone’. Had it alerted us properly, I wouldn’t have erased our latest backup.
  5. I accidentally overwrote our most recent backup in such a way that it was unrecoverable. Our temporary backup solution could have been hacked into a better tool that would have prevented this, but that didn’t happen.

I take full responsibility for our mistakes above. Reading back, part of me thinks that it’s just a string of bad luck. But since bad luck can come along anytime, I think that’s just ‘excusifying’. We really need to up our game and ensure we are better prepared to handle things like this without disrupting our service or the enjoyment of it by the PurplePort community.

For me, the biggest mistakes made were by our hosting company. Whilst we were unprepared for this event (our responsibility), it should never have happened (their responsibility). I hope to work with them to make some policy and procedure changes that will better protect all of their customers, and also provide more confidence for their staff when doing things like this. I can’t imagine how awful their team felt about it.

Data that is gone and never coming back

Sadly, we were forced to accept the loss of some statistics data (e.g. profile views, image views, article views, etc - essentially data showing statistics for anything) as it was corrupt beyond use. We have data for this up to 20th May 2020, but between 20th May 2020 and 31st May 2020 is gone. The data is transient and only has a lifespan of about a year. So, eventually, it would be gone anyway. But its loss highlights the potential loss for this sort of event.

Again, we got lucky.

Missing VIP subscriptions 

Any subscriptions that went through during the outage are missing from our records. But, over the next few days, we will enter all missing transactions and get VIP subscription dates updated.

Our plan to ensure this doesn’t happen again

Our new backup is live and so the potential for a tragic comedy in my morning routine is gone. We are reviewing our backup and overall systems approach because, in all honesty, it’s time we grew up a bit more. Our new backup is good for now, but we need something more. 

As we continue to grow in terms of the community, we also need to grow in terms of infrastructure. Sometimes growth comes smoothly, but it’s through pain that we truly grow. This episode has forced us to grow - not smoothly, but by lurching ahead.

We will also develop better ways for our team to handle such failures, especially relating to how we communicate what’s happening (how long we think it might be for, etc). It’s been clear that our communications have room for much improvement.

Thank you for your patience

As a way of apologising, and to show our gratitude for being so supportive and patient with us during this difficult time, we have given everyone 1 month free VIP membership (just in time to get those post-lockdown shoots booked!).

We are so grateful for your patience, and the positive comments of support we have received. It’s been really heartwarming for our whole team and helped us push through this painful episode.

Below are just some of the nicest and most amusing comments we received during the outage :-)

"Get well soon Purplebot! "

"Hope all ok a day without pp is horrible the team know what they doing so hope be back on soon"

"Hope everything is going okay, purpleport hasn't been down for this long In a while."

"Hope that the team are able to fix this, got to be a bad outage issue as it has been down for so long. Good luck fixing the issue."

"Thanks for the update, hope you get it fixed soon."

"Feeling some sympathy for the PP Technical Team, this must be a major issue."

"Thank you for keeping us updated :)"

"Thanks for the update"

"Have you tried turning it off and on again ? "

"Support to the IT team at this time. Better back properly more slowly than back quicker but still sick. Of course we miss the site, but can wait. We do appreciate the updates, even if they can't give definitive information."

"No worries, wash your hands and social distancing will sort it out......"

"Thank you for the update! It's obviously something very serious - I've not known PP be offline for so long before! I hope you are able to sort it - get well soon Purple Bot!"

"Is it because my subscription just expired! I'll stick a tenner in the Leccy meter if you need."

"Well done guys good luck getting things sorted"

"Update is greatly appreciated."

"If we stand outside clapping them will it help? The techie peeps are using a keyboard, so technically key "board" workers "

"But but but *insert rant of your choosing here*. I blame purplebot, he's a wrongun."

"Thanks for the updates. As an ex-IT manager I know what you are going through!"

"An annoyance for members, but not anything more, given the world crisis... It must be appalling for everyone at Purple Towers: my sympathy, thanks, and best wishes to all of you."

"No problem guys, hope you can enjoy some of the weekend"

"Is PurpleBot self isolating?  Thanks for the update - I started to think it was me! "

"Thank you for letting us know what's happening. You have been missed!"

"Good luck guys I know you're doing you're best."

"It's been furloughed.........."

"I think we should all go outside and clap for the PP IT team at 8pm tonight."

 

What is PurplePort

PurplePort is a modelling website that brings models, photographers and other creatives together with one fantastic service. We provide the tools and help you need to get together and create amazing photos.

Established in 2010, PurplePort has grown from strength to strength and now has 40,000+ active members worldwide. With features such as integrated messaging, calendar, shoot plans, image albums, references, credited photos, busy member forums, hundreds of articles, and dedicated full-time staff to help you, it's easy finding the perfect creatives you need for your next photography project. It's a fresh, fast, and feature-rich alternative to ModelMayhem.

Join the fastest growing, most feature-rich service of its kind and start making magical photos a reality!

Be inspired; Follow us on social media too

We regularly post beautiful and inspiring images from our incredibly talented members to our social media.

Please do follow/like us on social media so you can to stay inspired whenever you are away from PurplePort.

Have a browse of our social media and you are sure to be inspired.

Join now and start networking

Join the community and start your modelling or photography career now.