Hi everyone, many of you will have noticed a couple of site outages today which our developers have been investigating. We sincerely apologize for the disruption today and for the inconvenience this may have caused. This was not a planned outage and our developers have been diligently investigating the issue and getting our Marketplaces back online as quickly as possible.
There were initially some reports of missing item comments and some missing sales and we’ll have an official update on that investigation soon right here in this thread. If you are still experiencing any issues, please hold off reporting it until an official announcement has been made – we’d prefer to resolve and recover everything in one fell swoop if possible and save you time reporting anything to support yet. Just keep a note of any issues for now. Likewise, please do not open a forum thread about it. If there’s a need to report anything this can and should be done through a support ticket only, but again, I would strongly recommend holding off for now until further notice.
We will have someone drop by and provide a full update of what happened as soon as possible. In the meantime, we’ll lock all existing (and any subsequent) duplicate forum threads about this issue so we can channel all official comms here without any important info being lost or overlooked.
Stand-by for another update as soon as possible. Thanks again for everyone’s patience and again, we apologize for any inconvenience.
I’m John Viner, the Software Development Manager at Envato. Thanks so much for you patience on the two periods of downtime today, and again, we’re very sorry for any inconvenience this has caused.
Our initial investigation has indicated that between 6:30AM and 7:00AM AEST, after the site came up from the initial downtime, some data was written to one of our non-primary databases instead of the primary database. This explains why some information such as a small number of sales, appear to be missing. Nothing is actually missing however, we have all the data. But we are now working on merging this data back into our primary database so everything aligns correctly.
We’ve got the whole development team focused on this issue and I’ll continue to keep you informed on this thread as we make progress. Thank you again for your continued patience.
We’ve started the data restoration process and have contacted individuals who were affected. We expect the first updates to come through in the next half hour.
We’ve still got a number of small data inconsistencies that we’ll be continuing to work on. I’ll keep you posted on this thread, and again I appreciate your patience.
The dev team has completed identifying all users affected by the data inconsistencies that occurred between 6:30AM EAST to 7:00 AEST. We’ve sent direct communication to those users and have restored critical data. Again I apologise for the inconvenience, we’ve made a number of immediate changes to our infrastructure configuration to prevent this happening again and will be undertaking a major review of the incident.
Since my update late last week, we’re continuing to keep in touch with individuals that have been affected. We’ve also been working hard to identify any additional data inconsistencies and rectify them. In the meantime we’ve undertaken a detailed technical review of the incident and have started to put in place recommendations from this review.
We’ve completed all the investigation and communication with affected community members. If you think there are outstanding issues with your sales, purchases or data and you have not been contacted, please put a ticket into support via support.envato.com. Our internal review of the outage and data issues revealed that:
- we were hit by a spike in traffic which caused a full site outage
- when we recovered from this outage, our application incorrectly connected to one of our secondary databases, for a period of 30 minutes. This was due to a network mis-configuration of that secondary database, which caused it to behave as if it was the primary database
- when we realised the incorrect state of the database we brought the site into maintenance mode, effectively disconnected this secondary database and connected the application to the correct primary database
- during this 30 minute window, transactions were being written to the secondary database, which was meant to be configured in a read-only mode.
- we didn’t lose any data in this process, however, we had production data on two databases. This is very complicated to merge.
- we immediately began work on identifying and reconstructing critical data transactions written to the secondary database. This included performing manual adjustments to users accounts as well as replaying sales.
- we identified all users that had performed any actions that would have written data to the secondary database and contacted them directly. For the transactions that we did not automatically reconstruct, we recommended that the user perform the action again (for example, rating an item or commenting on an item)
- we have rolled out immediate changes to the database network configuration along with the installation of monitoring that would alert us to both mis-configurations in the future.
- we have a plan to improve our automated testing tools to incorporate checks for these kinds of configuration issues in a pre-production environment.
Once again, thank you for bearing with us as we got to the bottom of this incident, we’ll be working hard to make sure this doesn’t happen again.