In May 2023, the Google Cloud VMware Engine (GCVE) team was enjoying a routine day when they received a high-priority ticket from UniSuper, a major Australian superannuation fund managing approximately AUD 135 billion for over 660,000 members. UniSuper required a private cloud with specific capacity that couldn’t be provisioned using Google Cloud’s standard public interface. To meet these needs, Google engineers deployed the cloud manually using an internal tool designed to handle such requests.

The deployment seemed successful. UniSuper’s private cloud was up and running, integrated with their existing VMware tools, and everything appeared to be functioning smoothly. However, exactly one year later, on May 1, 2024, UniSuper’s development team noticed something alarming: their entire cloud environment had disappeared, taking with it critical financial services and data. The audit logs showed no record of any engineer deleting the environment, leading to confusion and urgency to resolve the issue.

Google's team quickly began investigating and discovered that the cloud had been automatically deleted due to its one-year fixed term expiring. This auto-deletion was highly unusual, as it wasn’t a standard feature available to customers. Further investigation revealed that when Google engineers manually created UniSuper's cloud a year earlier, they inadvertently omitted a crucial parameter. This oversight caused the cloud to default to a test setting, which included automatic deletion after one year—a setting originally intended for internal testing environments, not for customer production use.

The immediate task was to restore the infrastructure and get UniSuper’s services back online. Reprovisioning the infrastructure was straightforward, but recovering the lost data required a mix of manual effort and newly developed automation tools. Fortunately, UniSuper had been diligent about their backups, storing copies of their data both in Google Cloud Storage and with another provider. This foresight prevented a complete loss of financial data, which could have been catastrophic given the scale of the funds managed.

As UniSuper and Google worked to restore services, customers began to express concerns, especially since they could no longer access or view their accounts. Fearing a potential cyberattack, users demanded answers. UniSuper issued a public statement, reassuring customers that the issue stemmed from a third-party service provider and was not the result of malicious activity. They later updated the statement to name Google Cloud as the responsible party.

Despite initial skepticism, the situation was clarified on May 8th, when Google Cloud’s CEO confirmed that the issue was their fault. By this time, UniSuper's services were partially restored, and customers could once again access their accounts. Full restoration was achieved by May 15th, two weeks after the accidental deletion.

This incident underscores several critical lessons in cloud management. For Google, it highlighted the risks of using internal tools with default behaviors that are not suitable for production environments. The fact that a single missing parameter could lead to such a severe outcome raised questions about the robustness of their deployment processes.

For UniSuper, the situation reinforced the importance of comprehensive backup strategies. Even though they had done everything right in terms of data protection, this event showed that relying on a single cloud provider, no matter how reputable, carries inherent risks. The 3-2-1 backup rule—having three copies of your data on two different types of media, with one stored off-site—proved to be essential in recovering from this unexpected disaster.

Ultimately, while UniSuper’s service was restored, the incident served as a reminder to all organizations about the potential dangers of cloud computing. It’s a powerful technology, but like any tool, it requires careful management and a strong contingency plan to mitigate risks.

This article is based on the insights shared in this YouTube video. If you found this information helpful, be sure to like the video and subscribe to the channel.

Tags: unisuper, cloud crisis, google cloud, vmware engine, data loss, cloud lessons, cloud deletion, backup strategy, cloud disaster, infrastructure recovery, private cloud, cloud failure, service outage, financial data, computing risks, environment deletion