Minimizing Data Loss by Simplifying Complex Migrations
May 8, 2022
If you’ve ever designed a database schema, you must know the pain involved in introducing any changes to it down the road - be it columns optimization, splitting of tables, or defining a new data relation. To top it all, the fear of breaking compatibility with previous schemas is always breathing down your neck. So, how can we deal with not-so-simple DB migrations and still keep it cool?
### Initial Schema: The All-In-One Table
As a dev env solution, we want to keep track of our users (developers) and their git tokens (to allow code replication into our clusters), which the user provides us by authenticating with our app via GitHub/Bitbucket.
And so, initially, we were storing users along with the git token in a single DB row.
### Winds of Change: Just a ”Tiny” Migration
As part of our public beta preparations, we wanted to switch to a many-to-many relation between users and git tokens, as we’ve learned from the following use cases:
- **A single user interacting with Raftt over different git providers** such as GitHub, Bitbucket, etc.
- **Multiple users interacting with Raftt using the same git token** - mostly common in small companies where the entire development team uses the same git account.
This migration called for a new schema, in which our beloved all-in-one “users” table was now split in two:
- Data relevant to the user itself, *e.g.* their ID and name, was left intact. - Token data was extracted to a new table (”tokens”), without a reference to specific user information.
To link these two tables, a third table (”user_tokens”) was introduced, where each row represented a link between a single user and a single token.
*But wait!* Now that we have this design, why not take it further and get a little bit more familiar with our tokens? Introducing an enhanced **Token** schema:
- Store the provider from which the token was requested (GitHub, Bitbucket, ...) - Use the provider API to associate the token string with its corresponding git account identifier.
Now the previously simple migration of extracting a table column into a new table becomes a more prone-to-error migration due to 3rd-party API calls.
### The Test That Got Away
On the surface, this migration sure looks simple enough. We would use GORM’s auto-migration capability, which would pick up on tables and columns being added or removed, and call the git provider API to fetch the token-relevant data for each of the existing tokens. Obviously, we would be testing this migration on a development database.
Our test data at this point was very straight-forward:
- Test a user registered via GitHub; - Test a user registered via Bitbucket.
*And.... test!* Not surprisingly, this test succeeded, and our two beloved cultural icons had their tokens extracted to our new “tokens” table. Guess we’re ready for production, right?
### DB 101: How to Lose Data Using a Single Command
Confident in our new DB schema and migration, we moved on to pre-prod testing, *i.e.* cloning our production DB data into a dev DB and using it to test our newest migration.
In hindsight, we should have anticipated what would happen next. In the words of Miriam Roth - *“Suddenly.. Boom! Pow! What happened?”* - the test failed and the database was now corrupted! However, this is *not* the end for every balloon, and had we considered the following - that balloon would have lived happily ever after:
1. Our test data was very optimistic, assuming each user somehow registered in our backend using one of the git providers, when in fact our production users were completely different: - Many tokens were invalid, which caused the API calls to the git provider to fail. - Some users had tokens from both providers. Some had none. - Some of the tokens were shared between multiple users.
2. We were using an auto-migration tool based on schema changes, so the user’s *token* column would be lost regardless of whether the migration itself succeeded or not (no rollback available to save the day).
### Handling Complex Migrations: The Raftt Way
Our development team’s previous method used auto-migration to handle all DB migrations - which automatically handles schema changes. However, that’s not the right way to handle vulnerable schema changes.
Our supposedly single migration actually consisted of several mini-migrations, and could have been broken down into smaller, recoverable ones:
1. Extract user’s token(s) into “tokens” table, without removing it from “users” table just yet. The token provider (GitHub/Bitbucket) can be inferred instantly, however the user’s account ID (which requires an API call to the git provider) is not relevant at this point - let’s leave it empty for now. 2. For each token in the new table, try to infer its’ associated provider account ID. Also, try not to panic if this fails for some tokens, as this is expected due to missing/invalid tokens. 3. From “tokens” table *only*, remove invalid tokens (the ones that failed in the previous step). 4. Finally, remove the token columns from “users” table.
## Lesson Learned: What’s Next?
As painful as our experience was, it was a chance for us to revamp our procedure of introducing database migrations.
### Preventing Errors
- Always ask ourselves - is this really only *one* migration? Can we break it down any further? - Enabling auto-migration is dangerous when a migration consists of multiple steps. It’s enough that one step fails to lose data. We should only use such tools *after* the vulnerable and prone-to-error logic has passed.
### Avoiding Downtime
Luckily for us, the data loss never reached our production database. And yet, we should make efforts to make sure that even if it did, it would not harm us. Those could be either:
- Add one or more replicated database containers to ensure data availability during migrations. - Create database snapshots to serve as a recovery point, should any migration fail in the future.
### The *Right* Way to Test Migrations
We can’t always trust ourselves to create the perfect data test. As pointed out earlier, it was missing many use cases that we actually had in our production DB. So:
- If possible, test our migration on a pre-production database containing data from the production database. That way, we wouldn’t have to wonder whether we had covered all use cases or not. If for any reason this is *not* possible, ask a colleague to provide some useful test data - this is still better than a single developer writing both the migration logic and its test data! - It’s safe to say that we should *never* use our production database to perform such tests.