How to run Zero-Downtime Migrations on an RDBMS (like MySQL)

Being able to migrate your database is an essential part of the health of any application in production.

Without the ability to refactor, or rethink your data model things will either become slow, or technical debt will increase.

Migrations are essential… but they can also cause the app to break.

Say you want to add a column to a 20 million row table. In databases like MySQL that will most likely cause the entire table to lock up for the duration of the ALTERcommand.

Locking tables could mean your clients are losing money; you’re depleting your bank account, either way, someone will be angry… Nevermind the political consequences of having your app go down randomly throughout the week.

Downtime mitigation is why zero-downtime migrations are so essential to your business and development team. This post will outline the different methods for zero-downtime migrations that I’ve found to be useful.

In general, all of these can be achieved through some SQL or relying on a tool like pt-online-schema-change. The basic idea remains the same between all four which is to limit the amount of ALTER commands on a table.

The Basic Zero-Downtime Migration Algorithm

The basic procedure is (this is MySQL specific, but the same idea applies for Postgres or other DB’s):

  1. Create a copy of the table schema based on the original. CREATE TABLE _receipts_new LIKE receipts.
  2. Add triggers
  3. Chunk through all data in the original table and insert into the new table.
  4. Rename old table to _old_receipts, rename new table  to receipts
  5. DROP triggers
    Create copy of table schema

The idea here is that we want to migrate something, which we’ll get into in a bit. But we want to either add the column, rename the column, or change the column type here.

For instance, if we want to remove `url` from receipts (let’s say we’re going to make these dynamically).

CREATE TABLE _receipts_new LIKE receipts;

ALTER TABLE _receipts_new DROP COLUMN `url`;

2. Add triggers to catch new data

Triggers are the bread and butter of zero-downtime migrations. While we can easily migrate all the data from the old table into the new table now, we might miss out on new information. These migrations serve as a way to keep things up to date as the original table is updated.

The three triggers are below to capture whenever the original table is updated, inserted, or deleted.

CREATE TRIGGER `after_receipts_ins` AFTER INSERT ON receipts FOR EACH ROW REPLACE INTO _receipts_new (id, ...) VALUES (NEW.id, ...);

CREATE TRIGGER `after_receipts_upd` AFTER UPDATE ON receipts FOR EACH ROW REPLACE INTO _receipts_new (id, ...) VALUES (NEW.id, ...);

CREATE TRIGGER `after_receipts_del` AFTER DELETE ON receipts FOR EACH ROW DELETE IGNORE FROM _receipts_new WHERE id = OLD.id;

3. Chunk through all the data and insert into the new table

Unfortunately, many straight SQL languages don’t allow you to chunk very well. MySQL for example, allows you to do things like LIMIT / OFFSET, but it can be slow as you get further on the table.

For this part, it’s smart to either use a Ruby, Python, or Javascript script. The idea is effectively this:

Determine an optimal chunk size (say 1000) and then iterate through the primary keys incrementing by 1000 each time. Here’s a Rails style version of this.

 

id = Receipt.first.id # this will be kind of weird if you have changed default scope...
max_id = Receipt.last.id

id.upto(max_id, 1000) do |i|
ActiveRecord::Base.connection.execute(<<-SQL)
REPLACE INTO _receipts_new (id, …)
SELECT id, … FROM receipts WHERE id BETWEEN #{i} AND #{i + 1000}
SQL
end

 

This script will chunk through all of the available records and updates the information. After this is done then the table should be up to date.

4. Optional: Check the integrity of the table

You can run a quick gut check to make sure everything is there by doing something like this. Though depending on the size, of the table this might be quite large.

 

SELECT count(1) from receipts r LEFT JOIN _receipts_new rn ON rn.id = r.id WHERE rn.id IS NULL;

 

This should return a count of 0. If it doesn’t that means you have missed some data.

5. Rename the table and drop the triggers

Now it’s time to finish.

RENAME TABLE receipts TO _receipts_old;

RENAME TABLE _receipts_new TO receipts;

DROP TRIGGER after_receipts_ins;

DROP TRIGGER after_receipts_upd;

DROP TRIGGER after_receipts_del;

-- After you verify you can drop the old table too

DROP TABLE _receipts_old;

If at this point everything goes to hell make sure you are prepared to rename _receipts_old back to receipts

 

RENAME TABLE receipts TO _receipts_new;

RENAME TABLE _receipts_old TO receipts;


But this shouldn't happen unless the code has a bug.

The 4 Types of Zero Downtime Migrations

There are only four types of database schema migrations:

  • Add a column
  • Change a column type
  • Rename a column
  • Delete a column

How to Add a Column

Now I’m going to get into specifics of each type of migration. Each of these four types of migrations has their specific nuances related mostly with code. For instance, if you have code that references a column that doesn’t exist that will cause production issues.

But let’s start with the easiest first which is adding a column. Unless your code is a mutant, it won’t reference a column that shows up randomly. Adding a column in this way is quite simple. You can add the column before any code changes go out and as soon as it’s live.

add_a_column.png

Though there is a possible downside which is adding a column that requires something to be in it. For example, let’s say you add a password field to a table. What to do then?

Of course, this will depend on the code in question, but you can write code that will allow you to fill the field if it exists and ignore it if it doesn’t exist.

One important part of this is that if you are adding data based on other points, you will need to add it as you populate the table, and also update the triggers to reflect the change.

You Can Also Change a Column Type

The benefits of changing a column type can have substantial performance payoffs. For instance, removing things like blobs or varchars out of a MySQL database can reduce the memory footprint of a table.

In general column type changes shouldn’t have any fidelity changes. And as a result, they should be as simple as adding a column as above.

For example, changing a signed int to an unsigned tinyint doesn’t change anything. Same with a varchar to an enum.

change_column_type.png

When it comes to populating the table you might need to cast things accordingly. Or perhaps you will find one bit of data doesn’t quite sync up with what you thought was there.

Maybe you need to rename a column type

Renaming a column type is where things get a bit more tricky. Renaming a column for instance from pdf_identifier to pdf_id can have long ranging consequences with code. If you change the database and the field disappears the code will not know what to do.

That is why it’s important to ship code that will allow both to happen before migrating the table.

rename_column.png

This connection between data and code adds complexity to the deployment process:

  • Deploy code that allows both pdf_identifier and pdf_id as fields
  • Migrate table from pdf_identifier to pdf_id
  • Ship code that removes references to pdf_identifier

That is three times more complicated than most deploys! So if you can, it’s best to try to avoid these types of triple deploys. If a table is small enough (<50k rows) then it’s probably just safe to migrate it using an ALTER command.

Here’s how to delete a column (without screwing up)

Removing a column is by far the most complicated to do without any downtime or errors in judgment. Whenever you remove a column you need to realize that the data doesn’t go to a trash bin, it’s gone forever. So when it involves customer data, it’s important to be careful!

So this requires the following to pull off effectively:

  • Deploy code that stops referencing the old column
  • Migrate column to a renamed version of the same. For example, pdf_identifier -> pdf_identifier_deleted
  • Wait a week
  • If no errors crop up then remove column through a zero-downtime migration
  • If errors crop up then hotfix code and wait another week
delete_a_column.png

This connection between data and code adds complexity to the deployment process:

  • Deploy code that allows both pdf_identifier and pdf_id as fields
  • Migrate table from pdf_identifier to pdf_id
  • Ship code that removes references to pdf_identifier

That is three times more complicated than most deploys! So if you can, it’s best to try to avoid these types of triple deploys. If a table is small enough (<50k rows) then it’s probably just safe to migrate it using an ALTER command.

Conclusion

Zero-downtime migrations can be tricky to pull off using SQL so whenever possible try to rely on a script like pt-online-schema-change which will automate some of this for you.

Although even though that tool works well, it can sometimes not work as well as expected which is why this article exists.

These migrations take a long time to execute, and are tedious but can prevent you from losing money and possibly your customers.

If your app isn’t online, then you can easily be losing thousands of dollars a minute as your users either go elsewhere or can’t make money themselves...which easily justifies the cost of having someone methodically go through zero-downtime migrations.