We’re building our web framework from scratch at my new company, so I built a migration system that addresses a lot of the pain I had with Django migrations. Code here.
Django migrations
First, Django migrations are fantastic. It’s a massive improvement over using sql directly. And not in a “SQL is hard and Python is easy” sense. It’s like asymptotically better. (I’m obviously not talking about cases where Django isn’t powerful enough to do what you need from sql.) In a pure sql world, there’s no immediate answer to “what state am I in right now?”. You have to look at your local db or your prod db to determine the current state, or you need to look at your migration history and add them all up. Then you need to answer “how do I get from this state to the state I want to be in”, which is easy if you’re just adding a column or something, but harder in cases that are more dependent on the current state. And the real problem is that these steps get more difficult the larger your schema is, so you’re only getting slower over time.
Django migrations let you declare your schema in a state-based way instead of action-based. You just have to say “my schema should be X”, and it knows how to generate the migrations that get you into that state. You always have an immediate answer to the current state (which is nice even when you doing normal work and not changing schemas), and you spend zero time figuring out how to get from your previous state to your desired state.
As with most improvements, Django migrations increased my throughput to the point where new bottlenecks became apparent (as in, maybe they cut out 90% of the wasted time, but now I do 10x as much so even though I’ve come out massively positive, the friction is just as noticeable).
The main problem is how much time it takes to run migrations from scratch, and how frequently you need to do this (in local/test environments). Running from scratch takes a long time because you have to run your entire migration history. So if you have a table with 5 columns, you’re probably not running “Create this table with 5 columns”, you’re running “Create this table with 2 columns, add a column, rename a column, add 2 more columns, make this column non-null, etc”. Obviously resetting the db is going to take some time, but this path dependence makes the amount of time grow much faster (it doesn’t take long to hit minutes).
Ideally you don’t have to run migrations from scratch often. You might have to do it if you run migrations on a branch, push it up for review, and then start a different branch before the first one lands. Another really annoying situation is when you iterate on model changes within a PR. Say you add a column, then realize you want it to have a different name, then realize it needs to be nullable. First, you have to run the command to make migrations after every change (or more realistically you run tests, see them fail, and then run the command). And then when you’re done you have three options. 1) delete these migrations, make migrations again, and reset the db (which, as described above, takes forever). 2) Figure out the name of your first migration, run command to reverse migrations to that point, delete the migrations, and make migrations again (no db restart needed). Or 3) land all 3 migrations and accept that you are adding 3x as much to the db-resetting burden as necessary. Django has a migration squashing system that is meant to address this, but it takes a lot of time and it doesn’t work well enough to really solve the problem.
My migrations
After 10 years working with Django migrations, I realized that migration files are the problem and we can simply not have them, so I built a system that determines migrations and runs them on the fly. It takes model definitions and converts them to some representation of what state the db should be in, compares this to a record of what state the db is actually in, runs the migrations to get you into the correct state, and runs the migrations to update the record of the state you’re in. This last step is the key difference. In Django, the migration files are both a plan for what you want to do, and a record to rebuild the state you’re in. But if you can accept not having the plan written out, you can store the state directly instead of as a series of steps. Kind of like Terraform for database schemas. I still only support pretty basic use cases, but I have the general structure down and extending it to handle more cases is very easy, and it’s only ~500 lines of code.
This removes pretty much all friction around migrations. No generating migration files, no squashing, and no path-dependence when rebuilding db from scratch (which rarely has to happen now anyway). You can make 100 small changes to models as part of a PR, and each time you run tests the migration step just gets you into the right schema. You can abandon a PR with model changes and go back to main and you’ll just get migrated back into main’s schema the first time you run anything. These have made a huge difference in how quickly I can move on code that involves model changes.
The lack of explicit migration files feels a bit scary, but to be fair it’s not like anyone actually reviews them carefully anyway. But there are two things about this system that still feel a little sketchy.
First, the migrations you run in prod aren’t the same as those that run in dev or locally. If you add a null column and deploy to dev, then you make the column non null and deploy to dev and prod, they’ll get to the final state in two different ways.
Second, you decide on a set of migrations at deploy time instead of at development time. This was the whole point, but it also adds some potential disastrous failure modes. While I trust that my system isn’t going to make random mistakes, a borked deploy (eg one that leaves us thinking we have no models for some reason) would have us delete all our models.
My general approach when eliminating friction makes things unacceptably dangerous is to just proceed anyway and find a way to address the danger directly, and usually I end up coming out positive. In this case, my plan is to build a review system around prod deploys. So an attempted deploy will compare the new schema with what’s actually in the prod db, and I’ll have to confirm that I’m okay with migrations. Any dangerous operations like dropping a column or adding a column without a default can be flagged extra clearly. Approving the deploy will whitelist a set of operations for an hour or so, and if somehow the actual deploy ends up with a different set of operations it’ll just bail. Another thing I’ll do to be extra safe is make column/table drops just be renames, and have some cleanup job drop them a few days later.
My code is here. It’s not directly usable or anything, and it’s pretty chaotic, but feel free to copy it or take ideas from it.