Aug 26, 2020

Large Data Migrations Django

Been working with a Django project with a large dataset and wanted to insert a bunch of inter-related data in a migration. Where Django has support for data migrations via the migration framework, it doesn’t do well with large datasets. This page has a really good method of inserting large amounts of data it doesn’t really work when you have that is inter-related – i.e. being in different tables (or objects) and have foreign keys in between them.

I’m using this csv file of US States and Territories, Counties and Cities. The models for them are pretty straight-forward:

class State(models.Model):
    name = models.CharField(max_length=50)
    symbol = models.CharField(max_length=2)

    def __str__(self):
        return self.name


class County(models.Model):
    name = models.CharField(max_length=100)
    state = models.ForeignKey(State, on_delete=models.CASCADE)

    def __str__(self):
        return self.name


class City(models.Model):
    name = models.CharField(max_length=100)
    county = models.ForeignKey(County, on_delete=models.CASCADE)

    def __str__(self):
        return self.name

The first thing I did was make an empty migration, via:

./manage.py makemigrations --empty <app>

It will give you a the name of the file it generated – open it up on your favorite editor. We are going to set it up similar to the description found in the Django documentation but we will have to change things a little bit

The first change is that Django runs it’s migrations in one transaction, which is usually fine but we are going to inserting tens of thousands of records… one transaction would be dangerous. So we need to remote the atomic transactions – easily done with added a line to our migration class:

class Migration(migrations.Migration):
    atomic = False

Then I created a function in my migration file that parses the file and inserts it in. It’s a basic CSV parsing class… but I wanted to handle the transaction myself. Django seems to think that this is dangerous and glosses over how to do it. But it’s easy to go and if you code it carefully, it’s not even that dangerous.

First, turn off auto-commits (which is what Django does by default):

    transaction.set_autocommit(False)

And then keep count of how many objects you have created. I use the Django ORM get_or_create method, which makes this pretty easy.

state, created = State.objects.get_or_create(
    name=row["State full"], symbol=row["State short"],
)
if created:
    print(state.name)
    obj_count += 1

Once you create enough objects (not just states, but counties and cities too) then do a manual commit:

if obj_count >= 200:
    transaction.commit()
    obj_count = 0

And at the end of your function, make sure you do one last commit so the last few items are inserted:

transaction.commit()

I’m not going to step you through line-by-line but you can see my gist with my final migration file. Finally when you are done, just run ./manage.py migrate, go get some coffee and your data will be inserted.

About the Author

Mike Hostetler profile.

Mike Hostetler

Principal Technologist

Mike has almost 20 years of experience in technology. He started in networking and Unix administration, and grew into technical support and QA testing. But he has always done some development on the side and decided a few years ago to pursue it full-time. His history of working with users gives Mike a unique perspective on writing software.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blog Posts
Snowflake CI/CD using Jenkins and Schemachange
CI/CD and Management of Data Warehouses can be a serious challenge. In this blog you will learn how to setup CI/CD for Snowflake using Schemachange, Github, and Jenkins. For access to the code check out […]
How to get your pull requests approved more quickly
TL;DR The fewer reviews necessary, the quicker your PR gets approved. Code reviews serve an essential function on any software codebase. Done right, they help ensure correctness, reliability, and maintainability of code. On many teams, […]
Kafka & Kubernetes: Scaling Consumers
Kafka and Kubernetes (K8s) are a great match. Kafka has knobs to optimize throughput and Kubernetes scales to multiply that throughput. On the consumer side, there are a few ways to improve scalability. Resource & […]
AWS RDS MYSQL Playground
Do you need a reliable database platform to hammer out some new application ideas? Or, maybe you’d like to learn MYSQL in a disposable environment? This Hashicorp Terraform MYSQL RDS Module will build all the […]