Testing Django project infrastructure failure tolerance

Show melted gum on road

Unit testing and functional testing only get us so far when trying to predict errors with our code. Tests perform checks to see if our code is behaving as it was designed. But, these tests' domain end at the application level, they tell us nothing about how our code will behave when an unexpected infrastructure problem arises in the future, when the project has already been deployed.

While debugging an issue in Mayan EDMS found by an user in which the upload in bulk of documents compressed in a zip file was failing silently, I stumbled into this same scenario. The problem was with the database not with the application. After uncompressing the uploaded files, Mayan was launching many background tasks, one for each document contained in the compressed file. All the background processes doing introspecting, OCRing and indexing of the documents, were causing the database manager to hit concurrency limits. The concurrency limits were causing database locking issues showing as OperationalError in Django's ORM, which are usually fatal. We knew the reason of the errors but couldn't fix them because these were load dependant errors.

This is one of the worsts spots for a programmer to be in, non repeatable errors. The first step to fix an error is understanding the error. To understand the error you must be able to provoke the error in a repeatable way. Infrastructure errors are not repeatable at the application level, so during testing I found myself trying to saturate the database to provoke locking issues and trigger OperationalError. Failing at being able to saturate the database I started sprinkling the project with code to randomly raise OperationalError at critical paths. This worked but leaving error triggering test code in production code is not acceptable. This is how Django Sabot was born.

Django Sabot logo

Django Sabot monkey patches Django systems to insert code that will raise exceptions at predetermined times or conditions. The code is divided into two main components: the patchers and the error producers. The patchers, monkey patch a selected method in a Django internal system class, inserting an error producer. The error producers instances will raise the specified errors based on a specific trigger logic.

Right now two patchers are included: ConnectPatcher, and CursorPatcher, which patch the connect and cursor methods of django.db.backends.BaseDatabaseWrapper respectively. To provoke errors three classes are included: CountErrorProducer, RandomErrorProducer, TimeDeltaErrorProducer. CountErrorProducer will raise the passed exception after its check method is called a specific number of times. The RandomErrorProducer will raise the passed exception if a random number in a given ranger in returned. TimeDeltaErrorProducer will raise the execution passed after the specified time has passed after it was instantiated. The CountErrorProducer and TimeDeltaErrorProducer have an extra reset argument that when set to True, will reset them after raising the exception, allowing them to raise more exceptions with the same periodicity.

Django Sabot is configured via the settings.py file to allow it to be used from diverse test rigs. To test your code for OperationalError when reading from the database at random, with a chance of 1 in 10, enter the following in the setting file you are using to run your tests:

from django.db import OperationalError
from sabot.import *

SABOT_PATCHES =
    (
        CursorPatcher(
            error_producer=RandomErrorProducer,
            kwargs={
                'exception': OperationalError,
                'low': 1, 'high': 10
            }
        ),
    )

Django also provides a few ways to override settings at the test level. Use any of these methods in conjunction with Django Sabot to simulate predictable infrastructure failures and test each of your project's critical code paths, adding remediation code to retry or elegantly fail.

Hopefully now you will be able to test your project's most infrastructure critical code paths before it is ever deployed on testing, staging or production environments.