Copilot Induced Crash

Klaas van Schelven; January 14 - 5 min read

A programmer and his copilot, about to crash

AI-generated image for an AI-generated bug; as with code, errors are typically different from human ones.

While everyone is talking about how “AI” can help solve bugs, let me share how LLM-assisted coding gave me 2024’s hardest-to-find bug.

Rather than take you along on my “exciting” debugging journey, I’ll cut to the chase. Here’s the bug that Microsoft Copilot introduced for me while I was working on my import statements:

from django.test import TestCase as TransactionTestCase

Python’s “import as”

What are we looking at? For those unfamiliar with Python, the as keyword in an import lets you give an imported entity a different name. It can be used to avoid naming conflicts, or for brevity.

Here are some sensible uses:

# for brevity / idiomatic use:
import numpy as np

# to avoid naming conflicts / introduce clarity:
from django.test import TestCase as DjangoTestCase
from unittest import TestCase as RegularTestCase

The bug in the above is not one of those sensible uses, however. It is in fact the evilest possible use of as.

The problem? The django.test contains multiple different test classes, including TestCase and TransactionTestCase, with subtly different semantics. The line above imports one of those under the name of the other.

The actual bug

In this particular case, the two TestCases have (as the name of one of them suggests) slightly different semantics with respect to database transactions.

The TestCase class wraps each test in a transaction and rolls back that transaction after each test, providing test isolation.
The TransactionTestCase class has (somewhat surprisingly depending on how you read that name) no implicit transaction management, which makes it ideal for tests that depend on, or test some part of, your application’s DB transactions management.

The bug, then, is that if you depend on the semantics of TransactionTestCase, but actually are running Django’s default TestCase (because of the weird import), you will end up with tests that fail all of a sudden. This is what occurred in my case.

Two hours of my life

I won’t make you suffer through the same series of surprises I experienced in those two hours of debugging, the exact test that blew up for me, or the steps I took not to fall into this trap again.

The short of it is: after establishing that my tests were failing because the database transactions weren’t behaving as they should, this led to first look for problems in my own code, then to suspect a bug in Django, only to finally spot the problem as detailed above.

Why did I start suspecting Django? Well… because I was sure I was using a TransactionTestCase, but from the behavior of the tests it was clear that the TransactionTestCase was not behaving as promised in the documentation. This led me to suspect some kind of subtle bug in Django, and to much stepping through Django’s source code.

Why was this so hard to spot?

You might be tempted to think that the problem is easy to spot, because I’ve already given you the answer in the first lines of this article. Trust me: in practice, it was not. Let’s look at why.

First, please understand that although I did run my tests before committing, I did not run them straight after copilot introduced this line. So when I finally had a failing test on my hands, I had approximately two full screens of diff-text to look at.

Second, let’s look at the usage location of the alias. Note that it simply reads TransactionTestCase here, and how the carefully written comment now serves as a way to further misdirect you into believing that this is what you’re looking at.

class IngestViewTestCase(TransactionTestCase):
    # We use TransactionTestCase because of the following:
    #
    # > Django’s TestCase class wraps each test in a transaction and rolls
    # > back that transaction after each test, in order to provide test
    # > isolation. This means that no transaction is ever actually committed,
    # > thus your on_commit() callbacks will never be run.
    # > [..]
    # > Another way to overcome the limitation is to use TransactionTestCase
    # > instead of TestCase. This will mean your transactions are committed,
    # > and the callbacks will run. However [..] significantly slower [..]

The alias misled me into thinking TransactionTestCase was being correctly used. Combined with the detailed comment explaining the use of TransactionTestCase, I wasted time diving deep into Django internals instead of suspecting the import.

An Unhuman Error

However, the most important factor driving up the cost of this bug was the fact that the error was simply so weird.

Note that it took me about two hours to debug this, despite the problem being freshly introduced. (Because I hadn’t committed yet, and had established that the previous commit was fine, I could have just run git diff to see what had changed).

In fact, I did run git diff and git diff --staged multiple times. But who would think to look at the import statements? The import statement is the last place you’d expect a bug to be introduced. It’s a place where you’d expect to find only the most boring, uninteresting, and unchanging code.

Debugging is based on building an understanding, and any understanding is based on assumptions. A reasonable assumption (pre-LLMs) is that the code like the above would not happen. Because who would write such a thing?

Are you sure it was Copilot?

Yes…

Well… unfortunately I don’t have video-evidence or a MITM log of the requests to copilot to prove it. But 8 months later I can still reproduce this for some conditions:

from django.test import Te... # copilot autocomplete finishes this as:
from django.test import TestCase as TransactionTestCase

Knowing that the code below this import statement contains some uses of TransactionTestCase, and no uses of TestCase, I can see how a machine that was trained on fillling in blanks might come up with this line. That is, it’s reasonable for some definition of reasonable.

But there is just no reasonable path for a human to come up with this line. It’s not idiomatic, it’s not a common pattern, and it’s not a good idea. Which leaves copilot as the only reasonable suspect.

Copilot induced crash

AI-assisted code introduces new types of errors.

Experienced developers understand their own failure modes, as well as those of others (like juniors). But AI adds a new flavor of failure to the mix. It confidently produces mistakes we’d never expect – like the import statement above.

When relying on AI assistance, the bugs we encounter aren’t always the ones we’d naturally anticipate. Instead, they reflect the AI’s quirks – introducing new layers of unpredictability to our workflows. For me personally, the balance is still positive, but it’s important to be aware of the new types of bugs that AI can introduce.

So what’s with “copilot induced crash” in the title? Well, it’s a bit of a joke. The bug was introduced by copilot, but there was no actual crash here (I never committed this code). But given “copilot” it was just too tempting to continue the metaphor of a plane-crash.