Resurrecting a Dead Library: Part Two - Stabilization

August 6, 2018

10 minute read

In this post, I demonstrate how to retrofit automated tests onto an untested legacy library.

This is part two of a three-part series about how I resurrected ingredient-phrase-tagger, a library that uses machine learning to parse raw recipe ingredients (e.g., “2 cups milk”) into structured data. Read part one for the full context, but the short version is that I discovered an abandoned library and brought it back to life so that it could power my SaaS business:

  • Part One: Resuscitation - In which I nurse the code back to health so that it runs on any modern system
  • Part Two: Stabilization (this post) - In which I prevent functionality from regressing while I restore the code
  • Part Three: Rehabilitation (coming soon) - In which I fix the code’s most egregious bugs and begin refactoring

Beavers stabilizing shaky house

Running it in continuous integration

At the end of part one, I created a Docker image that allowed the library to run on any system. The next step was to run the library in continuous integration.

Travis CI logo

Continuous integration is the practice of using an indepedent, controlled environment to test software on each change to the code. My preferred continuous integration solution is Travis. Their configuration files are intuitive, and they offer unlimited free builds for open-source projects.

To integrate with Travis, I added my fork of ingredient-phrase-tagger on Travis’ configuration page and then enabled builds:

Screenshot of enabling Travis
Enabling Travis builds for ingredient-phrase-tagger library

Then, I created a file called .travis.yml, which told Travis how to build the library:

sudo: required
services: docker
script: docker build .

I pushed my commit to Github, created a pull request, and Travis built it successfully:

Screenshot of first successful build on Travis CI
First successful build on Travis

Adding an end-to-end test

Travis was building my Docker image, but the build wasn’t meaningful yet. It only built the library’s dependencies — it didn’t exercise any of its behavior. I wanted a build that could alert me when I broke the library’s functionality. To do that, I needed an end-to-end test.

An end-to-end test verifies that a complete, real-world scenario works as expected. It generally matches the following structure:

  1. Supply pre-generated input and its expected output (also known as the “golden output”).
  2. Use automation tools to feed the input to the library.
  3. Compare the library’s output to the golden output.

The original repository contained a script called that resembled an end-to-end test. It provided pre-generated input to the library, used a portion of the input to train a new machine learning model, then used that model to parse other portions of the input. The only piece missing was that it never compared results to a known-good output.

A basic end-to-end test

In part one, I showed that the script’s final result was a set of summary statistics about the model’s performance:

Sentence-Level Stats:
        correct:  1487
        total:  1999
        % correct:  74.3871935968

Word-Level Stats:
        correct: 10391
        total: 11450
        % correct: 90.7510917031

That gave me enough for a simple end-to-end test. I re-ran the final step of the script, but redirected the console output to a new file called tests/golden/eval_output and added it to source control:

python bin/ tmp/test_output > tests/golden/eval_output

Now that I had known-good output, I modified the end of so that it would compare all future outputs against this saved output:

python bin/ tmp/test_output > tmp/eval_output
diff tests/golden/eval_output tmp/eval_output

Does my test know when code breaks?

An end-to-end test is only useful if it catches bugs, so my next step was to simulate a breaking change and check if my end-to-end test caught it.

In, there was a regular expression that matched sequences of numbers (e.g., "83625"):

m3 = re.match('^\d+$', ss)

As an experiment, I tweaked the regular expression so that it would fail to recognize any number that included a 9:

m3 = re.match('^[0-8]+$', ss)

I then re-ran my modified script:

<       correct:  1487
>       correct:  1486
<       % correct:  74.3871935968
>       % correct:  74.33716858429

It worked!

When I told the code that 9 was no longer considered a number, the library’s accuracy fell, and the script terminated with a failing exit code.

Want to parse ingredients without all this work?

I went through all of these steps, so you don’t have to. Check out Zestful, my managed service for ingredient parsing.

Expanding the end-to-end test

The basic end-to-end test above was useful, but executes a data pipeline with several stages. It would be convenient to know which particular stage broke, so I looked for more outputs to include in the end-to-end test.

In addition to printing output to the console, the script also wrote files to a subdirectory called tmp/:

$ file tmp/*
tmp/model_file:  data
tmp/output.html: HTML document, ASCII text, with very long lines
tmp/test_file:   ASCII text
tmp/test_output: ASCII text
tmp/train_file:  ASCII text

test_file, test_output, and train_file were all plaintext files that looked a bit like this:

$ head -n 16 tmp/test_file
1       I1      L12     NoCAP   NoPAREN B-QTY
boneless        I2      L12     NoCAP   NoPAREN I-COMMENT
pork    I3      L12     NoCAP   NoPAREN B-NAME
tenderloin      I4      L12     NoCAP   NoPAREN I-NAME
,       I5      L12     NoCAP   NoPAREN B-COMMENT
about   I6      L12     NoCAP   NoPAREN I-COMMENT
1       I7      L12     NoCAP   NoPAREN B-QTY
pound   I8      L12     NoCAP   NoPAREN I-COMMENT

Salt    I1      L8      YesCAP  NoPAREN B-NAME
and     I2      L8      NoCAP   NoPAREN I-NAME
freshly I3      L8      NoCAP   NoPAREN B-COMMENT
ground  I4      L8      NoCAP   NoPAREN I-COMMENT
black   I5      L8      NoCAP   NoPAREN B-NAME
pepper  I6      L8      NoCAP   NoPAREN I-NAME

I didn’t understand the file format yet, but I didn’t have to. All I needed was a way to detect when the files changed.

After copying these files to tests/golden, I saved them to source control as additional golden outputs. Then, I added diffs to my build script to detect when these output files changed.

Want to parse ingredients without all this work?

I went through all of these steps, so you don’t have to. Check out Zestful, my managed service for ingredient parsing.

The complete build script

After all my modifications to, I saved it as a new file called, which looked like this:


# Exit build script on first failure
set -e
# Echo commands to stdout.
set -x


OUTPUT_DIR=$(mktemp -d)

bin/generate_data \
  --data-path=nyt-ingredients-snapshot-2015.csv \
  --count=$COUNT_TRAIN \
  --offset=0 > "$ACTUAL_CRF_TRAINING_FILE"
bin/generate_data \
  --data-path=nyt-ingredients-snapshot-2015.csv \
  --count=$COUNT_TEST \

crf_learn \

crf_test \


# Check against golden output.


I then added a simple wrapper around that script called docker_build that ran the end-to-end test within the library’s custom Docker container:


# Exit on first failing command.
set -e
# Echo commands to console.
set -x


docker build \
  --tag "$IMAGE_NAME" \

docker run \
  --tty \
  --detach \
  --name "$CONTAINER_NAME" \

docker exec "$CONTAINER_NAME" ./

With the docker_build script, my end-to-end test could run on any system that supported Docker. Naturally, I wanted to run it in my continuous integration environment.

Running my end-to-end tests in continuous integration

My earlier Travis configuration built the Docker image but didn’t exercise the library. Now that I had a thorough test script, I updated my .travis.yml file to run it:

sudo: required
services: docker
-script: docker build .
+script: ./docker_build

I pushed my changes, ready to witness the splendor of my brilliant test that could run consistently anywhere. Instead, it failed:

End-to-end test failing on Travis
End-to-end test fails on Travis after passing on my local machine

I wasn’t happy to see a build break, but I was glad that my end-to-end test caught something. I just had to figure out what it was.

Debugging the discrepancy

The whole point of a Docker container is that the program should behave the same anywhere, so how could I run the same container in two places and see different outputs?

The Travis build log showed that the test failed on the diff of the testing_output file:

+ diff --context=2 tests/golden/testing_output /tmp/tmp.W5S3C5T4if/testing_output
*** tests/golden/testing_output  Fri Jul 27 02:44:20 2018
--- /tmp/tmp.W5S3C5T4if/testing_output  Fri Jul 27 03:03:56 2018
*** 173,178 ****
  1  I1  L8  NoCAP  NoPAREN  B-QTY  B-QTY
  tablespoon  I2  L8  NoCAP  NoPAREN  B-UNIT  B-UNIT
! corn  I4  L8  NoCAP  NoPAREN  B-NAME  B-NAME
  syrup  I5  L8  NoCAP  NoPAREN  I-NAME  I-NAME

--- 173,178 ----
  1  I1  L8  NoCAP  NoPAREN  B-QTY  B-QTY
  tablespoon  I2  L8  NoCAP  NoPAREN  B-UNIT  B-UNIT
! corn  I4  L8  NoCAP  NoPAREN  B-NAME  I-NAME
  syrup  I5  L8  NoCAP  NoPAREN  I-NAME  I-NAME

The testing_output file was the result of these two lines in my script:

crf_learn \

crf_test \

crf_learn and crf_test were both command-line utilities for CRF++, the engine that powered ingredient-phrase-tagger’s machine learning logic. Without knowing much about these utilities, I could deduce from the syntax that crf_learn created a machine learning model and crf_test used that model to classify data.

The end-to-end test had verified that the contents of $ACTUAL_CRF_TRAINING_FILE and $ACTUAL_CRF_TESTING_FILE matched my golden versions. This meant that crf_learn and crf_test took in inputs that were identical on my local system as well as in continuous integration, but they produced different outputs depending on the environment.

A deeper dive into CRF++

Was CRF++ non-deterministic? I tried running the test again locally. It passed. I re-ran the test on Travis, and it failed in the same way. This told me that CRF++ was consistent across executions in the same environment, but was inconsistent across environments.

I didn’t like where this was pointing. It suggested that CRF++’s behavior depended on the system’s underlying hardware. Maybe an Intel CPU yielded different results than an AMD CPU. That would be a pain because Travis doesn’t guarantee anything about its hardware environment. Furthermore, if different hardware yielded different results, that would defeat the purpose of a Docker container.

In desperation, I checked CRF++’s command-line documentation to look for anything that might hint about hardware dependencies:

$ crf_learn --help
 -p, --thread=INT            number of threads (default auto-detect)

The --thread flag looked interesting. I checked the full documentation for more details:

-p NUM:

If the PC has multiple CPUs, you can make the training faster by using multi-threading. NUM is the number of threads.

This sounded promising.

I compared the CRF++ output on Travis to the same output lines in my local environment:

Difference in crf_learn thread count between local machine and Travis
crf_learn runs with two threads on Travis, but eight in my local environment

Ah ha!

Because I omitted the --thread flag, CRF++ set it automatically based on the number of CPU cores available. My Travis environment had two CPU cores, while my local machine had eight.

I tweaked my script to set the thread count explicitly:

+crf_learn \
+  --thread=2 \

Then, I saved the newly generated output files as my golden copies. I pushed the changes to Github and was greeted with a pleasant sight: my end-to-end tests passed:

Success after fixing end-to-end test
End-to-end test passing on Travis

The value of good tests

The end-to-end test proved its value very quickly. While it was tedious to dive into the documentation for one of the library’s dependencies, the test exposed that the library produced inconsistent results depending on its environment. This is something the library’s original authors likely never realized.

With the end-to-end test in place and continuous integration running, I had an authoritative environment that demonstrated the library’s expected functionality. The test provided a valuable safeguard in case I made any changes that unintentionally changed the library’s behavior.

What’s next?

With the confidence from my test, it was time for my favorite part of a software project: refactoring. I was free to make large-scale changes to the code because I knew the build would break loudly if I did anything too stupid.

Stay tuned for part three of this series, where I will describe how I:

  • reduced the build time from 20 minutes to 90 seconds
  • applied style conventions to the code automatically
  • added static analysis to the build

Want to parse ingredients without all this work?

I went through all of these steps, so you don’t have to. Check out Zestful, my managed service for ingredient parsing.

Cover illustration by Loraine Yow. My fork of the ingredient-phrase-tagger library is available on Github. I offer a managed service based on this library called Zestful.

Be the first to know when I post cool stuff

Subscribe to get my latest articles by email.

Leave a Comment