Optimizing Dwarfbot


Dwarf Fortress is an ascii based game, and is often called the world's most computationally heavy program. It simulates an entire world down to the teeth and nerves in each creature. The dev logs and associated bugs are also very humorous to read at times, like the one time babies were born wielding daggers. Or the one time animals wouldn't bred if they were unwilling to marry.

Anyway, one of the key features of Dwarf Fortress is that the 'ascii' graphics can be rendered in multiple fonts, called tilesets. For example, see the below image. As a small tour, to the upper left, there is a weaving station. On the right half, there is a small milita, some dogs, and a cat training to defend the fortress.

A screenshot of Dwarf Fortress.

As a challenge, I set out to create a fast python program that converts an in-game screenshot from one tileset to another tileset, without any other input. This program is named DwarfBot. Below is a demo - try sliding the white bar left and right:

This problem is difficult due to several factors:

  • The rendering algorithm is not public.
    • Reverse engineering introduces error.
  • Each screenshot can use any of ~180 possible tilesets, each with 256 different tiles.
    • Some tilesets share tile images.
  • The font colors can be anything.
  • Screenshots can introduce noise, and can be cropped poorly too.

Most importantly, this becomes a huge brute force problem. A naive implementation of this algorithm is estimated to take 5 hours to run. However, with a lot of tricks, this was boiled down to 15 seconds per image, with ~1 minute precomputation work. It took several tricks to get here, in python nonetheless, so let's discuss them below.

Heuristics

One part of the algorithm picks a location on the image, and then guesses which tile is there. This means testing every single tileset tile against the location. Each tile is given a confidence for how likely it is there.

This part happens to be very slow as it requires both guessing the foreground / background color of a tile, as well as rendering the tile. This happens hundreds of millions of times each run, if done naively.

A flow chart for guessing a tile.
A flow chart for guessing a tile.

What if, instead of testing every possibility, the majority of possibilities could be filtered out very quickly?

Introducing the first speed trick. We want a heuristic that allows us to quickly check if two tiles might be the same. This will allow us to compare part of a screenshot to a tileset tile to see if they match. The heuristic must be hue invariant, brightness invariant, and noise resistant.

The original heuristic used was entropy. Each color in a tile was assigned a unique id, and the entropy of the id's was calculated. This heuristic, while color invariant, does not handle noise well. Noise introduces or removes colors, and hence skews the entropy.

Meanwhile, the current heuristic calculates the difference between consecutive pixels, then zeros and scales the resultant vector. The result is called the tile summary. Two tiles are compared by calculating the dot of their summaries, meaning that similar tiles will have a higher result. This heuristic fulfills the original requirements, and provides an easily interpretable comparison. The latter allows us to control the leniency of the filter as well.

The filter is so effective that entire tilesets are often discarded at a time. It provided the largest speed up of every trick used, having caused a ~120 times speedup. Looking forwards, it might be possible to add a second filtering step for even more speed.

Optimizing Numpy

With the above additions, profiling identified a few bottlenecks spread throughout the code. The major pain points were generating renderings of tiles, as well as computing and comparing tile summaries.

Now this is obvious in hindsight, but the tile summary code was very repetitive. It was computing the dot product between many vectors and a single vector via a for loop. Via vectorization, I was able to squash this to a single numpy dot(), to take advantage of bulk optimizations. The function's share of run time fell from ~35% to ~10%.

Looking to optimize the tile rendering code next, I wanted to vectorize it as well, but was skeptical because of two if statements. However, one if was easy to replicate using a boolean mask, and the other was true or false for the entire method call. Vectorization was not as difficult as expected.

As a general reminder, if your code is performance critical, make sure to vectorize or parallelize it. Make use of all the compute resources that you have available, so that they aren't idle. As an example, Machine Learning systems become much faster the more that they are vectorized.

One other optimization worth noting is that numpy operations sometimes copy the parameters involved. It is good to avoid unnecessary copies whenever possible. This can be done by both by using certain operations (ex: ravel vs flatten), as well as specifying the out parameter. These optimizations improved program performance slightly.

Precomputation

After making the existing algorithm quite efficient, the next step is to cache calculations, or even to precompute them at the beginning.

The main challenge here was to develop a data structure possible of cleanly passing lots of data around the code. This is a public project, and should be as neat as possible for others.

To take things a step further, a lot of the precomputations are saved to file, so as to avoid recomputing them at every single program run. Not all precomputations are saved to file, as numpy data types don't jsonify well, but many are.

In the future, it would be nice to cache more. Several tilesets sample the same data from the screenshot, and hence a lot of calculations are repeated. Caching would require even more code infrastructure built up to nicely store data, but is guesstimated to save ~30% runtime.