The Black Art of Optimising

Optimising program code for maximum performance is often considered to be some sort of black art. Also, quite a number of programmers believe in micro-optimisations that rarely ever improve program execution time, but nearly always introduce subtle bugs or obfuscate the code, or both.

In the following we will have a look at optimisation possibilities that help cut down execution time considerably while still keeping the code clear and easily maintainable.

Note that we won't use profiling for the course of this article. We'll pick a problem of the highest simplicity and exclusively look at program execution time. In the real world, optimising code does hardly ever work like that, because in nearly 100% of all cases you simply HAVE to profile before optimising anything -- both because it's almost impossible to guess where CPU time is spent and because programs usually tend to do way more complex things than the one we will look at here, so a simple test run using the "time" command line tool is not possible.

The programming language we will use is plain C. Most of what we are going to do, however, applies to other high level languages as well. It should not be a problem to convert the example code to Java, for example. I also assume a UNIX-like shell and gcc as the compiler, but again, if you have a different setup, but same functionality, you will be fine.

The test system

I'm developing, building and running all this on a rather slow system, a P3 Coppermine 700, running Gentoo. It does not make much sense to use the fastest system available when optimising by measuring program execution time, since optimising oftentimes means making tiny changes to the code and testing if the result runs faster or slower than before. Obviously, the difference may be small in some cases, maybe too small to notice on a fast system.

Of course, one can argue that an optimisation leading to a difference in execution time that isn't even measurable on, say, a P4 with 3 GHz is not much of an optimisation anyway, and most of the time in real world scenarios this might be true. Still, measuring execution time on a slower system is more reliable -- it is however a lot more tedious as well.

The problem

The program we are trying to optimise is a little math calculation to solve the following puzzle, taken from the June 4th, 2005 edition of the British magazine "New Scientist": "Find the largest number whose digits are all different (and non-zero) and which is divisible by each of its digits." I am not a reader of this magazine, but stumbled across the problem in Charles Petzold's interesting essay "Does Visual Studio rot the mind?". Petzold, famous for his books on Windows Programming, has his own solution to the problem on his web page.

Let's think about the problem for a short moment: What do the different parts of the problem description tell us? "The largest number" is rather unhelpful: It doesn't really narrow down the problem's scale in any way, quite the contrary: The solution might be between 0 (we assume the solution is positive) and infinity. So we proceed to "whose digits are all different". That's a lot better! Without a doubt, the largest number with all different digits is 9,876,543,210. Obviously, the solution is somewhere between 0 and this number. What's next? "(and non-zero)". Great, so the maximum value comes down to 987,654,321. The last part of the problem description "and which is divisible by each of its digits." doesn't narrow down the potential solution anymore, unfortunately: It's not a clue, it is what we are supposed to do.

Obviously, the problem can be solved with a simple brute-force loop, going through all possible numbers, starting at 987,654,321 and counting downwards, checking each one until a solution is found. We implement this in a completely straightforward manner first -- remember: "Make it work, make it right, make it fast." This is the first version I came up with:

#include <stdio.h>
#include <string.h>

/* the number we are looking for cannot possibly be larger than this. */
#define MAX_VALUE 987654321

int main(void)
{
    /* the loop counter for the number we are currently trying */

    unsigned int current_number;

    for (current_number = MAX_VALUE; current_number > 0; current_number--)
    {
        /* a copy of the current number: we need to modify it */
        unsigned int n = current_number;

        /* array to take note which digits we have already seen */
        char digits[10];

        /* set index 0 to 1, indicating we never want a 0 */
        digits[0] = 1;

        /* clear the rest of the array */

        memset(digits + 1, 0, sizeof(digits) - 1);

        /* now go through the digits of n */
        while (n > 0)
        {
            /* if we have seen this digit before, get out of the while() */

            if (digits[n % 10] == 1)
                break;

            /* if the number is not divisible by the digit, get out, too */
            if (current_number % (n % 10) != 0)
                break;

            /* remember that we have seen this digit */
            digits[n % 10] = 1;

            /* divide by 10 so we can get at the next digit */
            n /= 10;
        }

        /* if n is 0 here, we found the solution */
        if (n == 0)
        {
            printf("Solution: %d", current_number);
            break;
        }
    }

    return 0;
}

Let's build it with

$ gcc -W -Wall -ansi -pedantic -O3 optimising1.c -o optimising1

We are building the program with the option -O3 to turn on most of gcc's own optimisations that are safe to use. It wouldn't make much sense to squeeze time out of parts of a program that the compiler itself can optimise for us at no cost at all.

So let us see how fast it runs on the test system:

$ time ./optimising1

Solution: 9867312

real    2m43.078s
user    2m42.979s
sys     0m0.023s

Great! We found a solution and it seems to make sense: All the numbers are different and it doesn't have a 0. Not so great is the program's execution time: Surely a painfully slow matter so far.

Where to start?

How do we tackle this now? Where do we begin? And, maybe even more important than that: Where do we stop? You might be tempted to say "Well, we'll stop once we don't find anything else to optimise", and that is more or less what we're going to do here for the sake of demonstration, but in a real world scenario, one would be ill-advised to do just that: Optimising code, though certainly an art, cannot and must not be an end in itself. One always has to have a clear and precise idea of what optimisation is to achieve. To simulate this, we're setting goals for ourselves here. The first one is: Get the program to run in one minute. In other words: We need to cut execution time by nearly two thirds.

Now that we have a goal to achieve, back to the first question: Where to start? When optimising code, anything one does falls in either of the following two categories:

1) Make the program do what it does more quickly.
2) Make the program do less.

The first point is obvious: Many a programmer thinks of optimising as finding a faster way for his code to do what it does already. Unfortunately, the risk this programmer runs is that he will only make code faster that should not have been what or where it is (or even not there at all) in the first place.

And maybe not even that, because some so-called optimisations in this first category are rather pointless: Forget about replacing all occurances of postfix increments or decrements in your program with their prefix counterparts, just because the prefix operator "should" be faster in theory. In close to one hundred percent of all cases, you won't achieve a thing with measures like that, because your compiler (or maybe even a smart interpreter) will most likely optimise this for you anyway. Of course, not all possible optimisations in the first category are like that. We will see quite a number of them until we're done.

The second category, however, is a lot more interesting: It is pretty obvious that our program will run faster if it has to do less work. Naturally, we need to find ways to have our program do less work without losing any of its required functionality.

Where does our little program do things it does not have to do? And what are the most time consuming tasks for the program? Maybe answering the second question makes answering the first one easier. Clearly, the program will spend most of its execution time with divisions, either directly (n /= 10) or indirectly for the modulo operations. If we could find a way to reduce the number of divisions the program has to perform, we would certainly speed it up quite a bit.

A first attempt might be to move the three occurances of "n % 10" to the top of the while-loop and store the result in a temporary variable. Unfortunately, gcc is as smart as we are in this regard and has already performed this optimisation for us, because the result of "n % 10" is clearly constant during the execution of the while-loop. So we would only complicate the code if we did this ourselves, gaining no speed improvement. We need to look elsewhere.

There are two checks we perform in the while-loop: We find out if we have seen the digit we are currently looking at before, and we check if the current number is divisible by the current digit. But what if we check a number where a digit is found twice, but at the first and last position? In the worst case, we would have to calculate quite a number of divisions of this number by its individual digits only to give up once we reach the last digit. In other words: We are doing a lot of unnecessary work in that case. The only way to avoid that unnecessary work is to split the while-loop in two: A first loop to check if all the digits are distinct, a second one to perform the divisions. Naturally, if the first loop is not successful, we can skip the second.

The result might look like this:

#include <stdio.h>
#include <string.h>
 
#define MAX_VALUE 987654321
 
int main(void)
{
    unsigned int current_number;
 
    for (current_number = MAX_VALUE; current_number > 0; current_number--)
    {
        unsigned int n = current_number;
        char digits[10];
        digits[0] = 1;
 
        memset(digits + 1, 0, sizeof(digits) - 1);
 
        while (n > 0)
        {
            if (digits[n % 10] == 1)
                break;
 
            digits[n % 10] = 1;
 
            n /= 10;
        }
 
        if (n != 0)
            continue;
 
        n = current_number;
 
        while (n > 0)
        {
            if (current_number % (n % 10) != 0)
                break;
 
            n /= 10;
        }
 
        if (n == 0)
        {
            printf("Solution: %d ", current_number);
            break;
        }
    }
 
    return 0;
}

As you can see, not much has changed: We now have two while-loops, one checking if the digits are all distinct and the other if the number is divisible. In between we check if the first loop was successful and reset the temporary variable to current_number. So how much did we gain by this, if anything?

$ time ./optimising2
Solution: 9867312

real    2m0.679s
user    2m0.594s
sys     0m0.011s

Hmm, well. Not too bad. We got the execution time down by 27%. However, we're still far away from our self-set goal to get the execution time down to a minute. To achieve this, we will have to cut the work done in half somehow.

We have to take a step back and rethink the problem and the algorithm we've chosen to solve it. Remember what we are trying to do: "Find the largest number whose digits are all different (and non-zero) and which is divisible by each of its digits."

We already found out that the highest possible solution is 987,654,321. But it's not the true solution, and it cannot be for various reasons. One of them is it isn't an even number, while at the same time there are several even digits in it. For a number to be divisible by an even number, the number itself has to be even. And if the digits in our solution have to be distinct, no number above 98765 that is odd can be the solution, because any number with more than five different digits would have to have at least one even digit in it. Why? Because there are only five odd digits: 1, 3, 5, 7, and 9. So to have more than five distinct digits, we would have at least one even one.

How will this realisation help us optimise the program? Simple enough: If a number higher than 98765 cannot be the solution we're looking for if it is not even, we need not even try all the things we do in our while-loops with it, we might just as well skip it right from the start. And checking if a number is higher than 98765 and odd or even is a lot less expensive in calculation time than the divisions in our loops.

Let's incorporate this in our program. Here's how it looks now:

#include <stdio.h>
#include <string.h>
 
#define MAX_VALUE 987654321
 
int main(void)
{
    unsigned int current_number;
 
    for (current_number = MAX_VALUE; current_number > 0; current_number--)
    {
        unsigned int n = current_number;
        char digits[10];
 
        if (current_number > 98765 && (current_number & 1) == 1)
            continue;
 
        digits[0] = 1;
 
        memset(digits + 1, 0, sizeof(digits) - 1);
 
        while (n > 0)
        {
            if (digits[n % 10] == 1)
                break;
 
            digits[n % 10] = 1;
 
            n /= 10;
        }
 
        if (n != 0)
            continue;
 
        n = current_number;
 
        while (n > 0)
        {
            if (current_number % (n % 10) != 0)
                break;
 
            n /= 10;
        }
 
        if (n == 0)
        {
            printf("Solution: %d ", current_number);
            break;
        }
    }
 
    return 0;
}

Again we build and run the program while measuring execution time:

$ time ./optimising3
Solution: 9867312
 
real    1m3.927s
user    1m3.918s
sys     0m0.002s

That's quite a breakthrough! Without sacrificing any functionality, we managed to cut down execution time to nearly a minute, thereby achieving our first goal.

Let's look again at what we did so far. We performed two optimisations that got down execution time from 163 seconds to 121 seconds to 63 seconds. In relative numbers, we cut down execution time by 62%. And how did we achieve this? By two very distinct measures, but both of them were rather in the second optimisation category ("make the program do less") than in the first ("make the program do what it does more quickly"). That's an important lesson to learn, already. But we're not finished, yet.

Look at those numbers!

The next goal: Have the program do what it does in less than 10 seconds. Impossible, you think? Not at all. To help us in finding another potential optimisation, we let the program output each number that we check before we enter the first while loop, that is, after we've skipped all odd numbers above 98765:

...
987611120
987611118
987611116
987611114
987611112
987611110
987611108
987611106
987611104
987611102
987611100
987611098
987611096
987611094
987611092
987611090
987611088
987611086
987611084
987611082
987611080
987611078
987611076
987611074
987611072
987611070
987611068
987611066
987611064
987611062
987611060
987611058
...

Can you see a potential optimisation here? We only have even numbers, and that's fine, but we see a lot of numbers where we know just by looking at them that they're not going to work: 987611120 has three 1 digits, for example. And the moment we find out about that, we should be able to tell that the next one we're going to try, 987611118, won't work either, because the three 1 digits are still there. In other words: We must find a way to skip a whole range of numbers once we find a problem with the current one that we, somehow, know won't be gone for the next sequence of numbers. But how can we do that?

Let us look at 987611120 again. The way our program works right now, it will start checking that number by looking at the last digit. In this case, the last digit is 0, so the number will be skipped. Next one is 987611118. Last digit is 8, that's fine. Next is 1, that's fine as well. Now we find another 1 and skip the number, only to try 987611116. Not a brilliant idea, because we (but right now, not our program, the way it's written) already know we will also find at least two 1's again.

The fundamental problem is that we are currently checking digits starting at the right end of the number to try. As long as we keep it that way, we cannot react to duplicate digits or zeros the way we should. But, after all, how exactly should we react? Consider this: In our first while-loop, we could start from the beginning of the number instead, and first look at 9, then at 8, then at 7, then at 6, then at the first 1. All would be fine until then. In the next iteration, we'd encounter the second 1. At that point, how could we make sure we're not just skipping the current number, but all of the following numbers until the second of those two 1's is gone?

Well, that's easy: We just need to subtract 1120 (the remaining digits we haven't yet checked) from the number, resulting in 987610000. For the next iteration of the outer for loop, the current number would again be decremented by 1, so the next number to be checked would be 987609999. Of course, that one also would not work, but still: We'd have skipped 1120 / 2 (the odd ones wouldn't have been checked anyway) numbers.

A different approach

After having discovered this potential optimisation, let us see how we can incorporate it into our code. Sadly, it's not at all trivial: The way our two while loops are written, we depend on checking digits starting at the right end of the number, because we pick the rightmost number with a modulo and then divide by 10 to get the next number to the left of it. That has been all fine and well until now, but for the next step we need to pick a different approach: We write the current number to a temporary character array and go through this array from left to right. As soon as we find a digit that is impossible, because we have seen this digit before or because it is a 0, we convert the part of the character array we haven't yet looked at back to an integer and substract the result from current_number.

Here's how this approach might look in code:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
 
#define MAX_VALUE 987654321
 
int main(void)
{
    unsigned int current_number;
    unsigned int n;
    unsigned int len;
    char s[10];
 
    for (current_number = MAX_VALUE; current_number > 0; current_number--)
    {
        char digits[10];
 
        if (current_number > 98765 && (current_number & 1) == 1)
            continue;
 
        digits[0] = 1;
 
        memset(digits + 1, 0, sizeof(digits) - 1);
 
        len = sprintf(s, "%d", current_number);
 
        for (n = 0; n < len; n++)
        {
            if (digits[(s[n] - '0')] == 1)
            {
                current_number -= atoi(s + n + 1);
                break;
            }
 
            digits[(s[n] - '0')] = 1;
        }
 
        if (n != len)
            continue;
 
        for (n = 0; n < len; n++)
            if (current_number % (s[n] - '0') != 0)
                break;
 
        if (n == len)
        {
            printf("Solution: %d ", current_number);
            break;
        }
    }
 
    return 0;
}

Let's see if this gains us anything:

$ time ./optimising4
Solution: 9867312
 
real    0m6.683s
user    0m6.680s
sys     0m0.002s

Not bad. Not bad at all. We're down from more than a minute of execution time to about 7 seconds. That's impressive. And we achieved this by picking a different approach, rewriting most of our code. That's another important lesson to learn: Sometimes, optimising means rewriting major parts of code from scratch.

How does this new approach, writing the number to a character array via sprintf() and checking the individual array entries, fare without the optimisation? Remove or comment out the "current_number -= atoi(s + n + 1);" and see for yourself:

$ time ./optimising4
Solution: 9867312
 
real    8m43.623s
user    8m43.435s
sys     0m0.015s

Ouch! That's terrible. And what exactly does this tell us? The approach itself, converting the number to a character array before testing the digits, is a lot worse than using modulo and division to get at the digits. But since it offers more room for optimisation, it is still a superior approach!

Final goal: Under one second

We're not satisfied yet. Maybe we can get execution time to under one second? Perhaps it's time to look at the numbers again. And think about math for a little longer. What do you think of, say, 987563420? Is it a possible solution? "No way", you will say. "There's a zero at the end!" And you'd be completely right. But there's another reason that has nothing to do with the zero and we're able to tell much earlier: There's a five in that number, at fourth position. So the number would have to be divisible by five, right? What do all numbers divisible by five have in common? Look at this sequence of numbers divisible by five: 5, 10, 15, 20, 25, 30, 35, 40, ... 12345, 12350, 12355, 12360 ... Do you see the pattern? All numbers divisible by five either end in a zero (and thus cannot be our solution anyway) or a five. So, in other words: As soon as we find a five in the number we check and it is not at the end of the number, it cannot be the solution, because for the number to be divisible by five, there either would have be another five at the end (so we had more than one five and the number is out) or a zero (and zeros are, we know it by now, forbidden).

And of course we can combine this with the skipping of whole number ranges we discussed earlier. Not only cannot 987563420 possibly be our solution, but neither can 987563418 nor 987563416 and so on: As long as the five is there at position four in the number, no number will work.

It's ridiculously easy to incorporate this potential optimisation into our program:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
 
#define MAX_VALUE 987654321
 
int main(void)
{
    unsigned int current_number;
    unsigned int n;
    unsigned int len;
    char s[10];
 
    for (current_number = MAX_VALUE; current_number > 0; current_number--)
    {
        char digits[10];
 
        if (current_number > 98765 && (current_number & 1) == 1)
            continue;
 
        digits[0] = 1;
 
        memset(digits + 1, 0, sizeof(digits) - 1);
 
        len = sprintf(s, "%d", current_number);
 
        for (n = 0; n < len; n++)
        {
            if (digits[(s[n] - '0')] == 1 || (s[n] == '5' && n < len - 1))
            {
                current_number -= atoi(s + n + 1);
                break;
            }
 
            digits[(s[n] - '0')] = 1;
        }
 
        if (n != len)
            continue;
 
        for (n = 0; n < len; n++)
            if (current_number % (s[n] - '0') != 0)
                break;
 
        if (n == len)
        {
            printf("Solution: %d ", current_number);
            break;
        }
    }
 
    return 0;
}

So what did we gain?

$ time ./optimising5
Solution: 9867312
 
real    0m1.588s
user    0m1.585s
sys     0m0.002s

That's a massive improvement again: We're alrea