A polynomial benchmark in D

The benchmark code

Initially I had intended to post pseudocode for the polynomial benchmark I decided to use. However, the D code itself is so easy to follow as a result of the additional work we did in the previous post that I can give it here as self-commenting code:

import std.stdio;
import bignum;
import gmp;
import polynomial;

void main()
{
   alias poly!(ZZ, "x") R;
   alias poly!(R, "y") S;

   auto x = R.gen();
   auto y = S.gen();

   auto A = 4*x^^2 + 3*x + 2;
   auto B = 5*x^^2 + 6*x + 7;
   auto C = 8*x^^2 + 9*x + 10;

   auto D = A*y^^2 + B*y + C;
   auto E = D;

   for (long i = 0; i < 10000000; i++)
      E = E + D;

   writeln(E);
}

The reasons that I chose multivariate polynomial addition are as follows:

  • Addition does not tax the bignum library. We want to benchmark the language, not the external GMP.
  • We have used small lengths to maximise overhead and really tax the compiler.
  • I chose to use E = E + D instead of E += D so that it was possible to examine a worst case with regard to memory allocation and garbage collection.
  • Multiple levels of polynomial rings really taxes the type system. Few languages can handle the kind of abstraction required for this benchmark.

Performance Results

I ran the above benchmark in a number of systems:

  • Using Sympy’s polynomial module.
  • Using the Sage computer algebra system.
  • A hand coded recursive rings implementation in flint. In particular, I used the generics module of the bland branch implemented by Fredrik Johansson.
  • A Pari/GP script which simply implements the above program.
  • The D code described in this blog (compiled using dmd, ldc2 and gdc)

The results on a 2.2GHz AMD Mangy Cours Opteron server are as follows:

  • Sympy : 560s (pure Python)
  • Sage : 21s (uses Cython)
  • flint : 2.5s
  • Pari/GP : 11s
  • dmd -O : 41s
  • ldc2 -O2 : 54s
  • gdc-4.6.3 : failed to compile

Summary

  • D is a pretty amazing language with really great features.
  • The compilers are pretty stable, though I didn’t understand some of their complaints.
  • I don’t know why gdc doesn’t compile my code, but I was unable to figure out how to install gdc-4.7.2 and this may be all that is needed.
  • D is much slower than I had expected, though it is not clear how much of this is due to the way I am using classes, especially the fact that a new object is created every time an operation is performed.
  • D still fails the BOTCH test. (I don’t know any languages which do not fail the test. Julia comes pretty close to passing, but you have to abuse the compiler in some pretty unusual ways.)

Future Experiments

  • It would be extremely interesting to implement a similar test in C++, which is D’s direct competitor. I suspect that D may not look so slow when compared against a similar class based implementation in C++, though some template magic in C++ can make the extra creation of objects go away. Not sure about D, though it is possibly the same.
  • It would be interesting to see if causing GMP to use Boehm’s GC and switching to a struct based implementation instead of class based may speed up the D code.
  • It would be very interesting to see how much of a difference the += operator would make, both for bignums and polynomials.
  • It would be great to see more languages implement this test. It really explores dark corners of the compiler, the language and is a great way to find out something about the performance of the language in the direction of generic programming.
‹‹‹ Previous Post Next Post ›››

2 thoughts on “A polynomial benchmark in D

  1. You are literally benchmarking the involved memory allocators here: From a quick glance at the profiler output for the DMD-produced executable, it looks like the program spends 55% (!) of the time in one of the several million calls to D’s GC.malloc function, and another ~20% in the GMP alloc/realloc functions. You are not really »taxing the compiler« as much as taxing your own implementation mistakes.

    This is probably also the reason (assuming that you used the 0.10.0 release) for which LDC performs worse than DMD, which shouldn’t normally happen: The optimizer contained a bug which forced us to compile the GC code in a way that lead to weaker optimizations being applied than in DMD, even if it should usually be the other way round.

    Unfortunately, if you are using tools on a level that exposes the low-level details, in this case while building a higher-level abstraction, there will always be the chance to shoot yourself in the foot if you don’t keep the general »machine-level« concepts in mind all the time.

    However, I’m reasonably confident that it is possible to implement the types used in the benchmark code much more efficiently with a comparable amount of effort, yielding at least a 4-5X speedup. I’d encourage you to ask around on the D forums (http://forum.dlang.org/), many people there like working on a little optimization challenge from time to time.

    Oh, and using += will almost certainly help, as it’s not only the allocations on the D side, but also the ones on the GMP side that have a noticeable performance impact here.

Leave a Reply