Resources

Templates across API boundaries: Powerful but problematic

C++ is controversial. For every discussion about it, a certain percentage of people will love it and another substantial group will hate it.

Above all, one feature highlights this divide: templates. Templates have many problems: they slow compile times, force implementations to be specified in headers and can bloat generated code. However they’re also a powerful mechanism for programming against generic types.

Undo software engineer, Stephen Cross, explores various methods of using templates effectively in this article for Embedded.com. See the original article or read on for more.

There can be no doubt that templates are a key part of the C++ language; the standard library relies on them heavily and they’ve been adopted throughout libraries such as Boost. This series of articles explains the problems with C++ templates and outlines new and exciting approaches for enhancing their capabilities and eliminating their drawbacks. In this first part, I define templates and discuss problems engineers have faced in dealing with templates across API boundaries. 


What are templates?

Here’s a very simple example: 

template <typename T>
T divide(T a, T b) {
   return a / b;
}

This is a template for a function which works for any type ‘T’, which you can instantiate with ‘T=int’, ‘T=float’ etc. For example:

assert(divide<int>(5, 2) == 2);
assert(divide<float>(5.0f, 2.0f) == 2.5f);

Of course, the utility of templates is clearer when you’re building containers, such as resizable arrays, lists, etc, all of which are in the C++ standard library.

The key point to remember is that there is no such function ‘divide’; it’s a template for creating functions. The real functions appear when the compiler performs substitutions; a function is generated for each instantiation.


Templates in APIs

Consider the following:

a.cpp:

template <typename T>
T divide(T a, T b);
void f() {
   int a = divide<int>(b, c);
   // ...
}

b.cpp:

template <typename T>
T divide(T a, T b) {
    return a / b;
}

Is this going to work?

We’ve declared the template in a.cpp and implemented it in b.cpp. For a normal function, this would work splendidly.

But it won’t work for templates. Remember, they’re templates of a function, not real functions themselves. The compiler needs to generate the real functions, and to do that it needs to see all the template implementation.


The export keyword

No discussion of this topic would be complete without mentioning the mysterious export keyword.

It turns out that C++ developers have known about these problems for a long time. Hence the ‘export’ keyword was invented in a C++ standard to be a solution to this problem. Unfortunately it turns out to only make things worse and has only ever been implemented by the Edison Design Group.

Consider the same example as above, with the ‘export’ keyword:

a.cpp: 

export template <typename T>
T divide(T a, T b);
void f() {
    int a = divide<int>(b, c); 
    // ...
}

b.cpp:

template <typename T>
T divide(T a, T b) {
    return a / b;
}

If you happen to be using the one compiler that supports ‘export’, this will work. But how?

Unfortunately the answer is quite unnerving. The compiler will, as normal, run on each of the translation units but it will produce an intermediate representation of the C++ code. Then the linker will be invoked, which will then invoke the compiler.

It really says something that EDG themselves pushed for ‘export’ to be removed from the C++ standard. For more information, see Herb Sutter’s analysis.


Re-establishing the boundary

So there are some serious issues with C++ templates, but this doesn’t stop them being useful. Most C++ developers heavily use std::vector<>, std::list<>, etc., all of which rely on templates.

The question is how to get the good of templates without the bad, or the ugly.

When I looked at this, there seemed to be unavoidable conclusion: separate compilation must work.

In other words, this must work:

a.cpp:

template <typename T>
T divide(T a, T b);
void f() {
    int a = divide<int>(b, c);
    // ...
}

b.cpp:

template <typename T>
T divide(T a, T b) {
    return a / b;
}

Even if a.cpp and b.cpp are compiled straight down to machine code on different sides of the planet, this has to work. We want fast compile times and we don’t want to have to use new object file formats. We want template functions to be real functions. And we still want the generated code to be fast.

Can we do this?

It turns out, we can.

The following describes how the Loci programming language resolves these issues, but the approach could be easily generalised to any low-level programming language.


Predicates

Before we even get down to code generation difficulties, we need a way to detect when the templates are being incorrectly. C++ achieves this by substituting the types into the templated code and then checking if errors occur.

Clearly, this approach doesn’t work for separate compilation because either we don’t know what the templated code is going to be or we don’t know the arguments we’ll be substituting into it. Furthermore, this approach leads to the well known terrible C++ compiler errors.

Loci resolves this issue using predicates, which are boolean expressions that specify what properties a type must have. For example:

template <typename T>
class array_type {
    T copy() require(is_copyable<T>);
};

So this says the array is only copyable if each element is copyable. The compiler checks that the element type is copyable when we call the copy() method. It also separately checks that the implementation of array_type::copy() doesn’t require any extra functionality of the element type that isn’t specified.

This functionality has been implemented and used in Loci very successfully. Similar approaches have been planned for C++, in the form of concepts, but they’ve largely stalled. More recently a very interesting proposal for ‘concepts-lite’ appears to be gaining steam and that also uses predicates.


Run-time Templates

One of the biggest steps taken in Loci is to switch templates from a compile-time feature to a run-time feature (i.e. each function computes its template arguments at run-time). This completely removes the need for the compiler to see template function implementations when it’s compiling their uses. Now instantiation occurs (in principle) at run-time, the compiler just has to emit a call with an extra parameter.

At this point I usually have to explain how run-time templates can be as fast as compile-time templates. Clearly, we’re not just always evaluating templates at run-time because that would have to be slower.

This is when optimisations step into the picture. Loci compiler’s code generation will emit LLVM IR. Then LLVM’s optimisations will start working and, where it can, you’ll see the compile-time template instantiations effectively happening as the optimisation passes inline the templated functions. Importantly, the optimisations will only do this when it benefits performance, so you won’t be paying for unnecessary code bloat or hitting unnecessary cache misses.

And that’s the point: by switching templates from a forced compile-time mechanism to an allowed-to-be-at-run-time mechanism we give the compiler more flexibility. This means that if you want to do a fast incremental build the compiler can omit most optimisations and give you a very speedy compile time. However if you want to sacrifice compile-time to get better run-time performance, we can do even better here as well.


But what about genuine separate compilation?

This is, as you may remember, exactly the case C++ can’t handle. So you’re covering new ground here!

Having said that, we do want this case to be fast. Let’s assume you can’t use link time optimisation and you have an API boundary that does use templates. How fast will it be?

It’s time to step into the implementation.


Implementation

I arrived at the design stage of the implementation of templates with three criteria:

  • No heap allocation - For a systems level language we don’t want to be adding any overhead. Requiring heap allocation for a key language feature is unacceptable.
  • Minimal overhead - No linear searches of big tables, we need well-defined and contained worst case, etc.
  • It must work with other language features.


Naive solution: Template argument arrays

The most obvious approach is to pass an array to each function with its template arguments. So if a function wants to get its Nth argument, it just pulls that out of slot N in the array.

For example:

void f() {
    g<int>();
}
template <typename T>
void g() {
    h<T, float>();
}
template <typename A, typename B>
void h() { }

When we call h() here, it’ll get [int, float] as its template arguments. Here’s some C code to demonstrate how this works:

void f() {
    type g_args[] = { int_vtable };
    g(&g_args[0]);
}
void g(type* args) {
    type h_args[] = { args[0], float_vtable };
    h(&h_args[0]);
}
void h(type* args) {
    // ...
}

Hopefully it’s fairly obvious what’s going on here. We’re using the stack to hold the template arguments, building a new array of template arguments for each function call.

Looking back at our criteria, it doesn’t need heap allocation, it’s definitely fast (optimisations will eat this up) and the compiler shouldn’t have trouble compressing stack usage. However there is a subtle problem which means it fails the last criterion (“It must work with other language features”): it breaks polymorphic references to templated classes.

To make this clearer, here’s an example where this wouldn’t work in Loci:

template <typename T>
class Example {
    void do_thing();
}
interface DoThingable {
    void do_thing();
}
void f(Example<int>& object) {
    g(object);
}
void g(DoThingable& object) {
    object.do_thing();
}

Did you see it?

Loci supports structural typing, so the fact that Example contains all the methods that DoThingable requires is enough to make the cast valid. But the cast in f() from ‘Example<int>’ -> ‘DoThingable’ obscures the template arguments of the ‘Example’ class when it’s used in g(), and it’s possible that the implementation of Example::do_thing() relies on those template arguments.

Even in a language without structural typing, any polymorphic cast could be made from a templated type to a non-templated interface or virtual base class.


So how is Example::do_thing() going to get its template arguments?

The problem here is getting the template arguments to Example::do_thing() when g() calls it, even though g() doesn’t know these template arguments itself (or even that it’s calling into a templated class).

One approach is to store the template arguments in the memory of the ‘Example’ object. But since via structural typing any object can be cast to an interface, we’ve just added that extra overhead to every object. If objects are allocated on the heap then optimisations will be largely thwarted from eliminating the template arguments. This isn’t acceptable for Loci and most other languages would find it introduces a large unavoidable space cost.

Maybe we can store the template arguments with the reference. Unfortunately there could be any number of template arguments, and even for a tight upper limit of the number of arguments reference types would suddenly be huge (e.g. 8 times sizeof(pointer)). Worse, each template argument can be a class type with its own list of template arguments (e.g. A<B, C<D, E<F>>, G>). This isn’t acceptable either.

Why don’t we just have a pointer to the template arguments on the stack? For this case it might just work. But in general the developer could store the interface reference somewhere and then call it later, after our stack frames have unwound. Now we have a use-after-free!


Next time

All these approaches have serious problems, but we’re heading in the right direction. In the next part I’ll explain a much more viable approach that uses trees of compiler-generated functions. This unlocks what we’ve been looking for: templates across API boundaries.