Using a virtual dispatch might get relatively expensive in terms of clock cycles due to multiple levels of indirections including indirect branching as well as this pointer adjustment. Wise programmers do not use virtual dispatch without a good reason but oftentimes it is required either by design or when creating non-template reusable components/libraries and the final implementation of some parts of the program is not known.

Optimizing compilers do a great deal of work trying to optimize virtual dispatch as much as possible, sometimes even eliminating it in the end program entirely. Let’s take a quick look at how compilers implement virtual dispatch, how it can be optimized away and go over a few tricks that programmers can do to achieve better runtime performance.

Virtual Dispatch Mechanics

Implementation of the virtual dispatch in C++ is not covered by the language standard. It is up to compiler to decide how to implement it. All of the compilers that I’ve worked with implement it using a virtual method table concept. The details of an exact implementation tend to get very complex once multiple inheritance, virtual inheritance and optimization gets involved. For utmost simple cases, one can think of a straightforward implementation where a base class has a pointer to an array of function pointers. To demonstrate this, let’s write a functional equivalent of the below C++ program in C.

Simple C++ Example Using Virtual Dispatch
1
2
3
4
5
6
7
8
9
struct A {
    virtual ~A() {}
    virtual int foo() { return 0; }
};

int main() {
    A a;
    return a.foo();
}

Assuming that no optimization is performed, the above code can be described in C like this:

Simple Virtual Table Example in C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
struct A;

/*
 * A virtual table for class A implicitly generated by C++ compiler.
 * It consists of two "methods" - a pointer to a destructor of class A
 * and a pointer to A's "foo" virtual method implementation. In both cases
 * "this" pointer is a pointer to base class A which is passed to methods
 * implicitly.
 */
struct A_vtable {
        void (*destructor)(struct A *this);
        int (*foo)(struct A *this);
};

struct A {
        /* A class implicitly gets a "vtable" pointer as its first field */
        const struct A_vtable *vtable;
};

/* Implementation of A's destructor */
void A_destructor(struct A *this) {
}

/* Implementation of A's foo method. */
int A_foo(struct A *this) {
        return 0;
}

/* Virtual table for class A generated by the compiler */
static const struct A_vtable A_vtable_instance = {
        &A_destructor,
        &A_foo
};

int main() {
        int ret;
        struct A a;

        /*
         * Initialize class A by setting its virtual table pointer.
         * This pointer can be set to a different "table" instance
         * in case of derived classes. This happens implicitly during
         * class construction.
         */
        a.vtable = &A_vtable_instance;

        /* Invoke A's foo method through the virtual table. */
        ret = (a.vtable->foo)(&a);

        /*
         * Call A's destructor. In our case this happens implicitly
         * because of RAII.
         * http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization
         */
        (a.vtable->destructor)(&a);

        /* Return from main */
        return ret;
}

By looking at how virtual functions are called, it becomes apparent that invocation of a virtual function has some overhead. Basically, three steps need to be performed to call a virtual function:

  1. Offset must be applied to a virtual table in order to obtain an address of a function to call.
  2. A pointer to a virtual table must be dereferenced to read the address of a function.
  3. An indirect call must be made to call a function by address stored in register.

x86_64 Machine Code

To demonstrate this, let’s look at the disassembled function that gets an instance of class A by reference and calls its virtual foo() method:

Simple Invocation of Virtual Method in C++
1
2
3
int call(A& obj) {
    return obj.foo();
}

GCC would generate the following code for x86_64 with all optimizations turned off:

Equivalent code in x86_64 assembler (unoptimised)
1
2
3
4
5
6
7
8
9
10
11
12
13
push   %rbp
mov    %rsp,%rbp
sub    $0x10,%rsp
mov    %rdi,-0x8(%rbp)
mov    -0x8(%rbp),%rax
mov    (%rax),%rax
add    $0x10,%rax
mov    (%rax),%rax
mov    -0x8(%rbp),%rdx
mov    %rdx,%rdi
callq  *%rax
leaveq
retq

Putting irrelevant parts aside, there are four instructions involved in virtual function call:

  1. Load vtable base address: mov (%rax),%rax
  2. Apply offset to get address of the function to call: add $0x10,%rax
  3. Read the address of the function: mov (%rax),%rax
  4. Perform indirect call: callq *%rax

With optimization turned on, GCC omits a frame pointer, gets rid of one extra memory load, does not move around parameters on the stack. The code is much better, yet the most expensive operations are still there:

Equivalent code in x86_64 assembler (optimised)
1
2
3
   mov    (%rdi),%rax
   mov    0x10(%rax),%rax
   jmpq   *%rax

In more complex scenarios, such as multiple inheritance, compiler might also need to adjust an the value of this pointer that is implicitly passed to the virtual function so that it points to a correct base class in the inheritance hierarchy. One more extra level of indirection might also be needed in case of virtual inheritance.

Optimizations

There are a few optimizations that compilers can do to improve the runtime performance when it comes to virtual dispatch.

Extracting the Function Pointer

Calling a virtual function involves multiple steps. In case when virtual function is called multiple times, such as calling the function in a loop or subsequently calling the same function of the same object, compiler optimizes the code by performing the following steps just once:

  1. Loading the member pointer from a virtual table. It consists of a function pointer and information needed to adjust this pointer.
  2. Adjusting this pointer.
  3. Extracting a function pointer from member pointer.

Then compilers simply perform the fourth step by calling a function using an extracted function pointer with adjusted this pointer.

GCC Extension

The function pointer extraction is “unsafe” and “dangerous” and done by the compiler automatically, behind the scenes. GCC, unlike other compilers, provides a non-standard extension for brave developers who know what they are doing, so that they can do this manually. It is part of pointer-to-member function (PMF) conversion functionality and is described in §7.6:

7.6 Extracting the function pointer from a bound pointer to member function

In C++, pointer to member functions (PMFs) are implemented using a wide pointer of sorts to handle all the possible call mechanisms; the PMF needs to store information about how to adjust the this pointer, and if the function pointed to is virtual, where to find the vtable, and where in the vtable to look for the member function. If you are using PMFs in an inner loop, you should really reconsider that decision. If that is not an option, you can extract the pointer to the function that would be called for a given object/PMF pair and call it directly inside the inner loop, to save a bit of time.

Note that you still pay the penalty for the call through a function pointer; on most modern architectures, such a call defeats the branch prediction features of the CPU. This is also true of normal virtual function calls.

The syntax for this extension is

extern A a;
extern int (A::*fp)();
typedef int (*fptr)(A *);

fptr p = (fptr)(a.*fp);

For PMF constants (i.e. expressions of the form &Klasse::Member), no object is needed to obtain the address of the function. They can be converted to function pointers directly:

fptr p1 = (fptr)(&A::foo);

You must specify -Wno-pmf-conversions to use this extension.

Manual PMF

This is also possible to do with other compilers, though in a lot more dangerous manner. For example, having a member pointer and knowing the ABI of a given platform (for example, Itanium C++ ABI describes member pointers “layout” in §2.3 Member Pointers), programmers can “extract” this adjustment information as well as raw function pointer, essentially performing PMF manually.

Devirtualization

In order to reduce the overhead and improve runtime performance, optimizing compilers attempt to convert calls to virtual functions to direct calls. This is done both within a procedure and interprocedurally as part of indirect inlining and interprocedural constant propagation.

This optimization technique is implemented by most production grade C++ compilers. For example, both GCC and clang perform this optimization. To demonstrate what it does, let’s say we got a piece of code written by a third-party company that does a lot of great things and looks like this:

A great API
1
2
3
4
5
6
7
8
struct A {
    virtual ~A() {}
    virtual int foo() { return 0; }
};

inline int do_something(A& obj) {
    return obj.foo();
}

At this point, there isn’t much that can be done because the end-user use case is unknown at this stage. But let’s say the user of the above API writes the following program:

A great API end-usage
1
2
3
4
5
6
7
8
9
struct B : A {
    virtual ~B() {}
    virtual int foo() { return 1; }
};

int main() {
    B b;
    return do_something(b);
}

By looking at the code, it is obvious that class B is the final class in the inheritance hierarchy. Therefore, when do_something() function is called, calling B’s foo() method directly would result in a functional equivalent of calling it through the virtual table.

When compiling the above program with optimization turned on, compiler takes that information into account. Since compilers has a definition of do_something() function body, a wise compiler chooses to perform a direct call or even inline the body of a derived class’s virtual function, which in the above case is B::foo().

Interestingly enough, since all of the code is known to the compiler, it optimizes the program further, eventually reducing it to just a few CPU instructions:

1
2
3
_main:
  mov    eax,0x1
  ret

From the above observation we can draw a conclusion that when compiler has all of the information needed, it optimizes away the virtual dispatch.

Link Time Optimization

Needless to say that the above example is very simple and naive. In real-life it is unlikely that do_something() function body will be known. Because it was chosen to use virtual dispatch instead of template meta-programming, body of the function is likely complex and resides in a compiled object file. In that case compiler would not be able to deduce the information it needs to perform devirtualization. This makes this optimization unlikely to occur in real-life.

There is a way to make it work though. Both GCC and clang provide a feature called Link Time Optimization (aka LTO). When this feature is enabled, compilers generate object files in intermediate language. GCC uses GIMPLE for this purpose while Clang is using LLVM bitcode. Such “intermediate” object files can later be used by the compiler to perform intermodular optimizations. At the same time, source code is not required to be distributed shown, which makes it great for paranoid software vendors. Intermodular optimizations, which in case of LTO are performed at linking stage, allow the compiler to inline functions defined in separate object files or, what’s most important in our case, perform devirtualization across different translation units.

Getting back to our simple example, given that user program is stored in file1.cc and do_something() function is defined in file2.cc, compiler does not perform devirtualization. However, when both files are compiled with LTO enabled, the resulting program is optimized down to two CPU instructions. Here is an example of compiling and linking two separate files with LTO using GCC:

1
2
3
$ g++ -Wall -O3 -flto -c ./file1.cc
$ g++ -Wall -O3 -flto -c ./file2.cc
$ g++ -Wall -O3 -flto ./file1.o ./file2.o -o prog

The only time when this won’t work is when the code is provided in a form of shared object because LTO does not work across shared object boundaries.

So if you plan on providing a library that must use a virtual dispatch and is performance critical to the point where tradeoffs of the virtual dispatch start to matter, consider distributing the code in a form of pre-compiled static library built with LTO support.

C++11 Final Keyword

Sometimes compilers are not able to determine that a class is a final instance in the inheritance hierarchy. Consider the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
struct A {
    virtual ~A() {}
    virtual int foo() { return 0; }
};

struct B : A {
    virtual ~B() {}
    virtual int foo() { return 1; }
};

int do_something(B& b) {
    b.foo();
}

In the above example, compiler does not know whether some other classes could inherit from class B or not. Therefore, it would not devirtualize the invocation of b.foo() when generating the code for do_something() function body.

If, however, nothing else inherits from B, then programmers might want to consider using a final keyword that was introduced in C++11. Once the class B is marked as final, compiler knows that B is the final instance in the inheritance tree and immediately optimizes away all indirect calls.

1
2
3
4
struct B final : A {
    virtual ~B() {}
    virtual int foo() { return 1; }
};

Devirtualization in C

Even though C language does not have a concept of virtual dispatch built into the language, compilers do recognize this concept and can still optimize away all indirect calls. Both GCC and clang have compiled a simple C virtual dispatch example program into, well, almost nothing. Here is what Clang generated:

1
2
3
4
5
6
00000000004004b0 <main>:
  4004b0:       31 c0                   xor    eax,eax
  4004b2:       c3                      ret
  4004b3:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  4004ba:       00 00 00
  4004bd:       0f 1f 00                nop    DWORD PTR [rax]

And here is what GCC have done:

1
2
3
4
0000000000400400 <main>:
  400400:       31 c0                   xor    eax,eax
  400402:       c3                      ret
  400403:       90                      nop
Thanks for reading!
1
2
3
4
5
6
_the_end:
  as always
  thanks for
  reading
  nop
  nop

See Also