performance interview questions

Top 15 performance interview questions

20411 Jobs openings for performance

Speed comparison with Project Euler: C vs Python vs Erlang vs Haskell

I have taken Problem #12 from Project Euler as a programming exercise and to compare my (surely not optimal) implementations in C, Python, Erlang and Haskell. In order to get some higher execution times, I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem.

The result is the following:


lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin

real    0m11.074s
user    0m11.070s
sys 0m0.000s


lorenzo@enzo:~/erlang$ time ./euler12.py 

real    1m16.632s
user    1m16.370s
sys 0m0.250s

python with pypy:

lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py 

real    0m13.082s
user    0m13.050s
sys 0m0.020s


lorenzo@enzo:~/erlang$ erlc euler12.erl 
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.4  (abort with ^G)
1> 842161320

real    0m48.259s
user    0m48.070s
sys 0m0.020s


lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx 

real    2m37.326s
user    2m37.240s
sys 0m0.080s


  • C: 100%
  • python: 692% (118% with pypy)
  • erlang: 436% (135% thanks to RichardC)
  • haskell: 1421%

I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three. Also it doesn't need to load a runtime first (Do the others?).

Question 1: Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

Question 2: Why is Haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as Haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.


Question 4: Do my functional implementations permit LCO (last call optimization, a.k.a tail recursion elimination) and hence avoid adding unnecessary frames onto the call stack?

I really tried to implement the same algorithm as similar as possible in the four languages, although I have to admit that my Haskell and Erlang knowledge is very limited.

Source codes used:

#include <stdio.h>
#include <math.h>

int factorCount (long n)
    double square = sqrt (n);
    int isquare = (int) square;
    int count = isquare == square ? -1 : 0;
    long candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;

int main ()
    long triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
        index ++;
        triangle += index;
    printf ("%ld\n", triangle);

#! /usr/bin/env python3.2

import math

def factorCount (n):
    square = math.sqrt (n)
    isquare = int (square)
    count = -1 if isquare == square else 0
    for candidate in range (1, isquare + 1):
        if not n % candidate: count += 2
    return count

triangle = 1
index = 1
while factorCount (triangle) < 1001:
    index += 1
    triangle += index

print (triangle)

-module (euler12).
-compile (export_all).

factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).

factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

factorCount (Number, Sqrt, Candidate, Count) ->
    case Number rem Candidate of
        0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
        _ -> factorCount (Number, Sqrt, Candidate + 1, Count)

nextTriangle (Index, Triangle) ->
    Count = factorCount (Triangle),
        Count > 1000 -> Triangle;
        true -> nextTriangle (Index + 1, Triangle + Index + 1)  

solve () ->
    io:format ("~p~n", [nextTriangle (1, 1) ] ),
    halt (0).

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' number sqrt candidate count
    | fromIntegral candidate > sqrt = count
    | number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
    | otherwise = factorCount' number sqrt (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

Source: (StackOverflow)

StringBuilder vs String concatenation in toString() in Java

Given the 2 toString() implementations below, which one is prefered:

public String toString(){
    return "{a:"+ a + ", b:" + b + ", c: " + c +"}";


public String toString(){
    StringBuilder sb = new StringBuilder(100);
    return sb.append("{a:").append(a)
          .append(", b:").append(b)
          .append(", c:").append(c)


More importantly - given we have only 3 properties it might not make a difference, but at what point would you switch from + concat to StringBuilder?

Source: (StackOverflow)

When should I use Cross Apply over Inner Join?

What is the main purpose of using CROSS APPLY?

I have read (vaguely, through posts on the Internet) that cross apply can be more efficient when selecting over large data sets if you are partitioning. (Paging comes to mind)

I also know that CROSS APPLY doesn't require a UDF as the right-table.

In most INNER JOIN queries (one-to-many relationships), I could rewrite them to use CROSS APPLY, but they always give me equivalent execution plans.

Can anyone give me a good example of when CROSS APPLY makes a difference in those cases where INNER JOIN will work as well?


Here's a trivial example, where the execution plans are exactly the same. (Show me one where they differ and where cross apply is faster/more efficient)

create table Company (
    companyId int identity(1,1)
,   companyName varchar(100)
,   zipcode varchar(10) 
,   constraint PK_Company primary key (companyId)

create table Person (
    personId int identity(1,1)
,   personName varchar(100)
,   companyId int
,   constraint FK_Person_CompanyId foreign key (companyId) references dbo.Company(companyId)
,   constraint PK_Person primary key (personId)

insert Company
select 'ABC Company', '19808' union
select 'XYZ Company', '08534' union
select '123 Company', '10016'

insert Person
select 'Alan', 1 union
select 'Bobby', 1 union
select 'Chris', 1 union
select 'Xavier', 2 union
select 'Yoshi', 2 union
select 'Zambrano', 2 union
select 'Player 1', 3 union
select 'Player 2', 3 union
select 'Player 3', 3 

/* using CROSS APPLY */
select *
from Person p
cross apply (
    select *
    from Company c
    where p.companyid = c.companyId
) Czip

/* the equivalent query using INNER JOIN */
select *
from Person p
inner join Company c on p.companyid = c.companyId

Source: (StackOverflow)

Big O, how do you calculate/approximate it?

Most people with a degree in CS will certainly know what Big O stands for. It helps us to measure how (in)efficient an algorithm really is and if you know in what category the problem you are trying to solve lays in you can figure out if it is still possible to squeeze out that little extra performance.1

But I'm curious, how do you calculate or approximate the complexity of your algorithms?

1 but as they say, don't overdo it, premature optimization is the root of all evil, and optimization without a justified cause should deserve that name as well.

Source: (StackOverflow)

Why is my program slow when looping over exactly 8192 elements?

Here is the extract from the program in question. The matrix img[][] has the size SIZE×SIZE, and is initialized at:

img[j][i] = 2 * j + i

Then, you make a matrix res[][], and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.

    for(j=1;j<SIZE-1;j++) {
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;

That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.

#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
        img[j][i] = (2*j+i)%8196;

Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:

SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs

The compiler is GCC. From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.

Source: (StackOverflow)

Swift performance: sorting arrays

I was implementing an algorithm in Swift and noticed that the performance was very poor. After digging deeper I realised that one of the bottlenecks was something as simple as sorting arrays. The relevant part is here:

let n = 1000000
let x = Int[](count: n, repeatedValue: 0)
for i in 0..n {
    x[i] = random()
// start clock here
let y = sort(x)
// stop clock here

In C++, a similar operation takes 0.06 s on my computer.

In Python it takes 0.6 s (no tricks, just y = sorted(x) for a list of integers).

In Swift it takes 6 s if I compile it with the following command:

xcrun swift -O3 -sdk `xcrun --show-sdk-path --sdk macosx`

And it takes as much as 88 s if I compile it with the following command:

xcrun swift -O0 -sdk `xcrun --show-sdk-path --sdk macosx`

Timings in Xcode with "Release" vs. "Debug" builds are similar.

What is wrong here? I could understand some performance loss in comparison with C++, but not a 10-fold slowdown in comparison with pure Python.

Edit: mweathers noticed that changing -O3 to -Ofast makes this code run almost as fast as the C++ version! However, -Ofast changes the semantics of the language a lot — in my testing, it disabled the checks for integer overflows and array indexing overflows. For example, with -Ofast the following Swift code runs silently without crashing (and prints out some garbage):

let n = 10000000
let x = Int[](count: n, repeatedValue: 10)

So -Ofast is not what we want; the whole point of Swift is that we have the safety nets in place. Of course the safety nets have some impact on the performance, but they should not make the programs 100 times slower. Remember that Java already checks for array bounds, and in typical cases the slowdown is by a factor much less than 2. And in Clang and GCC we have got -ftrapv for checking (signed) integer overflows, and it is not that slow, either.

Hence the question: how can we get a reasonable performance in Swift without losing the safety nets?

Edit 2: I did some more benchmarking, with very simple loops along the lines of

for i in 0..n {
    x[i] = x[i] ^ 12345678

(Here the xor operation is there just so that I can more easily find the relevant loop in the assembly code. I tried to pick an operation that is easy to spot but also "harmless" in the sense that it should not require any checks related to integer overflows.)

Again, there was a huge difference in the performance between -O3 and -Ofast. So I had a look at the assembly code:

  • With -Ofast I get pretty much what I would expect. The relevant part is a loop with 5 machine language instructions.

  • With -O3 I get something that was beyond my wildest imagination. The inner loop spans 88 lines of assembly code. I did not try to understand all of it, but the most suspicious parts are 13 invocations of "callq _swift_retain" and another 13 invocations of "callq _swift_release". That is, 26 subroutine calls in the inner loop!

Edit 3: In comments, Ferruccio asked for benchmarks that are fair in the sense that they do not rely on built-in functions (e.g. sort). I think the following program is a fairly good example:

let n = 10000
let x = Int[](count: n, repeatedValue: 1)
for i in 0..n {
    for j in 0..n {
        x[i] = x[j]

There is no arithmetic, so we do not need to worry about integer overflows. The only thing that we do is just lots of array references. And the results are here—Swift -O3 loses by factor almost 500 in comparison with -Ofast:

  • C++ -O3: 0.05 s
  • C++ -O0: 0.4 s
  • Java: 0.2 s
  • Python with PyPy: 0.5 s
  • Python: 12 s
  • Swift -Ofast: 0.05 s
  • Swift -O3: 23 s
  • Swift -O0: 443 s

(If you are concerned that the compiler might optimise out the pointless loops entirely, you can change it to e.g. x[i] ^= x[j], and add a print statement that outputs x[0]. This does not change anything; the timings will be very similar.)

And yes, here the Python implementation was a stupid pure Python implementation with a list of ints and nested for loops. It should be much slower than unoptimised Swift. Something seems to be seriously broken with Swift and array indexing.

Edit 4: These issues (as well as some other performance issues) seems to have been fixed in Xcode 6 beta 5.

For sorting, I now have the following timings:

  • clang++ -O3: 0.06 s
  • swiftc -Ofast: 0.1 s
  • swiftc -O: 0.1 s
  • swiftc: 4 s

For nested loops:

  • clang++ -O3: 0.06 s
  • swiftc -Ofast: 0.3 s
  • swiftc -O: 0.4 s
  • swiftc: 540 s

It seems that there is no reason anymore to use the unsafe -Ofast (a.k.a. -Ounchecked); plain -O produces equally good code.

Source: (StackOverflow)

Why does Python code run faster in a function?

def main():
    for i in xrange(10**8):

This piece of code in Python runs in

real    0m1.841s
user    0m1.828s
sys     0m0.012s

However, if the for loop isn't placed within a function,

for i in xrange(10**8):

then it runs for a much longer time:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

Why is this?

Note: The timing is done with the time function in BASH in Linux.

Source: (StackOverflow)

How can you profile a Python script?

Project Euler and other coding contests often have a maximum time to run or people boast of how fast their particular solution runs. With python, sometimes the approaches are somewhat kludgey - i.e., adding timing code to __main__.

What is a good way to profile how long a python program takes to run?

Source: (StackOverflow)

Why is 1000000000000000 in range(1000000000000001) so fast in Python 3?

It is my understanding that the range() function, which is actually an object type in Python 3, generates its contents on the fly, similar to a generator.

This being the case, I would have expected the following line to take an inordinate amount of time, because in order to determine whether 1 quadrillion is in the range, a quadrillion values would have to be generated:

1000000000000000 in range(1000000000000001)

Furthermore: it seems that no matter how many zeroes I add on, the calculation more or less takes the same amount of time (basically instantaneous).

I have also tried things like this, but the calculation is still almost instant:

1000000000000000000000 in range(0,1000000000000000000001,10) # count by tens

If I try to implement my own range function, the result is not so nice!!

def my_crappy_range(N):
    i = 0
    while i < N:
        yield i
        i += 1

What is the range() object doing under the hood that makes it so fast?

EDIT: This has turned out to be a much more nuanced topic than I anticipated - there seems to be a bit of history behind the optimization of range().

Martijn Pieters' answer was chosen for its completeness, but also see abarnert's first answer for a good discussion of what it means for range to be a full-fledged sequence in Python 3, and some information/warning regarding potential inconsistency for __contains__ function optimization across Python implementations. abarnert's other answer goes into some more detail and provides links for those interested in the history behind the optimization in Python 3 (and lack of optimization of xrange in Python 2). Answers by poke and by wim provide the relevant C source code and explanations for those who are interested.

Source: (StackOverflow)

MyISAM versus InnoDB

I'm working on a projects which involves a lot of database writes, I'd say (70% inserts and 30% reads). This ratio would also include updates which I consider to be one read and one write. The reads can be dirty (e.g. I don't need 100% accurate information at the time of read).
The task in question will be doing over 1 million database transactions an hour.

I've read a bunch of stuff on the web about the differences between MyISAM and InnoDB, and MyISAM seems like the obvious choice to me for the particular database/tables that I'll be using for this task. From what I seem to be reading, InnoDB is good if transactions are needed since row level locking is supported.

Does anybody have any experience with this type of load (or higher)? Is MyISAM the way to go?

Source: (StackOverflow)

Python string formatting: % vs. .format

Python 2.6 introduced the str.format() method with a slightly different syntax from the existing % operator. Which is better and for what situations?

  1. The following uses each method and has the same outcome, so what is the difference?

    sub1 = "python string!"
    sub2 = "an arg"
    a = "i am a %s" % sub1
    b = "i am a {0}".format(sub1)
    c = "with %(kwarg)s!" % {'kwarg':sub2}
    d = "with {kwarg}!".format(kwarg=sub2)
    print a    # "i am a python string!"
    print b    # "i am a python string!"
    print c    # "with an arg!"
    print d    # "with an arg!"
  2. Furthermore when does string formatting occur in Python? For example, if my logging level is set to HIGH will I still take a hit for performing the following % operation? And if so, is there a way to avoid this?

    log.debug("some debug info: %s" % some_info)

Source: (StackOverflow)

\d is less efficient than [0-9]

I made a comment yesterday on an answer where someone had used [0123456789] in a regular expression rather than [0-9] or \d. I said it was probably more efficient to use a range or digit specifier than a character set.

I decided to test that out today and found out to my surprise that (in the C# regex engine at least) \d appears to be less efficient than either of the other two which don't seem to differ much. Here is my test output over 10000 random strings of 1000 random characters with 5077 actually containing a digit:

Regular expression \d           took 00:00:00.2141226 result: 5077/10000
Regular expression [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first
Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first

It's a surprise to me for two reasons:

  1. I would have thought the range would be implemented much more efficiently than the set.
  2. I can't understand why \d is worse than [0-9]. Is there more to \d than simply shorthand for [0-9]?

Here is the test code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace SO_RegexPerformance
    class Program
        static void Main(string[] args)
            var rand = new Random(1234);
            var strings = new List<string>();
            //10K random strings
            for (var i = 0; i < 10000; i++)
                //Generate random string
                var sb = new StringBuilder();
                for (var c = 0; c < 1000; c++)
                    //Add a-z randomly
                    sb.Append((char)('a' + rand.Next(26)));
                //In roughly 50% of them, put a digit
                if (rand.Next(2) == 0)
                    //Replace one character with a digit, 0-9
                    sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));

            var baseTime = testPerfomance(strings, @"\d");
            var testTime = testPerfomance(strings, "[0-9]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            testTime = testPerfomance(strings, "[0123456789]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);

        private static TimeSpan testPerfomance(List<string> strings, string regex)
            var sw = new Stopwatch();

            int successes = 0;

            var rex = new Regex(regex);

            foreach (var str in strings)
                if (rex.Match(str).Success)

            Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);

            return sw.Elapsed;

Source: (StackOverflow)

Is the recommendation to include CSS before JavaScript invalid?

In countless places online I have seen the recommendation to include CSS prior to JavaScript. The reasoning is generally, of this form:

When it comes to ordering your CSS and JavaScript, you want your CSS to come first. The reason is that the rendering thread has all the style information it needs to render the page. If the JavaScript includes come first, the JavaScript engine has to parse it all before continuing on to the next set of resources. This means the rendering thread can't completely show the page, since it doesn't have all the styles it needs.

My actual testing reveals something quite different:

My test harness

I use the following Ruby script to generate specific delays for various resources:

require 'rubygems'
require 'eventmachine'
require 'evma_httpserver'
require 'date'

class Handler  < EventMachine::Connection
  include EventMachine::HttpServer

  def process_http_request
    resp = EventMachine::DelegatedHttpResponse.new( self )

    return unless @http_query_string

    path = @http_path_info
    array = @http_query_string.split("&").map{|s| s.split("=")}.flatten
    parsed = Hash[*array]

    delay = parsed["delay"].to_i / 1000.0
    jsdelay = parsed["jsdelay"].to_i

    delay = 5 if (delay > 5)
    jsdelay = 5000 if (jsdelay > 5000)

    delay = 0 if (delay < 0) 
    jsdelay = 0 if (jsdelay < 0)

    # Block which fulfills the request
    operation = proc do
      sleep delay 

      if path.match(/.js$/)
        resp.status = 200
        resp.headers["Content-Type"] = "text/javascript"
        resp.content = "(function(){
            var start = new Date();
            while(new Date() - start < #{jsdelay}){}
      if path.match(/.css$/)
        resp.status = 200
        resp.headers["Content-Type"] = "text/css"
        resp.content = "body {font-size: 50px;}"

    # Callback block to execute once the request is fulfilled
    callback = proc do |res|

    # Let the thread pool (20 Ruby threads) handle request
    EM.defer(operation, callback)

EventMachine::run {
  EventMachine::start_server("", 8081, Handler)
  puts "Listening..."

The above mini server allows me to set arbitrary delays for JavaScript files (both server and client) and arbitrary CSS delays. For example, gives me a 500 ms delay transferring the CSS.

I use the following page to test.

<!DOCTYPE html>
      <script type='text/javascript'>
          var startTime = new Date();
      <link rel='nofollow' href="" type="text/css" rel="stylesheet">
      <script type="text/javascript" src=";jsdelay=1000"></script> 
      Elapsed time is: 
      <script type='text/javascript'>
        document.write(new Date() - startTime);

When I include the CSS first, the page takes 1.5 seconds to render:

CSS first

When I include the JavaScript first, the page takes 1.4 seconds to render:

JavaScript first

I get similar results in Chrome, Firefox and Internet Explorer. In Opera however, the ordering simply does not matter.

What appears to be happening is that the JavaScript interpreter refuses to start until all the CSS is downloaded. So, it seems that having JavaScript includes first is more efficient as the JavaScript thread gets more run time.

Am I missing something, is the recommendation to place CSS includes prior to JavaScript includes not correct?

It is clear that we could add async or use setTimeout to free up the render thread or put the JavaScript code in the footer, or use a JavaScript loader. The point here is about ordering of essential JavaScript bits and CSS bits in the head.

Source: (StackOverflow)

How to efficiently count the number of keys/properties of an object in JavaScript?

What's the fastest way to count the number of keys/properties of an object? It it possible to do this without iterating over the object? i.e. without doing

var count = 0;
for (k in myobj) if (myobj.hasOwnProperty(k)) count++;

(Firefox did provide a magic __count__ property, but this was removed somewhere around version 4.)

Source: (StackOverflow)

Replacing a 32-bit loop count variable with 64-bit introduces crazy performance deviations

I was looking for the fastest way to popcount large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC.

The Benchmark

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
        startP = chrono::system_clock::now();
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;


As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line. Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount. To get a more precise result, we do the popcount 10,000 times. We measure the times for the popcount. In the upper case, the inner loop variable is unsigned, in the lower case, the inner loop variable is uint64_t. I thought that this should make no difference, but the opposite is the case.

The (absolutely crazy) results

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):

g++ -O3 -march=native -std=c++11 test.cpp -o test

Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data):

  • unsigned 41959360000 0.401554 sec 26.113 GB/s
  • uint64_t 41959360000 0.759822 sec 13.8003 GB/s

As you see, the throughput of the uint64_t version is only half the one of the unsigned version! The problem seems to be that different assembly gets generated, but why? First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3):

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Result: test 1

  • unsigned 41959360000 0.398293 sec 26.3267 GB/s
  • uint64_t 41959360000 0.680954 sec 15.3986 GB/s

So, it is almost the same result and is still strange. But now it gets super strange. I replace the buffer size that was read from input with a constant 1, so I change:

uint64_t size = atol(argv[1]) << 20;


uint64_t size = 1 << 20;

Thus, the compiler now knows the buffer size at compile time. Maybe it can add some optimizations! Here are the numbers for g++:

  • unsigned 41959360000 0.509156 sec 20.5944 GB/s
  • uint64_t 41959360000 0.508673 sec 20.6139 GB/s

Now, both versions are equally fast. However, the unsigned got even slower! It dropped from 26 to 20 GB/s, thus replacing a non-constant by a constant value lead to a deoptimization. Seriously, I have no clue what is going on here! But now to clang++ with the new version:

  • unsigned 41959360000 0.677009 sec 15.4884 GB/s
  • uint64_t 41959360000 0.676909 sec 15.4906 GB/s

Wait, what? Now, both versions dropped to the slow number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang!

I asked a colleague with an Ivy Bridge CPU to compile my benchmark. He got similar results, so it does not seem to be Haswell. Because two compilers produce strange results here, it also does not seem to be a compiler bug. We do not have an AMD CPU here, so we could only test with Intel.

More madness, please!

Take the first example (the one with atol(argv[1])) and put a static before the variable, i.e.:

static uint64_t size=atol(argv[1])<<20;

Here are my results in g++:

  • unsigned 41959360000 0.396728 sec 26.4306 GB/s
  • uint64_t 41959360000 0.509484 sec 20.5811 GB/s

Yay, yet another alternative. We still have the fast 26 GB/s with u32, but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all. Sadly, this only works for g++, clang++ does not seem to care about static.

My question

Can you explain these results? Especially:

  • How can there be such a difference between u32 and u64?
  • How can replacing a non-constant by a constant buffer size trigger less optimal code?
  • How can the insertion of the static keyword make the u64 loop faster? Even faster than the original code on my collegue's computer!

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. Of course, I always want to have the version that is able to popcount 26 GB/s. The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. This is the only way I can get rid of compilers that seem to go mad on small changes. What do you think? Is there another way to reliably get the code with most performance?

The Disassembly

Here is the disassembly for the various results:

26 GB/s version from g++ / u32 / non-const bufsize:

lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8

13 GB/s version from g++ / u64 / non-const bufsize:

popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00

15 GB/s version from clang++ / u64 / non-const bufsize:

popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50

20 GB/s version from g++ / u32&u64 / const bufsize:

popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68

15 GB/s version from clang++ / u32&u64 / const bufsize:

popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0

Interestingly, the fastest (26 GB/s) version is also the longest! It seems to be the only solution that uses lea. Some versions use jb to jump, others use jne. But apart from that, all versions seem to be comparable. I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very short and good. Can anyone explain this?

Lessons learned

No matter what the answer to this question will be; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code. I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. So even if you unroll, you can still get hit by major performance deviations. Quite interesting.

Source: (StackOverflow)