Baby steps in SIMD (SSE/AVX)

In case you have never used SIMD instructions, this post explores the real basics. For example: what is SIMD? SIMD stands for “Single instruction, multiple data”. We’re computing more than one “math problem” with a single instruction. CPUs have had instructions to do this for a long time. If you remember the “Pentium MMX” hype – that was the first time SIMD instructions came to the x86 architecture.

However, with some trickery, you can do some limited SIMD without actually using these instructions. Let’s say we want to add 1 to two values at the same time. If we put these two values right next to each other in memory, we can interpret them as a single larger datatype. That’s not all that straightforward to understand, so here’s an example: you can interpret two 8-bit values right next to each other as one 16-bit value, right? To increment both values at the same time, you do value + 0x0101, which is just one assembly instruction. So with no special instructions at all, on a 64-bit platform you can increment eight 8-bit values at the same time by adding 0x0101010101010101.

Okay, that feels pretty hacky and unreliable. Once you’ve incremented a value 256 times, you’ll have spilt into the neighboring value! That’s pretty bad.

So SSE provides 128-bit registers that allow you to comfortably work on e.g. four 32-bit floats at the same time, without any spilling. AVX provides 256-bit registers, and AVX512 provides 512-bit registers. Woo! Unfortunately AVX512 isn’t widely available yet.

So how do you use this? Let’s start with SSE, though you’ll see that updating code to use AVX or AVX512 instead is pretty easy. We’ll look at some very basic example code to add two vectors together.

#include <xmmintrin.h> // Need this in order to be able to use the SSE "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>

int main(int argc, char **argv) {
    float a[4], b[4], result[4]; // a and b: input, result: output
    __m128 va, vb, vresult; // these vars will "point" to SIMD registers

    // initialize arrays (just {0,1,2,3})
    for (int i = 0; i < 4; i++) {
        a[i] = (float)i;
        b[i] = (float)i;
    // load arrays into SIMD registers
    va = _mm_loadu_ps(a); //
    vb = _mm_loadu_ps(b); // same

    // add them together
    vresult = _mm_add_ps(va, vb);

    // store contents of SIMD register into memory
    _mm_storeu_ps(result, vresult); //

    // print out result
    for (int i = 0; i < 4; i++) {
        printf("%f\n", result[i]);

That doesn’t seem so hard, does it? To access SIMD instructions without writing assembly code, we use something called “intrinsics”, which make the SIMD instructions look like regular C functions. Don’t worry though, these functions are inline and mostly just consist of the assembly instruction itself, so you probably won’t see any difference in performance.

In this example, we’re using three intrinsics, _mm_loadu_ps, _mm_add_ps, and _mm_storeu_ps. _mm_loadu_ps copies four float values from memory into the SSE register. We do this twice and are thus using two SSE registers. (We have 16 SSE registers available on 64-bit CPUs.) Then, we use _mm_add_ps to, in a single instruction, add the four floats in one register to the corresponding floats in the other register. (So we get a[0]+b[0], a[1]+b[1], a[2]+b[2], a[3]+b[3].) This is stored in a third SSE register. Using _mm_storeu_ps, we put the contents of this result register into the result float array.

We can compile and run this without any extra linking:

$ gcc -Wall -o sse_test sse_test.c 
$ ./sse_test

Wow, it worked!

_mm_loadu_ps/_mm_storeu_ps have sister functions without the ‘u’. These functions require memory alignment, which just means that the memory has to start at an address that is cleanly divisible by a certain number, which mostly increases performance (unless something unfortunate happens in the CPU caching department).

To get the alignment, we just declare the arrays like this:

    float a[4] __attribute__ ((aligned (16)));
    float b[4] __attribute__ ((aligned (16)));
    float result[4]  __attribute__ ((aligned (16)));

And then change all instances of _mm_loadu_ps/_mm_storeu_ps to _mm_load_ps/_mm_store_ps.  Intel’s documentation states that we need 16-byte alignment. And GCC’s syntax just looks a bit obscure. It’s described here:

Cool, that’s SSE. What about AVX? Well, it turns out that we just need to change the included header file, the array sizes and the names of the intrinsics! (Note that you can include all intrinsics available by doing #include <x86intrin.h> instead.)

So here’s the same thing using AVX, and with aligned memory accesses:

#include <immintrin.h> // Need this in order to be able to use the AVX "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>

int main(int argc, char **argv) {
    float a[8] __attribute__ ((aligned (32))); // Intel documentation states that we need 32-byte alignment to use _mm256_load_ps/_mm256_store_ps
    float b[8]  __attribute__ ((aligned (32))); // GCC's syntax makes this look harder than it is:
    float result[8]  __attribute__ ((aligned (32)));
    __m256 va, vb, vresult; // __m256 is a 256-bit datatype, so it can hold 8 32-bit floats

    // initialize arrays (just {0,1,2,3,4,5,6,7})
    for (int i = 0; i < 8; i++) {
        a[i] = (float)i;
        b[i] = (float)i;

    // load arrays into SIMD registers
    va = _mm256_load_ps(a); //
    vb = _mm256_load_ps(b); // same

    // add them together
    vresult = _mm256_add_ps(va, vb); //

    // store contents of SIMD register into memory
    _mm256_store_ps(result, vresult); //

    // print out result
    for (int i = 0; i < 8; i++) {
        printf("%f\n", result[i]);
    return 0;

So let’s compile that:

gcc -Wall -o avx256_test_aligned avx256_test_aligned.c 
avx256_test_aligned.c: In function ‘main’:
avx256_test_aligned.c:15:8: warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
     va = _mm256_load_ps(a); //
In file included from /usr/lib/gcc/x86_64-linux-gnu/6/include/immintrin.h:41:0,
                 from avx256_test_aligned.c:1:
/usr/lib/gcc/x86_64-linux-gnu/6/include/avxintrin.h:852:1: error: inlining failed in call to always_inline ‘_mm256_store_ps’: target specific option mismatch
 _mm256_store_ps (float *__P, __m256 __A)
avx256_test_aligned.c:18:5: note: called from here

Oh no, what happened? It didn’t complain when we used SSE instructions (perhaps because all CPUs of the implicitly selected architecture (x86_64) support SSE, which was first introduced a very long time ago), but it’s complaining that our use of AVX instructions is causing a “target-specific option mismatch”. That’s a bit cryptic, but it means that our target (“vanilla” x86_64) does not support AVX instructions. To make this work, we need to supply the -mavx option:

$ gcc -Wall -mavx -o avx256_test_aligned avx256_test_aligned.c 
$ ./avx256_test_aligned 

Nice! BTW, for AVX512, we just need to change the 256s to 512s and the array index 8s to 16s, and supply -mavx512f to gcc.

Addendum: if you execute the AVX512 code on a CPU that doesn’t support it, you get this:

gcc -mavx512f -Wall -o avx_test_aligned avx_test_aligned.c 
Illegal instruction

Second addendum: if you use the aligned instructions without actually aligning your arrays, you get this:

$ ./avx_with_bad_alignment
Segmentation fault

Let me know if you have any questions.

How to find out if an executable uses (e.g.) SIMD instructions (includes jq mini-tutorial!)

“Embarrassingly parallel” algorithms can often make use of SIMD instructions like those that came with the SSE and AVX extensions. In the Python world, numpy is a very popular package to work with arrays. One of the first things I wondered when I started using numpy was, “How optimized is numpy?” Some quick investigation shows that it’s multi-threaded, and some googling shows that it uses SIMD instructions:

Now, it’s a bit tedious to grep for strings like VADDPD in the disassembly, so this post develops a nicer method.

For the impatient, here’s an unorthodox dirty one-liner (it creates a temporary file) that does this for you. It requires jq and internet access to download a database.

tempfile=`mktemp`; curl | cpp | sed -n '/^{/,/^}/ { p }' | jq '[ .instructions | .[] | { (.[0]): .[4] } ] | add' > $tempfile; objdump --no-show-raw-insn -M intel -d /usr/lib/python2.7/dist-packages/numpy/core/*.so | awk '{print $2}' | grep -v : | sort | uniq | while read line; do echo -n "$line  "; output=$(jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" $tempfile); if [ -z "$output" ]; then echo; else echo $output; fi; done > output_test; rm $tempfile

Note that it is not able to distinguish between e.g. AVX and AVX512. It always prints out the most advanced extension possible, so it will print out AVX512 if any AVX is used. If you want something better, check out the Node.js version at the bottom of this post.

And around this point we start the explanation for the less impatient readers: first of all, we need a database of CPU instructions, and a simple Google query brings up this: (The following discussion is based on commit 488b6d986964627f0b130b5265722dde8d93f11d.)

This project is in JavaScript, and the data file isn’t quite in JSON, so let’s do some minor preprocessing first to make our database easier to use:

cpp x86data.js | sed -n '/^{/,/^}/ { p }' > json

cpp is the C preprocessor to remove comments (there are comments and even multi-line comments in the actual data). The sed bit looks for a line starting with a { and after that a line starting with a }, all the while printing out this whole block.

Next, we need to get a disassembly. Here’s an example for numpy’s .so files:

objdump --no-show-raw-insn -M intel -d /usr/lib/python2.7/dist-packages/numpy/core/*.so | grep -P "^ +[0-9a-z]+:" | awk '{print $2}' | sort | uniq > numpy_instructions

This will get us all instruction mnemonics used. We get a file like this:


Let’s go back to our data. Today, we’ll use jq as our main tool to get the job done (though it will be many times slower than if we wrote a simple script that loads the hash once and re-uses it for every input instruction). If we just want the instructions block, we could do this:

jq '.instructions' json > instructions

However, this tool is a real Swiss army knife. We can use the familiar concept of piping, and we can wrap things in arrays or hashes just by enclosing expressions in [] or {}. Here’s an entire command to get an array of hashes containing only the instruction and the corresponding extension from the json file:

jq '[ .instructions | .[] | {instruction: .[0], extension: .[4] } ]' json

.[] iterates over the array inside the instructions key. Every item in the array is piped to a bit of jq code that creates a hash with an instruction and an extension key, which correspond to array element 0 and 4 in the input data. So we get output like this:

    "instruction": "aaa",
    "extension": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
    "instruction": "aas",
    "extension": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"

Now we’re going to do something slightly naughty. The extension field isn’t the same for all instructions with the same mnemonic, as different opcodes with the same mnemonics have been added to the instruction set over time. However, we don’t need to be that precise IMO, so we’re just going to merge everything into an object like {“mnemonic”: “extension info”}. First, let’s get an array of hashes:

jq '[ .instructions | .[] | { (.[0]): .[4] } ]' json | head
    "aaa": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
    "aas": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
    "aad": "X86 Deprecated   OF=U SF=W ZF=W AF=U PF=W CF=U"

Now we just need to pipe this into the add filter to merge this array of hashes/objects into a single hash/object:

jq '[ .instructions | .[] | { (.[0]): .[4] } ] | add' json > mnem2ext.json

And the result is:

  "aaa": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W",
  "aas": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W",
  "aad": "X86 Deprecated   OF=U SF=W ZF=W AF=U PF=W CF=U",
  "aam": "X86 Deprecated   OF=U SF=W ZF=W AF=U PF=W CF=U",
  "adc": "X64              OF=W SF=W ZF=W AF=W PF=W CF=X",
  "add": "X64              OF=W SF=W ZF=W AF=W PF=W CF=W",
  "and": "X64              OF=0 SF=W ZF=W AF=U PF=W CF=0",
  "arpl": "X86 ZF=W",
  "bndcl": "MPX X64",

Wee! But how do we access the information in this file? Well, with jq of course (not efficient though):

while read line; do echo -n "$line  "; jq ".$line" min.json; done < numpy_instructions

Here’s an extract from the output:

cvttpd2dq  "SSE2"
cvttps2dq  "SSE2"
cvttsd2si  "SSE2 X64"
cvttss2si  "SSE X64"
cwde  "ANY"
div  "X64              OF=U SF=U ZF=U AF=U PF=U CF=U"
divpd  "SSE2"
divps  "SSE"
divsd  "SSE2"
divss  "SSE"
fabs  "FPU              C0=U C1=0 C2=U C3=U"
fadd  "FPU              C0=U C1=W C2=U C3=U"

Such a nice mix of instructions. <3 We have a few problems though. Here are some instructions that couldn’t resolved:

cmova  null
cmpneqss  null
ja  null
rep  null
seta  null

A closer look at our database reveals that some instructions have slashes in them, like “cmova/cmovnbe”. These are aliases, so we should be able to detect these as well. jq sort of allows to search for keys using regex, though the syntax isn’t easy, and the bash escaping makes things a bit worse:

while read line; do echo -n "$line  "; jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\")))" min.json; done < numpy_instructions > output

Things have gotten a bit slower again, and the rest of our output looks a bit different too:

xor  {
  "xor": "X64              OF=0 SF=W ZF=W AF=U PF=W CF=0"
xorpd  {
  "xorpd": "SSE2"
xorps  {
  "xorps": "SSE"

We can’t get rid of the echo, otherwise we’ll have no way to tell if jq is finding the mnemonic or not. So we’ll use jq to fix the format. Here’s an easy example:

echo '{ "b": "c" }' | jq 'to_entries[]'
    "key": "b",
    "value": "c"
echo '{ "b": "c" }' | jq 'to_entries | .[] | .value'

Here, we’re just converting the hash into an array (as we did above with with_entries), and only select the .values. We can just pipe this within jq:

while read line; do echo -n "$line  "; jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" min.json; done < numpy_instructions > output

However, we don’t get a newline when we didn’t find an instruction, so we work around this in bash:

while read line; do echo -n "$line  "; output=$(jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" min.json); if [ -z "$output" ]; then echo; else echo $output; fi; done < numpy_instructions > output

That leaves mostly pseudo-instructions. The following pseudo-instructions are not included in this database but would indicate SSE2: CMPEQPD, CMPLTPD, CMPLEPD, CMPUNORDPD, CMPNEQPD, CMPNLTPD, CMPNLEPD, CMPORDPD. These all belong to the CMPPD instruction introduced in SSE2, as far as I can tell. ( It would make sense to have them in the database in this case, but I think I’ll leave well enough alone for now though.

Anyway, doing something like awk ‘{print $2}’ output | sed s/\”//g | sort | uniq shows that my numpy version may use instructions from the following sets:


Well, that’s great. Let’s package this up into a shell script so it’s a bit easier to use. Just stick it in a directory that has cpu_extensions.min.json in it and it’ll work.


json_file=$(dirname $0)/cpu_extensions.min.json
objdump --no-show-raw-insn -M intel -d $* | grep -P "^ [0-9a-z]+:" | awk '{print $2}' | sort | uniq | while read line; do
    echo -n "$line  "
    output=$(jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" $json_file);
    if [ -z "$output" ];
        then echo;
        echo $output | sed -e 's/"//g' -e 's/ .*//g'

Also, here’s a more efficient (O(n)) implementation in Node.js. It gets away with much less pre-processing, all you have to do is:

sed -n '/^{/,/^}/ { p }' x86data.js > cpu_extensions.json

However, it doesn’t execute objdump for you, so you have to call it like this:

show_cpu_extensions.js <(objdump --no-show-raw-insn -M intel -d /usr/lib/python2.7/dist-packages/numpy/core/*.so | grep -P "^ +[0-9a-z]+:" | awk '{print $2}' | sort | uniq)

I’ve also made it display all possible extensions.


var database_file;
var disassembly_file;

if (process.argv.length == 3) {
    // Use default database
    database_file = __dirname + "/cpu_extensions.json";
    disassembly_file = process.argv[2];
} else if (process.argv.length == 4) {
    database_file = process.argv[2];
    disassembly_file = process.argv[3];
} else {
    console.log("Usage: " + process.argv[1] + " [database] disassembly");

var fs = require("fs");
var readline = require("readline"); 
var mnem2ext = {};

var obj = JSON.parse(fs.readFileSync(database_file, "utf8"));
obj["instructions"].map(function(v, i) {
    var ext = v[4].replace(/ +[A-Z]+=.*/, "").replace(/  +.*/, "");

    if (v[0].match(/\//)) {
        v[0].split("/").forEach(function(v, i) {
            if (!mnem2ext[v]) {
                mnem2ext[v] = {};
            mnem2ext[v][ext] = true;
    } else {
        if (!mnem2ext[v[0]]) {
            mnem2ext[v[0]] = {};
        mnem2ext[v[0]][ext] = true;

var lineReader = require("readline").createInterface({input: fs.createReadStream(disassembly_file)});
lineReader.on("line", function(line) {
    console.log(line + ": " + (mnem2ext[line] ? Object.keys(mnem2ext[line]).join(", ") : undefined));