Baby steps in SIMD (SSE/AVX)

In case you have never used SIMD instructions, this post explores the real basics. For example: what is SIMD? SIMD stands for “Single instruction, multiple data”. We’re computing more than one “math problem” with a single instruction. CPUs have had instructions to do this for a long time. If you remember the “Pentium MMX” hype – that was the first time SIMD instructions came to the x86 architecture.

However, with some trickery, you can do some limited SIMD without actually using these instructions. Let’s say we want to add 1 to two values at the same time. If we put these two values right next to each other in memory, we can interpret them as a single larger datatype. That’s not all that straightforward to understand, so here’s an example: you can interpret two 8-bit values right next to each other as one 16-bit value, right? To increment both values at the same time, you do value + 0x0101, which is just one assembly instruction. So with no special instructions at all, on a 64-bit platform you can increment eight 8-bit values at the same time by adding 0x0101010101010101.

Okay, that feels pretty hacky and unreliable. Once you’ve incremented a value 256 times, you’ll have spilt into the neighboring value! That’s pretty bad.

So SSE provides 128-bit registers that allow you to comfortably work on e.g. four 32-bit floats at the same time, without any spilling. AVX provides 256-bit registers, and AVX512 provides 512-bit registers. Woo! Unfortunately AVX512 isn’t widely available yet.

So how do you use this? Let’s start with SSE, though you’ll see that updating code to use AVX or AVX512 instead is pretty easy. We’ll look at some very basic example code to add two vectors together.

#include <xmmintrin.h> // Need this in order to be able to use the SSE "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>

int main(int argc, char **argv) {
    float a[4], b[4], result[4]; // a and b: input, result: output
    __m128 va, vb, vresult; // these vars will "point" to SIMD registers

    // initialize arrays (just {0,1,2,3})
    for (int i = 0; i < 4; i++) {
        a[i] = (float)i;
        b[i] = (float)i;
    // load arrays into SIMD registers
    va = _mm_loadu_ps(a); //
    vb = _mm_loadu_ps(b); // same

    // add them together
    vresult = _mm_add_ps(va, vb);

    // store contents of SIMD register into memory
    _mm_storeu_ps(result, vresult); //

    // print out result
    for (int i = 0; i < 4; i++) {
        printf("%f\n", result[i]);

That doesn’t seem so hard, does it? To access SIMD instructions without writing assembly code, we use something called “intrinsics”, which make the SIMD instructions look like regular C functions. Don’t worry though, these functions are inline and mostly just consist of the assembly instruction itself, so you probably won’t see any difference in performance.

In this example, we’re using three intrinsics, _mm_loadu_ps, _mm_add_ps, and _mm_storeu_ps. _mm_loadu_ps copies four float values from memory into the SSE register. We do this twice and are thus using two SSE registers. (We have 16 SSE registers available on 64-bit CPUs.) Then, we use _mm_add_ps to, in a single instruction, add the four floats in one register to the corresponding floats in the other register. (So we get a[0]+b[0], a[1]+b[1], a[2]+b[2], a[3]+b[3].) This is stored in a third SSE register. Using _mm_storeu_ps, we put the contents of this result register into the result float array.

We can compile and run this without any extra linking:

$ gcc -Wall -o sse_test sse_test.c 
$ ./sse_test

Wow, it worked!

_mm_loadu_ps/_mm_storeu_ps have sister functions without the ‘u’. These functions require memory alignment, which just means that the memory has to start at an address that is cleanly divisible by a certain number, which mostly increases performance (unless something unfortunate happens in the CPU caching department).

To get the alignment, we just declare the arrays like this:

    float a[4] __attribute__ ((aligned (16)));
    float b[4] __attribute__ ((aligned (16)));
    float result[4]  __attribute__ ((aligned (16)));

And then change all instances of _mm_loadu_ps/_mm_storeu_ps to _mm_load_ps/_mm_store_ps.  Intel’s documentation states that we need 16-byte alignment. And GCC’s syntax just looks a bit obscure. It’s described here:

Cool, that’s SSE. What about AVX? Well, it turns out that we just need to change the included header file, the array sizes and the names of the intrinsics! (Note that you can include all intrinsics available by doing #include <x86intrin.h> instead.)

So here’s the same thing using AVX, and with aligned memory accesses:

#include <immintrin.h> // Need this in order to be able to use the AVX "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>

int main(int argc, char **argv) {
    float a[8] __attribute__ ((aligned (32))); // Intel documentation states that we need 32-byte alignment to use _mm256_load_ps/_mm256_store_ps
    float b[8]  __attribute__ ((aligned (32))); // GCC's syntax makes this look harder than it is:
    float result[8]  __attribute__ ((aligned (32)));
    __m256 va, vb, vresult; // __m256 is a 256-bit datatype, so it can hold 8 32-bit floats

    // initialize arrays (just {0,1,2,3,4,5,6,7})
    for (int i = 0; i < 8; i++) {
        a[i] = (float)i;
        b[i] = (float)i;

    // load arrays into SIMD registers
    va = _mm256_load_ps(a); //
    vb = _mm256_load_ps(b); // same

    // add them together
    vresult = _mm256_add_ps(va, vb); //

    // store contents of SIMD register into memory
    _mm256_store_ps(result, vresult); //

    // print out result
    for (int i = 0; i < 8; i++) {
        printf("%f\n", result[i]);
    return 0;

So let’s compile that:

gcc -Wall -o avx256_test_aligned avx256_test_aligned.c 
avx256_test_aligned.c: In function ‘main’:
avx256_test_aligned.c:15:8: warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
     va = _mm256_load_ps(a); //
In file included from /usr/lib/gcc/x86_64-linux-gnu/6/include/immintrin.h:41:0,
                 from avx256_test_aligned.c:1:
/usr/lib/gcc/x86_64-linux-gnu/6/include/avxintrin.h:852:1: error: inlining failed in call to always_inline ‘_mm256_store_ps’: target specific option mismatch
 _mm256_store_ps (float *__P, __m256 __A)
avx256_test_aligned.c:18:5: note: called from here

Oh no, what happened? It didn’t complain when we used SSE instructions (perhaps because all CPUs of the implicitly selected architecture (x86_64) support SSE, which was first introduced a very long time ago), but it’s complaining that our use of AVX instructions is causing a “target-specific option mismatch”. That’s a bit cryptic, but it means that our target (“vanilla” x86_64) does not support AVX instructions. To make this work, we need to supply the -mavx option:

$ gcc -Wall -mavx -o avx256_test_aligned avx256_test_aligned.c 
$ ./avx256_test_aligned 

Nice! BTW, for AVX512, we just need to change the 256s to 512s and the array index 8s to 16s, and supply -mavx512f to gcc.

Addendum: if you execute the AVX512 code on a CPU that doesn’t support it, you get this:

gcc -mavx512f -Wall -o avx_test_aligned avx_test_aligned.c 
Illegal instruction

Second addendum: if you use the aligned instructions without actually aligning your arrays, you get this:

$ ./avx_with_bad_alignment
Segmentation fault

Let me know if you have any questions.

How to find out if an executable uses (e.g.) SIMD instructions (includes jq mini-tutorial!)

“Embarrassingly parallel” algorithms can often make use of SIMD instructions like those that came with the SSE and AVX extensions. In the Python world, numpy is a very popular package to work with arrays. One of the first things I wondered when I started using numpy was, “How optimized is numpy?” Some quick investigation shows that it’s multi-threaded, and some googling shows that it uses SIMD instructions:

Now, it’s a bit tedious to grep for strings like VADDPD in the disassembly, so this post develops a nicer method.

For the impatient, here’s an unorthodox dirty one-liner (it creates a temporary file) that does this for you. It requires jq and internet access to download a database.

tempfile=`mktemp`; curl | cpp | sed -n '/^{/,/^}/ { p }' | jq '[ .instructions | .[] | { (.[0]): .[4] } ] | add' > $tempfile; objdump --no-show-raw-insn -M intel -d /usr/lib/python2.7/dist-packages/numpy/core/*.so | awk '{print $2}' | grep -v : | sort | uniq | while read line; do echo -n "$line  "; output=$(jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" $tempfile); if [ -z "$output" ]; then echo; else echo $output; fi; done > output_test; rm $tempfile

Note that it is not able to distinguish between e.g. AVX and AVX512. It always prints out the most advanced extension possible, so it will print out AVX512 if any AVX is used. If you want something better, check out the Node.js version at the bottom of this post.

And around this point we start the explanation for the less impatient readers: first of all, we need a database of CPU instructions, and a simple Google query brings up this: (The following discussion is based on commit 488b6d986964627f0b130b5265722dde8d93f11d.)

This project is in JavaScript, and the data file isn’t quite in JSON, so let’s do some minor preprocessing first to make our database easier to use:

cpp x86data.js | sed -n '/^{/,/^}/ { p }' > json

cpp is the C preprocessor to remove comments (there are comments and even multi-line comments in the actual data). The sed bit looks for a line starting with a { and after that a line starting with a }, all the while printing out this whole block.

Next, we need to get a disassembly. Here’s an example for numpy’s .so files:

objdump --no-show-raw-insn -M intel -d /usr/lib/python2.7/dist-packages/numpy/core/*.so | grep -P "^ +[0-9a-z]+:" | awk '{print $2}' | sort | uniq > numpy_instructions

This will get us all instruction mnemonics used. We get a file like this:


Let’s go back to our data. Today, we’ll use jq as our main tool to get the job done (though it will be many times slower than if we wrote a simple script that loads the hash once and re-uses it for every input instruction). If we just want the instructions block, we could do this:

jq '.instructions' json > instructions

However, this tool is a real Swiss army knife. We can use the familiar concept of piping, and we can wrap things in arrays or hashes just by enclosing expressions in [] or {}. Here’s an entire command to get an array of hashes containing only the instruction and the corresponding extension from the json file:

jq '[ .instructions | .[] | {instruction: .[0], extension: .[4] } ]' json

.[] iterates over the array inside the instructions key. Every item in the array is piped to a bit of jq code that creates a hash with an instruction and an extension key, which correspond to array element 0 and 4 in the input data. So we get output like this:

    "instruction": "aaa",
    "extension": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
    "instruction": "aas",
    "extension": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"

Now we’re going to do something slightly naughty. The extension field isn’t the same for all instructions with the same mnemonic, as different opcodes with the same mnemonics have been added to the instruction set over time. However, we don’t need to be that precise IMO, so we’re just going to merge everything into an object like {“mnemonic”: “extension info”}. First, let’s get an array of hashes:

jq '[ .instructions | .[] | { (.[0]): .[4] } ]' json | head
    "aaa": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
    "aas": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
    "aad": "X86 Deprecated   OF=U SF=W ZF=W AF=U PF=W CF=U"

Now we just need to pipe this into the add filter to merge this array of hashes/objects into a single hash/object:

jq '[ .instructions | .[] | { (.[0]): .[4] } ] | add' json > mnem2ext.json

And the result is:

  "aaa": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W",
  "aas": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W",
  "aad": "X86 Deprecated   OF=U SF=W ZF=W AF=U PF=W CF=U",
  "aam": "X86 Deprecated   OF=U SF=W ZF=W AF=U PF=W CF=U",
  "adc": "X64              OF=W SF=W ZF=W AF=W PF=W CF=X",
  "add": "X64              OF=W SF=W ZF=W AF=W PF=W CF=W",
  "and": "X64              OF=0 SF=W ZF=W AF=U PF=W CF=0",
  "arpl": "X86 ZF=W",
  "bndcl": "MPX X64",

Wee! But how do we access the information in this file? Well, with jq of course (not efficient though):

while read line; do echo -n "$line  "; jq ".$line" min.json; done < numpy_instructions

Here’s an extract from the output:

cvttpd2dq  "SSE2"
cvttps2dq  "SSE2"
cvttsd2si  "SSE2 X64"
cvttss2si  "SSE X64"
cwde  "ANY"
div  "X64              OF=U SF=U ZF=U AF=U PF=U CF=U"
divpd  "SSE2"
divps  "SSE"
divsd  "SSE2"
divss  "SSE"
fabs  "FPU              C0=U C1=0 C2=U C3=U"
fadd  "FPU              C0=U C1=W C2=U C3=U"

Such a nice mix of instructions. <3 We have a few problems though. Here are some instructions that couldn’t resolved:

cmova  null
cmpneqss  null
ja  null
rep  null
seta  null

A closer look at our database reveals that some instructions have slashes in them, like “cmova/cmovnbe”. These are aliases, so we should be able to detect these as well. jq sort of allows to search for keys using regex, though the syntax isn’t easy, and the bash escaping makes things a bit worse:

while read line; do echo -n "$line  "; jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\")))" min.json; done < numpy_instructions > output

Things have gotten a bit slower again, and the rest of our output looks a bit different too:

xor  {
  "xor": "X64              OF=0 SF=W ZF=W AF=U PF=W CF=0"
xorpd  {
  "xorpd": "SSE2"
xorps  {
  "xorps": "SSE"

We can’t get rid of the echo, otherwise we’ll have no way to tell if jq is finding the mnemonic or not. So we’ll use jq to fix the format. Here’s an easy example:

echo '{ "b": "c" }' | jq 'to_entries[]'
    "key": "b",
    "value": "c"
echo '{ "b": "c" }' | jq 'to_entries | .[] | .value'

Here, we’re just converting the hash into an array (as we did above with with_entries), and only select the .values. We can just pipe this within jq:

while read line; do echo -n "$line  "; jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" min.json; done < numpy_instructions > output

However, we don’t get a newline when we didn’t find an instruction, so we work around this in bash:

while read line; do echo -n "$line  "; output=$(jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" min.json); if [ -z "$output" ]; then echo; else echo $output; fi; done < numpy_instructions > output

That leaves mostly pseudo-instructions. The following pseudo-instructions are not included in this database but would indicate SSE2: CMPEQPD, CMPLTPD, CMPLEPD, CMPUNORDPD, CMPNEQPD, CMPNLTPD, CMPNLEPD, CMPORDPD. These all belong to the CMPPD instruction introduced in SSE2, as far as I can tell. ( It would make sense to have them in the database in this case, but I think I’ll leave well enough alone for now though.

Anyway, doing something like awk ‘{print $2}’ output | sed s/\”//g | sort | uniq shows that my numpy version may use instructions from the following sets:


Well, that’s great. Let’s package this up into a shell script so it’s a bit easier to use. Just stick it in a directory that has cpu_extensions.min.json in it and it’ll work.


json_file=$(dirname $0)/cpu_extensions.min.json
objdump --no-show-raw-insn -M intel -d $* | grep -P "^ [0-9a-z]+:" | awk '{print $2}' | sort | uniq | while read line; do
    echo -n "$line  "
    output=$(jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" $json_file);
    if [ -z "$output" ];
        then echo;
        echo $output | sed -e 's/"//g' -e 's/ .*//g'

Also, here’s a more efficient (O(n)) implementation in Node.js. It gets away with much less pre-processing, all you have to do is:

sed -n '/^{/,/^}/ { p }' x86data.js > cpu_extensions.json

However, it doesn’t execute objdump for you, so you have to call it like this:

show_cpu_extensions.js <(objdump --no-show-raw-insn -M intel -d /usr/lib/python2.7/dist-packages/numpy/core/*.so | grep -P "^ +[0-9a-z]+:" | awk '{print $2}' | sort | uniq)

I’ve also made it display all possible extensions.


var database_file;
var disassembly_file;

if (process.argv.length == 3) {
    // Use default database
    database_file = __dirname + "/cpu_extensions.json";
    disassembly_file = process.argv[2];
} else if (process.argv.length == 4) {
    database_file = process.argv[2];
    disassembly_file = process.argv[3];
} else {
    console.log("Usage: " + process.argv[1] + " [database] disassembly");

var fs = require("fs");
var readline = require("readline"); 
var mnem2ext = {};

var obj = JSON.parse(fs.readFileSync(database_file, "utf8"));
obj["instructions"].map(function(v, i) {
    var ext = v[4].replace(/ +[A-Z]+=.*/, "").replace(/  +.*/, "");

    if (v[0].match(/\//)) {
        v[0].split("/").forEach(function(v, i) {
            if (!mnem2ext[v]) {
                mnem2ext[v] = {};
            mnem2ext[v][ext] = true;
    } else {
        if (!mnem2ext[v[0]]) {
            mnem2ext[v[0]] = {};
        mnem2ext[v[0]][ext] = true;

var lineReader = require("readline").createInterface({input: fs.createReadStream(disassembly_file)});
lineReader.on("line", function(line) {
    console.log(line + ": " + (mnem2ext[line] ? Object.keys(mnem2ext[line]).join(", ") : undefined));

“Wrap marker” Thunderbird Extension

Yay, time for a new Thunderbird extension. Wrap Marker.

The code is up on GitHub.

This Thunderbird extension adds a word wrap marker (also called “ruler”, depending on what editor you’re using) to the text area in the compose window when you’re editing plain text emails. In effect, a vertical line indicating that you’re close to the 72/76/80-character mark. (You can change the position in about:config. The default is 76.)

It works by changing the entire editor’s (think “iframe”) designMode from “on” to “off”, and adding a div with contenteditable=”true” instead. If this changes how your compose text area behaves, I’d consider that a bug, so please let me know.

At the time of this writing (February 26, 2018), this extension is still kind of beta and not exactly “thoroughly tested”. It will be submitted to Thunderbird’s extension page once it’s been tested some more and maybe once it’s gotten some of the known bugs fixed. These include:

  • Quoted text in a reply isn’t blue.
  • Your cursor position preference isn’t honored. The cursor will always be in the upper left corner when you start a new reply.
  • This feature is disabled for HTML emails. I don’t think it’ll ever work for HTML emails.
  • You get scrollbars all the time (This is probably fixable. Forgot to fix.)

Backporting security fixes to old versions of the Linux kernel (Meltdown to 2.6.18) (Part 1)

In this post, I’ll give a quick overview over what it takes to backport a large patch (the KAISER patch to protect against Meltdown) to the Linux kernel to a version of the Linux kernel from around ten years ago. Note that this post only covers the main technique and the assembly portion of the patch.

First of all, one should think hard about whether this necessary. Couldn’t you just run a newer kernel with older user space? The answer is, in most cases, yes, you could. As evidenced by our ability to run old Docker images with 10-year-old userland on modern kernels (perhaps adding vsyscall=emulate to the kernel command line), things often work just fine. However, you may run into problems if you’re running on bare metal. I’ve heard of people running a maintained 3.10 kernel on 10-year-old userland without much fuss. I’ve personally run a 64-bit kernel with 100% 32-bit userland (same kernel version, without X11).

However, some people may not be able to afford to re-test their whole setup with different kernel versions all the time, and that is why distributions usually backport pure security fixes from newer kernels to older kernels. The Linux kernel is constantly improved, and over time, the code base of the kernel version included in a specific stable version of a distribution, which may only get security fixes, tends to look pretty different from the current Linux kernel.

Now let’s pretend we have to backport a fix for the Meltdown vulnerability to Linux 2.6.18. First of all, we try very hard to come up with alternative ways to thwart this vulnerability. For 2.6.18, we come up empty-handed, but for earlier kernels, we may find the so-called 4G/4G patch.

This 4G/4G patch unfortunately never made it into the mainline kernel, but was adopted by Red Hat for inclusion in Red Hat Enterprise Linux up to version 4. So we could get our hands on a version of this patch for Linux 2.6.9, and perhaps forward-port this to 2.6.18. The patch at weighs in at around 4500 lines, and our foremost priority should be to find a patch with as few lines as possible.

The patch referenced in the original Meltdown paper weighs in at only 1000 lines, and is almost guaranteed to be very barebones. I’d say it would therefore make sense to attempt to backport this patch, and if we manage to do that, perhaps look at what the various distributors decided to do differently from what’s in this patch.

Before we start, it would probably make sense to find a couple sentences that describe what the patch is supposed to do. It’s more than likely that we came across various descriptions of the patch when we were looking for a barebones patch to base our work off from. LWN has a good introduction.


We need the source tree of the target kernel version and the source kernel version extracted somewhere. The source kernel version can be had by doing:

$ git clone
$ # cd / mv / etc.
$ git checkout v4.10-rc6

The target version in our case is over here: We need to extract this and apply all of the existing patches. I use a current version of Debian, and rpmbuild operates in ~/rpmbuild. So create this directory, and the directories, SRPMS, RPMS, SPECS, SOURCES, BUILD, and BUILDROOT below it. Move the .src.rpm into the SOURCES directory, and issue the following commands.

$ cd ~/rpmbuild/SOURCES
$ rpm2cpio * | cpio -idmv
$ mv kernel.spec ../SPEC
$ cd ../SPEC
$ rpmbuild --nodeps -bp kernel.spec

Make sure you didn’t get any errors in the last step. Your patched kernel, ready to build from, is now inside ~/rpmbuild/BUILD/.

We’ll be making a lot of use of grep and git blame to backport patches. I usually use less to browse code quickly, or open it in an editor (usually kate and/or sublime) when I think I’ll need the file for a longer time. I have two monitors, but having more would help. I also have a bunch of paper to scribble stuff on. When you have a lot of terminal windows open just for the grepping, compiling and other things, you’ll probably find that giving the editor

You’ll find that you’ll have to read up on four-level page tables while creating the patch. Depending on the way you work, you might as well do that before you dig in.

Here are a few more less tips:

  • You likely already know that you can search files by hitting ‘/’
    • You can use the arrow keys to browse through your search history
    • You can disable regex search by hitting Ctrl-R
    • You can type -N followed by return to display line numbers

For debugging, I use the venerable Bochs.

Digging in

arch/x86/entry/entry_64.S and arch/x86/entry/entry_64_compat.S

We have something in arch/x86/entry/entry_64.S and arch/x86/entry/entry_64_compat.S. Okay, we’re adding a few macros (SWITCH_KERNEL_CR3_NO_STACK, SWITCH_USER_CR3, SWITCH_KERNEL_CR3). These macros all seem to be close to a macro called SWAPGS or SWAPGS_UNSAFE_STACK. The presence of “UNSAFE_STACK” also dictates which SWITCH_CR3 macro we’re using. Though nothing may make sense yet, these are all important observations.

On the old kernel, this path doesn’t exist at all, but we have a promising-sounding arch/x86_64/ path.

~/src/kernel/el5/linux-$ find arch/x86_64/ -name *entry*

Opening arch/x86_64/kernel/entry.S, we see code that looks similar on the whole. SWAPGS doesn’t exist, but swapgs (as a pure assembly instruction) does. So let’s figure out what SWAPGS is about:

~/src/kernel/git$ grep -rn SWAPGS
arch/x86/include/asm/irqflags.h:122:#define SWAPGS      swapgs
arch/x86/include/asm/paravirt.h:908:#define SWAPGS                                                              \
        PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE,     \
                  call PARA_INDIRECT(pv_cpu_ops+PV_CPU_swapgs)          \

At this point, we might have a hunch that SWAPGS was introduced with the intention to make the same entry code work for both real hardware/real virtualization and paravirtualization, and this is sufficiently confirmed when we git blame the file a bit:

$ git blame arch/x86/entry/entry_64.S
72fe485854429 arch/x86/kernel/entry_64.S (Glauber de Oliveira Costa 2008-01-30 13:32:08 +0100  143)     SWAPGS_UNSAFE_STACK
$ git show 72fe485854429
commit 72fe4858544292ad64600765cb78bc02298c6b1c
Author: Glauber de Oliveira Costa <>
Date:   Wed Jan 30 13:32:08 2008 +0100

    x86: replace privileged instructions with paravirt macros
    The assembly code in entry_64.S issues a bunch of privileged instructions,
    like cli, sti, swapgs, and others. Paravirt guests are forbidden to do so,
    and we then replace them with macros that will do the right thing.

When looking at the above git blame, there are a lot of lines affecting SWAPGS with different commit hashes, but this one is the oldest. We should be able to transfer the macro calls to the lines adjacent to the swapgs instructions. Fortunately, the number of swapgs instructions and the number of SWAPGS macro calls are almost the same in both kernels. With just the names (SWITCH_KERNEL_CR3) of the macros we don’t really know if this switches the kernel CR3 to the user CR3 or the other way round, and when you look at code that was accepted upstream or in distributions, you might see that the macro names have become easier to understand. So let’s dig into the macros, which are declared in the newly #included asm/kaiser.h.


asm/kaiser.h consists of assembly code (#ifdef __ASSEMBLY__) and C code (#else).  Assembly code in the Linux kernel uses AT&T syntax, which means that the first operands are the sources and the second operands the destinations. The macros look pretty clean (i.e., they are mostly pure assembly code), except for the use of something called PER_CPU_VAR. Modern processors have more than one core, and these cores operate independently. One core might be executing user land, and another core might be in the kernel or about to do the entry into the kernel.

Unfortunately, when we grep for PER_CPU_VAR in the old kernel code, we come up empty-handed:

src/kernel/el5/linux-$ grep -r PER_CPU_VAR .

Note that a case-insensitive grep comes up with ia64-specific (as in Itanium) code. grepping for PER_CPU, on the other hand, yields a lot of results. Even the KAISER patch itself contains DECLARE_PER_CPU and DEFINE_PER_CPU statements. However, the older kernel doesn’t have DECLARE_PER_CPU_SECTION or DEFINE_PER_CPU_SECTION.

~/src/kernel/git$ grep -r PER_CPU_SECTION . | grep define
./include/linux/percpu-defs.h:#define DECLARE_PER_CPU_SECTION(type, name, sec)                  \
... (More matches in the same file)

Now, we do a chain of git blames until we find something that we consider useful:

git blame include/linux/percpu-defs.h
git show 7c756e6e19e71
git blame 7c756e6e19e71^ -- include/linux/percpu-defs.h # start blaming from one before 7c756e6e19e71; don't forget the '--'
git show 5028eaa97dd1d
# Looks like 5028eaa97dd1d creates the file for the first time, and the definitions used to be in include/asm-generic/percpu.h
git blame 5028eaa97dd1d^ -- include/asm-generic/percpu.h
git show 9b8de7479d0db
git blame 9b8de7479d0db^ -- include/linux/percpu.h
git show 0bd74fa8e29dc

At this point, we finally found the commit that first introduced DEFINE_PER_CPU_SECTION, but this still depends on DEFINE_PER_CPU_PAGE_ALIGNED, which isn’t available yet in 2.6.18. So the search continues:

git blame 0bd74fa8e29dc^ -- include/linux/percpu.h
git show 63cc8c7515646

This commit indicates that DEFINE_PER_CPU_PAGE_ALIGNED was introduced to avoid wasting memory. I don’t believe we really need to care about this. Let’s trace PER_CPU_VAR next:

grep -r PER_CPU_VAR . | grep define
git blame ./arch/x86/include/asm/percpu.h
git show dd17c8f72993f
git blame dd17c8f72993f^ -- arch/x86/include/asm/percpu.h
git show 3334052a321ac

This commit unifies the percpu_32.h and percpu_64.h files into a single header file, and indicates that PER_CPU_VAR only existed in the 32-bit code paths. Instead, the 64-bit code had this, which we grep straight away:

DECLARE_PER_CPU(struct x8664_pda, pda);

~/src/kernel/el5/linux-$ grep -r x8664_pda
include/asm-x86_64/pda.h:11:struct x8664_pda {
~/src/kernel/el5/linux-$ less -N include/asm-x86_64/pda.h
     10 /* Per processor datastructure. %gs points to it while the kernel runs */ 
     11 struct x8664_pda {
     12         struct task_struct *pcurrent;   /* Current process */
     13         unsigned long data_offset;      /* Per cpu data offset from linker address */
     14         unsigned long kernelstack;  /* top of kernel stack for current */ 
     15         unsigned long oldrsp;       /* user rsp for system call */
     17         unsigned long debugstack;   /* #DB/#BP stack. */
     18 #endif
     19         int irqcount;               /* Irq nesting counter. Starts with -1 */   
     20         int cpunumber;              /* Logical CPU number */
     21         char *irqstackptr;      /* top of irqstack */
     22         int nodenumber;             /* number of current node */
     23         unsigned int __softirq_pending;
     24         unsigned int __nmi_count;       /* number of NMI on this CPUs */
     25         int mmu_state;     
     26         struct mm_struct *active_mm;
     27         unsigned apic_timer_irqs;
     28 } ____cacheline_aligned_in_smp;

Interesting, this is a per-processor data structure? pda.h doesn’t exist in modern kernels anymore, but some additional googling confirms that, yes, we should be able to use this. I ended up adding unsafe_stack_register_backup to this struct. Through some additional code searching we can find out how to access members of the PDA structure (for assembly, there’s a hint at the top: %gs points to the structure when we’re in kernel space).

The rest of asm/kaiser.h consists entirely of C function prototypes, which we can just copy over. At this point, we have successfully backported about 37% of the entire patch. I used this git blame technique to backport the entire patch. It’s a lot of work, and if you do not include the time it takes to read through the Meltdown papers and the news to get a good overview of what needs to be done, it took me about two to three weeks to get a still-broken patch that causes the system to panic around PID number 370, which is still long before you get to log in to the console. It still took well over a dozen rebuilds to get there.

KDE: Windows freeze or flicker but application doesn’t crash

I’m running KDE on two different systems, and one of them exhibits the following problem very often, and the other just did for the first time:

Windows stop updating their content, and perhaps flicker a bit. Switching to a different window and back causes the window contents to be updated, but still frozen. Which means that the application itself is not crashed.

The following command fixes this:

kwin --replace

You can run this from the run command prompt (Alt+F2) (also called Plasma search or krunner), or you could run it in a terminal. (You’d have to make sure the process doesn’t exit when you close the terminal though.)

If everything appears to be frozen and you can’t get to the run command prompt, you could still switch to a console, log in, and try running the following:

DISPLAY=:0 kwin --replace

Both systems have internal Intel graphics (quite different chipsets though) and KDE5.

The above commands will fix the problem for that time. Your open applications should not be affected by the change. I haven’t looked much into permanent fixes, but changing the rendering backend (System Settings → Display and Monitor → Compositor) may change the frequency the problem is triggered or maybe even get rid of it altogether. (I felt that OpenGL 2.0 probably triggered the problem fewer times than OpenGL 3.1.)

I’ve noticed a fair amount of traffic to my KDE-related posts. If you run into any weird KDE problems that you don’t know how to fix, feel free to leave a comment and ask.

Meltdown / Spectre Kernel Patch Benchmarks on Older Systems

The Meltdown patch for the Linux kernel makes use of the relatively new PCID instruction. I still sometimes use my old laptop, which contains a Core 2 Duo Penryn CPU (T7250), and does not support the PCID instruction, so I did a quick UnixBench run to see what kind of difference the absence of the PCID instruction would make. At the end of this article, I have a bonus “benchmark” for an alternative way to mitigate Meltdown: disabling the CPU’s caches. All my tests were performed on Debian Wheezy (currently oldstable) using kernel version 3.16.0-5-amd64.

First of all, here are another person’s results for a CPU that supports PCID. And since that’s in Japanese, here’s the important bit:

Test Before After Change (positive is better)
System Call Overhead 5391.9 4009.7 -25.63%

Now, my tests on the Penryn CPU:

Test Before After Change (positive is better)
Dhrystone 2 using register variables 3360.4 3414.1 +1.60%
Double-Precision Whetstone 724.1 724 -0.01%
Execl Throughput 1351.7 1222.9 -9.53%
File Copy 1024 bufsize 2000 maxblocks 1582 1244 -21.37%
File Copy 256 bufsize 500 maxblocks 1255.9 922.1 -26.58%
File Copy 4096 bufsize 8000 maxblocks 1982.4 1810.6 -8.67%
Pipe Throughput 1672.8 765.4 -54.24%
Pipe-based Context Switching 1108.3 671 -39.46%
Process Creation 1150 1025.3 -10.84%
Shell Scripts (1 concurrent) 1995.7 1909 -4.34%
Shell Scripts (8 concurrent) 1831.8 1743.3 -4.83%
System Call Overhead 1705.6 544.9 -68.05%
System Benchmarks Index Score 1535.8 1160.9 -24.41%

And the raw data in case you are interested:

Before updating:

Test Score Unit Time Iters. Baseline Index
Dhrystone 2 using register variables 39215974.0 lps 10.0 s 7 116700.0 3360.4
Double-Precision Whetstone 3982.6 MWIPS 9.9 s 7 55.0 724.1
Execl Throughput 5812.4 lps 29.2 s 2 43.0 1351.7
File Copy 1024 bufsize 2000 maxblocks 626453.0 KBps 30.0 s 2 3960.0 1582.0
File Copy 256 bufsize 500 maxblocks 207854.8 KBps 30.0 s 2 1655.0 1255.9
File Copy 4096 bufsize 8000 maxblocks 1149781.6 KBps 30.0 s 2 5800.0 1982.4
Pipe Throughput 2080979.1 lps 10.0 s 7 12440.0 1672.8
Pipe-based Context Switching 443337.7 lps 10.0 s 7 4000.0 1108.3
Process Creation 14490.3 lps 30.0 s 2 126.0 1150.0
Shell Scripts (1 concurrent) 8461.7 lpm 60.0 s 2 42.4 1995.7
Shell Scripts (8 concurrent) 1099.1 lpm 60.1 s 2 6.0 1831.8
System Call Overhead 2558469.9 lps 10.0 s 7 15000.0 1705.6
System Benchmarks Index Score: 1535.8

After updating:

Test Score Unit Time Iters. Baseline Index
Dhrystone 2 using register variables 39842314.8 lps 10.0 s 7 116700.0 3414.1
Double-Precision Whetstone 3982.0 MWIPS 9.8 s 7 55.0 724.0
Execl Throughput 5258.5 lps 30.0 s 2 43.0 1222.9
File Copy 1024 bufsize 2000 maxblocks 492638.1 KBps 30.0 s 2 3960.0 1244.0
File Copy 256 bufsize 500 maxblocks 152610.9 KBps 30.0 s 2 1655.0 922.1
File Copy 4096 bufsize 8000 maxblocks 1050156.7 KBps 30.0 s 2 5800.0 1810.6
Pipe Throughput 952188.4 lps 10.0 s 7 12440.0 765.4
Pipe-based Context Switching 268401.0 lps 10.0 s 7 4000.0 671.0
Process Creation 12918.3 lps 30.0 s 2 126.0 1025.3
Shell Scripts (1 concurrent) 8094.2 lpm 60.0 s 2 42.4 1909.0
Shell Scripts (8 concurrent) 1046.0 lpm 60.1 s 2 6.0 1743.3
System Call Overhead 817288.1 lps 10.0 s 7 15000.0 544.9
System Benchmarks Index Score: 1160.9

Now, Mitigating Meltdown by switching off CPU caches:

You wouldn’t even want to run UnixBench without CPU caches. Here’s a “simpler” benchmark that tells you why:

# time perl -e 'for (1..1000000) {}'

real 0m0.056s
user 0m0.052s
sys 0m0.000s
# insmod disable_cache.ko
# time perl -e 'for (1..1000000) {}' 

real 0m44.689s
user 0m40.044s
sys 0m0.520s
# rmmod disable_cache

Unless you enjoy working on a system that is some 800 times slower. (Don’t try to do this in a GUI setting.)

Nonetheless, here’s some code to disable the CPU caches. (Modified from

#include <linux/init.h>
#include <linux/module.h>
#include <linux/smp.h>


void _disable_cache(void *p) {
 printk(KERN_ALERT "Disabling L1 and L2 caches on processor %d.\n", smp_processor_id());
 __asm__(".intel_syntax noprefix\n\t"
 "mov rax,cr0\n\t"
 "or rax,(1 << 30)\n\t"
 "mov cr0,rax\n\t"
 ".att_syntax noprefix\n\t"
 : : : "rax" );
void _enable_cache(void *p) {
 printk(KERN_ALERT "Enabling L1 and L2 caches on processor %d.\n", smp_processor_id());
 __asm__(".intel_syntax noprefix\n\t"
 "mov rax,cr0\n\t"
 "and rax,~(1 << 30)\n\t"
 "mov cr0,rax\n\t"
 ".att_syntax noprefix\n\t"
 : : : "rax" );

static int disable_cache_init(void)
 on_each_cpu(_disable_cache, NULL, 1);
 return 0;
static void disable_cache_exit(void)
 on_each_cpu(_enable_cache, NULL, 1);



obj-m += disable_cache.o

	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

Note that you need to indent using tabs in Makefile. CR0 can only be read from Ring 0, and thus a kernel module is needed.

Here’s some example code to just read the CR0 registers on all CPUs:

#include <linux/init.h>
#include <linux/module.h>
#include <linux/smp.h>


void cache_status(void *p) {
 long int cr0_30 = 0;
 __asm__(".intel_syntax noprefix\n\t"
 "mov %0, cr0\n\t"
 "and %0, (1 << 30)\n\t"
 "shr %0, 30\n\t"
 ".att_syntax noprefix\n\t"
 : "=r" (cr0_30));
 printk(KERN_INFO "Processor %d: %ld\n", smp_processor_id(), cr0_30&(1<<30)>>30);

static int cache_status_init(void) {
 on_each_cpu(cache_status, NULL, 1);
 return 0;
static void cache_status_exit(void) {
 on_each_cpu(cache_status, NULL, 1);


And the corresponding Makefile:

obj-m += cache_status.o

	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules


KDE: The Window Switcher installation is broken, resources are missing.

So I was highly displeased with the standard Breeze task switcher, and thought I’d get a few new ones by clicking the star icon next to the drop-down menu where you select the task switcher. My recommendation is “Grid”. Trying to use Grid, all I get is this error message:

The Window Switcher installation is broken, resources are missing.
Contact your distribution about this.

Hrm. So then I Google and look at code, waste time trying silly things, just to postpone this problem for another weekend. Well, it’s the next weekend now, and just when I’m about to dive back into the code… I restart X (i.e. re-login), and when I try to bring the message up one more time… It doesn’t appear anymore, and Grid and all the others are working! I try installing one more, and sure enough, it didn’t work, but one more re-login and tada. So the answer to this problem might be: restart your KDE session.

It’s still a bug though. But unfortunately I’m no longer interested in looking into this bug now. :(

KDE input problems (KDE applications don’t accept input)

On KDE 5.28.0, which is currently the version of KDE included in Debian Stretch, you may run into the following problem: if you run IBus (to work with an IME like Mozc or Anthy to type Japanese), you might find that you sometimes lose the ability to input text in KDE/Qt applications (like konsole, or the run bar (krunner) when you press Alt+F2). It looks like you can fix this by running:

ibus-daemon -r -x -d

Since you can still type stuff in applications that don’t use Qt, such as Firefox, typing the above command into a text area in Firefox, copying and pasting (right click, paste) that text into krunner, then using your mouse to select “Command line: ibus-daemon -r -x -d”, you should be able to get input to work again. If you do not have any non-Qt applications, you could switch to the console and instead do:

DISPLAY=:0 ibus-daemon -r -x d

However, you may need a second hack to get Japanese input to work again: open xterm (not konsole) and activate your Japanese IME (which should work just fine). This seems to cause Japanese input to work again system-wide.


Debian Stretch に入っている KDE 5.28.0 に、(IBus が稼働している環境では)KDE/Qt 系のアプリケーションに、時々、何も入力できなくなる不具合があるようです。IBus のデーモンを再起動すると直るので、以下のコマンドを実行してみてください。

ibus-daemon -r -x -d

キーボードが使えないのに、コマンドを実行するのにはどうすればいいですかというと、Qt を使用しているアプリケーション以外の入力はできるはずなので、例えば Firefox のテキストボックスにコマンドを入力、コピーして、Alt+F2 で開ける実行メニュー (krunner) を開いて、コマンドを右クリックで貼り付けて、マウスで実行するように選択すれば、キーボードをほとんど使わずに実行できます。(ショートカットキーは使えるはずです。)

DISPLAY=:0 ibus-daemon -r -x -d

ただし、これだけでは日本語入力が直らない(場合があります?)。直らない場合は、xterm (konsole などではなく、xterm)を開いて、一回日本語入力に切り替えて、使えるかどうか試してみてください。(使えるはずです。)

The Boring Game

I wrote a game! In 2.5 hours, even. It’s a console (as in Linux terminal) game, and written in Perl (5). I’ll call it “The Boring Game”.

You’re the pilot of a sophisticated airplane that does not crash into mountains, but bores tunnels through them. Flying costs money (because you use up fuel). Operating the boring machine attached to your airplane is extremely energy-intensive, and costs a fortune. Boring horizontally is expensive, but wait till you see how much you have to pay to bore up. However, (completed) tunnels are very useful infrastructure, so you get a nice reward every time you make it through the mountain.

The game’s settings are global constants at the top of the source file:

my $USLEEP = 80000;
my @STATE_CHANGE_PROBABILITY = (0.1, 0.4, 0.1);
my $MAX_ALTITUDE = 120; # (1 == one column). Need a few more columns to display the current funds
my $MOUNTAIN_CHAR = '.';
my $PLAYER_CHAR = '@';
my $UP_CMD = 'k';
my $DOWN_CMD = 'j';
my $QUIT_CMD = 'q';

Here’s a YouTube video of the game in action:

You can get the source code at
To play the game, put in any directory you want, and issue the following command:


Beach Cleaning in Matsue, Japan

Ocean trash
Ocean trash

Now that I’m living in Matsue, I often find myself not having much to do. Which means I’m usually sitting at my computer or sitting on my bicycle. One of my first destinations was the sea, which is about 10 km north from where I live.

The spot marked “須々海海岸” on Google Maps is overwhelmingly beautiful and saddening at the same time. While the pictures shared on Google Maps may show you that this is indeed a very beautiful spot, most of these pictures do not show that there is a lot of plastic trash on the beach.

So not having much to do, being somewhat young (28 back then) and being reasonably environmentally minded, I one day decided to see if I could maybe help clean this place up. Unfortunately, my Google queries for beach cleanup activities in Matsue didn’t yield any results, so I just decided to buy a pair of (gardening) gloves and a pack of large trash bags and get some cleaning done.

It turned out to be a great way to pass the time (in late spring, when it isn’t crazy hot and mostly not raining), so I kept coming back, and decided to continue until the tsuyu (rainy season) would kick in.

Ocean garbage collection application form
Ocean garbage collection application form

Just gathering the trash is of course not quite enough. You need to get it to the waste processing facilities. So I just went to the town hall and asked the person at the entrance what to do. I was told to go to the “volunteer” department at the 松江市環境センター (Matsue Environmental Office), where I had to fill out a form (pictured) with the following information: personal information, pick-up address (no house means no address, so this is a bit hard, but the guy at the counter really knew his way around town, and showing him the place on Google StreetView helped a bit too), the number of trash bags, cleanup date, next date in case rains gets in the way. After filling out the form, I got the number of Matsue-branded trash bags that I’d put on the form, at which point I had to explain that I’d actually already started cleaning, unfortunately using regular unlabeled trash bags. That was fine, but he told me to use the right bags next time. The form allows you to tick 自己搬入 (bringing in the trash yourself), but you’d probably have to explain yourself if you want to do that. I opted to have a truck pick up the trash I’d pile up at the side of the road, which usually takes place within one week after your cleanup date.

Some collected ocean trash
Some collected ocean trash

Not knowing much about the  recycling facilities here, I generally sorted the trash by type: plastic bottles (of which there are many, many), plastic bottle labels, plastic bottle caps, styrofoam, hard plastic (probably mostly originating from buoys), soft plastic (think polyester), random other plastic. I had no intention of taking care of tree branches/logs, and was in fact told not to pick those up, as they wouldn’t fit into the plastic bags anyway.

Random thoughts

  • Japanese beaches may have a lot of ugly フナムシ (sea roaches). They will definitely crawl all over your bags, so make sure to close them properly. :p
  • This place doesn’t have a lot of people come by, but the people I did meet were quite eager to talk. Mostly older guys who have come out to do some fishing.
  • Carrying the trash bags from the beach up to the road (which is probably a 20 m altitude difference) was pretty tough, but very, um, good exercise. Combined with the 10 km (very much non-flat) ride on the bicycle… it was pretty intense. :p
  • I’ll probably re-commence my cleaning activities when it gets a bit cooler, perhaps in September. Hoping the place won’t be infested by spiders.
  • One day, I found that someone had helped during my absence \o/