Forwarding DNS requests using netcat, without dnsmasq/bind/other DNS software

I’ve sometimes found that it would be useful to be able to forward DNS requests from one network into another.

In this article, the examples are for forwarding Docker’s internal DNS. My (potential) use case is to (hopefully) work around Softether VPN’s internal DNS server not being able to resolve the names of other Docker containers on the Docker network (when running Softether VPN in a Docker container).

I tried the following on a CentOS 7.5 machine, but this only worked for the first request.

mkfifo0
mkfifo1
nc -l -u 172.19.0.3 53 < fifo1 > fifo0 & nc -u 127.0.0.11 53 < fifo0 > fifo1

Checking netstat -lnp after sending the first request, we see that nc is no longer listening. The problem is that the -k option is missing, but adding the -k option gets us this message:

Ncat: UDP mode does not support the -k or --keep-open options, except with --exec or --sh-exec. QUITTING.

Wait, “–exec”? “–sh-exec”? What, we don’t have to do this whole mkfifo stuff at all?!

/root/nc.sh:

#!/bin/sh
nc -u 127.0.0.11 53

Command:

nc -k -l -u 172.19.0.3 53 -e /root/nc.sh

Note, -e is short for –exec. This appears to work just fine. (Note: the corresponding –sh-exec (-c) option wouldn’t work immediately and I didn’t feel like spending too much time on this.)

Here’s a dnsmasq command to do something similar:

dnsmasq -u root -i eth0 --no-dhcp-interface=eth0 --port=5353

This will also allow you to resolve things using /etc/hosts on the container running dnsmasq, while disabling dnsmasq’s internal DHCP. (If you change the listening interface given in -i, you’ll also have to change the interface given in –no-dhcp-interface.)

Spreadsheets vs. Command Line Utilities vs. SQL (for Pivot Tables)

When processing text files on Linux, you have a lot of choice. Sed, Awk, Perl, or just coreutils, or perhaps a spreadsheet application? I’m a reasonably educated spreadsheet application user, but I’m also a reasonably educated command line user and a reasonably educated SQL user. This article pits the three against each other.

In this article, I’ll show different ways to process a large CSV file: one solution using a spreadsheet application, one solution using standard CLI utilities (GNU coreutils GNU datamash), and one solution using q (http://harelba.github.io/q – Run SQL directly on CSV files) (and one solution using sqlite3, which is almost the same).

Conclusions

Yes, I’m putting my conclusions first. If you need to create a Pivot Table from CSV files, I believe SQL is the best solution. The q utility makes using SQL very comfortable.

The data

The dataset used in this article describes export statistics, i.e., trade from Japan to other countries. We would like to do a simple Pivot Table-like task that would be really easy in Excel: find the total export volume (in JPY) from Japan to a specific country for every HS “section”. Here are some examples of HS sections and their corresponding HS chapters:

Chapters 01-05: LIVE ANIMALS; ANIMAL PRODUCTS
Chapters 06-14: VEGETABLE PRODUCTS
Chapter 15: ANIMAL OR VEGETABLE FATS AND OILS AND THEIR CLEAVAGE PRODUCTS; PREPARED EDIBLE FATS; ANIMAL OR VEGETABLE WAXES

The following links are for import, but that doesn’t matter in our case I think. Here’s the whole table: http://www.customs.go.jp/english/tariff/2018_4/index.htm This table contains links to tables further describing the HS codes in each HS chapter. For example, here’s the table for section I, “LIVE ANIMALS; ANIMAL PRODUCTS”, chapter 01: “Live animals”: http://www.customs.go.jp/english/tariff/2018_4/data/e_01.htm.

The HS codes in our dataset look like this: ‘010121000’; the first two digits correspond to the HS chapter, which is all we are going to look at for now. We have to group by these two digits.

The files

I downloaded all the CSV files on this page: https://www.e-stat.go.jp/stat-search/files?page=1&layout=datalist&toukei=00350300&tstat=000001013141&cycle=1&year=20170&month=24101212&tclass1=000001013180&tclass2=000001013181&result_back=1 (English) and merged them into a single file, data.csv like this:

head -n 1 ik-100h2017e001.csv > header
tail -q -n +2 ik-100h2017e0*csv >> header

The HS chapters/sections are described here: http://www.customs.go.jp/english/tariff/2018_4/index.htm (English. A Japanese page is available too, of course.)

The country codes are listed here: http://www.customs.go.jp/toukei/sankou/code/country_e.htm (English. Japanese is available.)

data.csv.gz
countries.csv
hs_sections.csv
hs_chapters_to_sections.csv
hs_sections_no_to_descriptions.csv

The spreadsheet solution

I won’t go into much detail here. First of all, we add worksheets for all of the above files (or reference external files). Then we add a column to compute the first two digits in the HS codes, using a function like MID(C2,2,2). We use VLOOKUP() to look up the HS section. (Perhaps we use another VLOOKUP() for the country codes.) Then we create a pivot table. (It would be more efficient to VLOOKUP() from the pivot table, but while I believe that to be possible in Excel, I’m not sure it’s possible in OpenOffice/LibreOffice.)

Anyway, using spreadsheets is rather user-friendly, but large files take quite a while to process. Adding extra columns to the original data is very inconvenient too. (Using calculated fields in Excel may help with this.)

The CLI/GNU(?) solution

We are going to make use of GNU datamash here. GNU datamash is capable of grouping and summing, which is already halfway there. For the lookups, we use the join command(!), which is part of coreutils.

We need to do some minor pre-processing, as we do not want the header rows in this solution:

tail -n +1 data.csv > data_nh.csv
tail -n +1 countries.csv > countries_nh.csv
tail -n +1 hs_sections.csv > hs_sections_nh.csv

The other files do not have any headers. So far so good, but using common CLI tools gets a bit awkward in the next step, cutting off characters in the middle of the HS code field. Let’s isolate that field:

$ cut -d, -f3 data_nh.csv | head -n3
'010121000'
'010121000'
'010121000'

Then we cut off the unneeded characters:

$ cut -d, -f3 data_nh.csv | cut -c 2-3 | head -n3
01
01
01

Then we need to re-add the other columns. This is one of the slightly awkward steps when doing this using CLI tools. Let’s isolate the other relevant columns first though:

$ cut -d, -f4,9 data_nh.csv | head -n3
103,2100
105,1800
205,84220

To paste these two columns back onto the first isolated columns, we use the aptly(?) named paste command. The -d option allows use to combine fields using the comma operator. (Default is tab.) We’ll pass the HS section as standard input, and the other two relevant columns using bash’s <().

$ cut -d, -f3 data_nh.csv | cut -c 2-3 | paste -d, - <(cut -d, -f4,9 data_nh.csv) | head -n3
01,103,2100
01,105,1800
01,205,84220

What we have now is a trimmed CSV that goes “HS section”,”Country Code”,”Amount”.

The “VLOOKUP” part is slightly tricky. We are going to use the little-known join command, which is included in coreutils. Some HS sections and some country names have commas in them, which are a bit inconvenient, but not a huge problem as the result of the “VLOOKUP” is attached to the right of the entire original data.

Here’s a quick demonstration of the join command. (Note: countries_nh.csv is pre-sorted. Everything passed to join must be sorted.)

$ echo 222 | join -t, - countries_nh.csv
222,"Finland"

In Excel, we are able to safely group by cells that may contain commas, but not so in datamash. I left out something above: We’ve got the HS chapter code above, but from this chapter code, we wanted to look up the HS section, and group by that section. So let’s go back one step and use join to get us the HS section number from the HS chapter number. Note that all join input must be sorted, so before we add a pipe to sort on the newly created fourth field:

$ cut -d, -f3 data_nh.csv | cut -c 2-3 | paste -d, - <(cut -d, -f4,9 data_nh.csv) | sort -n -t, -k1,1 | join -t, - hs_chapters_to_sections.csv | head -n3
00,103,276850736,0
00,105,721488020,0
00,106,258320777,0

Getting the sort command to sort correctly by a single field isn’t very easy. If it weren’t for the –debug option that is! In this case we want to sort by the first field, so the command becomes ‘sort -n -t, -k1,1’. (Start field == end field == 1, so -k1,1.) Debug output looks like this:

$ cut -d, -f3 data_nh.csv | cut -c 2-3 | paste -d, - <(cut -d, -f4,9 data_nh.csv) | sort -n -t, -k1,1 --debug | head
sort: using ‘en_US.UTF-8’ sorting rules
00,103,276850736
__
________________
00,105,721488020
__
________________
00,106,258320777
__
________________

The field that has been sorted gets underlined. Great! Now let’s do the pivoting part using datamash:

$ cut -d, -f3 data_nh.csv | cut -c 2-3 | paste -d, - <(cut -d, -f4,9 data_nh.csv) | sort -n -t, -k1,1 | join -t, - hs_chapters_to_sections.csv | datamash -s -t, groupby 2,1,4 sum 3
103,00,276850736
103,01,418085
103,03,14476769

The -s option sorts, which is required when using groupby. The -t option selects ‘,’ as the delimiter. This command groups by column 2 (country), and then by column 4 (our HS section number), and computes a sum of column 3 for this grouping. So this is it! If we know the country codes and HS sections by heart, that is.

Well, our above output from datamash starts with the country code, and things are nice and sorted, so we just have to join again:

$ cut -d, -f3 data_nh.csv | cut -c 2-3 | paste -d, - <(cut -d, -f4,9 data_nh.csv) | sort -n -t, -k1,1 | join -t, - hs_chapters_to_sections.csv | datamash -s -t, groupby 2,4 sum 3 | join -t, - countries_nh.csv | head -n3
103,0,276850736,"Republic of Korea"
103,1,15325799,"Republic of Korea"
103,10,50079044,"Republic of Korea"

Next we would like to look up the HS section number to get the HS section description. In the above commands, we joined on the first field, but fortunately join supports joining on different fields. We need to sort on the second field and then tell join to join on the second field, which can be accomplished by using the -1 option and specifying 2 . (So -1 2 or simply -12, though that may look confusing.)

$ cut -d, -f3 data_nh.csv | cut -c 2-3 | paste -d, - <(cut -d, -f4,9 data_nh.csv) | sort -n -t, -k1,1 | join -t, - hs_chapters_to_sections.csv | datamash -s -t, groupby 2,4 sum 3 | join -t, - countries_nh.csv | sort -n -t, -k2,2 | join -12 -t, - hs_sections_no_to_descriptions.csv | head -n3
00,103,276850736,"Republic of Korea","Unknown"
00,105,721488020,"People's Republic of China","Unknown"
00,106,258320777,"Taiwan","Unknown"

That’s it! To get e.g. Finland, we’ll make it easy for ourselves and just grep for Finland:

$ cut -d, -f3 data_nh.csv | cut -c 2-3 | paste -d, - <(cut -d, -f4,9 data_nh.csv) | sort -n -t, -k1,1 | join -t, - hs_chapters_to_sections.csv | datamash -s -t, groupby 2,4 sum 3 | join -t, - countries_nh.csv | sort -n -t, -k2,2 | join -12 -t, - hs_sections_no_to_descriptions.csv | grep Finland
0,222,1758315,"Finland","Unknown"
2,222,10613,"Finland","VEGETABLE PRODUCTS"
3,222,654,"Finland","ANIMAL OR VEGETABLE FATS AND OILS AND THEIR CLEAVAGE PRODUCTS; PREPARED EDIBLE FATS; ANIMAL OR VEGETABLE WAXES"
4,222,45021,"Finland","PREPARED FOODSTUFFS; BEVERAGES, SPIRITS AND VINEGAR; TOBACCO AND MANUFACTURED TOBACCO SUBSTITUTES"
5,222,33611,"Finland","MINERAL PRODUCTS"
6,222,2624353,"Finland","PRODUCTS OF THE CHEMICAL OR ALLIED INDUSTRIES"
7,222,4880410,"Finland","PLASTICS AND ARTICLES THEREOF; RUBBER AND ARTICLES THEREOF"
8,222,12557,"Finland","RAW HIDES AND SKINS, LEATHER, FURSKINS AND ARTICLES THEREOF; ADDLERY AND HARNESS; TRAVEL GOODS, HANDBAGS AND SIMILAR CONTAINERS; ARTICLES OF ANIMAL GUT (OTHER THAN SILK-WORM GUT)"
9,222,3766,"Finland","WOOD AND ARTICLES OF WOOD; WOOD CHARCOAL; CORK AND ARTICLES OF CORK; MANUFACTURES OF STRAW, OF ESPARTO OR OF OTHER PLAITING MATERIALS; BASKETWARE AND WICKERWORK"
10,222,38476,"Finland","PULP OF WOOD OR OF OTHER FIBROUS CELLULOSIC MATERIAL; RECOVERED (WASTE AND SCRAP) PAPER OR PAPERBOARD; PAPER AND PAPERBOARD AND ARTICLES THEREOF"
11,222,527084,"Finland","TEXTILES AND TEXTILE ARTICLES"
12,222,1541,"Finland","FOOTWEAR, HEADGEAR, UMBRELLAS, SUN UMBRELLAS, WALKING-STICKS, SEAT-STICKS, WHIPS, RIDING-CROPS AND PARTS THEREOF; PREPARED FEATHERS AND ARTICLES MADE THEREWITH; ARTIFICIAL FLOWERS; ARTICLES OF HUMAN HAIR"
13,222,991508,"Finland","ARTICLES OF STONE, PLASTER, CEMENT, ASBESTOS, MICA OR SIMILAR MATERIALS; CERAMIC PRODUCTS; GLASS AND GLASSWARE"
14,222,5757,"Finland","NATURAL OR CULTURED PEARLS, PRECIOUS OR SEMI-PRECIOUS STONES, PRECIOUS METALS, METALS CLAD WITH PRECIOUS METAL AND ARTICLES THEREOF; IMITATION JEWELLERY; COIN"
15,222,971561,"Finland","BASE METALS AND ARTICLES OF BASE METAL"
16,222,14614308,"Finland","MACHINERY AND MECHANICAL APPLIANCES; ELECTRICAL EQUIPMENT; PARTS THEREOF; SOUND RECORDERS AND REPRODUCERS, TELEVISION IMAGE AND SOUND RECORDERS AND REPRODUCERS, AND PARTS AND ACCESSORIES OF SUCH ARTICLES"
17,222,13427653,"Finland","VEHICLES, AIRCRAFT, VESSELS AND ASSOCIATED TRANSPORT EQUIPMENT"
18,222,4062385,"Finland","OPTICAL, PHOTOGRAPHIC, CINEMATOGRAPHIC, MEASURING, CHECKING, PRECISION, MEDICAL OR SURGICAL INSTRUMENTS AND APPARATUS; CLOCKS AND WATCHES; MUSICAL INSTRUMENTS; PARTS AND ACCESSORIES THEREOF"
19,222,4550,"Finland","ARMS AND AMMUNITION; PARTS AND ACCESSORIES THEREOF"
20,222,399367,"Finland","MISCELLANEOUS MANUFACTURED ARTICLES"
21,222,20539,"Finland","WORKS OF ART, COLLECTORS' PIECES AND ANTIQUES"

If you need to match column names exactly, you could replace the above grep by something like this:

... | grep -P '^.*?,.*?,.*?,"Finland"

Though personally I’d maybe use awk for that. On my machine, executing this takes 0.675 seconds. That’s pretty fast!

The q solution

q is a tool that allows you to perform SQL queries on CSV files from the comfort of the command line. If you know SQL, that should be pretty cool!

We’ve got some issues with digits and hyphens in the column names, so we first pre-process to get rid of those:

$ head -n 1 data.csv | tr '1-9' 'A-I' | sed 's/-//g'; tail -n +2 data.csv

This uses tr to replace the digits 1-9 with corresponding letters and sed to get rid of hyphens. We pipe this into q. The q command itself looks like this:

$ q -d, -H 'select Description, sum(ValueYear) from - JOIN hs_sections.csv ON substr(HS,2,2)=Number JOIN countries.csv ON Country=CountryID where CountryName="Finland" group by Description'

-d specifies the delimiter, -H specifies the presence of a header row (and we can use these header names in the query!) and the rest is just SQL.

$ time (head -n 1 data.csv | tr '1-9' 'A-I' | sed 's/-//g'; tail -n +2 data.csv) | q -d, -H 'select Description, sum(ValueYear) from - JOIN hs_sections.csv ON substr(HS,2,2)=Number JOIN countries.csv ON Country=CountryID where CountryName="Finland" group by Description'
ANIMAL OR VEGETABLE FATS AND OILS AND THEIR CLEAVAGE PRODUCTS; PREPARED EDIBLE FATS; ANIMAL OR VEGETABLE WAXES,654
ARMS AND AMMUNITION; PARTS AND ACCESSORIES THEREOF,4550
"ARTICLES OF STONE, PLASTER, CEMENT, ASBESTOS, MICA OR SIMILAR MATERIALS; CERAMIC PRODUCTS; GLASS AND GLASSWARE",991508
...
real    0m12.590s
user    0m12.364s
sys     0m0.236s

Wow, this was pretty pleasant, but it took my machine 12.59 seconds to get here. q uses sqlite, and we can use the -S option to save the resulting sqlite database to a file. Here’s an sqlite3 command that executes the same query on a saved database:

time sqlite3 data.sqlite 'select Description, sum(QuantityAYear) from `-` JOIN `hs_sections.csv` ON substr(HS,2,2)=Number JOIN `countries.csv` ON Country=CountryID where CountryName="Finland" group by Description;' 
ANIMAL OR VEGETABLE FATS AND OILS AND THEIR CLEAVAGE PRODUCTS; PREPARED EDIBLE FATS; ANIMAL OR VEGETABLE WAXES|0
ARMS AND AMMUNITION; PARTS AND ACCESSORIES THEREOF|0
ARTICLES OF STONE, PLASTER, CEMENT, ASBESTOS, MICA OR SIMILAR MATERIALS; CERAMIC PRODUCTS; GLASS AND GLASSWARE|7436
...
real    0m0.218s
user    0m0.192s
sys     0m0.024s

As you can see, that is pretty fast. So it spends quite a bit of time importing the CSV file. Well, as it turns out, sqlite3 supports importing from CSV files as well. Here’s a command that creates a database in test.sqlite, imports the three required CSV files, and runs the query:

$ time (sqlite3 -csv test.sqlite '.import data.csv data'; sqlite3 -csv test.sqlite '.import hs_sections.csv hs_sections'; sqlite3 -csv test.sqlite '.import countries.csv countries'; sqlite3 -csv test.sqlite 'select Description, sum(`Quantity1-Year`) from data JOIN hs_sections ON substr(HS,2,2)=Number JOIN countries ON Country=CountryID where CountryName="Finland" group by Description; ')
"ANIMAL OR VEGETABLE FATS AND OILS AND THEIR CLEAVAGE PRODUCTS; PREPARED EDIBLE FATS; ANIMAL OR VEGETABLE WAXES",0
"ARMS AND AMMUNITION; PARTS AND ACCESSORIES THEREOF",0
"ARTICLES OF STONE, PLASTER, CEMENT, ASBESTOS, MICA OR SIMILAR MATERIALS; CERAMIC PRODUCTS; GLASS AND GLASSWARE",267696
...
real    0m2.581s
user    0m2.372s
sys     0m0.120s

This is much faster than using q, but may have various limitations. Note that if you feed sqlite3 commands through standard input, you can do all this in a single sqlite3 session, and the database can be entirely in-memory.

海外発行のクレジットカードで Amazon.co.jp で音楽を購入する方法

日本のクレジットカードを入手するのはなかなか困難なので、海外のクレジットカードをずっとそのまま使い続ける方、結構いらっしゃるではないでしょうか?少なくとも私はそうです。国際送金での返済など、不便は多いですね。

今回は、Amazon.co.jp のデジタルミュージックストアにて西野カナ氏の「abracadabra」という曲を購入しようとしたら、「 現在登録されているお支払い情報では、お客様のご購入を処理することができませんでした。日本国内で発行されたクレジットカードをご使用ください。」というエラーメッセージが表示され、購入できませんでした。(ちなみに、普段聞いている音楽は結構違うのですが、この曲のノリとかバックグラウンドのピアノとか、なかなか良かったです。)

でも実は、ギフトカード払いもできるのです。なお、ギフトカードを買うためにわざわざコンビニまで足を運ぶ必要もありません。なぜなら、Amazon のギフトカードは直接 Amazon で購入し、ギフトコードをメールで受け取ることができるからです: Amazonギフトカード(Eメールタイプ)。自分のメールアドレスを入力し、海外のクレジットカードで支払います。

デジタルミュージックストアに戻り再度購入しようとすると!あれ、同じエラーメッセージが表示されています。話を長くしても面白くないですよね。続いて、Amazon の支払い設定で、クレジットカード情報を消してみたら、見事、先程購入したギフトカードでの支払が可能になりました!

(実は、西野カナの「ダーリン」という曲も好きです。)

Can’t connect to (e.g.) GitLab, failing with “no hostkey alg”

If you get the following error message when connecting to a server (in my case, it was a GitLab instance running on Docker using something at least inspired by the official Docker image), you may be using an older SSH client, such as the one in RHEL/CentOS 6.

no hostkey alg

Some cursory web searching didn’t give me a satisfactory solution, so here goes: It seems likely that you’re using an older ssh client (for example, the one in CentOS 6.x). This client unfortunately doesn’t support the -Q option to list supported host keys, but we can figure out that information by doing the following:

ssh -vvvv 127.0.0.1

On a more modern system, you might get something like this:

debug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa

(Which is the same as ssh -Q key, but harder to read.)

On older systems, you won’t get the helpful “host key algorithms:” label, but you’ll still get the information. So perhaps look out for a line that contains “ssh-rsa”.

Then, try the same ssh -vvvv 12.34.56.78 (replace 12.34.56.78 with the target server’s name or address), and look at the equivalent line. (Or if you have access, log into the server and try ssh -Q key.)

In my case, the client only had ssh-rsa and ssh-dsa, and the target server only listed ecdsa-sha2-nistp256. In my case, this could be solved by entirely on the client side. All we have to do is add an option to the command line and create a key if it doesn’t exist yet:

ssh-keygen -t ecdsa -f /etc/ssh/ssh_host_ecdsa_key
ssh -o HostKeyAlgorithms=ecdsa-sha2-nistp256,ssh-rsa 12.34.56.78

To avoid adding this option every time, you can add the following into your ~/.ssh/config:

Host 12.34.56.78
        HostKeyAlgorithms ecdsa-sha2-nistp256,ssh-rsa

(Or if you want this on all hosts: Host *)

Hope this helps.

Factors of the first 1,000,000 integer numbers

A quick Google search only brought up lists up to e.g. 1,000, so here’s a CSV with the factors for the first 1,000,000 integers. (Note that from a programming perspective, you’re most likely better off calculating the factors yourself rather than parsing a CSV file. So it’s not quite clear if this list is actually useful.) Feel free to do with it whatever you want.

1000000_factors.gz (23 MB)

Here’s an extract:

1,1
2,1
3,1
4,1,2
5,1
6,1,2,3
7,1
8,1,2,4
9,1,3
10,1,2,5
11,1
12,1,2,3,4,6
13,1
14,1,2,7
15,1,3,5
16,1,2,4,8
17,1
18,1,2,3,6,9
19,1
20,1,2,4,5,10
21,1,3,7
22,1,2,11
23,1
24,1,2,3,4,6,8,12
25,1,5
26,1,2,13
27,1,3,9
28,1,2,4,7,14
29,1
30,1,2,3,5,6,10,15
31,1
32,1,2,4,8,16
33,1,3,11
34,1,2,17
35,1,5,7
36,1,2,3,4,6,9,12,18
37,1
38,1,2,19
39,1,3,13
40,1,2,4,5,8,10,20
41,1
42,1,2,3,6,7,14,21
43,1
44,1,2,4,11,22
45,1,3,5,9,15
46,1,2,23
47,1
48,1,2,3,4,6,8,12,16,24
49,1,7
50,1,2,5,10,25

Matrix multiplication using the FMA instruction

In my previous post, we did matrix multiplication using regular SSE/AVX instructions. In this post, we’ll implement matrix multiplication using the FMA (fused multiply-add) instruction, which takes three arguments and is able to multiply and add at the same time. (Think c = a*b + c.)

If you got here directly without reading the previous posts, note that this is just a somewhat naive implementation using inverted matrices (but without doing the inversion ourselves). (The intent of this article series is to show how to use SIMD instructions.) My previous post had a few links for people who need something really optimized.

Note that we’re using “FMA3” instructions, rather than FMA4 instructions, which only seem to be supported on some AMD processors. (The number indicates the number of arguments passed to the instruction. In the case of FMA4, the formula would be a = b*c + d.)

The first table below shows the operation of the first FMA instruction, where a is still 0 (as there is nothing to add on the very first instruction), and the second table the following FMA instruction, where the “addend” is the result in the first table.

Addend (a) 0 0 0 0
Factor 1 (b) 0.1 0.1 0.1 0.1
Factor 2 (c) 0.1 0.1 0.1 0.1
Result (a) 0.01 0.01 0.01 0.01
Addend (a) 0.01 0.01 0.01 0.01
Factor 1 (b) 0.1 0.1 0.1 0.1
Factor 2 (c) 0.1 0.1 0.1 0.1
Result (a) 0.02 0.02 0.02 0.02

Here’s the code. I changed the square matrix size to 2048 to make measuring a bit easier.

#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>

#define N 2048

float *matrix_a;
float *matrix_b;
float result[N][N];

void chunked_mm(int chunk, int n_chunks) {
    __m256 va, vb, vc;
    for (int i = chunk*(N/n_chunks); i < (chunk+1)*(N/n_chunks); i++) {
        for (int j = 0; j < N; j++) {
            float buffer[8] = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };
            vc = _mm256_loadu_ps(buffer);
            for (int k = 0; k < N; k += 8) {
                // load
                va = _mm256_loadu_ps(matrix_a+(i*N)+k); // matrix_a[i][k]
                vb = _mm256_loadu_ps(matrix_b+(j*N)+k); // matrix_b[j][k]

                // fused multiply and add
                vc = _mm256_fmadd_ps(va, vb, vc);
            }
            //vc = _mm256_hadd_ps(vc, vc);
            _mm256_storeu_ps(buffer, vc);
            result[i][j] = buffer[0] + buffer[1] + buffer[2] + buffer[3] + buffer[4] + buffer[5] + buffer[6] + buffer[7];
            //result[i][j] = buffer[0] + buffer[2] + buffer[4] + buffer[6];
        }
    }
}

int main(int argc, char **argv) {
    // initialize matrix_a and matrix_b
    matrix_a = malloc(N*N*sizeof(float));
    matrix_b = malloc(N*N*sizeof(float));

    for (int i = 0; i < N*N; i++) {
        *(matrix_a+i) = 0.1f;
        *(matrix_b+i) = 0.2f;
    }
    // initialize result matrix
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            result[i][j] = 0.0f;
        }
    }

    #pragma omp parallel for
    for (int i = 0; i < 4; i++) {
        chunked_mm(i, 4);
    }
    
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            //printf("%f ", result[i][j]);
            printf("%x ", *(unsigned int*)&result[i][j]);
        }
        printf("\n");
    }
    
    return 0;
}

Performance

Since the CPU used in the previous articles doesn’t support FMA (and we changed N), I’m re-benchmarking the AVX256 version on the new processor.

AVX256: 1.25 seconds
FMA: 1 second

Unfortunately this is borrowed hardware so I can’t play around with this too much, but the above result is pretty consistent.

Matrix multiplication using SIMD instructions

In my previous post, I tried various things to improve the performance of a matrix multiplication using compiler features.

# 20 seconds
gcc -Wall -o mm mm.c

# 1.182 seconds
gcc -g -O4 -fopenmp -fopt-info-optall-optimized -ftree-vectorize -mavx -o mm_autovectorized_openmp mm_autovectorized_openmp.c

However, -O4 -fopenmp using transposed matrices turned out faster (0.882 seconds) than -O4 -fopenmp and auto-vectorization using untransposed matrices. I couldn’t get auto-vectorization to work with the transposed matrices.

In this post, we’ll use simple SIMD instructions to optimize this further. It builds up on my post from two days ago, where I explain how to use SIMD instructions for a very simple and synthetic example.

Note that much more can be done to optimize matrix multiplication than is described in this post. This post just explains the very basics. If you need more advanced algorithms, maybe look through these three links:

https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0 High Performance Matrix Multiplication

https://news.ycombinator.com/item?id=17164737 Hacker News discussion of above post

https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf Anatomy of High-Performance Matrix Multiplication (academic paper)

Here’s an interesting article that implements high-performance matrix multiplication in just 100 lines, using FMA3: https://cs.stanford.edu/people/shadjis/blas.html

Using transposed matrices makes vectorizing matrix multiplication quite easy. Why? Well, remember that in our simple example, there were three steps. The first step requires that the data to be loaded is laid out sequentially in memory.

  1. Loading data into SIMD registers
  2. Performing operations on corresponding operands in two SIMD registers
  3. Storing the result

Step 1: Loading data

Remember that the data load wanted a memory address where the four (or eight) float values were stored sequentially. Well, if we just transpose the matrix before we start doing stuff, we can just load the matrix B floats sequentially. So the code looks almost the same as in the baby steps post. To make things a bit easier, we will be using SSE for now.

va = _mm_loadu_ps(&(matrix_a[i][k]));
vb = _mm_loadu_ps(&(matrix_b[j][k]));

Step 2: Doing the calculations

All right. We have our floats loaded into two registers. In SSE, we have four floats per register:

Register 1 (va) 0.1 0.1 0.1 0.1
Register 2 (vb) 0.2 0.2 0.2 0.2

The first step is to multiply. In the baby steps post, we used _mm_add_ps to perform addition. Well, multiplication uses an intrinsic with a similar name: _mm_mul_ps. (The AVX version is _mm256_mul_ps.) So if we do:

 vresult = _mm_mul_ps(va, vb)

And we get:

vresult 0.02 0.02 0.02 0.02

Great! Now we just need to add the contents of vresult together! Unfortunately, there is no SIMD instruction that would add every component together to give us 0.08 as the output, given the above vresult as its only input.

From SSE3, there exists _mm_hadd_ps however, the “horizontal add” instruction (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_hadd_ps&expand=2777), which takes two registers as input (you can use the same registers), and computes:

dst[31:0] := a[63:32] + a[31:0]
dst[63:32] := a[127:96] + a[95:64]
dst[95:64] := b[63:32] + b[31:0]
dst[127:96] := b[127:96] + b[95:64]

Here’s an example:

va 0.1 0.2 0.3 0.4
vb 0.5 0.6 0.7 0.8
vresult 0.3 0.7 1.1 1.5

Sorry for the weird color scheme. Maybe you can already see that this is a bit odd – why does it want two registers as input, for starters? We wanted 0.1+0.2+0.3+0.4, which should be 1. Well, let’s see what happens when we use the same register for both inputs, and perform this operation twice!

va 0.1 0.2 0.3 0.4
va 0.1 0.2 0.3 0.4
vresult 0.3 0.7 0.3 0.7
vresult 0.3 0.7 0.3 0.7
vresult 0.3 0.7 0.3 0.7
vresult (new) 1 1 1 1

Yay, we did it! We got 1, which is the result of 0.1+0.2+0.3+0.4. (This works for SSE. We will talk about AVX later.) Here’s the code:

vresult = _mm_hadd_ps(vresult, vresult);
vresult = _mm_hadd_ps(vresult, vresult);

Step 3: Storing the result

Step 3 involves storing the result. We can of course just store the four bytes into an array as before, but as they’re all the same, we’re really only interested in one of them. We could use _mm_extract_ps, which is capable of extracting any of the four floats. But we can do slightly better, we can just cast, which will get us the lowest float in the 128-bit register. There is an intrinsic for this type of cast, _mm_cvtss_f32, so we can just write:

result[i][j] += _mm_cvtss_f32(vresult);

And that’s (assuming SSE3) four sub-operations of the matrix multiplication done in one go! Because we’re doing four ks at once, we have to change the inner loop to reflect that:

for (int k = 0; k < 1024; k += 4) {
    ...
}

So let’s see the code. In this example I’ve also decided to use malloc instead of stack arrays (except for result), so matrix_a[i][k] turns into matrix_a+(i*1024)+k.

#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
    float *matrix_a = malloc(1024*1024*sizeof(float));
    float *matrix_b = malloc(1024*1024*sizeof(float));
    float result[1024][1024];
    __m128 va, vb, vresult;

    // initialize matrix_a and matrix_b
    for (int i = 0; i < 1048576; i++) {
        *(matrix_a+i) = 0.1f;
        *(matrix_b+i) = 0.2f;
    }
    // initialize result matrix
    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            result[i][j] = 0;
        }
    }

    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            for (int k = 0; k < 1024; k += 4) {
                // load
                va = _mm_loadu_ps(matrix_a+(i*1024)+k); // matrix_a[i][k]
                vb = _mm_loadu_ps(matrix_b+(j*1024)+k); // matrix_b[j][k]

                // multiply
                vresult = _mm_mul_ps(va, vb);

                // add
                vresult = _mm_hadd_ps(vresult, vresult);
                vresult = _mm_hadd_ps(vresult, vresult);

                // store
                result[i][j] += _mm_cvtss_f32(vresult);
            }
        }
    }
    
    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            printf("%f ", result[i][j]);
        }
        printf("\n");
    }
    
    return 0;
}
gcc -O4 -fopt-info-optall-optimized -msse3 -o sse_mm_unaligned sse_mm_unaligned.c
time ./sse_mm_unaligned > /dev/null

real    0m1.054s
user    0m1.044s
sys     0m0.008s

And the run time is about 1.054 seconds using a single thread. Note that we have to pass -msse3 to gcc, as vanilla SSE does not support the horizontal add instruction.

AVX

As mentioned earlier, the double-hadd method does not work for the AVX _mm256_hadd_ps intrinsic (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_hadd_ps&expand=2778), which works like this:

dst[31:0] := a[63:32] + a[31:0]
dst[63:32] := a[127:96] + a[95:64]
dst[95:64] := b[63:32] + b[31:0]
dst[127:96] := b[127:96] + b[95:64]
dst[159:128] := a[191:160] + a[159:128]
dst[191:160] := a[255:224] + a[223:192]
dst[223:192] := b[191:160] + b[159:128]
dst[255:224] := b[255:224] + b[223:192]

Here’s a va-vb-table that shows what happens with AVX:

va 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
vb 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6
vresult 0.3 0.7 1.9 1.3 1.1 1.5 2.7 3.1

Here’s the first va-va table of the double-hadd method:

va 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
va 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
vresult 0.3 0.7 0.3 0.7 1.1 1.5 1.1 1.5

And the second vresult-vresult table:

vresult 0.3 0.7 0.3 0.7 1.1 1.5 1.1 1.5
vresult 0.3 0.7 0.3 0.7 1.1 1.5 1.1 1.5
vresult (new) 1 1 1 1 2.6 2.6 2.6 2.6

As you can see, we do not reach our expected result of 3.6 (0.1+0.2+…+0.8). (It’s just like it’s doing two SSE hadds completely independent from each other.) There are various ways to get out of this problem, e.g. extract the two 128-bit halves from the 256-bit register, and then use SSE instructions. This is how you extract:

vlow = _mm256_extractf128_ps(va, 0);
vhigh = _mm256_extractf128_ps(va, 1);

The second argument indicates with half you want.

As an aside: instead of extracting the lower 128 bits and putting them in a register, we can also use a cast, _mm256_castps256_ps128 (https://software.intel.com/en-us/node/524181).

The lower 128-bits of the source vector are passed unchanged to the result. This intrinsic does not introduce extra moves to the generated code.

Anyway, let’s go with the extracted values first. So we have the following situation:

vlow 0.1 0.2 0.3 0.4
vhigh 0.5 0.6 0.7 0.8

And we want to add all these eight values together. So why don’t we just simply use our trusty _mm_add_ps(vlow, vhigh) first? This way we can do four of eight required additions, leaving us with the following 128-bit register:

vresult 0.6 0.8 1 1.2

And now we want to add up horizontally, so we use the double-_mm_hadd_ps method described above:

vresult 0.6 0.8 1 1.2
vresult 0.6 0.8 1 1.2
vresult 1.4 2.2 1.4 2.2
vresult 1.4 2.2 1.4 2.2
vresult 1.4 2.2 1.4 2.2
vresult 3.6 3.6 3.6 3.6
#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
    float *matrix_a = malloc(1024*1024*sizeof(float));
    float *matrix_b = malloc(1024*1024*sizeof(float));
    float result[1024][1024];
    __m256 va, vb, vtemp;
    __m128 vlow, vhigh, vresult;

    // initialize matrix_a and matrix_b
    for (int i = 0; i < 1048576; i++) {
        *(matrix_a+i) = 0.1f;
        *(matrix_b+i) = 0.2f;
    }
    // initialize result matrix
    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            result[i][j] = 0;
        }
    }

    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            for (int k = 0; k < 1024; k += 8) {
                // load
                va = _mm256_loadu_ps(matrix_a+(i*1024)+k); // matrix_a[i][k]
                vb = _mm256_loadu_ps(matrix_b+(j*1024)+k); // matrix_b[j][k]

                // multiply
                vtemp = _mm256_mul_ps(va, vb);

                // add
                // extract higher four floats
                vhigh = _mm256_extractf128_ps(vtemp, 1); // high 128
                // add higher four floats to lower floats
                vresult = _mm_add_ps(_mm256_castps256_ps128(vtemp), vhigh);
                // horizontal add of that result
                vresult = _mm_hadd_ps(vresult, vresult);
                // another horizontal add of that result
                vresult = _mm_hadd_ps(vresult, vresult);

                // store
                result[i][j] += _mm_cvtss_f32(vresult);
            }
        }
    }
    
    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            printf("%f ", result[i][j]);
        }
        printf("\n");
    }
    
    return 0;
}
$ gcc -O4 -fopt-info-optall-optimized -mavx -o avx256_mm_unaligned avx256_mm_unaligned.c 
$ time ./avx256_mm_unaligned > /dev/null

real    0m0.912s
user    0m0.904s
sys     0m0.004s

That is… a tiny bit faster. (Note that I’m running everything multiple times to make sure the difference isn’t just due to change.) However, with AVX we are supposed to get twice the FLOPs, right? We’ll look at other optimizations of the vectorization in a later post. Before that, let’s add OpenMP into the mix.

OpenMP

Unfortunately, OpenMP’s #pragma omp parallel for sometimes doesn’t appear to do what you need it to do. Sticking this in front of the outer (i) loop reduces performance by half! However, we can be sure that this isn’t the processor “oversubscribing” the SIMD units, because if we run two instances of our program at the same time, both finish with almost the same run time we see with just a single instance:

$ time (./avx256_mm_unaligned & ./avx256_mm_unaligned; wait) > /dev/null
real    0m1.001s
user    0m0.988s
sys     0m0.008s

So we’ll use the same chunking trick that we used last time, and our result gets a little better: 0.753 seconds:

#include <x86intrin.h> // Need this in order to be able to use the AVX "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>
#include <stdlib.h>

float *matrix_a;
float *matrix_b;
float result[1024][1024];

void chunked_mm(int chunk, int n_chunks) {
    __m256 va, vb, vtemp;
    __m128 vlow, vhigh, vresult;
    for (int i = chunk*(1024/n_chunks); i < (chunk+1)*(1024/n_chunks); i++) {
        for (int j = 0; j < 1024; j++) {
            for (int k = 0; k < 1024; k += 8) {
                // load
                va = _mm256_loadu_ps(matrix_a+(i*1024)+k); // matrix_a[i][k]
                vb = _mm256_loadu_ps(matrix_b+(j*1024)+k); // matrix_b[j][k]

                // multiply
                vtemp = _mm256_mul_ps(va, vb);

                // add
                // extract higher four floats
                vhigh = _mm256_extractf128_ps(vtemp, 1); // high 128
                // add higher four floats to lower floats
                vresult = _mm_add_ps(_mm256_castps256_ps128(vtemp), vhigh);
                // horizontal add of that result
                vresult = _mm_hadd_ps(vresult, vresult);
                // another horizontal add of that result
                vresult = _mm_hadd_ps(vresult, vresult);

                // store
                result[i][j] += _mm_cvtss_f32(vresult);
            }
        }
    }
}

int main(int argc, char **argv) {
    // initialize matrix_a and matrix_b
    matrix_a = malloc(1024*1024*sizeof(float));
    matrix_b = malloc(1024*1024*sizeof(float));
    for (int i = 0; i < 1048576; i++) {
        *(matrix_a+i) = 0.1f;
        *(matrix_b+i) = 0.2f;
    }
    // initialize result matrix
    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            result[i][j] = 0;
        }
    }

    #pragma omp parallel for num_threads(4)
    for (int i = 0; i < 4; i++) {
        chunked_mm(i, 4);
    }
    
    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            printf("%f ", result[i][j]);
        }
        printf("\n");
    }
    
    return 0;
}
$ gcc -fopenmp -O4 -mavx -o avx256_mm_unaligned_openmp avx256_mm_unaligned_openmp.c
$ time ./avx256_mm_unaligned_openmp > /dev/null 

real    0m0.753s
user    0m1.332s
sys     0m0.008s

To be honest, with a 2 core/4 thread system, I would have expected better. Running multiple instances doesn’t increase the run time, and the previous version took only 1.27 times as long as this.

Re-evaluating our performance measurements

Array initialization will always take the same small amount of time, but printf(“%f”, …) takes a non-constant amount of time and depends on the values. Let’s see what kind of timing we get when we change this to an %x format string.

printf("%x ", *(unsigned int*)&result[i][j]);
time ./avx256_mm_unaligned > /dev/null

real    0m0.488s
user    0m0.480s
sys     0m0.004s

time ./avx256_mm_unaligned_openmp > /dev/null

real    0m0.277s
user    0m0.832s
sys     0m0.008s

That sounds much better, both in absolute terms and in OpenMP terms. By the way, if we remove the matrix multiplication and only leave initialization and output, we still get an execution time of about 0.111 seconds. So it’s reasonably safe to say that our matrix multiplication takes about 0.377 seconds on a single thread. (I feel like I shot myself in the foot for measuring this using shell’s time, rather than embedding the measurement in the code itself…)

Aligned accesses

To allow the use of the aligned _mm256_load_ps, allocate your memory like this:

    matrix_a = aligned_alloc(ALIGNMENT, 1024*1024*sizeof(float));
    matrix_b = aligned_alloc(ALIGNMENT, 1024*1024*sizeof(float));

Unfortunately, I didn’t notice a significant difference. (You may be able to shave off a few percent.)

Results

Here are the results, again:

AVX, no OpenMP AVX, OpenMP SSE, no OpenMP
Run time 0.488 0.277 0.59
Minus init/output 0.377 0.166 0.479

Matrix multiplication using gcc’s auto-vectorization

In my previous post, I tried to explain how to use SIMD instructions for a really simple (and artificial) example: just adding numbers in two vectors together. In this post, I’d like to take this just a little bit further and talk about matrix multiplication. In this post, we’re using gcc’s auto-vectorization. We’ll vectorize this ourselves in my next post.

If you’re here, you probably know what matrix multiplication is. It’s got a lot of uses, including graphics and neural networks.

We’ll keep our implementation simple by only supporting square matrices with n dividable by 16 (in the case of AVX). Our example will use n=1024. So before we do the vectorized implementation, let’s look at a general (“naive”) example:

#include <stdio.h>

int main(int argc, char **argv) {
    float matrix_a[1024][1024];
    float matrix_b[1024][1024];
    float result_matrix[1024][1024];
    
    // initialize arrays
    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            matrix_a[i][j] = 0.1f;
            matrix_b[i][j] = 0.2f;
            result_matrix[i][j] = 0.0f;
        }
    }

    for (i = 0; i < 1024; i++) { // iterate over rows of matrix A/result matrix
        for (j = 0; j < 1024; j++) { // iterate over columns matrix B/result matrix
            for (k = 0; k < 1024; k++) { // iterate over columns of matrix A and rows of matrix B
                result_matrix[i][j] += matrix_a[i][k]*matrix_b[k][j]
            }
        }
    }

    // output
    for (int i = 0; i < 1024; i++) {
        for (int j = 0; j < 1024; j++) {
            printf("%f ", result_matrix[i][j]);
        }
        printf("\n");
    }
}

To compile and run, execute the following commands:

$ gcc -Wall -o mm mm.c
$ ulimit -s 16384
$ time ./mm > mm_output

real    0m20.189s
user    0m20.016s
sys     0m0.072s

(Note that we are allocating the arrays on the stack rather than using malloc, so we need to raise the stack size a bit, otherwise we get an immediate segmentation fault.)

The reason matrix multiplication code can look a bit mysterious is that there are a lot of things that can be optimized. However, there is only one optimization that is required to get vectorization to work at all.

As you can see, when we access matrix_a, we access matrix_a[i][0], then matrix_a[i][1], matrix_a[i][2], matrix_a[i][3], and so on until we have hit the end. This is nice and sequential memory access, and is much faster than haphazard (“random”) accesses. In matrix_b, we have somewhat haphazard accesses. The first access is matrix_b[0][j], the second access is (in our example) 1024 bytes away from the first, matrix_b[1][j], then another 1024 bytes away at matrix_b[2][j], etc. There is a 1024 byte gap between every access. This kind of access is slow. It ruins the CPU’s caching system. This is why matrix_b will often be transposed in matrix multiplication code. If you transpose the matrix, the rows will be the columns and the columns the rows, thus you get nice and sequential access to matrix_b. (In our demonstration code, we are using square matrices with the same values everywhere, so we don’t actually have to do any copying work, as matrix_b is the same transposed or not. So all we have to do is swap the indices.)

            result[i][j] += matrix_a[i][k]*matrix_b[j][k]

So what kind of speed-up does this get us? The naive implementation takes 19-21 seconds on my system. The implementation with the transposed matrix takes 4 seconds! That’s a 5x speed-up!

Next, we’ll try to parallelize the outer for-loop using OpenMP. With OpenMP we just have to add #pragma omp parallel for in front of the loop, like this:

    #pragma omp parallel for
    for (int i = 0; i < 1024; i++) {

And then compile and run like this:

$ gcc -fopenmp -Wall -o mmT mmT.c
$ ulimit -s 16384
$ time ./mmT > /dev/null

real    0m2.939s
user    0m9.984s
sys     0m0.016s

Next, we’ll ask gcc to auto-vectorize! Curiously enough, gcc didn’t autovectorize the version with the transposed loop, so I’ve gathered results for -O4 without autovectorization for non-transposed, -O4 with autovectorization for non-transposed, and -O4 transposed:

-O4 with SSE autovectorization -O4 with AVX autovectorization -O4 without autovectorization
Straight 2.99 1.527 8.921
Transposed n/a n/a 1.565

And here are the commands and some example output:

$ # -O4, no auto-vectorization, straight
$ gcc -g -O4 -fopt-info-optall-optimized -fno-tree-vectorize -o mm mm.c
mm.c:9:9: note: Loop 7 distributed: split to 1 loops and 1 library calls.
$ time ./mm > /dev/null

real    0m8.921s
user    0m8.912s
sys     0m0.004s
$ # -O4, SSE auto-vectorization, straight
$ gcc -g -O4 -fopt-info-optall-optimized -ftree-vectorize -o mm mm.c
mm.c:9:9: note: Loop 7 distributed: split to 1 loops and 1 library calls.
mm.c:18:9: note: loop vectorized
mm.c:9:9: note: loop vectorized
$ # -O4, AVX auto-vectorization, straight
$ gcc -g -O4 -fopt-info-optall-optimized -ftree-vectorize -mavx -o mm mm.c
$ # -O4, no auto-vectorization, transformed
$ gcc -g -O4 -fopt-info-optall-optimized -ftree-vectorize -o mmT mmT.c

Let’s add OpenMP into the mix:

-O4 with AVX autovectorization and OpenMP -O4 with OpenMP
Straight 1.18 5.568
Transposed n/a 0.882

Just asking OpenMP to parallelize the i-loop makes the auto-vectorization break, but we can work around that by manually splitting the matrix multiplication into chunks. This is the full code:

#include <stdio.h>

#define N 1024

float matrix_a[N][N];
float matrix_b[N][N];
float result_matrix[N][N];

void chunked_mm(int chunk, int n_chunks) {
    for (int i = chunk*(N/n_chunks); i < (chunk+1)*(N/n_chunks); i++) {
        for (int j = 0; j < N; j++) {
            for (int k = 0; k < N; k++) {
                result_matrix[i][j] += matrix_a[i][k] * matrix_b[k][j];
            }
        }
    }
}

int main(int argc, char **argv) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            matrix_a[i][j] = 0.1f;
            matrix_b[i][j] = 0.2f;
            result_matrix[i][j] = 0.0f;
        }
    }
    #pragma omp parallel for
    for (int chunk = 0; chunk < 4; chunk++) {
        chunked_mm(chunk, 4);
    }
 
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            printf("%f ", result_matrix[i][j]);
        }
        printf("\n");
    }
}

Compile and run:

$ gcc -g -O4 -fopenmp -fopt-info-optall-optimized -ftree-vectorize -mavx -o mm_autovectorized_openmp mm_autovectorized_openmp.c 
mm_autovectorized_openmp.c:11:9: note: loop vectorized
mm_autovectorized_openmp.c:11:9: note: loop vectorized
mm_autovectorized_openmp.c:21:9: note: Loop 4 distributed: split to 1 loops and 1 library calls.
mm_autovectorized_openmp.c:21:9: note: loop vectorized
mm_autovectorized_openmp.c:21:9: note: loop peeled for vectorization to enhance alignment
mm_autovectorized_openmp.c:21:9: note: loop turned into non-loop; it never loops.
mm_autovectorized_openmp.c:21:9: note: loop with 7 iterations completely unrolled
mm_autovectorized_openmp.c:19:5: note: loop turned into non-loop; it never loops.
mm_autovectorized_openmp.c:19:5: note: loop with 7 iterations completely unrolled
$ time ./mm_autovectorized_openmp > /dev/null

real    0m1.182s
user    0m3.036s
sys     0m0.012s

From the user time being larger than the real time, we can tell that this was indeed running in multiple threads. Enclosing the parallel loop with something like:

for (int loop = 0; loop < 10; loop++) {
        #pragma omp parallel for
        for (int chunk = 0; chunk < 8; chunk++) {
            chunked_mm(chunk, 8);
        }
    }

allows us a better measurement of how much improvement we get.

time ./mm_autovectorized_openmp > /dev/null

real    0m6.649s
user    0m23.572s
sys     0m0.012s

Anyway, the transpose still beats gcc’s auto-vectorization of the non-transposed code. I wish I could get gcc to auto-vectorize the transposed code, but alas.

In the next post we’ll vectorize this ourselves!

Baby steps in SIMD (SSE/AVX)

In case you have never used SIMD instructions, this post explores the real basics. For example: what is SIMD? SIMD stands for “Single instruction, multiple data”. We’re computing more than one “math problem” with a single instruction. CPUs have had instructions to do this for a long time. If you remember the “Pentium MMX” hype – that was the first time SIMD instructions came to the x86 architecture.

However, with some trickery, you can do some limited SIMD without actually using these instructions. Let’s say we want to add 1 to two values at the same time. If we put these two values right next to each other in memory, we can interpret them as a single larger datatype. That’s not all that straightforward to understand, so here’s an example: you can interpret two 8-bit values right next to each other as one 16-bit value, right? To increment both values at the same time, you do value + 0x0101, which is just one assembly instruction. So with no special instructions at all, on a 64-bit platform you can increment eight 8-bit values at the same time by adding 0x0101010101010101.

Okay, that feels pretty hacky and unreliable. Once you’ve incremented a value 256 times, you’ll have spilt into the neighboring value! That’s pretty bad.

So SSE provides 128-bit registers that allow you to comfortably work on e.g. four 32-bit floats at the same time, without any spilling. AVX provides 256-bit registers, and AVX512 provides 512-bit registers. Woo! Unfortunately AVX512 isn’t widely available yet.

So how do you use this? Let’s start with SSE, though you’ll see that updating code to use AVX or AVX512 instead is pretty easy. We’ll look at some very basic example code to add two vectors together.

#include <xmmintrin.h> // Need this in order to be able to use the SSE "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>

int main(int argc, char **argv) {
    float a[4], b[4], result[4]; // a and b: input, result: output
    __m128 va, vb, vresult; // these vars will "point" to SIMD registers

    // initialize arrays (just {0,1,2,3})
    for (int i = 0; i < 4; i++) {
        a[i] = (float)i;
        b[i] = (float)i;
    }
    
    // load arrays into SIMD registers
    va = _mm_loadu_ps(a); // https://software.intel.com/en-us/node/524260
    vb = _mm_loadu_ps(b); // same

    // add them together
    vresult = _mm_add_ps(va, vb);

    // store contents of SIMD register into memory
    _mm_storeu_ps(result, vresult); // https://software.intel.com/en-us/node/524262

    // print out result
    for (int i = 0; i < 4; i++) {
        printf("%f\n", result[i]);
    }
}

That doesn’t seem so hard, does it? To access SIMD instructions without writing assembly code, we use something called “intrinsics”, which make the SIMD instructions look like regular C functions. Don’t worry though, these functions are inline and mostly just consist of the assembly instruction itself, so you probably won’t see any difference in performance.

In this example, we’re using three intrinsics, _mm_loadu_ps, _mm_add_ps, and _mm_storeu_ps. _mm_loadu_ps copies four float values from memory into the SSE register. We do this twice and are thus using two SSE registers. (We have 16 SSE registers available on 64-bit CPUs.) Then, we use _mm_add_ps to, in a single instruction, add the four floats in one register to the corresponding floats in the other register. (So we get a[0]+b[0], a[1]+b[1], a[2]+b[2], a[3]+b[3].) This is stored in a third SSE register. Using _mm_storeu_ps, we put the contents of this result register into the result float array.

We can compile and run this without any extra linking:

$ gcc -Wall -o sse_test sse_test.c 
$ ./sse_test
0.000000
2.000000
4.000000
6.000000

Wow, it worked!

_mm_loadu_ps/_mm_storeu_ps have sister functions without the ‘u’. These functions require memory alignment, which just means that the memory has to start at an address that is cleanly divisible by a certain number, which mostly increases performance (unless something unfortunate happens in the CPU caching department).

To get the alignment, we just declare the arrays like this:

    float a[4] __attribute__ ((aligned (16)));
    float b[4] __attribute__ ((aligned (16)));
    float result[4]  __attribute__ ((aligned (16)));

And then change all instances of _mm_loadu_ps/_mm_storeu_ps to _mm_load_ps/_mm_store_ps.  Intel’s documentation states that we need 16-byte alignment. And GCC’s syntax just looks a bit obscure. It’s described here: https://gcc.gnu.org/onlinedocs/gcc-6.4.0/gcc/Common-Variable-Attributes.html#Common-Variable-Attributes

Cool, that’s SSE. What about AVX? Well, it turns out that we just need to change the included header file, the array sizes and the names of the intrinsics! (Note that you can include all intrinsics available by doing #include <x86intrin.h> instead.)

So here’s the same thing using AVX, and with aligned memory accesses:

#include <immintrin.h> // Need this in order to be able to use the AVX "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>

int main(int argc, char **argv) {
    float a[8] __attribute__ ((aligned (32))); // Intel documentation states that we need 32-byte alignment to use _mm256_load_ps/_mm256_store_ps
    float b[8]  __attribute__ ((aligned (32))); // GCC's syntax makes this look harder than it is: https://gcc.gnu.org/onlinedocs/gcc-6.4.0/gcc/Common-Variable-Attributes.html#Common-Variable-Attributes
    float result[8]  __attribute__ ((aligned (32)));
    __m256 va, vb, vresult; // __m256 is a 256-bit datatype, so it can hold 8 32-bit floats

    // initialize arrays (just {0,1,2,3,4,5,6,7})
    for (int i = 0; i < 8; i++) {
        a[i] = (float)i;
        b[i] = (float)i;
    }

    // load arrays into SIMD registers
    va = _mm256_load_ps(a); // https://software.intel.com/en-us/node/694474
    vb = _mm256_load_ps(b); // same

    // add them together
    vresult = _mm256_add_ps(va, vb); // https://software.intel.com/en-us/node/523406

    // store contents of SIMD register into memory
    _mm256_store_ps(result, vresult); // https://software.intel.com/en-us/node/694665

    // print out result
    for (int i = 0; i < 8; i++) {
        printf("%f\n", result[i]);
    }
    
    return 0;
}

So let’s compile that:

gcc -Wall -o avx256_test_aligned avx256_test_aligned.c 
avx256_test_aligned.c: In function ‘main’:
avx256_test_aligned.c:15:8: warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
     va = _mm256_load_ps(a); // https://software.intel.com/en-us/node/694474
     ~~~^~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-linux-gnu/6/include/immintrin.h:41:0,
                 from avx256_test_aligned.c:1:
/usr/lib/gcc/x86_64-linux-gnu/6/include/avxintrin.h:852:1: error: inlining failed in call to always_inline ‘_mm256_store_ps’: target specific option mismatch
 _mm256_store_ps (float *__P, __m256 __A)
 ^~~~~~~~~~~~~~~
avx256_test_aligned.c:18:5: note: called from here
...

Oh no, what happened? It didn’t complain when we used SSE instructions (perhaps because all CPUs of the implicitly selected architecture (x86_64) support SSE, which was first introduced a very long time ago), but it’s complaining that our use of AVX instructions is causing a “target-specific option mismatch”. That’s a bit cryptic, but it means that our target (“vanilla” x86_64) does not support AVX instructions. To make this work, we need to supply the -mavx option:

$ gcc -Wall -mavx -o avx256_test_aligned avx256_test_aligned.c 
$ ./avx256_test_aligned 
0.000000
2.000000
4.000000
6.000000
8.000000
10.000000
12.000000
14.000000

Nice! BTW, for AVX512, we just need to change the 256s to 512s and the array index 8s to 16s, and supply -mavx512f to gcc.

Addendum: if you execute the AVX512 code on a CPU that doesn’t support it, you get this:

gcc -mavx512f -Wall -o avx_test_aligned avx_test_aligned.c 
./avx_test_aligned
Illegal instruction

Second addendum: if you use the aligned instructions without actually aligning your arrays, you get this:

$ ./avx_with_bad_alignment
Segmentation fault

Let me know if you have any questions.

How to find out if an executable uses (e.g.) SIMD instructions (includes jq mini-tutorial!)

“Embarrassingly parallel” algorithms can often make use of SIMD instructions like those that came with the SSE and AVX extensions. In the Python world, numpy is a very popular package to work with arrays. One of the first things I wondered when I started using numpy was, “How optimized is numpy?” Some quick investigation shows that it’s multi-threaded, and some googling shows that it uses SIMD instructions: https://stackoverflow.com/questions/17109410/how-can-i-check-if-my-installed-numpy-is-compiled-with-sse-sse2-instruction-set

Now, it’s a bit tedious to grep for strings like VADDPD in the disassembly, so this post develops a nicer method.

For the impatient, here’s an unorthodox dirty one-liner (it creates a temporary file) that does this for you. It requires jq and internet access to download a database.

tempfile=`mktemp`; curl https://raw.githubusercontent.com/asmjit/asmdb/488b6d986964627f0b130b5265722dde8d93f11d/x86data.js | cpp | sed -n '/^{/,/^}/ { p }' | jq '[ .instructions | .[] | { (.[0]): .[4] } ] | add' > $tempfile; objdump --no-show-raw-insn -M intel -d /usr/lib/python2.7/dist-packages/numpy/core/*.so | awk '{print $2}' | grep -v : | sort | uniq | while read line; do echo -n "$line  "; output=$(jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" $tempfile); if [ -z "$output" ]; then echo; else echo $output; fi; done > output_test; rm $tempfile

Note that it is not able to distinguish between e.g. AVX and AVX512. It always prints out the most advanced extension possible, so it will print out AVX512 if any AVX is used. If you want something better, check out the Node.js version at the bottom of this post.

And around this point we start the explanation for the less impatient readers: first of all, we need a database of CPU instructions, and a simple Google query brings up this: https://github.com/asmjit/asmdb (The following discussion is based on commit 488b6d986964627f0b130b5265722dde8d93f11d.)

This project is in JavaScript, and the data file isn’t quite in JSON, so let’s do some minor preprocessing first to make our database easier to use:

cpp x86data.js | sed -n '/^{/,/^}/ { p }' > json

cpp is the C preprocessor to remove comments (there are comments and even multi-line comments in the actual data). The sed bit looks for a line starting with a { and after that a line starting with a }, all the while printing out this whole block.

Next, we need to get a disassembly. Here’s an example for numpy’s .so files:

objdump --no-show-raw-insn -M intel -d /usr/lib/python2.7/dist-packages/numpy/core/*.so | grep -P "^ +[0-9a-z]+:" | awk '{print $2}' | sort | uniq > numpy_instructions

This will get us all instruction mnemonics used. We get a file like this:

adc
add
addpd
addps
addsd
addss
and
andnpd
andnps

Let’s go back to our data. Today, we’ll use jq as our main tool to get the job done (though it will be many times slower than if we wrote a simple script that loads the hash once and re-uses it for every input instruction). If we just want the instructions block, we could do this:

jq '.instructions' json > instructions

However, this tool is a real Swiss army knife. We can use the familiar concept of piping, and we can wrap things in arrays or hashes just by enclosing expressions in [] or {}. Here’s an entire command to get an array of hashes containing only the instruction and the corresponding extension from the json file:

jq '[ .instructions | .[] | {instruction: .[0], extension: .[4] } ]' json

.[] iterates over the array inside the instructions key. Every item in the array is piped to a bit of jq code that creates a hash with an instruction and an extension key, which correspond to array element 0 and 4 in the input data. So we get output like this:

[
  {
    "instruction": "aaa",
    "extension": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
  },
  {
    "instruction": "aas",
    "extension": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
  },
  .
  .
  .
]

Now we’re going to do something slightly naughty. The extension field isn’t the same for all instructions with the same mnemonic, as different opcodes with the same mnemonics have been added to the instruction set over time. However, we don’t need to be that precise IMO, so we’re just going to merge everything into an object like {“mnemonic”: “extension info”}. First, let’s get an array of hashes:

jq '[ .instructions | .[] | { (.[0]): .[4] } ]' json | head
[
  {
    "aaa": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
  },
  {
    "aas": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W"
  },
  {
    "aad": "X86 Deprecated   OF=U SF=W ZF=W AF=U PF=W CF=U"
  },
  .
  .
  .
]

Now we just need to pipe this into the add filter to merge this array of hashes/objects into a single hash/object:

jq '[ .instructions | .[] | { (.[0]): .[4] } ] | add' json > mnem2ext.json

And the result is:

{
  "aaa": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W",
  "aas": "X86 Deprecated   OF=U SF=U ZF=U AF=W PF=U CF=W",
  "aad": "X86 Deprecated   OF=U SF=W ZF=W AF=U PF=W CF=U",
  "aam": "X86 Deprecated   OF=U SF=W ZF=W AF=U PF=W CF=U",
  "adc": "X64              OF=W SF=W ZF=W AF=W PF=W CF=X",
  "add": "X64              OF=W SF=W ZF=W AF=W PF=W CF=W",
  "and": "X64              OF=0 SF=W ZF=W AF=U PF=W CF=0",
  "arpl": "X86 ZF=W",
  "bndcl": "MPX X64",
  ...
}

Wee! But how do we access the information in this file? Well, with jq of course (not efficient though):

while read line; do echo -n "$line  "; jq ".$line" min.json; done < numpy_instructions

Here’s an extract from the output:

cvttpd2dq  "SSE2"
cvttps2dq  "SSE2"
cvttsd2si  "SSE2 X64"
cvttss2si  "SSE X64"
cwde  "ANY"
div  "X64              OF=U SF=U ZF=U AF=U PF=U CF=U"
divpd  "SSE2"
divps  "SSE"
divsd  "SSE2"
divss  "SSE"
fabs  "FPU              C0=U C1=0 C2=U C3=U"
fadd  "FPU              C0=U C1=W C2=U C3=U"

Such a nice mix of instructions. <3 We have a few problems though. Here are some instructions that couldn’t resolved:

cmova  null
cmpneqss  null
ja  null
rep  null
seta  null
vcmplepd

A closer look at our database reveals that some instructions have slashes in them, like “cmova/cmovnbe”. These are aliases, so we should be able to detect these as well. jq sort of allows to search for keys using regex, though the syntax isn’t easy, and the bash escaping makes things a bit worse:

while read line; do echo -n "$line  "; jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\")))" min.json; done < numpy_instructions > output

Things have gotten a bit slower again, and the rest of our output looks a bit different too:

xor  {
  "xor": "X64              OF=0 SF=W ZF=W AF=U PF=W CF=0"
}
xorpd  {
  "xorpd": "SSE2"
}
xorps  {
  "xorps": "SSE"
}

We can’t get rid of the echo, otherwise we’ll have no way to tell if jq is finding the mnemonic or not. So we’ll use jq to fix the format. Here’s an easy example:

echo '{ "b": "c" }' | jq 'to_entries[]'
[
  {
    "key": "b",
    "value": "c"
  }
]
echo '{ "b": "c" }' | jq 'to_entries | .[] | .value'
"c"

Here, we’re just converting the hash into an array (as we did above with with_entries), and only select the .values. We can just pipe this within jq:

while read line; do echo -n "$line  "; jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" min.json; done < numpy_instructions > output

However, we don’t get a newline when we didn’t find an instruction, so we work around this in bash:

while read line; do echo -n "$line  "; output=$(jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" min.json); if [ -z "$output" ]; then echo; else echo $output; fi; done < numpy_instructions > output

That leaves mostly pseudo-instructions. The following pseudo-instructions are not included in this database but would indicate SSE2: CMPEQPD, CMPLTPD, CMPLEPD, CMPUNORDPD, CMPNEQPD, CMPNLTPD, CMPNLEPD, CMPORDPD. These all belong to the CMPPD instruction introduced in SSE2, as far as I can tell. (https://www.felixcloutier.com/x86/CMPPD.html#tbl-3-2) It would make sense to have them in the database in this case, but I think I’ll leave well enough alone for now though.

Anyway, doing something like awk ‘{print $2}’ output | sed s/\”//g | sort | uniq shows that my numpy version may use instructions from the following sets:

ANY
AVX
AVX2
AVX512_BW
AVX512_DQ
AVX512_F
CMOV
FPU
FPU_POP
FPU_PUSH
I486
MMX2
SSE
SSE2
SSE4_1
X64

Well, that’s great. Let’s package this up into a shell script so it’s a bit easier to use. Just stick it in a directory that has cpu_extensions.min.json in it and it’ll work.

#!/bin/bash

json_file=$(dirname $0)/cpu_extensions.min.json
objdump --no-show-raw-insn -M intel -d $* | grep -P "^ [0-9a-z]+:" | awk '{print $2}' | sort | uniq | while read line; do
    echo -n "$line  "
    output=$(jq "with_entries(select(.key | match(\"(^$line\\/|\\/$line\$|$line\\/|^$line\$)\"))) | to_entries | .[] | .value" $json_file);
    if [ -z "$output" ];
        then echo;
    else
        echo $output | sed -e 's/"//g' -e 's/ .*//g'
    fi
done

Also, here’s a more efficient (O(n)) implementation in Node.js. It gets away with much less pre-processing, all you have to do is:

sed -n '/^{/,/^}/ { p }' x86data.js > cpu_extensions.json

However, it doesn’t execute objdump for you, so you have to call it like this:

show_cpu_extensions.js <(objdump --no-show-raw-insn -M intel -d /usr/lib/python2.7/dist-packages/numpy/core/*.so | grep -P "^ +[0-9a-z]+:" | awk '{print $2}' | sort | uniq)

I’ve also made it display all possible extensions.

#!/usr/bin/nodejs

var database_file;
var disassembly_file;

if (process.argv.length == 3) {
    // Use default database
    database_file = __dirname + "/cpu_extensions.json";
    disassembly_file = process.argv[2];
} else if (process.argv.length == 4) {
    database_file = process.argv[2];
    disassembly_file = process.argv[3];
} else {
    console.log("Usage: " + process.argv[1] + " [database] disassembly");
    console.log(process.argv);
    process.exit(1);
}

var fs = require("fs");
var readline = require("readline"); 
var mnem2ext = {};

var obj = JSON.parse(fs.readFileSync(database_file, "utf8"));
obj["instructions"].map(function(v, i) {
    var ext = v[4].replace(/ +[A-Z]+=.*/, "").replace(/  +.*/, "");

    if (v[0].match(/\//)) {
        v[0].split("/").forEach(function(v, i) {
            if (!mnem2ext[v]) {
                mnem2ext[v] = {};
            }
            mnem2ext[v][ext] = true;
        });
    } else {
        if (!mnem2ext[v[0]]) {
            mnem2ext[v[0]] = {};
        }
        mnem2ext[v[0]][ext] = true;
    }
});

var lineReader = require("readline").createInterface({input: fs.createReadStream(disassembly_file)});
lineReader.on("line", function(line) {
    console.log(line + ": " + (mnem2ext[line] ? Object.keys(mnem2ext[line]).join(", ") : undefined));
});