matrix multiplication speed depends on silly things


matrix multiplication speed depends on silly things



I wrote a program for fast matrix multiplication. To use CPU cache to the maximal extent, it divides matrix to 10*10 cells, and multiplies each cell separately (increasing cell size to 20*20 or 50*50 does not significantly change time).



It turns out that the speed significantly depends on whether matrix size and cell size is know in advance or not.



The program is:


#include <cmath>
#include <cstdlib>
#include <iostream>

using namespace std;

#define forall(i,n) for(int i=0; i<(int)(n); i++)

inline void Load(int N, int N2, float* x2, float* x, int iStart, int jStart) {
int start = iStart * N + jStart;
forall (i, N2)
forall (j, N2)
x2[i * N2 + j] = x[start + i * N + j];
}

inline void Add(int N, int N2, float* x, float* x2, int iStart, int jStart) {
int start = iStart * N + jStart;
forall (i, N2)
forall (j, N2)
x[start + i * N + j] += x2[i * N2 + j];
}

inline void Mul(int N, float* z, float* x, float* y) {
forall (i, N)
forall (j, N) {
double sum = 0;
forall (k, N)
sum += x[i*N+k] * y[k*N+j];
z[i*N+j] = sum;
}
}

inline double RandReal() {return random()/((double)RAND_MAX+1);}

int main(int argc, char** argv) {
#if VAR==3
const int N = atoi(argv[1]), N2 = atoi(argv[2]);
#elif VAR==2
const int N = 2000, N2 = atoi(argv[2]);
#elif VAR==1
const int N = atoi(argv[1]), N2 = 10;
#elif VAR==0
const int N = 2000, N2 = 10;
#else
#error "Bad VAR"
#endif
cout << "VAR=" << VAR << " N=" << N << " N2=" << N2 << endl;
float x[N*N], y[N*N];
forall (i, N)
forall (j, N) {
x[i*N+j] = RandReal();
y[i*N+j] = RandReal();
}
float z[N*N];
forall (i, N)
forall (j, N)
z[i*N+j] = 0;
for (int i1 = 0; i1 < N; i1 += N2) {
float x2[N2*N2], y2[N2*N2], z2[N2*N2];
for (int j1 = 0; j1 < N; j1 += N2) {
Load(N, N2, x2, x, i1, j1);
for (int k1 = 0; k1 < N; k1 += N2) {
Load(N, N2, y2, y, j1, k1);
Mul(N2, z2, x2, y2);
Add(N, N2, z, z2, i1, k1);
}
}
}

double val = 0, val2 = 0;
forall (i, N)
forall (j, N)
val += z[i*N+j], val2 += z[i*N+j]*(i+j);
cout << "val=" << val << " val2=" << val2 << endl;
}



Now the execution times:


$ for a in 0 1 2 3; do g++ -DVAR=$a -O3 -Wall -o mat mat.cpp; time ./mat 2000 10; done
VAR=0 N=2000 N2=10
val=2.00039e+09 val2=3.99867e+12

real 0m8.127s
user 0m8.108s
sys 0m0.020s
VAR=1 N=2000 N2=10
val=2.00039e+09 val2=3.99867e+12

real 0m3.304s
user 0m3.292s
sys 0m0.012s
VAR=2 N=2000 N2=10
val=2.00039e+09 val2=3.99867e+12

real 0m25.395s
user 0m25.388s
sys 0m0.008s
VAR=3 N=2000 N2=10
val=2.00039e+09 val2=3.99867e+12

real 0m25.515s
user 0m25.495s
sys 0m0.016s



In simple terms:



Why is it? I use g++ 5.4.0 .



inline does not play any role, if we remove it, the results will be the same.


inline





Have you looked at the difference in the generated assembly for each case?
– Retired Ninja
Jun 30 at 5:37





@RetiredNinja - I don't know assembler
– user31264
Jun 30 at 5:47





If things are known ahead of time, the compiler can do really cool things to the loops, possibly even compiling the loop out. Also worth noting that float z[N*N]; with N=2000. will require 16,000,000 bytes of stack. Something you probably won't have if you haven't increased the stack size.
– user4581301
Jun 30 at 6:10


float z[N*N];


N





I cannot reproduce all of your results. Here, VAR=0/1 runs in 2.7sec, VAR=2/3 runs in ~7sec, so there is no difference whether N is known/unknown.
– geza
Jun 30 at 8:52





the main difference between N2 is known/unknown is that with N2 known, the compiler puts highly unrolled loops, so that kinda explains the difference
– geza
Jun 30 at 8:58




1 Answer
1



Introductory note: Much of this post has been re-written, so some of the comments below won't make too much sense anymore. Please look behind the edit for details, if you care. So...



tl;dr



I agree with @user4581301 - the more the compiler knows ahead of time, the more it can do for you in terms of optimising your code.



But you need to profile this code - wall clock times will only take you so far. I don't know a whole lot about profilers for gcc (I have a good one for MSVC) but you could try your luck here.



It also pays (as @RetiredNinja said right off the bat) to try to learn some assembler, using Godbolt as a tool, especially if you want to understand a slowdown as dramatic as this.



Now having said all that, your timings make no sense to me so something strange is going on on your system. So I ran some tests of my own, and my results differ markedly from yours. I ran some of these tests on MSVC (because I have such wonderful profiling tools there) and some on gcc on the Mac (although I think that is actually secretly clang under the hood). I have no linux box, sorry.



Firstly, let's deal with the issue of allocating such large objects on the stack. This is obviously not wise, and I can't do it at all on MSVC since it doesn't support variable length arrays, but my tests on the Mac showed that this made no difference to the timings once I had increased ulimit to get it to work at all (see here). So I think we can discount that as a factor, as you yourself say in the comments.


ulimit



So now let's look at the timings I got on the Mac:


VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known)
user 0m10.813s

VAR=1 USE_STACK=0 N=2000 (unknown) N2=10 (known)
user 0m11.008s

VAR=2 USE_STACK=0 N=2000 (known) N2=10 (unknown)
user 0m12.714s

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
user 0m12.778s

VAR=0 USE_STACK=1 N=2000 (known) N2=10 (known)
user 0m10.617s

VAR=1 USE_STACK=1 N=2000 (unknown) N2=10 (known)
user 0m10.987s

VAR=2 USE_STACK=1 N=2000 (known) N2=10 (unknown)
user 0m12.653s

VAR=3 USE_STACK=1 N=2000 (unknown) N2=10 (unknown)
user 0m12.673s



Not much to see there; let's move on to what I observed on MSVC (where I can profile):


VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known)
Elapsed: 0:00:06.89

VAR=1 USE_STACK=0 N=2000 (unknown) N2=10 (known)
Elapsed: 0:00:06.86

VAR=2 USE_STACK=0 N=2000 (known) N2=10 (unknown)
Elapsed: 0:00:10.24

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
Elapsed: 0:00:10.39



Now we have something we can get our teeth into. Like @geza observed, the code takes longer to run when N2 isn't known, which is entirely in line with what one might expect since this is where the hot loops will be, and it's much more likely that the compiler will unroll such a loop when it knows that it's actually comprised of a small, known number of iterations.


N2



So let's get some information from the profiler. It tells me that the hot loop (by quite a long way) is the inner loop in Mul():


Mul()


inline void Mul(int N, float* z, float* x, float* y) {
forall (i, N)
forall (j, N) {
double sum = 0;



=> forall (k, N)
=> sum += x[i*N+k] * y[kN+j];
z[i
N+j] = sum;
}
}



Again, I can't say this surprises me much, and when I look at the code I can see that the loop is not unrolled at all (loop setup code omitted for brevity):


$1:
movss xmm0,dword ptr [r9+rsi*4]
mulss xmm0,dword ptr [r8+4]
movss xmm1,dword ptr [r9+r15*4]
mulss xmm1,dword ptr [r8]
cvtps2pd xmm2,xmm0
cvtps2pd xmm0,xmm1
movss xmm1,dword ptr [r8+8]
mulss xmm1,dword ptr [r9]
addsd xmm0,xmm3
addsd xmm2,xmm0
cvtps2pd xmm0,xmm1
movss xmm1,dword ptr [r9+r14*4]
movaps xmm3,xmm2
mulss xmm1,dword ptr [r8+0Ch]
add r9,rbp
add r8,10h
addsd xmm3,xmm0
cvtps2pd xmm0,xmm1
addsd xmm3,xmm0
sub rcx,1
jne $1



Now it doesn't look that there would be any savings to be made there by unrolling that loop since looping around is going to be cheap compared to executing all the rest of the code in there, but if you look at the disassembly of the same loop when N2 is known, you get a surprise:


N2


movss xmm0,dword ptr [rax-8]
mulss xmm0,dword ptr [rcx-50h]
cvtps2pd xmm2,xmm0
movss xmm0,dword ptr [rcx-28h]
mulss xmm0,dword ptr [rax-4]
addsd xmm2,xmm7
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx]
mulss xmm0,dword ptr [rax]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+28h]
mulss xmm0,dword ptr [rax+4]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+50h]
mulss xmm0,dword ptr [rax+8]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+78h]
mulss xmm0,dword ptr [rax+0Ch]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+0A0h]
mulss xmm0,dword ptr [rax+10h]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+0C8h]
mulss xmm0,dword ptr [rax+14h]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+0F0h]
mulss xmm0,dword ptr [rax+18h]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+118h]
mulss xmm0,dword ptr [rax+1Ch]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
addsd xmm2,xmm1
cvtpd2ps xmm0,xmm2
movss dword ptr [rdx+rcx],xmm0



Now there is no loop, and the number of instructions that will be executed overall is clearly reduced. Perhaps MS are not such a bunch of dumb clucks after all.



So finally, as an exercise, let's just quickly unroll that loop manually and see what timings we get (I didn't look at the generated code):


#define UNROLL_LOOP 1

inline void Mul(int N, float* z, float* x, float* y) {
forall (i, N)
forall (j, N) {
double sum = 0;
#if UNROLL_LOOP
assert (N == 10);
sum += x[i*N] * y[0*N+j];
sum += x[i*N+1] * y[1*N+j];
sum += x[i*N+2] * y[2*N+j];
sum += x[i*N+3] * y[3*N+j];
sum += x[i*N+4] * y[4*N+j];
sum += x[i*N+5] * y[5*N+j];
sum += x[i*N+6] * y[6*N+j];
sum += x[i*N+7] * y[7*N+j];
sum += x[i*N+8] * y[8*N+j];
sum += x[i*N+9] * y[9*N+j];
#else
forall (k, N)
sum += x[i*N+k] * y[k*N+j];
#endif
z[i*N+j] = sum;
}
}



And when I did that, I got:


VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
Elapsed: 0:00:07.48 (compared with 10.39 / 6.86, not bad, more may be possible).



So that's the process you need to go through to analyse performance issues like this, and you need good tools. I don't know what's happening in your case, because I can't reproduce the issue, but loop unrolling is (as expected) the primary factor on MSVC when (small) loop counts are unknown.



The test code I used is here in case anyone wants to refer to it. I think you owe me a vote, OP.



Edit:



Played around a little bit at Wandbox with gcc 9.0.0. Timings (these are rather slower and a bit more imprecise since we are running on a shared box, or, more likely, in a VM):



VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known)
Elapsed time = ~8sec



VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
Elapsed time = ~15.5sec



VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown), loop unrolled
Elapsed time = ~ 13.5sec



So that needs a bit more investigation - with a profiler and by looking at the generated code - and is still a million miles away from what the OP is getting.



Live demo - you can play around with that yourself if you want to try different things, OP.





I knew it! I knew from experience that somebody who doesn't know what he is talking about, will still try to answer the question. "For all I know, you're spending half your time in RandReal()." - then you know NOTHING, I repeat NOTHING. All the RandReal() stuff is O(N^2), while the multiplication is O(N^3). 8000000 calls to RandReal(), each one takes 5-20 nanoseconds. To be absolutely sure (maybe you know something which I don't?), I replaced calls to RandReal() by just 0.5 , and it did not change the time significantly (still 8.1, 3.3, and 25.4 seconds). Downvote.
– user31264
Jun 30 at 7:29






As for suggestions to do manual loop unrolling, (1) g++ can do loop unrolling itself (2) it misses the point. Before doing manual optimization, I want to know what is going on, that's was the point of my question, not how could I improve the speed by blind trials and errors.
– user31264
Jun 30 at 7:32






Wow, where did all that come from? I think you rather missed the point there. Profile the code. And maybe take a break / get some sleep.
– Paul Sanders
Jun 30 at 7:32






Oh, yes, I nearly forgot in all the excitement. If you really want know what the compiler is doing with your code, gaining some rudimentary knowledge of assembler would be very valuable. Then - after profiling - you can easily check if gcc has unrolled your most critical loop(s) or not (OK, well, clearly here your timings show it hasn't) and then decide for yourself what to do. No blind trial and error there. You can inspect code generated by the compiler quickly and easily at Godbolt. Still think I don't know what I'm doing? Honestly?
– Paul Sanders
Jun 30 at 7:42





[...Time passes...] Edited my answer, I do accept it could have used a little improvement. You should consider voting back up - you have made yourself look a little foolish here. ... Oh you did. Cool, thank you. Let's hope we can get over getting off on the wrong foot. I hate to be at odds with people. Could even vote up :)
– Paul Sanders
Jun 30 at 8:10







By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Extract Id from Twitch Clip URL

Why are these constructs (using ++) undefined behavior in C?

I'm Still Waiting (Diana Ross song)