matrix multiplication speed depends on silly things

I wrote a program for fast matrix multiplication. To use CPU cache to the maximal extent, it divides matrix to 10*10 cells, and multiplies each cell separately (increasing cell size to 20*20 or 50*50 does not significantly change time).

It turns out that the speed significantly depends on whether matrix size and cell size is know in advance or not.

The program is:

#include <cmath> #include <cstdlib> #include <iostream> using namespace std; #define forall(i,n) for(int i=0; i<(int)(n); i++) inline void Load(int N, int N2, float* x2, float* x, int iStart, int jStart) { int start = iStart * N + jStart; forall (i, N2) forall (j, N2) x2[i * N2 + j] = x[start + i * N + j]; } inline void Add(int N, int N2, float* x, float* x2, int iStart, int jStart) { int start = iStart * N + jStart; forall (i, N2) forall (j, N2) x[start + i * N + j] += x2[i * N2 + j]; } inline void Mul(int N, float* z, float* x, float* y) { forall (i, N) forall (j, N) { double sum = 0; forall (k, N) sum += x[i*N+k] * y[k*N+j]; z[i*N+j] = sum; } } inline double RandReal() {return random()/((double)RAND_MAX+1);} int main(int argc, char** argv) { #if VAR==3 const int N = atoi(argv[1]), N2 = atoi(argv[2]); #elif VAR==2 const int N = 2000, N2 = atoi(argv[2]); #elif VAR==1 const int N = atoi(argv[1]), N2 = 10; #elif VAR==0 const int N = 2000, N2 = 10; #else #error "Bad VAR" #endif cout << "VAR=" << VAR << " N=" << N << " N2=" << N2 << endl; float x[N*N], y[N*N]; forall (i, N) forall (j, N) { x[i*N+j] = RandReal(); y[i*N+j] = RandReal(); } float z[N*N]; forall (i, N) forall (j, N) z[i*N+j] = 0; for (int i1 = 0; i1 < N; i1 += N2) { float x2[N2*N2], y2[N2*N2], z2[N2*N2]; for (int j1 = 0; j1 < N; j1 += N2) { Load(N, N2, x2, x, i1, j1); for (int k1 = 0; k1 < N; k1 += N2) { Load(N, N2, y2, y, j1, k1); Mul(N2, z2, x2, y2); Add(N, N2, z, z2, i1, k1); } } } double val = 0, val2 = 0; forall (i, N) forall (j, N) val += z[i*N+j], val2 += z[i*N+j]*(i+j); cout << "val=" << val << " val2=" << val2 << endl; }

Now the execution times:

$ for a in 0 1 2 3; do g++ -DVAR=$a -O3 -Wall -o mat mat.cpp; time ./mat 2000 10; done VAR=0 N=2000 N2=10 val=2.00039e+09 val2=3.99867e+12 real 0m8.127s user 0m8.108s sys 0m0.020s VAR=1 N=2000 N2=10 val=2.00039e+09 val2=3.99867e+12 real 0m3.304s user 0m3.292s sys 0m0.012s VAR=2 N=2000 N2=10 val=2.00039e+09 val2=3.99867e+12 real 0m25.395s user 0m25.388s sys 0m0.008s VAR=3 N=2000 N2=10 val=2.00039e+09 val2=3.99867e+12 real 0m25.515s user 0m25.495s sys 0m0.016s

In simple terms:

Why is it? I use g++ 5.4.0 .

inline does not play any role, if we remove it, the results will be the same.

inline

Have you looked at the difference in the generated assembly for each case?
– Retired Ninja
Jun 30 at 5:37

@RetiredNinja - I don't know assembler
– user31264
Jun 30 at 5:47

If things are known ahead of time, the compiler can do really cool things to the loops, possibly even compiling the loop out. Also worth noting that float z[N*N]; with N=2000. will require 16,000,000 bytes of stack. Something you probably won't have if you haven't increased the stack size.
– user4581301
Jun 30 at 6:10

float z[N*N];

N

I cannot reproduce all of your results. Here, VAR=0/1 runs in 2.7sec, VAR=2/3 runs in ~7sec, so there is no difference whether N is known/unknown.
– geza
Jun 30 at 8:52

the main difference between N2 is known/unknown is that with N2 known, the compiler puts highly unrolled loops, so that kinda explains the difference
– geza
Jun 30 at 8:58

1 Answer
1

Introductory note: Much of this post has been re-written, so some of the comments below won't make too much sense anymore. Please look behind the edit for details, if you care. So...

tl;dr

I agree with @user4581301 - the more the compiler knows ahead of time, the more it can do for you in terms of optimising your code.

But you need to profile this code - wall clock times will only take you so far. I don't know a whole lot about profilers for gcc (I have a good one for MSVC) but you could try your luck here.

It also pays (as @RetiredNinja said right off the bat) to try to learn some assembler, using Godbolt as a tool, especially if you want to understand a slowdown as dramatic as this.

Now having said all that, your timings make no sense to me so something strange is going on on your system. So I ran some tests of my own, and my results differ markedly from yours. I ran some of these tests on MSVC (because I have such wonderful profiling tools there) and some on gcc on the Mac (although I think that is actually secretly clang under the hood). I have no linux box, sorry.

Firstly, let's deal with the issue of allocating such large objects on the stack. This is obviously not wise, and I can't do it at all on MSVC since it doesn't support variable length arrays, but my tests on the Mac showed that this made no difference to the timings once I had increased ulimit to get it to work at all (see here). So I think we can discount that as a factor, as you yourself say in the comments.

ulimit

So now let's look at the timings I got on the Mac:

VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known) user 0m10.813s VAR=1 USE_STACK=0 N=2000 (unknown) N2=10 (known) user 0m11.008s VAR=2 USE_STACK=0 N=2000 (known) N2=10 (unknown) user 0m12.714s VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown) user 0m12.778s VAR=0 USE_STACK=1 N=2000 (known) N2=10 (known) user 0m10.617s VAR=1 USE_STACK=1 N=2000 (unknown) N2=10 (known) user 0m10.987s VAR=2 USE_STACK=1 N=2000 (known) N2=10 (unknown) user 0m12.653s VAR=3 USE_STACK=1 N=2000 (unknown) N2=10 (unknown) user 0m12.673s

Not much to see there; let's move on to what I observed on MSVC (where I can profile):

VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known) Elapsed: 0:00:06.89 VAR=1 USE_STACK=0 N=2000 (unknown) N2=10 (known) Elapsed: 0:00:06.86 VAR=2 USE_STACK=0 N=2000 (known) N2=10 (unknown) Elapsed: 0:00:10.24 VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown) Elapsed: 0:00:10.39

Now we have something we can get our teeth into. Like @geza observed, the code takes longer to run when N2 isn't known, which is entirely in line with what one might expect since this is where the hot loops will be, and it's much more likely that the compiler will unroll such a loop when it knows that it's actually comprised of a small, known number of iterations.

N2

So let's get some information from the profiler. It tells me that the hot loop (by quite a long way) is the inner loop in Mul():

Mul()

inline void Mul(int N, float* z, float* x, float* y) { forall (i, N) forall (j, N) { double sum = 0;

=> forall (k, N)
=> sum += x[i*N+k] * y[kN+j];
z[iN+j] = sum;
}
}

Again, I can't say this surprises me much, and when I look at the code I can see that the loop is not unrolled at all (loop setup code omitted for brevity):

$1: movss xmm0,dword ptr [r9+rsi*4] mulss xmm0,dword ptr [r8+4] movss xmm1,dword ptr [r9+r15*4] mulss xmm1,dword ptr [r8] cvtps2pd xmm2,xmm0 cvtps2pd xmm0,xmm1 movss xmm1,dword ptr [r8+8] mulss xmm1,dword ptr [r9] addsd xmm0,xmm3 addsd xmm2,xmm0 cvtps2pd xmm0,xmm1 movss xmm1,dword ptr [r9+r14*4] movaps xmm3,xmm2 mulss xmm1,dword ptr [r8+0Ch] add r9,rbp add r8,10h addsd xmm3,xmm0 cvtps2pd xmm0,xmm1 addsd xmm3,xmm0 sub rcx,1 jne $1

Now it doesn't look that there would be any savings to be made there by unrolling that loop since looping around is going to be cheap compared to executing all the rest of the code in there, but if you look at the disassembly of the same loop when N2 is known, you get a surprise:

N2

movss xmm0,dword ptr [rax-8] mulss xmm0,dword ptr [rcx-50h] cvtps2pd xmm2,xmm0 movss xmm0,dword ptr [rcx-28h] mulss xmm0,dword ptr [rax-4] addsd xmm2,xmm7 cvtps2pd xmm1,xmm0 movss xmm0,dword ptr [rcx] mulss xmm0,dword ptr [rax] addsd xmm2,xmm1 cvtps2pd xmm1,xmm0 movss xmm0,dword ptr [rcx+28h] mulss xmm0,dword ptr [rax+4] addsd xmm2,xmm1 cvtps2pd xmm1,xmm0 movss xmm0,dword ptr [rcx+50h] mulss xmm0,dword ptr [rax+8] addsd xmm2,xmm1 cvtps2pd xmm1,xmm0 movss xmm0,dword ptr [rcx+78h] mulss xmm0,dword ptr [rax+0Ch] addsd xmm2,xmm1 cvtps2pd xmm1,xmm0 movss xmm0,dword ptr [rcx+0A0h] mulss xmm0,dword ptr [rax+10h] addsd xmm2,xmm1 cvtps2pd xmm1,xmm0 movss xmm0,dword ptr [rcx+0C8h] mulss xmm0,dword ptr [rax+14h] addsd xmm2,xmm1 cvtps2pd xmm1,xmm0 movss xmm0,dword ptr [rcx+0F0h] mulss xmm0,dword ptr [rax+18h] addsd xmm2,xmm1 cvtps2pd xmm1,xmm0 movss xmm0,dword ptr [rcx+118h] mulss xmm0,dword ptr [rax+1Ch] addsd xmm2,xmm1 cvtps2pd xmm1,xmm0 addsd xmm2,xmm1 cvtpd2ps xmm0,xmm2 movss dword ptr [rdx+rcx],xmm0

Now there is no loop, and the number of instructions that will be executed overall is clearly reduced. Perhaps MS are not such a bunch of dumb clucks after all.

So finally, as an exercise, let's just quickly unroll that loop manually and see what timings we get (I didn't look at the generated code):

#define UNROLL_LOOP 1 inline void Mul(int N, float* z, float* x, float* y) { forall (i, N) forall (j, N) { double sum = 0; #if UNROLL_LOOP assert (N == 10); sum += x[i*N] * y[0*N+j]; sum += x[i*N+1] * y[1*N+j]; sum += x[i*N+2] * y[2*N+j]; sum += x[i*N+3] * y[3*N+j]; sum += x[i*N+4] * y[4*N+j]; sum += x[i*N+5] * y[5*N+j]; sum += x[i*N+6] * y[6*N+j]; sum += x[i*N+7] * y[7*N+j]; sum += x[i*N+8] * y[8*N+j]; sum += x[i*N+9] * y[9*N+j]; #else forall (k, N) sum += x[i*N+k] * y[k*N+j]; #endif z[i*N+j] = sum; } }

And when I did that, I got:

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown) Elapsed: 0:00:07.48 (compared with 10.39 / 6.86, not bad, more may be possible).

So that's the process you need to go through to analyse performance issues like this, and you need good tools. I don't know what's happening in your case, because I can't reproduce the issue, but loop unrolling is (as expected) the primary factor on MSVC when (small) loop counts are unknown.

The test code I used is here in case anyone wants to refer to it. I think you owe me a vote, OP.

Edit:

Played around a little bit at Wandbox with gcc 9.0.0. Timings (these are rather slower and a bit more imprecise since we are running on a shared box, or, more likely, in a VM):

VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known)
Elapsed time = ~8sec

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
Elapsed time = ~15.5sec

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown), loop unrolled
Elapsed time = ~ 13.5sec

So that needs a bit more investigation - with a profiler and by looking at the generated code - and is still a million miles away from what the OP is getting.

Live demo - you can play around with that yourself if you want to try different things, OP.

I knew it! I knew from experience that somebody who doesn't know what he is talking about, will still try to answer the question. "For all I know, you're spending half your time in RandReal()." - then you know NOTHING, I repeat NOTHING. All the RandReal() stuff is O(N^2), while the multiplication is O(N^3). 8000000 calls to RandReal(), each one takes 5-20 nanoseconds. To be absolutely sure (maybe you know something which I don't?), I replaced calls to RandReal() by just 0.5 , and it did not change the time significantly (still 8.1, 3.3, and 25.4 seconds). Downvote.
– user31264
Jun 30 at 7:29

As for suggestions to do manual loop unrolling, (1) g++ can do loop unrolling itself (2) it misses the point. Before doing manual optimization, I want to know what is going on, that's was the point of my question, not how could I improve the speed by blind trials and errors.
– user31264
Jun 30 at 7:32

Wow, where did all that come from? I think you rather missed the point there. Profile the code. And maybe take a break / get some sleep.
– Paul Sanders
Jun 30 at 7:32

Oh, yes, I nearly forgot in all the excitement. If you really want know what the compiler is doing with your code, gaining some rudimentary knowledge of assembler would be very valuable. Then - after profiling - you can easily check if gcc has unrolled your most critical loop(s) or not (OK, well, clearly here your timings show it hasn't) and then decide for yourself what to do. No blind trial and error there. You can inspect code generated by the compiler quickly and easily at Godbolt. Still think I don't know what I'm doing? Honestly?
– Paul Sanders
Jun 30 at 7:42

[...Time passes...] Edited my answer, I do accept it could have used a little improvement. You should consider voting back up - you have made yourself look a little foolish here. ... Oh you did. Cool, thank you. Let's hope we can get over getting off on the wrong foot. I hate to be at odds with people. Could even vote up :)
– Paul Sanders
Jun 30 at 8:10

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Rtyjn