Why is there a large performance difference between C and Fortran for matrix multiplication?

I am doing comparison between Fortran and C programming language for matrix operations.

This time I have written two files (matmul.c and matmul.f90) that both do the same thing, i.e. multiply matrices A and B - which are both square matrices of size N by N.

In both files I have implemented loop blocking and I considered efficient memory access for better performance. However, Fortran is significantly faster for the multiplication of matrix A and B. When N is equal to 1024, Fortran is about three times faster. In that case, C takes about 0.57 seconds to complete matrix multiplication, while Fortran takes 0.18 seconds to complete matrix multiplication.

Can anyone take a look at my codes and tell me what can be done to reduce this difference?

I am running Windows 10, with an Intel i7-6700 CPU and 16 GB of RAM. For compilation I am using MinGW 14.2.

In order to compile matmul.f90 file I have used this command in cmd:

gfortran -O3 -march=native -funroll-all-loops matmul.f90

In order to compile matmul.c file I have used this command in cmd:

gcc -O3 -march=native -funroll-all-loops matmul.c

Here is the Fortran code:

program matmul_fortran
    implicit none
    integer, parameter :: N = 1024   ! matrix size 
    integer, parameter :: BLOCK_SIZE = 32  ! block size 
    real(8), dimension(N, N) :: A, B, C
    integer :: i, j, k, i_block, j_block, k_block
    real(8) :: start, finish, temp

    ! initialize matrices A and B with random values
    call random_seed()
    call random_number(A)
    call random_number(B)
    C = 0.0  ! set matrix C to zero values

    call cpu_time(start)

    ! multiplication
    do i_block = 1, N, BLOCK_SIZE
        do j_block = 1, N, BLOCK_SIZE
            do k_block = 1, N, BLOCK_SIZE
                do i = i_block, min(i_block + BLOCK_SIZE - 1, N)
                    do j = j_block, min(j_block + BLOCK_SIZE - 1, N)
                        do k = k_block, min(k_block + BLOCK_SIZE - 1, N)
                            C(k, i) = C(k, i) + A(k, j)*B(j, i)
                        end do
                    end do
                end do
            end do
        end do
    end do

    call cpu_time(finish)

    print *, "Fortran Matrix Multiplication Time: ", finish - start, " seconds"

end program matmul_fortran

Here is the C code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 1024  // matrix size
#define BLOCK_SIZE 32  // block size

// function to initliaze matrices with random values
void initialize_matrix(double *matrix) {
    for (int i = 0; i < N * N; i++) {
        matrix[i] = (double)rand() / RAND_MAX;  // Random values between 0 and 1
    }
}

int main() {
    double *A, *B, *C;
    clock_t start, end;

    A = (double *)malloc(N * N * sizeof(double));
    B = (double *)malloc(N * N * sizeof(double));
    C = (double *)malloc(N * N * sizeof(double));

    // set matrix C to zero values
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            C[i * N + j] = 0.0; 

    // Initialize matrices
    srand(time(NULL));
    initialize_matrix(A);
    initialize_matrix(B);

    start = clock();

    // multiplication
    for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
        for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
            for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
                for (int i = i_block; i < i_block + BLOCK_SIZE && i < N; i++) {
                    for (int j = j_block; j < j_block + BLOCK_SIZE && j < N; j++) {
                        for (int k = k_block; k < k_block + BLOCK_SIZE && k < N; k++) {
                            C[i*N + k] += A[j*N + k]*B[i*N + j];
                        }
                    }
                }
            }
        }
    }
    
    end = clock();

    printf("C Matrix Multiplication Time: %.6f seconds\n", ((double)(end - start)) / CLOCKS_PER_SEC);

    free(A);
    free(B);
    free(C);

    return 0;
}

I am doing comparison between Fortran and C programming language for matrix operations.

This time I have written two files (matmul.c and matmul.f90) that both do the same thing, i.e. multiply matrices A and B - which are both square matrices of size N by N.

Can anyone take a look at my codes and tell me what can be done to reduce this difference?

I am running Windows 10, with an Intel i7-6700 CPU and 16 GB of RAM. For compilation I am using MinGW 14.2.

In order to compile matmul.f90 file I have used this command in cmd:

gfortran -O3 -march=native -funroll-all-loops matmul.f90

In order to compile matmul.c file I have used this command in cmd:

gcc -O3 -march=native -funroll-all-loops matmul.c

Here is the Fortran code:

program matmul_fortran
    implicit none
    integer, parameter :: N = 1024   ! matrix size 
    integer, parameter :: BLOCK_SIZE = 32  ! block size 
    real(8), dimension(N, N) :: A, B, C
    integer :: i, j, k, i_block, j_block, k_block
    real(8) :: start, finish, temp

    ! initialize matrices A and B with random values
    call random_seed()
    call random_number(A)
    call random_number(B)
    C = 0.0  ! set matrix C to zero values

    call cpu_time(start)

    ! multiplication
    do i_block = 1, N, BLOCK_SIZE
        do j_block = 1, N, BLOCK_SIZE
            do k_block = 1, N, BLOCK_SIZE
                do i = i_block, min(i_block + BLOCK_SIZE - 1, N)
                    do j = j_block, min(j_block + BLOCK_SIZE - 1, N)
                        do k = k_block, min(k_block + BLOCK_SIZE - 1, N)
                            C(k, i) = C(k, i) + A(k, j)*B(j, i)
                        end do
                    end do
                end do
            end do
        end do
    end do

    call cpu_time(finish)

    print *, "Fortran Matrix Multiplication Time: ", finish - start, " seconds"

end program matmul_fortran

Here is the C code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 1024  // matrix size
#define BLOCK_SIZE 32  // block size

// function to initliaze matrices with random values
void initialize_matrix(double *matrix) {
    for (int i = 0; i < N * N; i++) {
        matrix[i] = (double)rand() / RAND_MAX;  // Random values between 0 and 1
    }
}

int main() {
    double *A, *B, *C;
    clock_t start, end;

    A = (double *)malloc(N * N * sizeof(double));
    B = (double *)malloc(N * N * sizeof(double));
    C = (double *)malloc(N * N * sizeof(double));

    // set matrix C to zero values
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            C[i * N + j] = 0.0; 

    // Initialize matrices
    srand(time(NULL));
    initialize_matrix(A);
    initialize_matrix(B);

    start = clock();

    // multiplication
    for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
        for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
            for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
                for (int i = i_block; i < i_block + BLOCK_SIZE && i < N; i++) {
                    for (int j = j_block; j < j_block + BLOCK_SIZE && j < N; j++) {
                        for (int k = k_block; k < k_block + BLOCK_SIZE && k < N; k++) {
                            C[i*N + k] += A[j*N + k]*B[i*N + j];
                        }
                    }
                }
            }
        }
    }
    
    end = clock();

    printf("C Matrix Multiplication Time: %.6f seconds\n", ((double)(end - start)) / CLOCKS_PER_SEC);

    free(A);
    free(B);
    free(C);

    return 0;
}

Share Improve this question edited Mar 31 at 9:40 Mark Rotteveel 110k229 gold badges156 silver badges224 bronze badges asked Mar 16 at 15:13 Ante Jurčević 634 bronze badges

3 You can try using a temp for B[i*N + j] like you do in Fortran (to avoid the matrix access in every iteration of the inner loop). – wohlstad Commented Mar 16 at 15:23
3 On the Fortran side, it might be interesting to compare to Fortran's built in matmul() as well. – John Bollinger Commented Mar 16 at 15:49
3 Side note: real(8) does not necessarily mean an 8-byte floating-point number. Size and representation are implementation dependent. It is safer to assume that Fortran double precision corresponds to C double, but if you want to be positive (probably unnecessary here) then you should use Fortran's C interop features to get the real kind number that yields the same type as the system C's double. – John Bollinger Commented Mar 16 at 15:53
4 You should always start with using restrict in C. Unless you do overlap pointers, it is a nobrainer. Also, given that you use GCC, you can just examine to what intermediate representations the two frontends compile your code. Use ` -fdump-tree-optimized`. – Vladimir F Героям слава Commented Mar 16 at 15:54
4 Or, in this case, just put the matrix multiplication into main(). It is not what you normally want to do, but it is what you did in Fortran... You should keep the codes as similar as possible. If it is the restrict issue, though, it will remain when you move the Fortran loops into a subroutine. – Vladimir F Героям слава Commented Mar 16 at 15:59

| Show 20 more comments

3 Answers 3

Sorted by: Reset to default 5

@Adrian McCarthy had the right observation. There is a conditional that must be evaluated to terminate the blocked loops in C and it is a compound expression involving summation and the logical && operator. Fortran, on the other hand, just calls the min function once to determine the upper bound. While some compilers might be able to optimize this, others may be not able.

When changing the loop nest to

    for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
        for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
            for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
                int maxi = i_block + BLOCK_SIZE < N ? i_block + BLOCK_SIZE : N;
                for (int i = i_block; i < maxi; i++) {
                    int maxj = j_block + BLOCK_SIZE < N ? j_block + BLOCK_SIZE : N;
                    for (int j = j_block; j < maxj; j++) {
                        int maxk = k_block + BLOCK_SIZE < N ? k_block + BLOCK_SIZE : N;
                        for (int k = k_block; k < maxk; k++) {
                            C[i*N + k] += A[j*N + k]*B[i*N + j];

the performance for the GCC compiler improves dramatically. In my case from 0.68 s to 0.28 s. Fortran still remains slightly faster at 0.25 s, but the speeds are much closer now. Intel is still better on my machine with 0.18-0.19 s with -O3 -xHost.

The new versions enables GCC to do more vectorization. Although I did not examine the assembly, only the intermediate GIMPLE, I can see vector operations such as

  <bb 7> [local count: 10406813]:
  # ivtmp.131_127 = PHI <ivtmp.178_547(6), ivtmp.131_217(9)>
  # ivtmp.139_438 = PHI <_22(6), ivtmp.139_172(9)>
  _20 = MEM[(double *)_365 + ivtmp.131_127 * 8];
  _390 = ivtmp.168_88 + _367;
  _391 = _390 * 8;
  vectp.57_387 = C_44(D) + _391;
  vectp.60_395 = (double *) ivtmp.139_438;
  vect_cst__404 = {_20, _20, _20, _20};
  _352 = ivtmp.139_438 + 8192;
  vectp.67_416 = (double *) _352;
  D__lsm0.98_9 = MEM[(double *)_365 + 8B + ivtmp.131_127 * 8];

  <bb 8> [local count: 56196792]:
  # ivtmp.122_502 = PHI <0(7), ivtmp.122_512(8)>
  vect__9.58_394 = MEM <vector(4) double> [(double *)vectp.57_387 + ivtmp.122_502 * 1];
  vect__15.61_403 = MEM <vector(4) double> [(double *)vectp.60_395 + ivtmp.122_502 * 1];
  vect__21.62_405 = vect__15.61_403 * vect_cst__404;
  vect__22.63_406 = vect__9.58_394 + vect__21.62_405;
  vect_cst__415 = {D__lsm0.98_9, D__lsm0.98_9, D__lsm0.98_9, D__lsm0.98_9};
  vect__122.68_425 = MEM <vector(4) double> [(double *)vectp.67_416 + ivtmp.122_502 * 1];
  vect__123.69_426 = vect_cst__415 * vect__122.68_425;
  vect__124.70_427 = vect__22.63_406 + vect__123.69_426;
  MEM <vector(4) double> [(double *)vectp.57_387 + ivtmp.122_502 * 1] = vect__124.70_427;
  ivtmp.122_512 = ivtmp.122_502 + 32;
  if (ivtmp.122_512 != 256)
    goto <bb 8>; [83.33%]
  else
    goto <bb 9>; [16.67%]

which have not been there before.

From my uneducated examination of this GIMPLE and the GIMPLE of the Fortran version it seems to me that there is more unrolling done by gfortran than by gcc for C which might be able to explain the remaining difference

 <bb 7> [local count: 267178624]:
  # ivtmp.52_174 = PHI <ivtmp.79_246(6), ivtmp.52_196(7)>
  # ivtmp.54_142 = PHI <_214(6), ivtmp.54_24(7)>
  _126 = (void *) ivtmp.52_174;
  _14 = MEM[(real(kind=8) *)_126 + -8200B];
  _135 = MEM[(real(kind=8) *)_126 + -8192B];
  vect_cst__93 = {_14, _14, _14, _14};
  vect_cst__70 = {_135, _135, _135, _135};
  _216 = (void *) ivtmp.67_176;
  vect__6.26_146 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216];
  _75 = (void *) ivtmp.54_142;
  vect__11.29_8 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8200B];
  vect__15.30_138 = vect__11.29_8 * vect_cst__93;
  vect__16.31_43 = vect__15.30_138 + vect__6.26_146;
  vect__126.36_3 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8B];
  vect__125.37_4 = vect__126.36_3 * vect_cst__70;
  vect__124.38_5 = vect__125.37_4 + vect__16.31_43;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216] = vect__124.38_5;
  vect__6.26_114 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 32B];
  vect__11.29_108 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8168B];
  vect__15.30_49 = vect_cst__93 * vect__11.29_108;
  vect__16.31_51 = vect__15.30_49 + vect__6.26_114;
  vect__126.36_89 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 24B];
  vect__125.37_88 = vect_cst__70 * vect__126.36_89;
  vect__124.38_87 = vect__16.31_51 + vect__125.37_88;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 32B] = vect__124.38_87;
  vect__6.26_59 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 64B];
  vect__11.29_67 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8136B];
  vect__15.30_123 = vect__11.29_67 * vect_cst__93;
  vect__16.31_42 = vect__6.26_59 + vect__15.30_123;
  vect__126.36_31 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 56B];
  vect__125.37_30 = vect__126.36_31 * vect_cst__70;
  vect__124.38_29 = vect__125.37_30 + vect__16.31_42;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 64B] = vect__124.38_29;
  vect__6.26_159 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 96B];
  vect__11.29_160 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8104B];
  vect__15.30_161 = vect_cst__93 * vect__11.29_160;
  vect__16.31_162 = vect__6.26_159 + vect__15.30_161;
  vect__126.36_164 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 88B];
  vect__125.37_165 = vect_cst__70 * vect__126.36_164;
  vect__124.38_166 = vect__16.31_162 + vect__125.37_165;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 96B] = vect__124.38_166;
  vect__6.26_181 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 128B];
  vect__11.29_182 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8072B];
  vect__15.30_183 = vect_cst__93 * vect__11.29_182;
  vect__16.31_184 = vect__6.26_181 + vect__15.30_183;
  vect__126.36_186 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 120B];
  vect__125.37_187 = vect_cst__70 * vect__126.36_186;
  vect__124.38_188 = vect__16.31_184 + vect__125.37_187;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 128B] = vect__124.38_188;
  vect__6.26_203 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 160B];
  vect__11.29_204 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8040B];
  vect__15.30_205 = vect_cst__93 * vect__11.29_204;
  vect__16.31_206 = vect__6.26_203 + vect__15.30_205;
  vect__126.36_208 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 152B];
  vect__125.37_209 = vect_cst__70 * vect__126.36_208;
  vect__124.38_210 = vect__16.31_206 + vect__125.37_209;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 160B] = vect__124.38_210;
  vect__6.26_225 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 192B];
  vect__11.29_226 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8008B];
  vect__15.30_227 = vect_cst__93 * vect__11.29_226;
  vect__16.31_228 = vect__6.26_225 + vect__15.30_227;
  vect__126.36_230 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 184B];
  vect__125.37_231 = vect_cst__70 * vect__126.36_230;
  vect__124.38_232 = vect__16.31_228 + vect__125.37_231;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 192B] = vect__124.38_232;
  vect__6.26_22 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 224B];
  vect__11.29_94 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -7976B];
  vect__15.30_92 = vect_cst__93 * vect__11.29_94;
  vect__16.31_91 = vect__6.26_22 + vect__15.30_92;
  vect__126.36_71 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 216B];
  vect__125.37_69 = vect_cst__70 * vect__126.36_71;
  vect__124.38_68 = vect__125.37_69 + vect__16.31_91;
  MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 224B] = vect__124.38_68;
  ivtmp.52_196 = ivtmp.52_174 + 16;
  ivtmp.54_24 = ivtmp.54_142 + 16384;
  if (ivtmp.54_24 == ivtmp.69_195)
    goto <bb 8>; [6.25%]
  else
    goto <bb 7>; [93.75%]

I am going to answer my question - partially.

I have tried compiling and running the codes for matrix multiplication on my laptop that has AMD Ryzen 8845HS processor and is running Ubuntu 22.04.5.

The compilation commands are the same, i.e.

Fortran: gfortran -O3 -march=native -funroll-all-loops matmul.f90

C: gcc -O3 -march=native -funroll-all-loops matmul.c

The situation on Ubuntu is now different. Fortran takes 0.13 seconds, while C takes 0.24 seconds. The difference is still quite significant.

However, when I use Intel Fortran and Intel C compiler (that are installed on my Windows 10 PC with i7 6700 processor) things are completely different. For matrices of sizes 1024 by 1024, Fortran compiled code takes 0.079 seconds, while C code takes 0.077 seconds. C code is actually faster.

The compilation options for Intel Fortran compiler that I have used are:

ifx -fast matmul.f90 /heap-arrays

The compilation options for Intel C compiler that I have used are:

ifc -fast matmul.c

It looks like gcc is not that great - as I initially taught. This is strange because I believed it was the best C compiler out there.

In my opinion it is simpler than it seems. Consider that given the lever of the 6 loops all the work is done by the cell calculation line. By comparing the two lines and counting the operations you have:

For C:

C[i*N + k] += A[j*N + k]*B[i*N + j] => 4 sum e 4 multiplication

For Fortran:

C(k, i) = C(k, i) + A(k, j)*B(j, i) => 1 sum e 1 multiplication

then:

Complexity(C) = (4-1)/1 * Complexity(Fortran)

Fortran compiler is oriented to a most efficient style of language than the C compiler in the calculation. A simple C compiler like gcc leaves to the programmer more freedom but also more commitment on the complexity of calculation. The most optimized intel compiler for simple expressions like these, may recover these issues.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Why is there a large performance difference between C and Fortran for matrix multiplication? - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)