I am doing comparison between Fortran and C programming language for matrix operations.
This time I have written two files (matmul.c and matmul.f90) that both do the same thing, i.e. multiply matrices A and B - which are both square matrices of size N by N.
In both files I have implemented loop blocking and I considered efficient memory access for better performance. However, Fortran is significantly faster for the multiplication of matrix A and B. When N is equal to 1024, Fortran is about three times faster. In that case, C takes about 0.57 seconds to complete matrix multiplication, while Fortran takes 0.18 seconds to complete matrix multiplication.
Can anyone take a look at my codes and tell me what can be done to reduce this difference?
I am running Windows 10, with an Intel i7-6700 CPU and 16 GB of RAM. For compilation I am using MinGW 14.2.
In order to compile matmul.f90 file I have used this command in cmd:
gfortran -O3 -march=native -funroll-all-loops matmul.f90
In order to compile matmul.c file I have used this command in cmd:
gcc -O3 -march=native -funroll-all-loops matmul.c
Here is the Fortran code:
program matmul_fortran
implicit none
integer, parameter :: N = 1024 ! matrix size
integer, parameter :: BLOCK_SIZE = 32 ! block size
real(8), dimension(N, N) :: A, B, C
integer :: i, j, k, i_block, j_block, k_block
real(8) :: start, finish, temp
! initialize matrices A and B with random values
call random_seed()
call random_number(A)
call random_number(B)
C = 0.0 ! set matrix C to zero values
call cpu_time(start)
! multiplication
do i_block = 1, N, BLOCK_SIZE
do j_block = 1, N, BLOCK_SIZE
do k_block = 1, N, BLOCK_SIZE
do i = i_block, min(i_block + BLOCK_SIZE - 1, N)
do j = j_block, min(j_block + BLOCK_SIZE - 1, N)
do k = k_block, min(k_block + BLOCK_SIZE - 1, N)
C(k, i) = C(k, i) + A(k, j)*B(j, i)
end do
end do
end do
end do
end do
end do
call cpu_time(finish)
print *, "Fortran Matrix Multiplication Time: ", finish - start, " seconds"
end program matmul_fortran
Here is the C code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 1024 // matrix size
#define BLOCK_SIZE 32 // block size
// function to initliaze matrices with random values
void initialize_matrix(double *matrix) {
for (int i = 0; i < N * N; i++) {
matrix[i] = (double)rand() / RAND_MAX; // Random values between 0 and 1
}
}
int main() {
double *A, *B, *C;
clock_t start, end;
A = (double *)malloc(N * N * sizeof(double));
B = (double *)malloc(N * N * sizeof(double));
C = (double *)malloc(N * N * sizeof(double));
// set matrix C to zero values
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
C[i * N + j] = 0.0;
// Initialize matrices
srand(time(NULL));
initialize_matrix(A);
initialize_matrix(B);
start = clock();
// multiplication
for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
for (int i = i_block; i < i_block + BLOCK_SIZE && i < N; i++) {
for (int j = j_block; j < j_block + BLOCK_SIZE && j < N; j++) {
for (int k = k_block; k < k_block + BLOCK_SIZE && k < N; k++) {
C[i*N + k] += A[j*N + k]*B[i*N + j];
}
}
}
}
}
}
end = clock();
printf("C Matrix Multiplication Time: %.6f seconds\n", ((double)(end - start)) / CLOCKS_PER_SEC);
free(A);
free(B);
free(C);
return 0;
}
I am doing comparison between Fortran and C programming language for matrix operations.
This time I have written two files (matmul.c and matmul.f90) that both do the same thing, i.e. multiply matrices A and B - which are both square matrices of size N by N.
In both files I have implemented loop blocking and I considered efficient memory access for better performance. However, Fortran is significantly faster for the multiplication of matrix A and B. When N is equal to 1024, Fortran is about three times faster. In that case, C takes about 0.57 seconds to complete matrix multiplication, while Fortran takes 0.18 seconds to complete matrix multiplication.
Can anyone take a look at my codes and tell me what can be done to reduce this difference?
I am running Windows 10, with an Intel i7-6700 CPU and 16 GB of RAM. For compilation I am using MinGW 14.2.
In order to compile matmul.f90 file I have used this command in cmd:
gfortran -O3 -march=native -funroll-all-loops matmul.f90
In order to compile matmul.c file I have used this command in cmd:
gcc -O3 -march=native -funroll-all-loops matmul.c
Here is the Fortran code:
program matmul_fortran
implicit none
integer, parameter :: N = 1024 ! matrix size
integer, parameter :: BLOCK_SIZE = 32 ! block size
real(8), dimension(N, N) :: A, B, C
integer :: i, j, k, i_block, j_block, k_block
real(8) :: start, finish, temp
! initialize matrices A and B with random values
call random_seed()
call random_number(A)
call random_number(B)
C = 0.0 ! set matrix C to zero values
call cpu_time(start)
! multiplication
do i_block = 1, N, BLOCK_SIZE
do j_block = 1, N, BLOCK_SIZE
do k_block = 1, N, BLOCK_SIZE
do i = i_block, min(i_block + BLOCK_SIZE - 1, N)
do j = j_block, min(j_block + BLOCK_SIZE - 1, N)
do k = k_block, min(k_block + BLOCK_SIZE - 1, N)
C(k, i) = C(k, i) + A(k, j)*B(j, i)
end do
end do
end do
end do
end do
end do
call cpu_time(finish)
print *, "Fortran Matrix Multiplication Time: ", finish - start, " seconds"
end program matmul_fortran
Here is the C code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 1024 // matrix size
#define BLOCK_SIZE 32 // block size
// function to initliaze matrices with random values
void initialize_matrix(double *matrix) {
for (int i = 0; i < N * N; i++) {
matrix[i] = (double)rand() / RAND_MAX; // Random values between 0 and 1
}
}
int main() {
double *A, *B, *C;
clock_t start, end;
A = (double *)malloc(N * N * sizeof(double));
B = (double *)malloc(N * N * sizeof(double));
C = (double *)malloc(N * N * sizeof(double));
// set matrix C to zero values
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
C[i * N + j] = 0.0;
// Initialize matrices
srand(time(NULL));
initialize_matrix(A);
initialize_matrix(B);
start = clock();
// multiplication
for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
for (int i = i_block; i < i_block + BLOCK_SIZE && i < N; i++) {
for (int j = j_block; j < j_block + BLOCK_SIZE && j < N; j++) {
for (int k = k_block; k < k_block + BLOCK_SIZE && k < N; k++) {
C[i*N + k] += A[j*N + k]*B[i*N + j];
}
}
}
}
}
}
end = clock();
printf("C Matrix Multiplication Time: %.6f seconds\n", ((double)(end - start)) / CLOCKS_PER_SEC);
free(A);
free(B);
free(C);
return 0;
}
Share
Improve this question
edited Mar 31 at 9:40
Mark Rotteveel
110k229 gold badges156 silver badges224 bronze badges
asked Mar 16 at 15:13
Ante JurčevićAnte Jurčević
634 bronze badges
25
|
Show 20 more comments
3 Answers
Reset to default 5@Adrian McCarthy had the right observation. There is a conditional that must be evaluated to terminate the blocked loops in C and it is a compound expression involving summation and the logical &&
operator. Fortran, on the other hand, just calls the min
function once to determine the upper bound. While some compilers might be able to optimize this, others may be not able.
When changing the loop nest to
for (int i_block = 0; i_block < N; i_block += BLOCK_SIZE) {
for (int j_block = 0; j_block < N; j_block += BLOCK_SIZE) {
for (int k_block = 0; k_block < N; k_block += BLOCK_SIZE) {
int maxi = i_block + BLOCK_SIZE < N ? i_block + BLOCK_SIZE : N;
for (int i = i_block; i < maxi; i++) {
int maxj = j_block + BLOCK_SIZE < N ? j_block + BLOCK_SIZE : N;
for (int j = j_block; j < maxj; j++) {
int maxk = k_block + BLOCK_SIZE < N ? k_block + BLOCK_SIZE : N;
for (int k = k_block; k < maxk; k++) {
C[i*N + k] += A[j*N + k]*B[i*N + j];
the performance for the GCC compiler improves dramatically. In my case from 0.68 s to 0.28 s. Fortran still remains slightly faster at 0.25 s, but the speeds are much closer now. Intel is still better on my machine with 0.18-0.19 s with -O3 -xHost
.
The new versions enables GCC to do more vectorization. Although I did not examine the assembly, only the intermediate GIMPLE, I can see vector operations such as
<bb 7> [local count: 10406813]:
# ivtmp.131_127 = PHI <ivtmp.178_547(6), ivtmp.131_217(9)>
# ivtmp.139_438 = PHI <_22(6), ivtmp.139_172(9)>
_20 = MEM[(double *)_365 + ivtmp.131_127 * 8];
_390 = ivtmp.168_88 + _367;
_391 = _390 * 8;
vectp.57_387 = C_44(D) + _391;
vectp.60_395 = (double *) ivtmp.139_438;
vect_cst__404 = {_20, _20, _20, _20};
_352 = ivtmp.139_438 + 8192;
vectp.67_416 = (double *) _352;
D__lsm0.98_9 = MEM[(double *)_365 + 8B + ivtmp.131_127 * 8];
<bb 8> [local count: 56196792]:
# ivtmp.122_502 = PHI <0(7), ivtmp.122_512(8)>
vect__9.58_394 = MEM <vector(4) double> [(double *)vectp.57_387 + ivtmp.122_502 * 1];
vect__15.61_403 = MEM <vector(4) double> [(double *)vectp.60_395 + ivtmp.122_502 * 1];
vect__21.62_405 = vect__15.61_403 * vect_cst__404;
vect__22.63_406 = vect__9.58_394 + vect__21.62_405;
vect_cst__415 = {D__lsm0.98_9, D__lsm0.98_9, D__lsm0.98_9, D__lsm0.98_9};
vect__122.68_425 = MEM <vector(4) double> [(double *)vectp.67_416 + ivtmp.122_502 * 1];
vect__123.69_426 = vect_cst__415 * vect__122.68_425;
vect__124.70_427 = vect__22.63_406 + vect__123.69_426;
MEM <vector(4) double> [(double *)vectp.57_387 + ivtmp.122_502 * 1] = vect__124.70_427;
ivtmp.122_512 = ivtmp.122_502 + 32;
if (ivtmp.122_512 != 256)
goto <bb 8>; [83.33%]
else
goto <bb 9>; [16.67%]
which have not been there before.
From my uneducated examination of this GIMPLE and the GIMPLE of the Fortran version it seems to me that there is more unrolling done by gfortran than by gcc for C which might be able to explain the remaining difference
<bb 7> [local count: 267178624]:
# ivtmp.52_174 = PHI <ivtmp.79_246(6), ivtmp.52_196(7)>
# ivtmp.54_142 = PHI <_214(6), ivtmp.54_24(7)>
_126 = (void *) ivtmp.52_174;
_14 = MEM[(real(kind=8) *)_126 + -8200B];
_135 = MEM[(real(kind=8) *)_126 + -8192B];
vect_cst__93 = {_14, _14, _14, _14};
vect_cst__70 = {_135, _135, _135, _135};
_216 = (void *) ivtmp.67_176;
vect__6.26_146 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216];
_75 = (void *) ivtmp.54_142;
vect__11.29_8 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8200B];
vect__15.30_138 = vect__11.29_8 * vect_cst__93;
vect__16.31_43 = vect__15.30_138 + vect__6.26_146;
vect__126.36_3 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8B];
vect__125.37_4 = vect__126.36_3 * vect_cst__70;
vect__124.38_5 = vect__125.37_4 + vect__16.31_43;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216] = vect__124.38_5;
vect__6.26_114 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 32B];
vect__11.29_108 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8168B];
vect__15.30_49 = vect_cst__93 * vect__11.29_108;
vect__16.31_51 = vect__15.30_49 + vect__6.26_114;
vect__126.36_89 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 24B];
vect__125.37_88 = vect_cst__70 * vect__126.36_89;
vect__124.38_87 = vect__16.31_51 + vect__125.37_88;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 32B] = vect__124.38_87;
vect__6.26_59 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 64B];
vect__11.29_67 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8136B];
vect__15.30_123 = vect__11.29_67 * vect_cst__93;
vect__16.31_42 = vect__6.26_59 + vect__15.30_123;
vect__126.36_31 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 56B];
vect__125.37_30 = vect__126.36_31 * vect_cst__70;
vect__124.38_29 = vect__125.37_30 + vect__16.31_42;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 64B] = vect__124.38_29;
vect__6.26_159 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 96B];
vect__11.29_160 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8104B];
vect__15.30_161 = vect_cst__93 * vect__11.29_160;
vect__16.31_162 = vect__6.26_159 + vect__15.30_161;
vect__126.36_164 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 88B];
vect__125.37_165 = vect_cst__70 * vect__126.36_164;
vect__124.38_166 = vect__16.31_162 + vect__125.37_165;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 96B] = vect__124.38_166;
vect__6.26_181 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 128B];
vect__11.29_182 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8072B];
vect__15.30_183 = vect_cst__93 * vect__11.29_182;
vect__16.31_184 = vect__6.26_181 + vect__15.30_183;
vect__126.36_186 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 120B];
vect__125.37_187 = vect_cst__70 * vect__126.36_186;
vect__124.38_188 = vect__16.31_184 + vect__125.37_187;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 128B] = vect__124.38_188;
vect__6.26_203 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 160B];
vect__11.29_204 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8040B];
vect__15.30_205 = vect_cst__93 * vect__11.29_204;
vect__16.31_206 = vect__6.26_203 + vect__15.30_205;
vect__126.36_208 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 152B];
vect__125.37_209 = vect_cst__70 * vect__126.36_208;
vect__124.38_210 = vect__16.31_206 + vect__125.37_209;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 160B] = vect__124.38_210;
vect__6.26_225 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 192B];
vect__11.29_226 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -8008B];
vect__15.30_227 = vect_cst__93 * vect__11.29_226;
vect__16.31_228 = vect__6.26_225 + vect__15.30_227;
vect__126.36_230 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 184B];
vect__125.37_231 = vect_cst__70 * vect__126.36_230;
vect__124.38_232 = vect__16.31_228 + vect__125.37_231;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 192B] = vect__124.38_232;
vect__6.26_22 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 224B];
vect__11.29_94 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + -7976B];
vect__15.30_92 = vect_cst__93 * vect__11.29_94;
vect__16.31_91 = vect__6.26_22 + vect__15.30_92;
vect__126.36_71 = MEM <vector(4) real(kind=8)> [(real(kind=8) *)_75 + 216B];
vect__125.37_69 = vect_cst__70 * vect__126.36_71;
vect__124.38_68 = vect__125.37_69 + vect__16.31_91;
MEM <vector(4) real(kind=8)> [(real(kind=8) *)_216 + 224B] = vect__124.38_68;
ivtmp.52_196 = ivtmp.52_174 + 16;
ivtmp.54_24 = ivtmp.54_142 + 16384;
if (ivtmp.54_24 == ivtmp.69_195)
goto <bb 8>; [6.25%]
else
goto <bb 7>; [93.75%]
I am going to answer my question - partially.
I have tried compiling and running the codes for matrix multiplication on my laptop that has AMD Ryzen 8845HS processor and is running Ubuntu 22.04.5.
The compilation commands are the same, i.e.
Fortran: gfortran -O3 -march=native -funroll-all-loops matmul.f90
C: gcc -O3 -march=native -funroll-all-loops matmul.c
The situation on Ubuntu is now different. Fortran takes 0.13 seconds, while C takes 0.24 seconds. The difference is still quite significant.
However, when I use Intel Fortran and Intel C compiler (that are installed on my Windows 10 PC with i7 6700 processor) things are completely different. For matrices of sizes 1024 by 1024, Fortran compiled code takes 0.079 seconds, while C code takes 0.077 seconds. C code is actually faster.
The compilation options for Intel Fortran compiler that I have used are:
ifx -fast matmul.f90 /heap-arrays
The compilation options for Intel C compiler that I have used are:
ifc -fast matmul.c
It looks like gcc is not that great - as I initially taught. This is strange because I believed it was the best C compiler out there.
In my opinion it is simpler than it seems. Consider that given the lever of the 6 loops all the work is done by the cell calculation line. By comparing the two lines and counting the operations you have:
For C:
C[i*N + k] += A[j*N + k]*B[i*N + j] => 4 sum e 4 multiplication
For Fortran:
C(k, i) = C(k, i) + A(k, j)*B(j, i) => 1 sum e 1 multiplication
then:
Complexity(C) = (4-1)/1 * Complexity(Fortran)
Fortran compiler is oriented to a most efficient style of language than the C compiler in the calculation. A simple C compiler like gcc leaves to the programmer more freedom but also more commitment on the complexity of calculation. The most optimized intel compiler for simple expressions like these, may recover these issues.
temp
forB[i*N + j]
like you do in Fortran (to avoid the matrix access in every iteration of the inner loop). – wohlstad Commented Mar 16 at 15:23matmul()
as well. – John Bollinger Commented Mar 16 at 15:49real(8)
does not necessarily mean an 8-byte floating-point number. Size and representation are implementation dependent. It is safer to assume that Fortrandouble precision
corresponds to C double, but if you want to be positive (probably unnecessary here) then you should use Fortran's C interop features to get thereal
kind number that yields the same type as the system C'sdouble
. – John Bollinger Commented Mar 16 at 15:53restrict
in C. Unless you do overlap pointers, it is a nobrainer. Also, given that you use GCC, you can just examine to what intermediate representations the two frontends compile your code. Use ` -fdump-tree-optimized`. – Vladimir F Героям слава Commented Mar 16 at 15:54main()
. It is not what you normally want to do, but it is what you did in Fortran... You should keep the codes as similar as possible. If it is therestrict
issue, though, it will remain when you move the Fortran loops into a subroutine. – Vladimir F Героям слава Commented Mar 16 at 15:59