The task was to implement various matrix multiplication algorithms using OpenMP. It turned out that with num_threads(1), the program runs faster than with any other number of threads. Is this due to the overhead of data transfer taking too much time, or is the issue something else?
#include <omp.h>
#include <stdio.h>
#include <cstdlib>
#include <ctime>
int main(){
const int n = 1024;
const int m = 128;
int** A = new int* [n];
int** B = new int* [n];
int** C = new int* [n];
for (int i=0; i<n; i++){
A[i] = new int[n];
B[i] = new int[n];
C[i] = new int[n];
}
for (int i=0; i<n; i++){
for (int j=0; j<n; j++){
A[i][j] = rand() % 10;
B[i][j] = rand() % 10;
}
}
clock_t start = clock();
#pragma omp parallel for num_threads(6) schedule (dynamic,m)
for (int j=0; j<n; j++){
for (int i=0; i<n; i++){
for (int k=0; k<n; k++)
C[i][j] += A[i][k]*B[k][j];
}
}
clock_t end = clock();
printf("Time for j-i-k: %f\n", (double) (end - start) / CLOCKS_PER_SEC);
start = clock();
#pragma omp parallel for num_threads(6) schedule (dynamic,m)
for (int k=0; k<n; k++){
for (int i=0; i<n; i++){
for (int j=0; j<n; j++)
C[i][j] += A[i][k]*B[k][j];
}
}
end = clock();
printf("Time for k-i-j: %f\n", (double) (end - start) / CLOCKS_PER_SEC);
for (int i=0; i<n; i++){
delete[] A[i];
delete[] B[i];
delete[] C[i];
}
delete[] A;
delete[] B;
delete[] C;
}
The task was to implement various matrix multiplication algorithms using OpenMP. It turned out that with num_threads(1), the program runs faster than with any other number of threads. Is this due to the overhead of data transfer taking too much time, or is the issue something else?
#include <omp.h>
#include <stdio.h>
#include <cstdlib>
#include <ctime>
int main(){
const int n = 1024;
const int m = 128;
int** A = new int* [n];
int** B = new int* [n];
int** C = new int* [n];
for (int i=0; i<n; i++){
A[i] = new int[n];
B[i] = new int[n];
C[i] = new int[n];
}
for (int i=0; i<n; i++){
for (int j=0; j<n; j++){
A[i][j] = rand() % 10;
B[i][j] = rand() % 10;
}
}
clock_t start = clock();
#pragma omp parallel for num_threads(6) schedule (dynamic,m)
for (int j=0; j<n; j++){
for (int i=0; i<n; i++){
for (int k=0; k<n; k++)
C[i][j] += A[i][k]*B[k][j];
}
}
clock_t end = clock();
printf("Time for j-i-k: %f\n", (double) (end - start) / CLOCKS_PER_SEC);
start = clock();
#pragma omp parallel for num_threads(6) schedule (dynamic,m)
for (int k=0; k<n; k++){
for (int i=0; i<n; i++){
for (int j=0; j<n; j++)
C[i][j] += A[i][k]*B[k][j];
}
}
end = clock();
printf("Time for k-i-j: %f\n", (double) (end - start) / CLOCKS_PER_SEC);
for (int i=0; i<n; i++){
delete[] A[i];
delete[] B[i];
delete[] C[i];
}
delete[] A;
delete[] B;
delete[] C;
}
Share
Improve this question
asked yesterday
Илья АнненковИлья Анненков
112 bronze badges
New contributor
Илья Анненков is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
11
|
Show 6 more comments
1 Answer
Reset to default 0I believe the problem might be with you clock
usage. OpenMP has a function omp_get_wtime()
that gets the wall clock time and allows time measurements in parallel regions which is more accurate. Furthermore after running you code with omp_get_wtime()
and modifying the code with num_threads(1)
and (6)
the time difference is quite drastic.
Comparison on times
You can also set the number of threads using OMP_NUM_THREADS=1 or OMP_NUM_THREADS=6 instead of in the code, since it allows for faster testing.
Also make sure to compile with the -fopenmp flag
std::vector<std::vector<int>>
(you're already leaking memory). As for usingclock
to measure performance... I would recommend you use a profiler, that will tell you where your bottlenecks really are. – Pepijn Kramer Commented yesterdaynew
/new[]
anddelete
/delete[]
doesn't belong in modern C++. Use smart pointers, RAII etc. Don't write C++ like we did in 1998. – Jesper Juhl Commented yesterdayclock
indeed. It measure the CPU time which means it increase with the number of threads running. Note OpenMP spawn threads and not processes. Oh and dynamic schedule is generally not a good idea, especially for matrix multiplications with a very stable uniform execution. Besides usingnum_threads(6)
is a bad practice here. You should really use the environment variableOMP_NUM_THREADS
instead. – Jérôme Richard Commented yesterday