![]() |
![]() |
#1 |
"Robert Gerbicz"
Oct 2005
Hungary
1,429 Posts |
![]()
None of the example codes speeds up the computations on my AMD Phenom(tm) 8450 Triple-Core Processor (with gcc 4.3.2 under open suse linux 11.1). See an example from https://computing.llnl.gov/tutorials/openMP/ , modified a little to obtain time informations and trying different size of input arrays:
Code:
#include <omp.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #define CHUNKSIZE 100 int main () { int i, chunk,N; double *a,*b,*c,dtime; printf("give size of the array\n"); scanf("%d",&N); a=(double*)(malloc)(N*sizeof(double)); b=(double*)(malloc)(N*sizeof(double)); c=(double*)(malloc)(N*sizeof(double)); /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; dtime=clock(); for(i=0;i<N;i++)c[i]=a[i]+b[i]; dtime=(double) (clock()-dtime)/CLOCKS_PER_SEC; printf("single thread time=%lf sec.\n",dtime); dtime=clock(); #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) c[i] = a[i] + b[i]; } /* end of parallel section */ dtime=(double) (clock()-dtime)/CLOCKS_PER_SEC; printf("multithreaded time=%lf sec.\n",dtime); return 0; } Code:
give size of the array 10000000 single thread time=0.170000 sec. multithreaded time=0.220000 sec. |
![]() |
![]() |
![]() |
#2 |
Tribal Bullet
Oct 2004
67168 Posts |
![]()
What is CLOCKS_PER_SEC on your system? It could be as small as 18 or so, meaning the time to complete your computation is rounded to the nearest few milliseconds.
|
![]() |
![]() |
![]() |
#3 | |
"Robert Gerbicz"
Oct 2005
Hungary
1,429 Posts |
![]() Quote:
But tried the above code for larger arrays also when the running time is more than 1 second, and the single threaded version is still faster. Say: give size of the array 60000000 single thread time=1.100000 sec. multithreaded time=1.920000 sec. Last fiddled with by R. Gerbicz on 2011-09-13 at 20:21 |
|
![]() |
![]() |
![]() |
#4 | |
"Ben"
Feb 2007
3,361 Posts |
![]() Quote:
Last fiddled with by bsquared on 2011-09-13 at 21:05 Reason: reference |
|
![]() |
![]() |
![]() |
#5 |
"Ben"
Feb 2007
3,361 Posts |
![]()
On my system, CHUNKSIZE appears to be nearly optimal at about 4000, instead of 100. I guess that makes sense given that the data cache of the Intel chip is 32kB and 4000 * sizeof(double) ~= 32k.
|
![]() |
![]() |
![]() |
#6 |
"Ben"
Feb 2007
D2116 Posts |
![]()
Example (code attached - too long to post):
Code:
% omp give size of the array 100000000 start. single thread time=0.684803 sec. multi-threaded time=0.208925 sec. |
![]() |
![]() |
![]() |
#7 |
"Robert Gerbicz"
Oct 2005
Hungary
1,429 Posts |
![]()
Yes, and in gmp we can use also omp! My first example for this is an example that shows how to multiple two numbers (there are altogether 6 multiplications.), these numbers has got approx. e bytes, where e is the exponent that the code asks.
Code:
#include <omp.h> #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <time.h> #include <sys/time.h> #include "gmp.h" void mult(mpz_ptr n1,mpz_t n2,mpz_t n3) { mpz_mul(n1,n2,n3); int th_id = omp_get_thread_num(); printf("thread id=%d\n",th_id); return; } typedef struct { long secs; long usecs; } TIME_DIFF; TIME_DIFF * my_difftime (struct timeval *, struct timeval *); TIME_DIFF * my_difftime (struct timeval * start, struct timeval * end) { TIME_DIFF * diff = (TIME_DIFF *) malloc ( sizeof (TIME_DIFF) ); if (start->tv_sec == end->tv_sec) { diff->secs = 0; diff->usecs = end->tv_usec - start->tv_usec; } else { diff->usecs = 1000000 - start->tv_usec; diff->secs = end->tv_sec - (start->tv_sec + 1); diff->usecs += end->tv_usec; if (diff->usecs >= 1000000) { diff->usecs -= 1000000; diff->secs += 1; } } return diff; } int main () { int i,j,e,th; mpz_t a[6][3],b[6]; struct timeval start, end; TIME_DIFF *diff; for(i=0;i<6;i++)for(j=0;j<3;j++)mpz_init(a[i][j]); for(i=0;i<6;i++)mpz_init(b[i]); printf("please give the exponent\n"); scanf("%d",&e); for(i=0;i<6;i++) for(j=1;j<3;j++)mpz_ui_pow_ui(a[i][j],257+6*i+2*j,e); printf("start\n"); gettimeofday(&start, NULL); for(i=0;i<6;i++)mult(b[i],a[i][1],a[i][2]); gettimeofday(&end, NULL); diff = my_difftime(&start, &end); printf("single thread time=%lf sec.\n",(double)diff->secs + (double)diff->usecs/1000000.); for(th=2;th<=3;th++){ omp_set_num_threads(th); gettimeofday(&start, NULL); #pragma omp parallel for schedule(static,6/th) for(i=0;i<6;i++) mult(a[i][0],a[i][1],a[i][2]); gettimeofday(&end, NULL); diff = my_difftime(&start, &end); printf("multi-threaded time with %d threads=%lf sec.\n",th,(double)diff->secs +(double) diff->usecs/1000000.); // safe check that the multiplications in paralell was also good for(i=0;i<6;i++) assert(mpz_cmp(b[i],a[i][0])==0); } for(i=0;i<6;i++)for(j=0;j<3;j++)mpz_clear(a[i][j]); for(i=0;i<6;i++)mpz_clear(b[i]); return 0; } Code:
gerbicz@linux-9puk:~/gmp-5.0.1> gcc -Wall -fopenmp -o omp ompgmp.c .libs/libgmp.a gerbicz@linux-9puk:~/gmp-5.0.1> ./omp please give the exponent 10000000 start thread id=0 thread id=0 thread id=0 thread id=0 thread id=0 thread id=0 single thread time=9.692724 sec. thread id=0 thread id=1 thread id=0 thread id=1 thread id=0 thread id=1 multi-threaded time with 2 threads=5.187532 sec. thread id=0 thread id=2 thread id=1 thread id=0 thread id=2 thread id=1 multi-threaded time with 3 threads=3.636895 sec. Last fiddled with by R. Gerbicz on 2011-09-13 at 22:54 Reason: bytes, not bits |
![]() |
![]() |
![]() |
#8 |
(loop (#_fork))
Feb 2006
Cambridge, England
6,379 Posts |
![]()
try the environment variable OMP_NUM_THREADS
|
![]() |
![]() |
![]() |
#9 |
∂2ω=0
Sep 2002
República de California
101101010000102 Posts |
![]()
While were at it, next OpenMP topic: I want to OpenMP-parallelize my SSE2-enabled FFT code. The functional API looks the same for the scalar-double (C) code and the SSE2 (C + inline assembler) builds. One key difference is that in SSE2 mode, instead of declaring lots of local variables to hold intermediate data, I allocate a local memory store which contains a mix of precomputed constants (not declared so, since they are just part of the malloc'ed storage block also used for writing intermediates to) and memory locations used for 128-bit packed-double xmm register spills. After allocating the memory block I compute values of several static pointers to key parts of it: One pointer to the beginning of the trig-data section, one to the beginning of the DFT-intermediates etc.
The question is: How to generalize this to a multithreaded setting? Would I simply alloc as many such local memory stores as threads, copy the const data to each, and then define thread-specific versions of the local-store access pointers? Would there be any need for private/shared declarations for these data? |
![]() |
![]() |
![]() |
#10 | |
"Ben"
Feb 2007
3,361 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#11 | |
∂2ω=0
Sep 2002
República de California
2·3·1,931 Posts |
![]() Quote:
Currently I use a common chunk of locally-malloc'ed memory for both the read-only and read-write data. Probably a good idea to make separate chunks for these as you say, but in my case the read-only data tend to be only a small fraction of the memory chunk, so duplicating those would also not be terribly wasteful. (I wanted the read-only and read/write to be in the same chunk for cache reasons, since these data are used by the same DFT code, i.e. are always 'hot' at the same time.) Mainly I waned to make sure there are no issues with the one-mem-block-copy-per-thread approach. I presume pthreads and OpenMP should not be different in this regard? (And is there any reason to prefer on over the otherr in general?) |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
GMP-ECM with --enable-openmp flag set in configure = bad results? | GP2 | GMP-ECM | 3 | 2016-10-16 10:21 |