mersenneforum.org OpenMP
 Register FAQ Search Today's Posts Mark Forums Read

 2011-09-13, 20:01 #1 R. Gerbicz     "Robert Gerbicz" Oct 2005 Hungary 1,429 Posts OpenMP None of the example codes speeds up the computations on my AMD Phenom(tm) 8450 Triple-Core Processor (with gcc 4.3.2 under open suse linux 11.1). See an example from https://computing.llnl.gov/tutorials/openMP/ , modified a little to obtain time informations and trying different size of input arrays: Code: #include #include #include #include #define CHUNKSIZE 100 int main () { int i, chunk,N; double *a,*b,*c,dtime; printf("give size of the array\n"); scanf("%d",&N); a=(double*)(malloc)(N*sizeof(double)); b=(double*)(malloc)(N*sizeof(double)); c=(double*)(malloc)(N*sizeof(double)); /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; dtime=clock(); for(i=0;i
 2011-09-13, 20:08 #2 jasonp Tribal Bullet     Oct 2004 67168 Posts What is CLOCKS_PER_SEC on your system? It could be as small as 18 or so, meaning the time to complete your computation is rounded to the nearest few milliseconds.
2011-09-13, 20:13   #3
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

1,429 Posts

Quote:
 Originally Posted by jasonp What is CLOCKS_PER_SEC on your system? It could be as small as 18 or so, meaning the time to complete your computation is rounded to the nearest few milliseconds.
I've seen: CLOCKS_PER_SEC=1000000.
But tried the above code for larger arrays also when the running time is more than 1 second, and the single threaded version is still faster.
Say:
give size of the array
60000000

Last fiddled with by R. Gerbicz on 2011-09-13 at 20:21

2011-09-13, 21:03   #4
bsquared

"Ben"
Feb 2007

3,361 Posts

Quote:
 Originally Posted by R. Gerbicz I've seen: CLOCKS_PER_SEC=1000000. But tried the above code for larger arrays also when the running time is more than 1 second, and the single threaded version is still faster. Say: give size of the array 60000000 single thread time=1.100000 sec. multithreaded time=1.920000 sec.
clock() is not multi-thread aware - so the times that are printed come from cycles that are elapsing on *all* of the threads. Use something like gettimeofday() instead. For something that is perfectly parallelizable like this, the multi-threaded time you are observing can be divided by the number of threads to roughly get at the wall clock time.

Last fiddled with by bsquared on 2011-09-13 at 21:05 Reason: reference

 2011-09-13, 21:11 #5 bsquared     "Ben" Feb 2007 3,361 Posts On my system, CHUNKSIZE appears to be nearly optimal at about 4000, instead of 100. I guess that makes sense given that the data cache of the Intel chip is 32kB and 4000 * sizeof(double) ~= 32k.
2011-09-13, 21:33   #6
bsquared

"Ben"
Feb 2007

D2116 Posts

Example (code attached - too long to post):
Code:
% omp
give size of the array
100000000
start.
multi-threaded time=0.208925 sec.
Is there a way to control how many threads are used somehow?
Attached Files
 omp.txt (1.8 KB, 94 views)

2011-09-13, 22:53   #7
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

1,429 Posts

Quote:
 Originally Posted by bsquared Is there a way to control how many threads are used somehow?
Yes, and in gmp we can use also omp! My first example for this is an example that shows how to multiple two numbers (there are altogether 6 multiplications.), these numbers has got approx. e bytes, where e is the exponent that the code asks.
Code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <sys/time.h>
#include "gmp.h"

void mult(mpz_ptr n1,mpz_t n2,mpz_t n3) {
mpz_mul(n1,n2,n3);
return;
}

typedef struct {
long secs;
long usecs;
} TIME_DIFF;

TIME_DIFF * my_difftime (struct timeval *, struct timeval *);

TIME_DIFF * my_difftime (struct timeval * start, struct timeval * end)
{
TIME_DIFF * diff = (TIME_DIFF *) malloc ( sizeof (TIME_DIFF) );

if (start->tv_sec == end->tv_sec) {
diff->secs = 0;
diff->usecs = end->tv_usec - start->tv_usec;
}
else {
diff->usecs = 1000000 - start->tv_usec;
diff->secs = end->tv_sec - (start->tv_sec + 1);
diff->usecs += end->tv_usec;
if (diff->usecs >= 1000000) {
diff->usecs -= 1000000;
diff->secs += 1;
}
}

return diff;
}

int main () {

int i,j,e,th;
mpz_t a[6][3],b[6];
struct timeval start, end;
TIME_DIFF *diff;

for(i=0;i<6;i++)for(j=0;j<3;j++)mpz_init(a[i][j]);
for(i=0;i<6;i++)mpz_init(b[i]);
scanf("%d",&e);

for(i=0;i<6;i++)
for(j=1;j<3;j++)mpz_ui_pow_ui(a[i][j],257+6*i+2*j,e);
printf("start\n");

gettimeofday(&start, NULL);
for(i=0;i<6;i++)mult(b[i],a[i][1],a[i][2]);
gettimeofday(&end, NULL);
diff = my_difftime(&start, &end);
printf("single thread time=%lf sec.\n",(double)diff->secs + (double)diff->usecs/1000000.);

for(th=2;th<=3;th++){
gettimeofday(&start, NULL);
#pragma omp parallel for schedule(static,6/th)
for(i=0;i<6;i++)
mult(a[i][0],a[i][1],a[i][2]);
gettimeofday(&end, NULL);
diff = my_difftime(&start, &end);
// safe check that the multiplications in paralell was also good
for(i=0;i<6;i++)
assert(mpz_cmp(b[i],a[i][0])==0);
}

for(i=0;i<6;i++)for(j=0;j<3;j++)mpz_clear(a[i][j]);
for(i=0;i<6;i++)mpz_clear(b[i]);

return 0;
}
on my triple core:
Code:
gerbicz@linux-9puk:~/gmp-5.0.1> gcc -Wall -fopenmp -o omp ompgmp.c .libs/libgmp.a
gerbicz@linux-9puk:~/gmp-5.0.1> ./omp
10000000
start
multi-threaded time with 3 threads=3.636895 sec.`

Last fiddled with by R. Gerbicz on 2011-09-13 at 22:54 Reason: bytes, not bits

 2011-09-13, 22:55 #8 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 6,379 Posts try the environment variable OMP_NUM_THREADS
 2011-09-21, 23:34 #9 ewmayer ∂2ω=0     Sep 2002 República de California 101101010000102 Posts While were at it, next OpenMP topic: I want to OpenMP-parallelize my SSE2-enabled FFT code. The functional API looks the same for the scalar-double (C) code and the SSE2 (C + inline assembler) builds. One key difference is that in SSE2 mode, instead of declaring lots of local variables to hold intermediate data, I allocate a local memory store which contains a mix of precomputed constants (not declared so, since they are just part of the malloc'ed storage block also used for writing intermediates to) and memory locations used for 128-bit packed-double xmm register spills. After allocating the memory block I compute values of several static pointers to key parts of it: One pointer to the beginning of the trig-data section, one to the beginning of the DFT-intermediates etc. The question is: How to generalize this to a multithreaded setting? Would I simply alloc as many such local memory stores as threads, copy the const data to each, and then define thread-specific versions of the local-store access pointers? Would there be any need for private/shared declarations for these data?
2011-09-22, 00:06   #10
bsquared

"Ben"
Feb 2007

3,361 Posts

Quote:
 Originally Posted by ewmayer While were at it, next OpenMP topic: I want to OpenMP-parallelize my SSE2-enabled FFT code. The functional API looks the same for the scalar-double (C) code and the SSE2 (C + inline assembler) builds. One key difference is that in SSE2 mode, instead of declaring lots of local variables to hold intermediate data, I allocate a local memory store which contains a mix of precomputed constants (not declared so, since they are just part of the malloc'ed storage block also used for writing intermediates to) and memory locations used for 128-bit packed-double xmm register spills. After allocating the memory block I compute values of several static pointers to key parts of it: One pointer to the beginning of the trig-data section, one to the beginning of the DFT-intermediates etc. The question is: How to generalize this to a multithreaded setting? Would I simply alloc as many such local memory stores as threads, copy the const data to each, and then define thread-specific versions of the local-store access pointers? Would there be any need for private/shared declarations for these data?
So you have a portion of the memory space read only and a portion that is read/write? If so then what I've done with a similar scenario using pthreads is nothing at all for the read only portion (each thread uses the same pointer to the read only structure and pthreads "handles it" somehow), and separate mallocs for the read/write stuff. Each thread is then assigned its own pointer to its own R/W space (and any initial values computed/copied to each).

2011-09-22, 00:30   #11
ewmayer
2ω=0

Sep 2002
República de California

2·3·1,931 Posts

Quote:
 Originally Posted by bsquared So you have a portion of the memory space read only and a portion that is read/write? If so then what I've done with a similar scenario using pthreads is nothing at all for the read only portion (each thread uses the same pointer to the read only structure and pthreads "handles it" somehow), and separate mallocs for the read/write stuff. Each thread is then assigned its own pointer to its own R/W space (and any initial values computed/copied to each).
Thanks, Bruce.

Currently I use a common chunk of locally-malloc'ed memory for both the read-only and read-write data. Probably a good idea to make separate chunks for these as you say, but in my case the read-only data tend to be only a small fraction of the memory chunk, so duplicating those would also not be terribly wasteful. (I wanted the read-only and read/write to be in the same chunk for cache reasons, since these data are used by the same DFT code, i.e. are always 'hot' at the same time.)

Mainly I waned to make sure there are no issues with the one-mem-block-copy-per-thread approach. I presume pthreads and OpenMP should not be different in this regard? (And is there any reason to prefer on over the otherr in general?)

All times are UTC. The time now is 04:31.

Mon Jan 25 04:31:30 UTC 2021 up 53 days, 42 mins, 0 users, load averages: 2.15, 2.20, 2.18

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.