mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2011-09-13, 20:01   #1
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

3×5×97 Posts
Question OpenMP

None of the example codes speeds up the computations on my AMD Phenom(tm) 8450 Triple-Core Processor (with gcc 4.3.2 under open suse linux 11.1). See an example from https://computing.llnl.gov/tutorials/openMP/ , modified a little to obtain time informations and trying different size of input arrays:
Code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define CHUNKSIZE 100

int main ()  
{

int i, chunk,N;
double *a,*b,*c,dtime;

printf("give size of the array\n");
scanf("%d",&N);
a=(double*)(malloc)(N*sizeof(double));
b=(double*)(malloc)(N*sizeof(double));
c=(double*)(malloc)(N*sizeof(double));

/* Some initializations */
for (i=0; i < N; i++)
  a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;

dtime=clock();
for(i=0;i<N;i++)c[i]=a[i]+b[i];
dtime=(double) (clock()-dtime)/CLOCKS_PER_SEC;
printf("single thread time=%lf sec.\n",dtime);


dtime=clock();
#pragma omp parallel shared(a,b,c,chunk) private(i)
  {

  #pragma omp for schedule(dynamic,chunk) nowait
  for (i=0; i < N; i++)
    c[i] = a[i] + b[i];

  }  /* end of parallel section */

dtime=(double) (clock()-dtime)/CLOCKS_PER_SEC;
printf("multithreaded time=%lf sec.\n",dtime);

return 0;
}
For example on my computer:
Code:
give size of the array
10000000
single thread time=0.170000 sec.
multithreaded time=0.220000 sec.
compiled omp.c code with g++ -Wall -fopenmp -o omp omp.c (here not used additional flags to speed up everything, but that does not change the time ratio). Why I see slowdown when I use multi cores?
R. Gerbicz is offline   Reply With Quote
Old 2011-09-13, 20:08   #2
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

353610 Posts
Default

What is CLOCKS_PER_SEC on your system? It could be as small as 18 or so, meaning the time to complete your computation is rounded to the nearest few milliseconds.
jasonp is offline   Reply With Quote
Old 2011-09-13, 20:13   #3
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

3·5·97 Posts
Default

Quote:
Originally Posted by jasonp View Post
What is CLOCKS_PER_SEC on your system? It could be as small as 18 or so, meaning the time to complete your computation is rounded to the nearest few milliseconds.
I've seen: CLOCKS_PER_SEC=1000000.
But tried the above code for larger arrays also when the running time is more than 1 second, and the single threaded version is still faster.
Say:
give size of the array
60000000
single thread time=1.100000 sec.
multithreaded time=1.920000 sec.

Last fiddled with by R. Gerbicz on 2011-09-13 at 20:21
R. Gerbicz is offline   Reply With Quote
Old 2011-09-13, 21:03   #4
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

41×83 Posts
Default

Quote:
Originally Posted by R. Gerbicz View Post
I've seen: CLOCKS_PER_SEC=1000000.
But tried the above code for larger arrays also when the running time is more than 1 second, and the single threaded version is still faster.
Say:
give size of the array
60000000
single thread time=1.100000 sec.
multithreaded time=1.920000 sec.
clock() is not multi-thread aware - so the times that are printed come from cycles that are elapsing on *all* of the threads. Use something like gettimeofday() instead. For something that is perfectly parallelizable like this, the multi-threaded time you are observing can be divided by the number of threads to roughly get at the wall clock time.

Last fiddled with by bsquared on 2011-09-13 at 21:05 Reason: reference
bsquared is offline   Reply With Quote
Old 2011-09-13, 21:11   #5
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

41·83 Posts
Default

On my system, CHUNKSIZE appears to be nearly optimal at about 4000, instead of 100. I guess that makes sense given that the data cache of the Intel chip is 32kB and 4000 * sizeof(double) ~= 32k.
bsquared is offline   Reply With Quote
Old 2011-09-13, 21:33   #6
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

41·83 Posts
Default

Example (code attached - too long to post):
Code:
% omp
give size of the array
100000000
start.
single thread time=0.684803 sec.
multi-threaded time=0.208925 sec.
Is there a way to control how many threads are used somehow?
Attached Files
File Type: txt omp.txt (1.8 KB, 107 views)
bsquared is offline   Reply With Quote
Old 2011-09-13, 22:53   #7
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

26578 Posts
Default

Quote:
Originally Posted by bsquared View Post
Is there a way to control how many threads are used somehow?
Yes, and in gmp we can use also omp! My first example for this is an example that shows how to multiple two numbers (there are altogether 6 multiplications.), these numbers has got approx. e bytes, where e is the exponent that the code asks.
Code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <sys/time.h>
#include "gmp.h"

void mult(mpz_ptr n1,mpz_t n2,mpz_t n3) {
     mpz_mul(n1,n2,n3);
     int th_id = omp_get_thread_num();
     printf("thread id=%d\n",th_id);
     return;
}

typedef struct {
	long secs;
	long usecs;
} TIME_DIFF;

TIME_DIFF * my_difftime (struct timeval *, struct timeval *);

TIME_DIFF * my_difftime (struct timeval * start, struct timeval * end)
{
	TIME_DIFF * diff = (TIME_DIFF *) malloc ( sizeof (TIME_DIFF) );

	if (start->tv_sec == end->tv_sec) {
		diff->secs = 0;
		diff->usecs = end->tv_usec - start->tv_usec;
	}
	else {
		diff->usecs = 1000000 - start->tv_usec;
		diff->secs = end->tv_sec - (start->tv_sec + 1);
		diff->usecs += end->tv_usec;
		if (diff->usecs >= 1000000) {
			diff->usecs -= 1000000;
			diff->secs += 1;
		}
	}
	
	return diff;
}

int main () {

  int i,j,e,th;
  mpz_t a[6][3],b[6];
  struct timeval start, end;
  TIME_DIFF *diff;

  for(i=0;i<6;i++)for(j=0;j<3;j++)mpz_init(a[i][j]);
  for(i=0;i<6;i++)mpz_init(b[i]);
  printf("please give the exponent\n");
  scanf("%d",&e);

  for(i=0;i<6;i++)
     for(j=1;j<3;j++)mpz_ui_pow_ui(a[i][j],257+6*i+2*j,e);
  printf("start\n");
  
gettimeofday(&start, NULL);
for(i=0;i<6;i++)mult(b[i],a[i][1],a[i][2]);
gettimeofday(&end, NULL);
diff = my_difftime(&start, &end);
printf("single thread time=%lf sec.\n",(double)diff->secs + (double)diff->usecs/1000000.);

for(th=2;th<=3;th++){
omp_set_num_threads(th);
gettimeofday(&start, NULL);
#pragma omp parallel for schedule(static,6/th)
    for(i=0;i<6;i++) 
        mult(a[i][0],a[i][1],a[i][2]);
gettimeofday(&end, NULL);
diff = my_difftime(&start, &end);
printf("multi-threaded time with %d threads=%lf sec.\n",th,(double)diff->secs +(double) diff->usecs/1000000.);
// safe check that the multiplications in paralell was also good
  for(i=0;i<6;i++)
     assert(mpz_cmp(b[i],a[i][0])==0);
}

  for(i=0;i<6;i++)for(j=0;j<3;j++)mpz_clear(a[i][j]);
  for(i=0;i<6;i++)mpz_clear(b[i]);

  return 0;
}
on my triple core:
Code:
gerbicz@linux-9puk:~/gmp-5.0.1> gcc -Wall -fopenmp -o omp ompgmp.c .libs/libgmp.a
gerbicz@linux-9puk:~/gmp-5.0.1> ./omp
please give the exponent
10000000
start
thread id=0
thread id=0
thread id=0
thread id=0
thread id=0
thread id=0
single thread time=9.692724 sec.
thread id=0
thread id=1
thread id=0
thread id=1
thread id=0
thread id=1
multi-threaded time with 2 threads=5.187532 sec.
thread id=0
thread id=2
thread id=1
thread id=0
thread id=2
thread id=1
multi-threaded time with 3 threads=3.636895 sec.

Last fiddled with by R. Gerbicz on 2011-09-13 at 22:54 Reason: bytes, not bits
R. Gerbicz is offline   Reply With Quote
Old 2011-09-13, 22:55   #8
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

13·491 Posts
Default

try the environment variable OMP_NUM_THREADS
fivemack is offline   Reply With Quote
Old 2011-09-21, 23:34   #9
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

3·53·31 Posts
Default

While were at it, next OpenMP topic: I want to OpenMP-parallelize my SSE2-enabled FFT code. The functional API looks the same for the scalar-double (C) code and the SSE2 (C + inline assembler) builds. One key difference is that in SSE2 mode, instead of declaring lots of local variables to hold intermediate data, I allocate a local memory store which contains a mix of precomputed constants (not declared so, since they are just part of the malloc'ed storage block also used for writing intermediates to) and memory locations used for 128-bit packed-double xmm register spills. After allocating the memory block I compute values of several static pointers to key parts of it: One pointer to the beginning of the trig-data section, one to the beginning of the DFT-intermediates etc.

The question is: How to generalize this to a multithreaded setting? Would I simply alloc as many such local memory stores as threads, copy the const data to each, and then define thread-specific versions of the local-store access pointers? Would there be any need for private/shared declarations for these data?
ewmayer is offline   Reply With Quote
Old 2011-09-22, 00:06   #10
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

340310 Posts
Default

Quote:
Originally Posted by ewmayer View Post
While were at it, next OpenMP topic: I want to OpenMP-parallelize my SSE2-enabled FFT code. The functional API looks the same for the scalar-double (C) code and the SSE2 (C + inline assembler) builds. One key difference is that in SSE2 mode, instead of declaring lots of local variables to hold intermediate data, I allocate a local memory store which contains a mix of precomputed constants (not declared so, since they are just part of the malloc'ed storage block also used for writing intermediates to) and memory locations used for 128-bit packed-double xmm register spills. After allocating the memory block I compute values of several static pointers to key parts of it: One pointer to the beginning of the trig-data section, one to the beginning of the DFT-intermediates etc.

The question is: How to generalize this to a multithreaded setting? Would I simply alloc as many such local memory stores as threads, copy the const data to each, and then define thread-specific versions of the local-store access pointers? Would there be any need for private/shared declarations for these data?
So you have a portion of the memory space read only and a portion that is read/write? If so then what I've done with a similar scenario using pthreads is nothing at all for the read only portion (each thread uses the same pointer to the read only structure and pthreads "handles it" somehow), and separate mallocs for the read/write stuff. Each thread is then assigned its own pointer to its own R/W space (and any initial values computed/copied to each).
bsquared is offline   Reply With Quote
Old 2011-09-22, 00:30   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

101101011010012 Posts
Default

Quote:
Originally Posted by bsquared View Post
So you have a portion of the memory space read only and a portion that is read/write? If so then what I've done with a similar scenario using pthreads is nothing at all for the read only portion (each thread uses the same pointer to the read only structure and pthreads "handles it" somehow), and separate mallocs for the read/write stuff. Each thread is then assigned its own pointer to its own R/W space (and any initial values computed/copied to each).
Thanks, Bruce.

Currently I use a common chunk of locally-malloc'ed memory for both the read-only and read-write data. Probably a good idea to make separate chunks for these as you say, but in my case the read-only data tend to be only a small fraction of the memory chunk, so duplicating those would also not be terribly wasteful. (I wanted the read-only and read/write to be in the same chunk for cache reasons, since these data are used by the same DFT code, i.e. are always 'hot' at the same time.)

Mainly I waned to make sure there are no issues with the one-mem-block-copy-per-thread approach. I presume pthreads and OpenMP should not be different in this regard? (And is there any reason to prefer on over the otherr in general?)
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
GMP-ECM with --enable-openmp flag set in configure = bad results? GP2 GMP-ECM 3 2016-10-16 10:21

All times are UTC. The time now is 08:50.

Tue Apr 13 08:50:50 UTC 2021 up 5 days, 3:31, 1 user, load averages: 0.98, 1.39, 1.48

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.