mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-02-01, 03:13   #1
Fred
 
Fred's Avatar
 
"Ron"
Jan 2016
Fitchburg, MA

97 Posts
Default Memory Bandwidth

So, again, please feel free to answer in 5 year old speak, this is all new to me.

I'm trying to understand what I'm looking at in the benchmark section "Benchmarking multiple workers to measure the impact of memory bandwidth".

In an ideal world with no memory bandwidth issues, would the "Total throughput (iter/sec)" increase proportionally as additional cores/workers are tested? In the example below, since 1cpu, 1 worker had a Total throughput of 56.02 iter/sec, would 4 cpus, 4 workers have a total throughput of 224.08, but instead it's only 143.99 because memory speed is a bottleneck?

Code:
Timing 4096K FFT, 1 cpu, 1 worker.  Average times: 17.85 ms.  Total throughput: 56.02 iter/sec.
Timing 4096K FFT, 2 cpus, 2 workers.  Average times: 19.62, 19.46 ms.  Total throughput: 102.35 iter/sec.
Timing 4096K FFT, 3 cpus, 3 workers.  Average times: 22.49, 22.40, 22.35 ms.  Total throughput: 133.86 iter/sec.
Timing 4096K FFT, 4 cpus, 4 workers.  Average times: 27.89, 27.62, 27.83, 27.78 ms.  Total throughput: 143.99 iter/sec.

Last fiddled with by Fred on 2016-02-01 at 03:14
Fred is offline   Reply With Quote
Old 2016-02-01, 03:43   #2
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2,917 Posts
Default

More memory bandwidth than speed, but yes.
Mark Rose is offline   Reply With Quote
Old 2016-02-01, 05:06   #3
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29×113 Posts
Default

Quote:
Originally Posted by Fred View Post
So, again, please feel free to answer in 5 year old speak, this is all new to me.

I'm trying to understand what I'm looking at in the benchmark section "Benchmarking multiple workers to measure the impact of memory bandwidth".

In an ideal world with no memory bandwidth issues, would the "Total throughput (iter/sec)" increase proportionally as additional cores/workers are tested? In the example below, since 1cpu, 1 worker had a Total throughput of 56.02 iter/sec, would 4 cpus, 4 workers have a total throughput of 224.08, but instead it's only 143.99 because memory speed is a bottleneck?

Code:
Timing 4096K FFT, 1 cpu, 1 worker.  Average times: 17.85 ms.  Total throughput: 56.02 iter/sec.
Timing 4096K FFT, 2 cpus, 2 workers.  Average times: 19.62, 19.46 ms.  Total throughput: 102.35 iter/sec.
Timing 4096K FFT, 3 cpus, 3 workers.  Average times: 22.49, 22.40, 22.35 ms.  Total throughput: 133.86 iter/sec.
Timing 4096K FFT, 4 cpus, 4 workers.  Average times: 27.89, 27.62, 27.83, 27.78 ms.  Total throughput: 143.99 iter/sec.
Removing any memory bandwidth constraints, it would theoretically scale much better.

George pointed out that in his code to split the work among different cores, it may not split the work into even amounts between all of the cores.

I don't know the details at all, but in my mind I imagined (a very simple example) a big honking multiplication that gets split into 6 chunks of work and distributed among 4 cores.

For the first batch, each core gets a chunk. For the second batch, only 2 of the 4 get any work and the other 2 sit idle.

Then of course at the end it's the job of one of the cores in the worker to collect the chunks of work and combine it all together, during which time the other cores would be idle.

I may be totally wrong in my understanding of the issue, but that was how my feeble mind understood the problem.

I don't know what would be involved in taking any given FFT size and splitting the work evenly to however many cores that worker has. It seems technically possible, but whether it's practical or not, I don't know.

Also, the final step that combines the work from each core, I wonder if instead of having one core do all of that, would it be faster to pair up cores so that core #1 combines the work of #1 and #2, and core #3 combines the work of #3 and #4. Then finally core #1 would combine those 2 chunks.

Would that be any faster in practice than simply having one core combine all the chunks at once? No idea.

I mean, when I talk about "chunks" I'm not even sure what those chunks are... I gather it's a way to split a large multiplication into several smaller ones, but beyond that... well, I haven't a clue.

So, all that being said, yes, more memory bandwidth is awesome, but the program itself and the way it distributes work isn't perfect, so it still wouldn't scale exactly, but it would be closer.
Madpoo is offline   Reply With Quote
Old 2016-02-01, 05:36   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23·3·7·43 Posts
Default

Quote:
Originally Posted by Madpoo View Post
George pointed out that in his code to split the work among different cores, it may not split the work into even amounts between all of the cores.
You are confusing this case (4 workers on 4 cores) with the multi-threaded case (1 worker on 4 cores).
Prime95 is online now   Reply With Quote
Old 2016-02-01, 05:39   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23·3·7·43 Posts
Default

Quote:
Originally Posted by Fred View Post
...because memory speed is a bottleneck?
Yes. There is also some speed loss due to contention for the L3 cache, but main memory bandwidth is the main culprit.
Prime95 is online now   Reply With Quote
Old 2016-02-01, 05:43   #6
axn
 
axn's Avatar
 
Jun 2003

12B516 Posts
Default

Quote:
Originally Posted by Madpoo View Post
Removing any memory bandwidth constraints, it would theoretically scale much better.
<snip>
So, all that being said, yes, more memory bandwidth is awesome, but the program itself and the way it distributes work isn't perfect, so it still wouldn't scale exactly, but it would be closer.
Ummm... That is valid for multi-threaded FFTs (eg:- 1 worker, 4 threads), but OP is asking for single-threaded FFTs.

Short answer is yes, it should scale (nearly) linearly if memory bandwidth is not a constraint. Except for, turbo boost!

EDIT:- Ninja'd by George!

Last fiddled with by axn on 2016-02-01 at 05:44
axn is online now   Reply With Quote
Old 2016-02-01, 13:53   #7
Fred
 
Fred's Avatar
 
"Ron"
Jan 2016
Fitchburg, MA

97 Posts
Default

You guys have been super helpful. Thanks for all your responses.

So, is this bottleneck something seen in almost all systems running LL tests? Or are there optimal builds where there would be no bottleneck?

I've been considering a build using an i5-6600. Ya'll helped me recognize that that cpu with DDR4-2133 would probably have memory bandwidth issues, and going with DDR4-3000 or 3200 would be a better choice. Do you suspect an i5-6600 with DDR4-3200 would have NO memory bottleneck issues? Or it would just be LESS of an issue?
Fred is offline   Reply With Quote
Old 2016-02-01, 14:10   #8
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dumbassville

26×131 Posts
Default

Quote:
Originally Posted by Fred View Post
You guys have been super helpful. Thanks for all your responses.

So, is this bottleneck something seen in almost all systems running LL tests? Or are there optimal builds where there would be no bottleneck?

I've been considering a build using an i5-6600. Ya'll helped me recognize that that cpu with DDR4-2133 would probably have memory bandwidth issues, and going with DDR4-3000 or 3200 would be a better choice. Do you suspect an i5-6600 with DDR4-3200 would have NO memory bottleneck issues? Or it would just be LESS of an issue?
I'm not even close to an expert but https://en.wikipedia.org/wiki/Memory...d_nomenclature may help realize that speed isn't the only factor in memory bandwidth. number of channels, size of channels, clock frequency etc.
science_man_88 is offline   Reply With Quote
Old 2016-02-01, 15:10   #9
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2,917 Posts
Default

Quote:
Originally Posted by Fred View Post
You guys have been super helpful. Thanks for all your responses.

So, is this bottleneck something seen in almost all systems running LL tests? Or are there optimal builds where there would be no bottleneck?

I've been considering a build using an i5-6600. Ya'll helped me recognize that that cpu with DDR4-2133 would probably have memory bandwidth issues, and going with DDR4-3000 or 3200 would be a better choice. Do you suspect an i5-6600 with DDR4-3200 would have NO memory bottleneck issues? Or it would just be LESS of an issue?
It would be less of an issue. I would say an i5-6400 or i5-6600 with DDR4-3000 would be a cost-optimal system. The price jump to DDR4-3200 isn't worth it. If you're going with a non-overclockable motherboard, and stuck with DDR4-2133, I would suggest the i5-6400.
Mark Rose is offline   Reply With Quote
Old 2016-02-01, 16:28   #10
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29·113 Posts
Default

Quote:
Originally Posted by axn View Post
Ummm... That is valid for multi-threaded FFTs (eg:- 1 worker, 4 threads), but OP is asking for single-threaded FFTs.

Short answer is yes, it should scale (nearly) linearly if memory bandwidth is not a constraint. Except for, turbo boost!

EDIT:- Ninja'd by George!
My mistake, I glossed over that tiny little detail () in the comment.
Madpoo is offline   Reply With Quote
Old 2016-02-01, 16:43   #11
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

29×113 Posts
Default

Quote:
Originally Posted by science_man_88 View Post
I'm not even close to an expert but https://en.wikipedia.org/wiki/Memory...d_nomenclature may help realize that speed isn't the only factor in memory bandwidth. number of channels, size of channels, clock frequency etc.
Correct... and if it were me, I'd lean towards a system that supported quad-channel memory instead of just dual.

Quad channel at DDR4-2133 speeds will beat dual channel DDR4-3200 in every way (I think... where's that post I made a while back showing the benchmarks of things exactly ike that...)

Well, it wasn't this article, but I found it somewhat amusing that the author concluded quad channel wasn't any better based on those other "real world" benchmarks. Well, in his case I guess it's true, if those are the programs he uses. But it would have been nice to test an app that is known to be memory bandwidth starved (like Prime95):
http://www.pcworld.com/article/29829...rformance.html

That seems to be a trend when people online discuss the pros/cons of quad channel... "it doesn't make my game run any faster, so why bother".

To which I would only say if that game needed more memory bandwidth, it would have helped, but turns out memory isn't the bottleneck for most games, apparently. Has more to do with the GPU and CPU.

It's like complaining that expensive snow tires aren't worth the cost, so therefore they're not worth it. Never mind that that person lives in a desert and sticks to dry roads only. Well, ask someone who drives on icy roads if snow tires help and you're going to get a different answer.

Gamers can be a funny breed... if it doesn't make their game run any faster, it's useless (to them...and they'd be right in their own way).
Madpoo is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
High Bandwidth Memory tha GPU Computing 4 2015-07-31 00:21
configuration for max memory bandwidth smartypants Hardware 11 2015-07-26 09:16
P-1 memory bandwidth TheMawn Hardware 1 2013-06-15 23:15
Parallel memory bandwidth fivemack Factoring 14 2008-06-11 20:43
Skype's (mis)use of their customers' bandwidth ewmayer Lounge 3 2007-05-22 20:13

All times are UTC. The time now is 06:48.

Thu Dec 3 06:48:31 UTC 2020 up 2:59, 0 users, load averages: 0.56, 0.94, 1.04

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.