mersenneforum.org Memory Bandwidth
 Register FAQ Search Today's Posts Mark Forums Read

 2016-02-01, 03:13 #1 Fred     "Ron" Jan 2016 Fitchburg, MA 97 Posts Memory Bandwidth So, again, please feel free to answer in 5 year old speak, this is all new to me. I'm trying to understand what I'm looking at in the benchmark section "Benchmarking multiple workers to measure the impact of memory bandwidth". In an ideal world with no memory bandwidth issues, would the "Total throughput (iter/sec)" increase proportionally as additional cores/workers are tested? In the example below, since 1cpu, 1 worker had a Total throughput of 56.02 iter/sec, would 4 cpus, 4 workers have a total throughput of 224.08, but instead it's only 143.99 because memory speed is a bottleneck? Code: Timing 4096K FFT, 1 cpu, 1 worker. Average times: 17.85 ms. Total throughput: 56.02 iter/sec. Timing 4096K FFT, 2 cpus, 2 workers. Average times: 19.62, 19.46 ms. Total throughput: 102.35 iter/sec. Timing 4096K FFT, 3 cpus, 3 workers. Average times: 22.49, 22.40, 22.35 ms. Total throughput: 133.86 iter/sec. Timing 4096K FFT, 4 cpus, 4 workers. Average times: 27.89, 27.62, 27.83, 27.78 ms. Total throughput: 143.99 iter/sec. Last fiddled with by Fred on 2016-02-01 at 03:14
 2016-02-01, 03:43 #2 Mark Rose     "/X\(‘-‘)/X\" Jan 2013 2,917 Posts More memory bandwidth than speed, but yes.
2016-02-01, 05:06   #3
Serpentine Vermin Jar

Jul 2014

29×113 Posts

Quote:
 Originally Posted by Fred So, again, please feel free to answer in 5 year old speak, this is all new to me. I'm trying to understand what I'm looking at in the benchmark section "Benchmarking multiple workers to measure the impact of memory bandwidth". In an ideal world with no memory bandwidth issues, would the "Total throughput (iter/sec)" increase proportionally as additional cores/workers are tested? In the example below, since 1cpu, 1 worker had a Total throughput of 56.02 iter/sec, would 4 cpus, 4 workers have a total throughput of 224.08, but instead it's only 143.99 because memory speed is a bottleneck? Code: Timing 4096K FFT, 1 cpu, 1 worker. Average times: 17.85 ms. Total throughput: 56.02 iter/sec. Timing 4096K FFT, 2 cpus, 2 workers. Average times: 19.62, 19.46 ms. Total throughput: 102.35 iter/sec. Timing 4096K FFT, 3 cpus, 3 workers. Average times: 22.49, 22.40, 22.35 ms. Total throughput: 133.86 iter/sec. Timing 4096K FFT, 4 cpus, 4 workers. Average times: 27.89, 27.62, 27.83, 27.78 ms. Total throughput: 143.99 iter/sec.
Removing any memory bandwidth constraints, it would theoretically scale much better.

George pointed out that in his code to split the work among different cores, it may not split the work into even amounts between all of the cores.

I don't know the details at all, but in my mind I imagined (a very simple example) a big honking multiplication that gets split into 6 chunks of work and distributed among 4 cores.

For the first batch, each core gets a chunk. For the second batch, only 2 of the 4 get any work and the other 2 sit idle.

Then of course at the end it's the job of one of the cores in the worker to collect the chunks of work and combine it all together, during which time the other cores would be idle.

I may be totally wrong in my understanding of the issue, but that was how my feeble mind understood the problem.

I don't know what would be involved in taking any given FFT size and splitting the work evenly to however many cores that worker has. It seems technically possible, but whether it's practical or not, I don't know.

Also, the final step that combines the work from each core, I wonder if instead of having one core do all of that, would it be faster to pair up cores so that core #1 combines the work of #1 and #2, and core #3 combines the work of #3 and #4. Then finally core #1 would combine those 2 chunks.

Would that be any faster in practice than simply having one core combine all the chunks at once? No idea.

I mean, when I talk about "chunks" I'm not even sure what those chunks are... I gather it's a way to split a large multiplication into several smaller ones, but beyond that... well, I haven't a clue.

So, all that being said, yes, more memory bandwidth is awesome, but the program itself and the way it distributes work isn't perfect, so it still wouldn't scale exactly, but it would be closer.

2016-02-01, 05:36   #4
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

23·3·7·43 Posts

Quote:
 Originally Posted by Madpoo George pointed out that in his code to split the work among different cores, it may not split the work into even amounts between all of the cores.
You are confusing this case (4 workers on 4 cores) with the multi-threaded case (1 worker on 4 cores).

2016-02-01, 05:39   #5
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

23·3·7·43 Posts

Quote:
 Originally Posted by Fred ...because memory speed is a bottleneck?
Yes. There is also some speed loss due to contention for the L3 cache, but main memory bandwidth is the main culprit.

2016-02-01, 05:43   #6
axn

Jun 2003

12B516 Posts

Quote:
 Originally Posted by Madpoo Removing any memory bandwidth constraints, it would theoretically scale much better. So, all that being said, yes, more memory bandwidth is awesome, but the program itself and the way it distributes work isn't perfect, so it still wouldn't scale exactly, but it would be closer.

Short answer is yes, it should scale (nearly) linearly if memory bandwidth is not a constraint. Except for, turbo boost!

EDIT:- Ninja'd by George!

Last fiddled with by axn on 2016-02-01 at 05:44

 2016-02-01, 13:53 #7 Fred     "Ron" Jan 2016 Fitchburg, MA 97 Posts You guys have been super helpful. Thanks for all your responses. So, is this bottleneck something seen in almost all systems running LL tests? Or are there optimal builds where there would be no bottleneck? I've been considering a build using an i5-6600. Ya'll helped me recognize that that cpu with DDR4-2133 would probably have memory bandwidth issues, and going with DDR4-3000 or 3200 would be a better choice. Do you suspect an i5-6600 with DDR4-3200 would have NO memory bottleneck issues? Or it would just be LESS of an issue?
2016-02-01, 14:10   #8
science_man_88

"Forget I exist"
Jul 2009
Dumbassville

26×131 Posts

Quote:
 Originally Posted by Fred You guys have been super helpful. Thanks for all your responses. So, is this bottleneck something seen in almost all systems running LL tests? Or are there optimal builds where there would be no bottleneck? I've been considering a build using an i5-6600. Ya'll helped me recognize that that cpu with DDR4-2133 would probably have memory bandwidth issues, and going with DDR4-3000 or 3200 would be a better choice. Do you suspect an i5-6600 with DDR4-3200 would have NO memory bottleneck issues? Or it would just be LESS of an issue?
I'm not even close to an expert but https://en.wikipedia.org/wiki/Memory...d_nomenclature may help realize that speed isn't the only factor in memory bandwidth. number of channels, size of channels, clock frequency etc.

2016-02-01, 15:10   #9
Mark Rose

"/X\(‘-‘)/X\"
Jan 2013

2,917 Posts

Quote:
 Originally Posted by Fred You guys have been super helpful. Thanks for all your responses. So, is this bottleneck something seen in almost all systems running LL tests? Or are there optimal builds where there would be no bottleneck? I've been considering a build using an i5-6600. Ya'll helped me recognize that that cpu with DDR4-2133 would probably have memory bandwidth issues, and going with DDR4-3000 or 3200 would be a better choice. Do you suspect an i5-6600 with DDR4-3200 would have NO memory bottleneck issues? Or it would just be LESS of an issue?
It would be less of an issue. I would say an i5-6400 or i5-6600 with DDR4-3000 would be a cost-optimal system. The price jump to DDR4-3200 isn't worth it. If you're going with a non-overclockable motherboard, and stuck with DDR4-2133, I would suggest the i5-6400.

2016-02-01, 16:28   #10
Serpentine Vermin Jar

Jul 2014

29·113 Posts

Quote:
 Originally Posted by axn Ummm... That is valid for multi-threaded FFTs (eg:- 1 worker, 4 threads), but OP is asking for single-threaded FFTs. Short answer is yes, it should scale (nearly) linearly if memory bandwidth is not a constraint. Except for, turbo boost! EDIT:- Ninja'd by George!
My mistake, I glossed over that tiny little detail () in the comment.

2016-02-01, 16:43   #11
Serpentine Vermin Jar

Jul 2014

29×113 Posts

Quote:
 Originally Posted by science_man_88 I'm not even close to an expert but https://en.wikipedia.org/wiki/Memory...d_nomenclature may help realize that speed isn't the only factor in memory bandwidth. number of channels, size of channels, clock frequency etc.
Correct... and if it were me, I'd lean towards a system that supported quad-channel memory instead of just dual.

Quad channel at DDR4-2133 speeds will beat dual channel DDR4-3200 in every way (I think... where's that post I made a while back showing the benchmarks of things exactly ike that...)

Well, it wasn't this article, but I found it somewhat amusing that the author concluded quad channel wasn't any better based on those other "real world" benchmarks. Well, in his case I guess it's true, if those are the programs he uses. But it would have been nice to test an app that is known to be memory bandwidth starved (like Prime95):
http://www.pcworld.com/article/29829...rformance.html

That seems to be a trend when people online discuss the pros/cons of quad channel... "it doesn't make my game run any faster, so why bother".

To which I would only say if that game needed more memory bandwidth, it would have helped, but turns out memory isn't the bottleneck for most games, apparently. Has more to do with the GPU and CPU.

It's like complaining that expensive snow tires aren't worth the cost, so therefore they're not worth it. Never mind that that person lives in a desert and sticks to dry roads only. Well, ask someone who drives on icy roads if snow tires help and you're going to get a different answer.

Gamers can be a funny breed... if it doesn't make their game run any faster, it's useless (to them...and they'd be right in their own way).

 Similar Threads Thread Thread Starter Forum Replies Last Post tha GPU Computing 4 2015-07-31 00:21 smartypants Hardware 11 2015-07-26 09:16 TheMawn Hardware 1 2013-06-15 23:15 fivemack Factoring 14 2008-06-11 20:43 ewmayer Lounge 3 2007-05-22 20:13

All times are UTC. The time now is 06:48.

Thu Dec 3 06:48:31 UTC 2020 up 2:59, 0 users, load averages: 0.56, 0.94, 1.04