mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2016-08-02, 22:53   #1
chaoz23
 
"Oz"
Aug 2016
Seattle

2 Posts
Arrow NVIDIA GeForce GTX 1060 - PCIe 2.0 vs. 3.0

Sharing some interesting results on PCI version and Processor affinity.

TL;DR Using PCIe 2.0 vs. 3.0 with GTX 1060 for CUDALucas results in ~20% reduction in performance.

Detailed Results
https://docs.google.com/spreadsheets...it?usp=sharing

it appears critical to have a PCIe 3.0 slot and important if using more than one card in the same machine to get 1:1 slot:processor affinity.

In my case even using using two PCIe 3.0 slots (Slots 2,6 which both share the same processor) in the HP Z820 also resulted in ~20% reduction in performance for the second GPU. Switching the second GPU to slot 4 (which is tied to the second processor) got the perf back up to expected levels.

Z820 PCI Slot Diagram with two GTX 1060s


PCI Slot | 1060 GTX | HP Z820 it/ms
================================
2 | Card A | 4.37
2 | Card B | 4.42
4 | Card A | 4.42
6 | Card A | 5.71
6 | Card B | 5.91



Piling on some evidence, the HP Z400 results I generated also appear to have been held back by only having a PCIe 2.0 slots

HP Z400 Slot Diagram oddslot

So if other folks are seeing ~5.5 ms/It results for the GTX vs. the expected ~4.4 ms/It, check if you are using a PCIe 2.0 slot and re-slot into a 3.0 slot for improved perf.

If using multiple GTX 1060 cards in PCIe 3.0 slots, shoot for 1:1 proc:slot isolation.

I bet a dollar (USD) this pattern is going to show up on other machines

Hoping this helps you all get the best perf possible out of your 1060s, truly an awesome card for perf vs. cost and power!

If you have questions or want to collaborate lmk, I am open to it.

-Oz in Seattle
chaoz23 is offline   Reply With Quote
Old 2016-08-03, 10:45   #2
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2·29·101 Posts
Default

Interesting. I would be interesting in comparing my 750Ti which is in a PCIe 1.1 x16 socket with a 750Ti in a more recent socket if anyone can do a benchmark in a later socket.
henryzz is online now   Reply With Quote
Old 2016-08-03, 12:11   #3
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10058 Posts
Default

This is with CUDALucas?

I expect mfakto not to have any performance difference given the minimal PCIe bandwidth needed. I actually wouldn't expect it with CUDALucas either as it is only supposed to copy
the data off GPU every 10k iterations for the checkpoint, so there are likely new optimizations to make to keep things on GPU.

There may be changes in how CUDA 8/ compute 6.1 handle certain operations causing this.
airsquirrels is offline   Reply With Quote
Old 2016-08-08, 15:56   #4
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

2·313 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
This is with CUDALucas?

I expect mfakto not to have any performance difference given the minimal PCIe bandwidth needed. I actually wouldn't expect it with CUDALucas either as it is only supposed to copy
the data off GPU every 10k iterations for the checkpoint, so there are likely new optimizations to make to keep things on GPU.

There may be changes in how CUDA 8/ compute 6.1 handle certain operations causing this.
I agree with airsquirrels. I have run the Nvidia trace programs on several of our Cuda programs available and all of them show only a tiny amount of overhead on the transfer from device to host and vice versa.

I also wonder if it is something buried in the prerelease of Cuda 8 ( such as debugging code ) that causes it. I would wait for the actual release.

Last fiddled with by tServo on 2016-08-08 at 15:57
tServo is offline   Reply With Quote
Old 2016-08-08, 16:32   #5
chaoz23
 
"Oz"
Aug 2016
Seattle

102 Posts
Default

Fascinating, it was surprising for sure.

I'm willing to try a few more versions out if useful.

-Oz
chaoz23 is offline   Reply With Quote
Old 2017-07-30, 04:26   #6
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·719 Posts
Default

Quote:
Originally Posted by chaoz23 View Post
Sharing some interesting results on PCI version and Processor affinity.

TL;DR Using PCIe 2.0 vs. 3.0 with GTX 1060 for CUDALucas results in ~20% reduction in performance.

Detailed Results
https://docs.google.com/spreadsheets...it?usp=sharing

it appears critical to have a PCIe 3.0 slot and important if using more than one card in the same machine to get 1:1 slot:processor affinity.

In my case even using using two PCIe 3.0 slots (Slots 2,6 which both share the same processor) in the HP Z820 also resulted in ~20% reduction in performance for the second GPU. Switching the second GPU to slot 4 (which is tied to the second processor) got the perf back up to expected levels.

Z820 PCI Slot Diagram with two GTX 1060s


PCI Slot | 1060 GTX | HP Z820 it/ms
================================
2 | Card A | 4.37
2 | Card B | 4.42
4 | Card A | 4.42
6 | Card A | 5.71
6 | Card B | 5.91



Piling on some evidence, the HP Z400 results I generated also appear to have been held back by only having a PCIe 2.0 slots

HP Z400 Slot Diagram oddslot

So if other folks are seeing ~5.5 ms/It results for the GTX vs. the expected ~4.4 ms/It, check if you are using a PCIe 2.0 slot and re-slot into a 3.0 slot for improved perf.

If using multiple GTX 1060 cards in PCIe 3.0 slots, shoot for 1:1 proc:slot isolation.

I bet a dollar (USD) this pattern is going to show up on other machines

Hoping this helps you all get the best perf possible out of your 1060s, truly an awesome card for perf vs. cost and power!

If you have questions or want to collaborate lmk, I am open to it.

-Oz in Seattle
Looks interesting, and probably represents considerable work. The column with all 0x0 residues at 30,000 iterations into a much larger exponent concerns me. There was a code fix to detect that illegal value and also 0x02. The 64-bit residue 0x0 is only legal output at the final iteration. CUDALucas generating 0x02, 0x00, or 0xfffffffffffffffd are known problems with 2.05.1 for Windows other than CUDA8 flavor, or presumably earlier CUDALucas version, any CUDA level. One of the symptoms of the error is faster iterations. Resolve the residue issue, then rerun it or a subset (fractional factorial design)?
kriesel is offline   Reply With Quote
Old 2017-08-03, 04:25   #7
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009

35778 Posts
Default

Quote:
Originally Posted by henryzz View Post
Interesting. I would be interesting in comparing my 750Ti which is in a PCIe 1.1 x16 socket with a 750Ti in a more recent socket if anyone can do a benchmark in a later socket.
I have a GTX-750Ti in a PCIe 3.0 slot in an HP workstation. Which program do you need the benchmark for? Give me the test parameters you need and I will give it a shot.

Last fiddled with by storm5510 on 2017-08-03 at 04:34 Reason: Updating
storm5510 is offline   Reply With Quote
Old 2017-08-03, 08:40   #8
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

585810 Posts
Default

Quote:
Originally Posted by storm5510 View Post
I have a GTX-750Ti in a PCIe 3.0 slot in an HP workstation. Which program do you need the benchmark for? Give me the test parameters you need and I will give it a shot.
Unfortunately the pc with a PCIe 1.1 slot died in September.
henryzz is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc and PCIe bus width Chuck GPU Computing 47 2016-01-08 07:51
Geforce GTX Titan 6GB ATH GPU Computing 295 2013-05-12 21:35
GeForce GTX 580 for sale TObject GPU Computing 13 2013-05-07 05:59
Cuda on GEForce 210? Christenson GPU Computing 8 2011-03-22 02:33
nVIDIA's GeForce 9800/G92 series to hit 1 TFLOPS ixfd64 Hardware 0 2007-10-01 08:05

All times are UTC. The time now is 07:51.

Wed Apr 21 07:51:49 UTC 2021 up 13 days, 2:32, 0 users, load averages: 1.77, 1.98, 2.08

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.