![]() |
![]() |
#1 |
"Oz"
Aug 2016
Seattle
2 Posts |
![]()
Sharing some interesting results on PCI version and Processor affinity.
TL;DR Using PCIe 2.0 vs. 3.0 with GTX 1060 for CUDALucas results in ~20% reduction in performance. Detailed Results https://docs.google.com/spreadsheets...it?usp=sharing it appears critical to have a PCIe 3.0 slot and important if using more than one card in the same machine to get 1:1 slot:processor affinity. In my case even using using two PCIe 3.0 slots (Slots 2,6 which both share the same processor) in the HP Z820 also resulted in ~20% reduction in performance for the second GPU. Switching the second GPU to slot 4 (which is tied to the second processor) got the perf back up to expected levels. Z820 PCI Slot Diagram with two GTX 1060s PCI Slot | 1060 GTX | HP Z820 it/ms ================================ 2 | Card A | 4.37 2 | Card B | 4.42 4 | Card A | 4.42 6 | Card A | 5.71 6 | Card B | 5.91 Piling on some evidence, the HP Z400 results I generated also appear to have been held back by only having a PCIe 2.0 slots HP Z400 Slot Diagram oddslot So if other folks are seeing ~5.5 ms/It results for the GTX vs. the expected ~4.4 ms/It, check if you are using a PCIe 2.0 slot and re-slot into a 3.0 slot for improved perf. If using multiple GTX 1060 cards in PCIe 3.0 slots, shoot for 1:1 proc:slot isolation. I bet a dollar (USD) this pattern is going to show up on other machines ![]() Hoping this helps you all get the best perf possible out of your 1060s, truly an awesome card for perf vs. cost and power! If you have questions or want to collaborate lmk, I am open to it. -Oz in Seattle |
![]() |
![]() |
![]() |
#2 |
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2·29·101 Posts |
![]()
Interesting. I would be interesting in comparing my 750Ti which is in a PCIe 1.1 x16 socket with a 750Ti in a more recent socket if anyone can do a benchmark in a later socket.
|
![]() |
![]() |
![]() |
#3 |
"David"
Jul 2015
Ohio
10058 Posts |
![]()
This is with CUDALucas?
I expect mfakto not to have any performance difference given the minimal PCIe bandwidth needed. I actually wouldn't expect it with CUDALucas either as it is only supposed to copy the data off GPU every 10k iterations for the checkpoint, so there are likely new optimizations to make to keep things on GPU. There may be changes in how CUDA 8/ compute 6.1 handle certain operations causing this. |
![]() |
![]() |
![]() |
#4 | |
"Marv"
May 2009
near the Tannhäuser Gate
2·313 Posts |
![]() Quote:
I also wonder if it is something buried in the prerelease of Cuda 8 ( such as debugging code ) that causes it. I would wait for the actual release. Last fiddled with by tServo on 2016-08-08 at 15:57 |
|
![]() |
![]() |
![]() |
#5 |
"Oz"
Aug 2016
Seattle
102 Posts |
![]()
Fascinating, it was surprising for sure.
I'm willing to try a few more versions out if useful. -Oz |
![]() |
![]() |
![]() |
#6 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7·719 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 |
Random Account
Aug 2009
35778 Posts |
![]()
I have a GTX-750Ti in a PCIe 3.0 slot in an HP workstation. Which program do you need the benchmark for? Give me the test parameters you need and I will give it a shot.
Last fiddled with by storm5510 on 2017-08-03 at 04:34 Reason: Updating |
![]() |
![]() |
![]() |
#8 |
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
585810 Posts |
![]() |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mfaktc and PCIe bus width | Chuck | GPU Computing | 47 | 2016-01-08 07:51 |
Geforce GTX Titan 6GB | ATH | GPU Computing | 295 | 2013-05-12 21:35 |
GeForce GTX 580 for sale | TObject | GPU Computing | 13 | 2013-05-07 05:59 |
Cuda on GEForce 210? | Christenson | GPU Computing | 8 | 2011-03-22 02:33 |
nVIDIA's GeForce 9800/G92 series to hit 1 TFLOPS | ixfd64 | Hardware | 0 | 2007-10-01 08:05 |