mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GMP-ECM (https://www.mersenneforum.org/forumdisplay.php?f=55)
-   -   gmp-ecm step 1 performance with hyperthreading (https://www.mersenneforum.org/showthread.php?t=26309)

nordi 2020-12-14 23:18

gmp-ecm step 1 performance with hyperthreading
 
While there's some [URL="https://www.mersenneforum.org/showthread.php?t=23245"]good information about gmp-ecm step 1 performance on a single core[/URL], and many postings about running multiple gmp-ecm instances, I could not find any information about whether it benefits from CPU-threads (a.k.a SMT, a.k.a. Hyperthreading). So I decided to test myself.

I tried out how gmp-ecm stage 1 performs when using all physical CPU cores compared to using all threads of the CPU. Specifically, I checked how stage 1 performs with B1=1e8 for [URL="https://www.mersenne.ca/exponent/1489"]M1489[/URL]. I chose these parameters because
[LIST][*]stage 1 takes much more time than stage 2, so its performance is more relevant[*]B2=1e8 gets results reasonably quickly but still runs long enough for reliable results[*]M1489 has so many known factors that <1000 bits remain to be factored[/LIST]The surprising result was that using all CPU-threads almost doubled the throughput! Time needed for 1 curve for each gmp-ecm process:

AMD Ryzen 9 3950X
16 cores 570 seconds
32 threads 589 seconds
throughput change: +91%

Intel Core i7 6600U
2 cores 858 seconds
4 threads 984 seconds
throughput change: +74%

I also cross-tested against mprime, which is said to be faster for stage 1. Interestingly, mprime working on M1489 with one thread and B1=1e9 on the Ryzen takes ~6400s, compared to the ~5850s with gmp-ecm. I assume this happens because mprime does not use the known factors when doing ECM. So for this particular number, gmp-ecm is actually a bit faster in step 1 than mprime.

nordi 2020-12-14 23:22

For reference, the benchmark was performed on Linux like this:
[code]
#!/bin/bash

function launch_ecm {
echo '(2^1489-1) / 71473 / 27201739919 / 51028917464688167 / 13822844053570368983 / 122163266112900081138309323835006063277267764895871 / 95909518295775374166321292697000685895150503357477127' | \
time taskset --cpu-list $1 ecm -v -modmuln 1e8 0 >> log_$i &
}

for i in $(seq 0 2 31); do
launch_ecm $i
done
wait

echo -e '\n\n\n\n\n\n'
sleep 30

for i in $(seq 0 31); do
launch_ecm $i
done
wait
[/code]Replace the "31" in the for-loops with your thread count minus 1.


I'm using "taskset" to ensure that in the first loop, each physical core gets one ecm process. Without that, some cores would get 2, some would get none.

VBCurtis 2020-12-15 01:07

GMP-ECM is quite kind to HT, yes. It's also rather kind to memory bandwidth, so it makes a nice hyperthread-partner to LLR or P95 work.
mprime's stage 1 ECM uses FFT arithmetic, similar to the LL test; so it's not as memory-kind. This is another reason that for workloads involving hyperthreads, it may be faster to run GMP-ECM rather than P95-ECM on these small mersenne composites.

kruoli 2020-12-15 14:54

You may want to test with the "known factors format":
[QUOTE=readme.txt][c]ECM2=k,b,n,c,B1,B2,curves_to_run[,"comma-separated-list-of-known-factors"][/c][/QUOTE]

That would result in:
[CODE]
ECM2=1,2,1489,-1,1000000000,0,1,"71473,27201739919,51028917464688167,13822844053570368983,122163266112900081138309323835006063277267764895871,95909518295775374166321292697000685895150503357477127"[/CODE]

You should also have a look at a exponent with fewer known factors. It could change the outcome of GMPECM vs. Prime95.

According to undoc.txt, [c]MaxStage0Prime[/c] does not change the behavior of ECM's first stage. But maybe the same optimization as in P-1 could be applied here as well?

nordi 2021-11-07 23:14

I also benchmarked Step 2 on my AMD Ryzen 9 3950X, using M1217 and B2=1e13 for Step 2 to answer two questions:
[LIST=1][*]does it make sense to run Step 2 on every CPU thread?[*]does it make sense to run Steps 1 and Step 2 in parallel on a physical core, using its two threads?[/LIST]
For question 1, I got
16 physical cores with Step 2: 357.5 seconds per curve
32 CPU threads with Step 2: 631.5 seconds per curve
throughput: 357.5/631.5*2 = 113.2%
which is [B]13.2%[/B] more throughput.


For question 2, I got
Step 2 takes 611.8 seconds
Step 2 throughput: 357.5/611.8 = 58.4%
Step 1 while Step 2 is running 599.6
Step 1 without Step 2 running: 354.0
Step 1 throughput: 354/599.6 = 59.0%
overall throughput: 58.4% + 59.0% = 117.4%
which is [B]17.4%[/B] more throughput.


The additional throughput is not as significant as for step 1 and comes at the expense of either doubled RAM requirements (case 1) or a longer time during which the RAM is used (case 2). But if you have enough RAM, it makes sense to use all CPU threads.


All times are UTC. The time now is 20:06.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.