mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve benchmarking (https://www.mersenneforum.org/showthread.php?t=25169)

Xyzzy 2020-02-03 22:39

Msieve benchmarking
 
2 Attachment(s)
We have uploaded a data file to use for msieve benchmarking.

[URL]https://www.dropbox.com/s/si1kyxq7yerahcw/benchmark.tar.gz[/URL]

It would be cool if timings for various setups were posted here.

If you need help, please ask!

:mike:

VBCurtis 2020-02-15 23:29

Machine: HP Z620, dual 10-core Ivy Bridge Xeon @2.6ish ghz
64GB memory per socket
-nc1 was run with target-density 134. After remdups and adding freerels in, msieve states 99.3M unique relations. Matrix came out 4.57M dimensions. TD=140 did not complete filtering.
Using taskset -c 10-19 to lock to socket #2 with VBITS=128 msieve compilation option.
10-threaded ETA after 1% of job: 5 hr 25 min.
5-threaded ETA after 1% of job: 9 hr 30 min.

Future tests will explore ETAs of smaller target densities, as well as splitting the job over two sockets.

bsquared 2020-02-17 22:20

Machine: 2 sockets of 20-core Cascade-Lake Xeon
Just used the default density.
matrix is 5149968 x 5150142 (1913.9 MB) with weight 597210677 (115.96/col)

Here is a basic 40 threaded job across both sockets (actually, I guess it is thread-limited to 32 threads):
4 hrs 58 min: /msieve -v -nc2 -t 40

Using MPI helps a lot. Here are various configurations using different VBITS settings (timings after 1% elasped):

[CODE]2x20 core VBITS=64
2x20 core VBITS=64
2 hrs 30 min: mpirun -np 4 msieve -nc2 2,2 -v -t 10
2 hrs 43 min: mpirun -np 8 msieve -nc2 2,4 -v -t 5
3 hrs 1 min: mpirun -np 20 msieve -nc2 4,5 -v -t 2
3 hrs 23 min: mpirun -np 40 msieve -nc2 5,8 -v
3 hrs 23 min: mpirun -np 40 msieve -nc2 8,5 -v

2x20 core VBITS=128
2 hrs 32 min: mpirun -np 8 msieve -nc2 2,4 -v -t 5
2 hrs 36 min: mpirun -np 20 msieve -nc2 4,5 -v -t 2
2 hrs 45 min: mpirun -np 40 msieve -nc2 5,8 -v
2 hrs 47 min: mpirun -np 4 msieve -nc2 2,2 -v -t 10
2 hrs 54 min: mpirun -np 40 msieve -nc2 8,5 -v

2x20 core VBITS=256
2 hrs 43 min: mpirun -np 40 msieve -nc2 5,8 -v
2 hrs 44 min: mpirun -np 8 msieve -nc2 2,4 -v -t 5
2 hrs 47 min: mpirun -np 40 msieve -nc2 8,5 -v
2 hrs 49 min: mpirun -np 4 msieve -nc2 2,2 -v -t 10
3 hrs 2 min: mpirun -np 20 msieve -nc2 4,5 -v -t 2[/CODE]

VBITS=128 seems to be most-consistently fast.

Grids that were significantly less square (e.g., 2x20 or 4x1 -t 10) didn't do as well.

RichD 2020-02-18 01:50

I have a Sandy Bridge Core-i5 (4 cores, no HT). Would this small machine be beneficial for a benchmark? I have two versions of Msieve; one with VBITS=128 and an older one without VBITS I use for poly search (GPU enabled). When I created the VBITS=128 I noticed about a 10% boost (or more) in my post-processing speed.

VBCurtis 2020-02-18 01:58

RichD- yes! I'd like to see how various generations of hardware compare, regular desktop or Xeon-grade. This also helps others see if perhaps their msieve copy isn't as fast as it could be (e.g. compiling it oneself can prove *much* faster if the binary one finds online isn't compiled for the same architecture).

Xyzzy 2020-02-18 03:28

Is there any way to tell how many cores and what target density was used by viewing the log file?

Maybe we are looking in the wrong place? Or maybe we can patch the source to include this info?

:mike:

VBCurtis 2020-02-18 04:51

Target density is listed in the log just below the polynomial, before msieve begins reading relations. If no density line is evident, then none was specified by the user and default density of 70 was used.
I believe the number of cores is listed when -nc2 phase begins; something like "8 threads" usually appears in the lines just before the first ETA is printed to the log.

Xyzzy 2020-02-18 12:53

2 Attachment(s)
Since there is no mention of threads or target density in the log files, these runs must have been done with the default target density and one thread.

linux.log.gz is an AMD 1920X CPU with dual-channel DDR4-2666 memory.
windows.log.gz is an Intel i7-9700K CPU with quad-channel DDR4-3200.

We will re-run these later with various settings to tune our systems better.

:mike:

bsquared 2020-02-18 14:44

Looks like 2 threads for 43.9 hrs in the linux case:
[QUOTE=linux.log.gz]

[U]Thu Jan 30 18:22:25 2020 commencing Lanczos iteration (2 threads)[/U]
Thu Jan 30 18:22:25 2020 memory use: 1762.9 MB
Thu Jan 30 18:23:17 2020 linear algebra at 0.0%, ETA 46h11m
Thu Jan 30 18:23:34 2020 checkpointing every 120000 dimensions
Sat Feb 1 14:08:28 2020 lanczos halted after 81439 iterations (dim = 5149917)
Sat Feb 1 14:08:33 2020 recovered 25 nontrivial dependencies
Sat Feb 1 14:08:33 2020 BLanczosTime: 158039

[/QUOTE]

Summary:
43.9 hrs: 2 threads Linux AMD 1920X CPU with dual-channel DDR4-2666 memory
36.6 hrs: 1 thread Windows Intel i7-9700K CPU with quad-channel DDR4-3200

Xyzzy 2020-03-10 12:45

2 Attachment(s)
Here are benchmarks for 1 through 12 cores on our 1920X and a pretty chart.

The blue line in the chart represents perfect additional core utilization. For example, two cores would be twice as fast as one.

We graphed the linear algebra times.

All benchmarks were done on an otherwise idle system. IRL, with lots of stuff running, things slow down dramatically.

:mike:

VBCurtis 2020-03-10 18:07

Xyzzy-
While unlikely, it is possible that 20 or 24 threads yields a bit of improvement. Hyperthreads don't always help on matrix solving, but since this is a benchmark thread it might be nice to demonstrate that.

I suggest 20 as alternative because using every possible HT might be impacted by any background process, but that effect should be reduced if we leave a few HTs 'open'. I've found situations where using N-1 cores runs faster than N cores, for what I presume are similar reasons.


All times are UTC. The time now is 15:32.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.