mersenneforum.org Mlucas 20.1 on Power9 v2 18-core
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2021-10-27, 20:45   #12
ewmayer
2ω=0

Sep 2002
República de California

2·13·449 Posts

Quote:
 Originally Posted by jas Running it takes around 38 minutes. Below is the mlucas.cfg and I put test.log here: https://gist.github.com/jas4711/100d...68592ae56e9a12
Thanks - what does "grep passed test.log" give?

Suggest you save that 1-core/1-thread cfg-file as mlucas.cfg.1c1t so subsequent self-tests don't overwrite it.

The /proc/cpuinfo file alas says nothing about which quartets of entries map to the same one of the 18 physical cores, and I still haven't found any docs which explain the logical core numbering convention for your SMT4 setup. So let's see if there any appreciable difference between the 4-thread timings given by the following:

./Mlucas -s m -cpu 0:3 >& test2.log
./Mlucas -s m -cpu 0:71:18 >& test3.log

The first uses the AMD core numbering convention, logical cores 0-3 all map to the same physical core; the second uses the Intel convention, where for an 18-physical-core 4-way-SMT CPU, logical cores 0,18,36,54 all map to the same physical core. I plan to add support for the hwloc topology-extracting freeware library next year, need to see if there's a simple way to build/install that in standalone mode so one can just use it as-is to get topology for one's platform.

It would be best to suspend/restart your ongoing DCs and whatnot via 'kill -[STOP|CONT] pid' to run the above. Based on your 1c1t runtime, each should take ~10 minutes.

Thanks,
-Erns

2021-11-10, 10:35   #13
jas

"Simon Josefsson"
Jan 2020
Stockholm

5×7 Posts

Quote:
 Originally Posted by ewmayer Thanks - what does "grep passed test.log" give?
I put it here:

https://gist.github.com/jas4711/100d...omment-3957631

Quote:
 Suggest you save that 1-core/1-thread cfg-file as mlucas.cfg.1c1t so subsequent self-tests don't overwrite it.
Thanks -- I keep multiple mlucas.cfg's around for various settings I've tried.

Quote:
 The /proc/cpuinfo file alas says nothing about which quartets of entries map to the same one of the 18 physical cores, and I still haven't found any docs which explain the logical core numbering convention for your SMT4 setup.
I think the output below ansers it, but I'll run the suggested tests to confirm it.

Code:
root@vello:~# ppc64_cpu --cores-present
Number of cores present = 18
root@vello:~# ppc64_cpu --threads-per-core
Threads per core: 4
root@vello:~# ppc64_cpu --info
Core   0:    0*    1*    2*    3*
Core   1:    4*    5*    6*    7*
Core   2:    8*    9*   10*   11*
Core   3:   12*   13*   14*   15*
Core   4:   16*   17*   18*   19*
Core   5:   20*   21*   22*   23*
Core   6:   24*   25*   26*   27*
Core   7:   28*   29*   30*   31*
Core   8:   32*   33*   34*   35*
Core   9:   36*   37*   38*   39*
Core  10:   40*   41*   42*   43*
Core  11:   44*   45*   46*   47*
Core  12:   48*   49*   50*   51*
Core  13:   52*   53*   54*   55*
Core  14:   56*   57*   58*   59*
Core  15:   60*   61*   62*   63*
Core  16:   64*   65*   66*   67*
Core  17:   68*   69*   70*   71*
root@vello:~#

FWIW, it passed another LL DC some I'm pretty confident everything works even if it probably can be optimized more.

https://www.mersenne.org/report_expo...2113993&full=1

2021-11-11, 08:03   #14
jas

"Simon Josefsson"
Jan 2020
Stockholm

5×7 Posts

Quote:
 Originally Posted by ewmayer ./Mlucas -s m -cpu 0:3 >& test2.log ./Mlucas -s m -cpu 0:71:18 >& test3.log

These didn't take the 10 minutes you guessed, not sure what is wrong but here is the results...

https://gist.github.com/jas4711/100d...68592ae56e9a12

Scroll down or search for "test2.log" and "test3.log" respectively. As you can see, the first run took 4.5 hours and the second run took 1h40m. I put the 'grep passed' output in a comment at the bottom.

The optimal setting I've found by experimenting is still -cpu 0:63:2.

2021-11-23, 23:30   #15
ewmayer
2ω=0

Sep 2002
República de California

2×13×449 Posts

Quote:
 Originally Posted by jas These didn't take the 10 minutes you guessed, not sure what is wrong but here is the results... https://gist.github.com/jas4711/100d...68592ae56e9a12 Scroll down or search for "test2.log" and "test3.log" respectively. As you can see, the first run took 4.5 hours and the second run took 1h40m. I put the 'grep passed' output in a comment at the bottom. The optimal setting I've found by experimenting is still -cpu 0:63:2.
Apologies for the delayed reply - been full-up busy with ongoing work on the latest Mlucas release.

Thanks for the data - my comments:

o Leading radix-32 appears to have been miscompiled - suggest trying an incremental rebuild at a lower opt-level: In /obj, 'gcc -c -O2 -g -DUSE_THREADS -march=native ../src/radix32_*cy*c && gcc -o Mlucas *o -lm -lrt -lpthread -lgmp', then './Mlucas -fft 2M -iters 100 -radset 32,32,32,32' to quick-test. If that still yields incorrect residues, try -O0. But I didn't see any cases in your test-logs where having said leading-radix working properly would have yielded the best timing at a given FFT length.

o I forgot to ask also for the mlucas.cfg files resulting from each self-test with a different -cpu arguments, but those numbers are extracted easily enough from the test-log data. Best-timings-in-msec/iter-and-radix-sets for various FFT lengths and -cpu args; asterisks mark timing anomalies, i.e. cases where timing for a given FFT length is slower than the next-larger one:
Code:
FFT		1c1t			0:3			0:71:18
----	--------------------	--------------------	--------------------
2048	 75.47	1024,32,32	 58.82	64,32,32,16	20.40	1024,32,32
2304	 91.20	36,32,32,32	 66.96	36,32,32,32	26.82	36,32,32,32
2560	108.77	40,32,32,32	 82.46	40,32,32,32	30.54	40,32,32,32
2816	114.72	44,32,32,32	 86.36	44,32,32,32	32.17	176,32,16,16
3072	121.73	48,32,32,32	 92.61	48,32,32,32	33.26	48,32,32,32
3328	137.32	52,32,32,32	102.31	208,32,16,16	38.02	52,32,32,32
3584	144.97	56,32,32,32	109.29	224,32,16,16	38.08	56,32,32,32
3840	157.37	60,32,32,32	118.77*	60,32,32,32	43.04	60,32,32,32
4096	163.65	128,32,32,16	117.51	128,32,32,16	43.85	64,32,32,32
4608	183.77	144,32,32,16	135.94	144,32,32,16	48.24	144,32,32,16
5120	234.53*	160,32,32,16	170.85*	160,32,32,16	61.20*	160,32,32,16
5632	228.96	176,32,32,16	169.98	176,32,32,16	59.85	176,32,32,16
6144	252.86	192,32,32,16	188.42	192,32,32,16	65.20	192,32,32,16
6656	272.21	208,32,32,16	199.11	208,32,32,16	70.45	208,32,32,16
7168	291.60	224,32,32,16	214.91	224,32,32,16	75.87	224,32,32,16
7680	352.16	240,32,32,16	248.33	240,32,32,16	91.04	240,32,32,16
As you note, you get best timings with -cpu 0:71:18, which makes sense in light of the 'cpu --info' data in your above post: This platform follows the AMD logical core numbering convention, physical core 0 maps to logical cores 0:3, physical core 1 maps to logical cores 4:7, and so forth. So we would expect the -cpu 0:71:18 timings to be best, since those put one thread on each of logical cores 0,18,36,54, which map to physical cores 0,4,9,13, respectively. We would expect similar timings from any -cpu argument which puts one of the 4 threads on each of 4 separate physical cores, e.g. -cpu 0:15:4. But that says little about maximizing the total system throughput. The -cpu 0:3 numbers actually look most promising in that regard, because they indicate that putting 4 threads on a single physical core appreciably boosts throughput over just 1 thread on the same physical core. But that is not completely certain because per the pthread affinity standard, such user assignment preferences are *hints* to the OS, there is no guarantee that a given OS will respect them, one simply hopes it does to as good a degree as possible given whatever elese may be running on the system.

The only real way to be sure is to use the above data to guide experiment. I suggest saving the 3 cfg-files corresponding to each of the above best-timing columns into mlucas.cfg.1c1t, mlucas.cfg.1c4t, and mlucas.cfg.4c4t, respectively. Then, under the dir containing the Mlucas binary and these cfg-files, create 18 run directories, say run0-h, using a hexadecimal-style numbering system. Put a separate worktodo.ini file into each rundir. Then try two different 18-instance configurations, each of which fills up all 72 logical cores, but in 2 different ways:

[1] In each of run0-G, 'ln -sf ../mlucas.cfg.1c4t mlucas.cfg', then in dir run0 invoke the binary with -cpu 0:3, in run1 with -cpu 4:7, all the way through runG with -cpu 68:71 . Let those 18 jobs get through at least several savefile writes, or just let them run for around 24 hours before doing 'killall -9 Mlucas'. Copy down the average time between checkpoints: 'tail -1 run*/p*stat' alas won't work because 'tail' has a stupid limitation that disallows a -[line count] argument for wildcarded invocations, but if you have multitail installed that will work nicely.

[2] In each of run0-G, 'ln -sf ../mlucas.cfg.4c4t mlucas.cfg', then in dir run0 invoke the binary with -cpu 0:71:4, in run1 with -cpu 1:71:4, all the way through dirG with -cpu 17:71:4 . Let those 18 jobs get through at least several savefile writes, or just let them run for around 24 hours. Compare the average time between checkpoints against that for setup [1]; if it's slower again 'killall -9 Mlucas' and revert to setup [1].

Whichever of [1] and [2] is faster, the setup can be encoded in a bash script to make run resumption after machine downtime or whatnot simple. It'll be interesting to see those total throughput numbers, and to compare to what iters/sec you get for -cpu 0:63:2, which puts 2 threads on each of physical cores 0-15.

Last fiddled with by ewmayer on 2021-11-24 at 18:52

 Thread Tools

 Similar Threads Thread Thread Starter Forum Replies Last Post em99010pepe Hardware 0 2011-11-11 15:18 Rodrigo PrimeNet 4 2011-07-30 14:43 Rodrigo Hardware 6 2010-11-29 18:48 jippie Information & Answers 7 2009-12-14 22:04 S485122 Software 0 2007-05-13 09:15

All times are UTC. The time now is 22:03.

Tue Nov 30 22:03:41 UTC 2021 up 130 days, 16:32, 0 users, load averages: 1.92, 2.05, 1.89

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.