#67
tdulcet

"Teal Dulcet"
Jun 2018

71 Posts

Quote:
 Originally Posted by drkirkby Why would 3 workers give the most throughput on a dual-socket computer?
I ran the throughput benchmark on a c5.metal instance and got different results. Specifically, two workers were faster at the higher FFT lengths. Here are the fastest numbers of workers for each supported FFT length benchmarked by default:
• 6 workers: 2048K, 2100K, 2160K, 2240K, 2304K, 2400K
• 4 workers: 2520K, 2560K, 2592K, 2688K, 2880K, 2940K, 3000K, 3072K, 3136K, 3200K, 3360K, 3456K, 3600K, 3840K, 3920K, 4200K, 4320K, 4480K, 4800K
• 3 workers: 4032K
• 2 workers: 4608K, 4704K, 5040K, 5120K, 5184K, 5376K, 5760K, 6048K, 6144K, 6272K, 6400K, 6720K, 7056K, 7168K, 7200K, 7680K, 8064K
Here are the actual results for one of the FFT lengths used for wavefront first time tests:
Code:
Timings for 6144K FFT length (48 cores, 1 worker): 1.35 ms. Throughput: 740.08 iter/sec.
Timings for 6144K FFT length (48 cores, 2 workers): 1.16, 1.19 ms. Throughput: 1697.08 iter/sec.
Timings for 6144K FFT length (48 cores, 3 workers): 3.04, 3.07, 1.23 ms. Throughput: 1470.23 iter/sec.
Timings for 6144K FFT length (48 cores, 4 workers): 3.05, 3.02, 3.02, 3.00 ms. Throughput: 1322.79 iter/sec.
Timings for 6144K FFT length (48 cores, 6 workers): 5.47, 5.47, 5.48, 5.39, 5.37, 5.39 ms. Throughput: 1105.26 iter/sec.
Timings for 6144K FFT length (48 cores, 8 workers): 7.56, 7.54, 7.56, 7.55, 7.41, 7.50, 7.46, 7.44 ms. Throughput: 1066.38 iter/sec.
Timings for 6144K FFT length (48 cores, 12 workers): 11.56, 11.61, 11.62, 12.32, 11.57, 11.51, 11.54, 11.55, 11.29, 11.40, 11.43, 11.25 ms. Throughput: 1039.05 iter/sec.
Timings for 6144K FFT length (48 cores, 16 workers): 20.99, 20.72, 20.95, 20.82, 20.78, 21.04, 20.89, 20.78, 14.67, 13.45, 14.54, 14.61, 13.46, 13.71, 13.94, 14.91 ms. Throughput: 949.13 iter/sec.
Timings for 6144K FFT length (48 cores, 24 workers): 57.30, 56.56, 56.51, 56.29, 56.69, 56.99, 56.67, 56.94, 56.71, 56.65, 56.85, 56.70, 26.03, 30.28, 25.78, 27.11, 29.24, 29.75, 27.03, 29.71, 27.35, 28.07, 30.19, 28.15 ms. Throughput: 637.96 iter/sec.
Timings for 6144K FFT length (48 cores, 48 workers): 130.05, 132.14, 128.50, 128.65, 129.51, 129.92, 128.45, 129.71, 128.78, 129.41, 130.18, 128.95, 130.09, 129.10, 130.14, 129.61, 128.04, 130.51, 129.25, 129.42, 129.92, 130.49, 129.53, 131.25, 86.04, 102.15, 87.65, 103.38, 74.75, 91.32, 91.88, 76.62, 75.89, 103.66, 101.44, 101.42, 95.30, 93.57, 79.79, 102.96, 72.71, 95.29, 98.47, 87.29, 100.54, 87.55, 94.32, 102.50 ms. Throughput: 449.43 iter/sec.
MPrime by default wanted to use 12 workers, but 2 workers is significantly faster.

#68
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

449 Posts

Quote:
 Originally Posted by tdulcet I ran the throughput benchmark on a c5.metal instance and got different results.
Are you using the same CPUs as I had - 2 x Intel Xeon Platinum 8275CL CPU @ 3.00GHz? The c5.metal does not specify the CPU type.

Amazon AWS seems to use some odd CPUs, which makes them difficult to use in other machines. I had a couple of high-spec CPUs (I think 26 core 2.6 GHz), but they would not run in my Dell 7920. Apparently they were used by Amazon for AWS, but were not supported by many motherboards.

#69
tdulcet

"Teal Dulcet"
Jun 2018

1078 Posts

Quote:
 Originally Posted by drkirkby Are you using the same CPUs as I had - 2 x Intel Xeon Platinum 8275CL CPU @ 3.00GHz? The c5.metal does not specify the CPU type.
Yes, the same CPU:
Code:
$wget https://raw.github.com/tdulcet/Linux-System-Information/master/info.sh -qO - | bash -s Linux Distribution: Ubuntu 20.04.3 LTS Linux Kernel: 5.11.0-1020-aws Computer Model: Amazon EC2 c5.metal 1.0 Processor (CPU): Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz CPU Cores/Threads: 48/96 Architecture: x86_64 (64-bit) Total memory (RAM): 193053 MiB (189GiB) (202431 MB (203GB)) Total swap space: 0 MiB (0 MB) Disk space: nvme0n1: 512000 MiB (500GiB) (536870 MB (537GB)) I was not sure how many workers would be used, so I ended up getting much more disk space than I now need for the PRP proof files, even at proof power 10... 2022-09-14, 15:22 #70 tdulcet "Teal Dulcet" Jun 2018 71 Posts AWS has new C6i instances, which have 33% more CPU cores and memory compared to the C5 instances and they claim provide up to 9% higher memory bandwidth. I ran the MPrime throughput benchmark on a c6i.metal instance and they are just over 3% faster when accounting for the extra CPU cores. As before, two workers were faster at the higher FFT lengths. Here are the fastest numbers of workers for each supported FFT length benchmarked by default: • 4 workers: 3072K, 3136K, 3200K, 3360K, 3456K, 3600K, 3840K, 3920K, 4032K, 4200K, 4320K, 4480K, 4608K, 4704K, 4800K, 5040K, 5120K, 5184K, 5760K • 2 workers: 5376K, 6048K, 6144K, 6272K, 6400K, 6720K, 7056K, 7168K, 7200K, 7680K, 8064K and the actual results for one of the FFT lengths used for wavefront first time tests: Code: Timings for 6144K FFT length (64 cores, 1 worker): 1.24 ms. Throughput: 807.35 iter/sec. Timings for 6144K FFT length (64 cores, 2 workers): 0.86, 0.87 ms. Throughput: 2318.22 iter/sec. Timings for 6144K FFT length (64 cores, 4 workers): 1.95, 1.91, 1.98, 1.94 ms. Throughput: 2058.45 iter/sec. Timings for 6144K FFT length (64 cores, 8 workers): 5.27, 5.25, 5.25, 5.27, 5.28, 5.23, 5.30, 5.26 ms. Throughput: 1520.27 iter/sec. Timings for 6144K FFT length (64 cores, 16 workers): 11.41, 11.86, 11.38, 11.40, 11.26, 11.44, 11.39, 11.43, 11.37, 11.28, 11.30, 11.13, 11.39, 11.33, 11.33, 11.42 ms. Throughput: 1405.95 iter/sec. Timings for 6144K FFT length (64 cores, 32 workers): 23.49, 23.06, 23.28, 23.11, 23.65, 23.80, 23.32, 23.34, 23.03, 23.23, 23.34, 23.20, 23.13, 23.23, 23.33, 23.40, 23.42, 23.30, 23.32, 23.12, 23.20, 23.21, 23.22, 23.36, 23.16, 23.22, 23.56, 23.30, 23.16, 23.24, 23.39, 23.52 ms. Throughput: 1373.40 iter/sec. Timings for 6144K FFT length (64 cores, 64 workers): 47.16, 47.00, 47.11, 47.05, 46.94, 47.28, 46.44, 46.86, 46.70, 47.37, 46.98, 46.54, 46.59, 47.13, 46.90, 47.48, 46.78, 47.04, 47.02, 46.70, 47.30, 46.80, 47.10, 46.58, 47.08, 46.86, 46.91, 47.04, 47.32, 46.99, 47.56, 47.58, 47.04, 47.14, 46.68, 47.16, 47.37, 46.90, 47.05, 46.90, 46.88, 46.86, 46.83, 47.05, 47.24, 46.82, 46.86, 46.95, 46.49, 47.07, 47.31, 46.97, 47.04, 47.16, 47.00, 46.72, 46.85, 47.16, 46.63, 46.64, 47.34, 47.38, 47.11, 47.50 ms. Throughput: 1361.62 iter/sec. I attached the full results.bench.txt file. MPrime by default wanted to use 16 workers. When using my Mlucas install script to do a throughput benchmark with Mlucas, either 32 or 64 workers were faster for most FFT lengths, but with only half the total throughput: Code: Benchmark Summary Adjusted msec/iter times (ms/iter) vs Actual iters/sec total throughput (iter/s) for each combination FFT #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 length ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s ms/iter iter/s 2048K 9.2 3877.988 9.9 4049.134 10.8 3677.103 12.16 4297.167 16.48 3212.725 41.28 1524.390 197.76 308.261 8.61 3204.588 9.04 3302.408 10.08 3315.682 12.08 4032.574 20 2565.038 90.88 1011.809 - - 2304K 11.08 3387.758 11.48 3209.781 13.16 3084.030 15.36 3222.328 21.6 2330.516 47.68 1339.686 - - 10.17 2816.611 10.92 2933.168 12.04 2714.630 14.64 3034.511 25.12 2059.512 - - - - 2560K 11.98 3071.143 12.44 2906.295 13.56 2851.886 15.76 2918.760 21.92 2293.556 49.92 1284.611 - - 11.07 2551.369 11.62 2421.313 13.04 2466.145 16.4 2814.493 26.24 1960.345 - - - - 2816K 14.06 2743.639 14.88 2596.983 16.4 2348.945 18.32 2365.856 25.12 2038.319 54.08 1090.517 - - 12.79 2233.350 13.78 2423.773 14.92 2132.498 18.56 2271.102 27.84 1774.640 - - - - 3072K 14.2 2572.612 14.82 2587.992 15.96 2337.116 18.8 2569.891 24.8 2052.901 53.76 1078.431 - - 13.28 2091.157 13.84 2085.101 14.88 1936.213 19.28 1794.908 29.76 1704.247 - - - - 3328K 16.13 2347.267 16.78 2249.035 18.32 1831.000 20.88 1857.355 26.56 2006.182 55.68 1018.065 - - 14.97 1908.566 15.8 2041.972 17 1875.230 20.32 1549.497 30.24 1655.682 - - - - 3584K 16.74 2184.298 17.4 2071.261 18.84 2015.334 22.08 1701.082 26.24 2001.601 55.68 1008.492 - - 15.67 1776.107 16.12 1700.491 17.76 1779.794 21.28 1468.572 29.92 1642.884 - - - - 3840K 18.91 2007.035 19.2 1927.621 21.24 1824.570 22.8 1704.757 28 1827.642 56.64 1004.005 - - 17.55 1642.385 18.74 1571.324 19.76 1593.616 22.4 1345.302 32.16 1562.182 - - - - 4096K 19.24 1908.389 19.76 1829.629 20.84 1773.479 22.8 1648.035 29.12 1884.797 53.76 1008.081 222.72 275.938 18.56 1555.231 19.2 1493.142 19.88 1434.561 22.4 1261.338 32.48 1575.412 126.08 817.072 - - 4608K 23.77 1425.222 24.22 1403.128 25.24 1370.652 26.88 1273.997 34.72 1353.899 63.36 838.937 242.56 252.525 21.44 1513.768 22.08 1453.030 23.12 1374.998 26.56 1078.660 38.4 1202.060 133.44 749.489 - - 5120K 26 1299.523 26.62 1270.726 28.28 1259.305 29.28 1151.423 35.52 1273.578 68.48 929.660 257.28 240.442 23.68 1365.599 24.44 1319.097 25.04 1250.206 28.08 987.904 40.8 1073.267 138.56 654.660 - - 5632K 31.53 1139.601 32.04 1121.889 33.48 1099.795 34.96 1011.509 43.84 1035.287 74.88 728.940 280.32 206.228 27.91 1228.281 28.54 1217.586 29.76 1130.747 33.04 887.546 44.16 864.658 159.36 593.240 - - 6144K 34.24 1054.845 34.98 1069.896 36.48 1020.605 38.16 935.837 46.56 877.383 79.04 691.432 277.12 212.540 30.84 1117.139 31.42 1083.632 32.4 1018.938 35.68 826.421 48.48 737.946 158.72 577.898 - - 6656K 36.88 978.016 38.14 967.198 39.28 947.125 40.64 864.520 49.76 742.998 86.08 643.316 288.64 210.393 32.93 1047.741 33.64 1004.116 34.84 956.815 39.28 774.321 52.48 636.751 161.6 568.586 - - 7168K 40.28 909.258 41.96 898.860 43.24 894.615 44.24 826.761 53.6 676.205 86.4 637.452 288.64 215.889 35.86 969.420 36.72 959.746 37.4 878.828 40.8 724.044 54.08 579.357 170.88 546.273 - - 7680K 45.68 827.973 47.08 823.036 48.76 807.410 50.48 740.681 58.4 607.710 93.12 611.168 309.76 189.394 40.41 895.935 41.44 863.145 42.16 820.114 46.32 657.822 60.64 536.359 179.84 404.555 - - Fastest combination # Workers/Runs Threads First -cpu argument 1 64 1 0 Mean ± σ std dev faster # Workers/Runs Threads First -cpu argument 1.023 ± 0.028 (2.3%) 2 32 2 0:1 1.079 ± 0.065 (7.9%) 3 16 4 0:3 1.119 ± 0.089 (11.9%) 4 8 8 0:7 1.211 ± 0.134 (21.1%) 5 4 16 0:15 1.951 ± 0.439 (95.1%) 6 2 32 0:31 6.030 ± 2.440 (503.0%) 7 1 64 0:63 1.097 ± 0.141 (9.7%) 8 64 2 0,64 1.107 ± 0.133 (10.7%) 9 32 4 0:1,64:65 1.158 ± 0.122 (15.8%) 10 16 8 0:3,64:67 1.300 ± 0.156 (30.0%) 11 8 16 0:7,64:71 1.426 ± 0.145 (42.6%) 12 4 32 0:15,64:79 2.137 ± 0.627 (113.7%) 13 2 64 0:31,64:95 I attached the full bench.txt file. For reference, here is the system information: Code:$ wget https://raw.github.com/tdulcet/Linux-System-Information/master/info.sh -qO - | bash -s

Linux Distribution:             Ubuntu 22.04 LTS
Linux Kernel:                   5.15.0-1011-aws
Computer Model:                 Amazon EC2 c6i.metal 110-003545-001
Processor (CPU):                Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Architecture:                   x86_64 (64-bit)
Total memory (RAM):             257746 MiB (252GiB) (270266 MB (271GB))
Total swap space:               0 MiB (0 MB)
Disk space:                     nvme0n1: 30720 MiB (30GiB) (32212 MB (33GB))
