mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2020-10-15, 00:02   #1
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×349 Posts
Default Ryzen 4700u Benchmarks

The usual benchmarking has been done in the usual threads but it's been too long since I've properly fondled some hardware so I plan to bench this to hell and back. The hardware is an Asus PN50 SFF PC paired with 2x16GB SO-DIMM DDR4 3200 CL22 DR and a 1TB NVMe TLC M.2 SSD. The bios is extremely basic with no overclocking options or much of anything, so the RAM is what it is AFAIK and the CPU/GPU can only be fiddled with using whatever tools exist for Linux.

ryzenadj ( https://github.com/FlyGoat/RyzenAdj ) seems to work well at setting power targets for the CPU, so that's how I've tested underclocking with M60721417:
Code:
CPU Clock     Power    Power At
  (MHz)     Target (W)  Wall (W)  ms/it   it/s   j/it  Command
  2825        20         33.5      4.87  205.3  0.163  "ryzenadj --stapm-limit=20000 --fast-limit=20000 --slow-limit=20000"
  2400        15         27        4.93  203.0  0.133  "ryzenadj --stapm-limit=15000 --fast-limit=15000 --slow-limit=15000"
  2025        13         24        5.01  199.6  0.120  "ryzenadj --stapm-limit=13000 --fast-limit=13000 --slow-limit=13000"
  1730        12         23        5.15  194.1  0.118  "ryzenadj --stapm-limit=12000 --fast-limit=12000 --slow-limit=12000"
  1450        11.5       21.5      5.52  181.1  0.118  "ryzenadj --stapm-limit=11500 --fast-limit=11500 --slow-limit=11500"
  1400        11         21        5.55  180.1  0.116  "ryzenadj --stapm-limit=11000 --fast-limit=11000 --slow-limit=11000"
  1375        10.5       20.5      5.85  170.9  0.119  "ryzenadj --stapm-limit=10500 --fast-limit=10500 --slow-limit=10500"
  1350        10         20        6.14  162.8  0.122  "ryzenadj --stapm-limit=10000 --fast-limit=10000 --slow-limit=10000"
  1250         8.5       17.5      7.77  128.7  0.135  "ryzenadj --stapm-limit=8500 --fast-limit=8500 --slow-limit=8500"
   630         7         15       10.15   98.5  0.152  "ryzenadj --stapm-limit=7000 --fast-limit=7000 --slow-limit=7000"
   400         5         11.5     16.90   59.1  0.194  "ryzenadj --stapm-limit=5000 --fast-limit=5000 --slow-limit=5000"
Stock settings appear to target 25W for a few minutes, settling down to a 15W target for a sustained workload, so the 15W target figure above is representative of stock settings. The clock, power at the wall and ms/it are all eyeballed, most of the clocks were variable within a few hundred MHz (except the 1400 and 400 ones, they remained solid). The wall power figures have the potential to be lower than shown, the NVMe drive is potentially power hungry and there's twice as many RAM chips than is necessary if you can find dual rank 2x8 SO-DIMMs. Wifi was on, bluetooth may have been, and Ubuntu 20.04 is stock with no power saving measures attempted yet.

Tried to run FlopsCL ( http://olab.is.s.u-tokyo.ac.jp/~kami.../projects.html ) to measure the GFlops of the iGPU, it compiled after setting OPENCL_LIBRARY_DIR = /opt/rocm/lib/ and OPENCL_INCLUDE_DIR = /opt/rocm/opencl/include/ in the Makefile, but the kernels fail. It might be because the rocm install doesn't appear to include OpenCL 2.0? /opt/rocm/opencl/lib/made no difference:
Code:
pn50@pn50:~/FlopsCL_src_linux$ ./flops

        1 OpenCL platform(s) detected:

        Platform 0: Advanced Micro Devices, Inc. AMD Accelerated Parallel Processing OpenCL 2.0 AMD-APP (3186.0), FULL_PROFILE

                1 device(s) found supporting OpenCL:

                Device 0:
                        CL_DEVICE_NAME = gfx900
                        CL_DEVICE_VENDOR = Advanced Micro Devices, Inc.
                        CL_DEVICE_VERSION = OpenCL 2.0 
                        CL_DRIVER_VERSION = 3186.0 (HSA1.1,LC)
                        CL_DEVICE_MAX_COMPUTE_UNITS = 27
                        CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
                        CL_DEVICE_MAX_WORK_ITEM_SIZES = 1024 / 1024 / 1024 
                        CL_DEVICE_MAX_WORK_GROUP_SIZE = 256
                        CL_DEVICE_MAX_CLOCK_FREQUENCY = 1600 MHz
                        CL_DEVICE_GLOBAL_MEM_SIZE = 512 MB
                        CL_DEVICE_ERROR_CORRECTION_SUPPORT = NO
                        CL_DEVICE_LOCAL_MEM_SIZE = 64 kB
                        CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE = 445644 kB
                                Compiling...
                                Starting tests...
ERROR: clEnqueueNDRangeKernel failed, cl_invalid_work_group_size
                                [float   ] Time: 0.016776s, 16385.00 GFLOP/s
ERROR: clEnqueueNDRangeKernel failed, cl_invalid_work_group_size
                                 [float2  ] Time: 0.016776s, 32770.00 GFLOP/s
The 27 CU figure looks suspect and there are 7 GPU cores that don't look like they're represented, but it could be legit for all I know. If you squint real hard it almost looks like a 16 TFlops iGPU exists so that's fun.
M344587487 is offline   Reply With Quote
Old 2020-10-15, 09:17   #2
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2BA16 Posts
Default Power Saving Tools

Tried a few power saving tools. tlp on it's own gave some nice power saving, due to the way the CPU has been "underclocked" this translates into better iteration timings instead of lower power consumption. Adding powertop on top of tlp regresses the power savings:
Code:
CPU Clock     Power    Power At
  (MHz)     Target (W)  Wall (W)  ms/it   it/s   j/it  Command
  1400        11         21        5.55  180.2  0.117 "ryzenadj --stapm-limit=11000 --fast-limit=11000 --slow-limit=11000"
  1400        11         21        5.18  189.4  0.109 "tlp bat && ryzenadj --stapm-limit=11000 --fast-limit=11000 --slow-limit=11000"
  1400        11         21        5.28  193.0  0.111 "powertop && tlp bat && ryzenadj --stapm-limit=11000 --fast-limit=11000 --slow-limit=11000"
M344587487 is offline   Reply With Quote
Old 2020-10-15, 16:26   #3
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×349 Posts
Default

Got FlopsCL working by lowering the THREADS_PER_BLOCK variable from the default of 1024 to 256. The average of some runs yields this:
Code:
[float  ]1410.088 GFlop/s
[float2 ]1416.99 GFlop/s
[float4 ]1424.106 GFlop/s
[float8 ]1427.09 GFlop/s
[float16]1427.668 GFlop/s
[double  ]89.354 GFlop/s
[double2 ]89.364 GFlop/s
[double4 ]89.39 GFlop/s
[double8 ]89.39 GFlop/s
[double16]89.39 GFlop/s
Does this look right? I think the paper GFlop/s figure should be 1433.6 GFlop/s of SP and 89.6 GFlop/s of DP based on this formula and a 1:16 ratio but could be wrong:
Code:
shader_units*speed*instructions_per_clock
= 448 * 1600 * 2
=1433.6
Assuming the above is right and FlopsCL is a pretty accurate reflection of GFlop/s, can someone with a Radeon VII run FlopsCL to determine once and for all what the DP ratio is for that card? There was some question as to if the ratio was 1:4 like they said it was or if it got the full 1:2 treatment, I don't think there was ever a definitive answer. FlopsCL is here if you want to try it: http://olab.is.s.u-tokyo.ac.jp/~kami.../projects.html
M344587487 is offline   Reply With Quote
Old 2020-10-15, 17:43   #4
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

11110100000102 Posts
Default

Quote:
Originally Posted by M344587487 View Post
ryzenadj ( https://github.com/FlyGoat/RyzenAdj ) seems to work well at setting power targets for the CPU…
Have you looked at this?

https://github.com/sbski/Renoir-Mobile-Tuning
Xyzzy is offline   Reply With Quote
Old 2020-10-15, 20:28   #5
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2BA16 Posts
Default

Thanks I hadn't come across that one. It's C# so I probably can't get it to run on Linux, always had trouble getting supposedly cross-platform .NET stuff running. Doesn't look like it matters either way as I'm running headless via ssh and it's a GUI program. This is probably one of the better GUI tools for Windows users, I'm just in the exact wrong niche.
M344587487 is offline   Reply With Quote
Old 2020-10-16, 13:02   #6
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2BA16 Posts
Default Run from RAM, no NVMe present

Tried running an Ubuntu livecd from RAM (by adding toram as a boot parameter so the USB stick can be removed, entire OS resides in RAM), so that the NVMe could be removed entirely to see how much impact the SSD power draw has. It was a massive failure. Updated OS, power target set to 11W, tlp not installed as it wouldn't on the livecd so the comparison point is in the first table. The no-NVMe timings are just over 8ms/it with the same wall power draw, compared to the 5.55ms/it figure from above it's a massive regression. The 5.55ms/it figure was headless but I tried again with it not headless using both i3 and gnome and the timings stayed the same within variability. A livecd uses a few extra gigs of RAM as storage but they shouldn't be in active use so that shouldn't hinder p95 timings, and I was under the impression that all 32GB of RAM has to be refreshed regardless of whether it's in use so RAM occupancy shouldn't matter when it comes to power draw. To rule occupancy out a ramdisk occupying most of the remaining RAM was created, filled with random data instead of zeroes just incase that matters. As expected there was no noticable difference to timings or power use.

Can't see how removing the NVMe can be a massive negative, at worst there should be no difference and at best a power save, so the conclusion is that a livecd just has some undetermined nonsense that makes it unsuitable for use as a compute environment. Doesn't make a lot of sense as if anything I'd expect a livecd to be lighter on the system than its installed counterpart, but there we are.

Last fiddled with by M344587487 on 2020-10-16 at 13:47 Reason: title
M344587487 is offline   Reply With Quote
Old 2020-10-17, 14:26   #7
aheeffer
 
Aug 2020

25 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Assuming the above is right and FlopsCL is a pretty accurate reflection of GFlop/s, can someone with a Radeon VII run FlopsCL to determine once and for all what the DP ratio is for that card? There was some question as to if the ratio was 1:4 like they said it was or if it got the full 1:2 treatment, I don't think there was ever a definitive answer. FlopsCL is here if you want to try it: http://olab.is.s.u-tokyo.ac.jp/~kami.../projects.html
I wrote these down from a Radeon VII forum but forgot to save the link:

fp32 13481
fp64 3417
aheeffer is offline   Reply With Quote
Old 2020-10-17, 15:22   #8
chris2be8
 
chris2be8's Avatar
 
Sep 2009

111100110102 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Can't see how removing the NVMe can be a massive negative,
Try running the live CD with the NVMe plugged in but not being used, and compare power draw with when it's not plugged in. That should isolate its power draw. You probably want to compare both cases when the system's idle and when it's under load (see if the difference between idle and busy varies if the NVMe is there).

Chris
chris2be8 is offline   Reply With Quote
Old 2020-10-17, 16:12   #9
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

34·71 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Thanks I hadn't come across that one. It's C# so I probably can't get it to run on Linux, always had trouble getting supposedly cross-platform .NET stuff running. Doesn't look like it matters either way as I'm running headless via ssh and it's a GUI program. This is probably one of the better GUI tools for Windows users, I'm just in the exact wrong niche.
I believe .net core is much better at running on linux
henryzz is offline   Reply With Quote
Old 2020-10-20, 16:45   #10
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×349 Posts
Default

Quote:
Originally Posted by aheeffer View Post
I wrote these down from a Radeon VII forum but forgot to save the link:

fp32 13481
fp64 3417
Thanks, 1:4 ratio it is.
Quote:
Originally Posted by chris2be8 View Post
Try running the live CD with the NVMe plugged in but not being used, and compare power draw with when it's not plugged in. That should isolate its power draw. You probably want to compare both cases when the system's idle and when it's under load (see if the difference between idle and busy varies if the NVMe is there).

Chris
I'll revisit NVMe when Ubuntu 20.10 is out in a few days and try what you suggest. Trying to get the iGPU running nicely ATM which is trickier than it sounds.
Quote:
Originally Posted by henryzz View Post
I believe .net core is much better at running on linux
It probably is, I'm just stubbornly against the entire ecosystem as I've been burned too many times. If I had a penny for every community tool I've encountered that doesn't have source (or does but it's not written for core) and only has a windows .NET binary that absolutely fails in wine even if you use MS's libraries that you need to accept their EULA for, well I'd have a handful of pennies but it's very annoying. EXE tools written in C/C++ tend to work flawlessly under wine unless they interact closely with the hardware. I'd prefer a community tool be written in java than .NET which is saying something.

iGPU debugging
Did a mfakto run with stock settings but found that the iGPU endlessly cycles between being occupied for ~10 seconds then dropping to its minimum frequency for a few seconds, a run completed in 1h37m. mfakto live output shows 110 GHz-d/day at high frequency and 55 GHz-d/day at low frequency, wall power is 5-6W at low frequency so the chip must be using very little power then. Searching suggests that perhaps there's a bug with the VRAM/RAM split, using more than the 512MiB dedicated to the GPU means relying on the OS to manage it dynamically and it apparently can bottleneck somehow. Unfortunately my bios doesn't expose the split as a variable so it's fixed at 512MiB. To try and confirm that going into dynamically allocated memory is what's causing the cycling I'm setting an upper vram limit using amdgpu.vramlimit=x as a kernel boot parameter (default is vramlimit=0 meaning no limit):

Factor=77936863,71,72 tests:
Code:
       Boot Parameter   Time (M)
amdgpu.vramlimit=256           83
amdgpu.vramlimit=448           85
amdgpu.vramlimit=512          122
No boot parameter            97
vramlimit of 256 and 448 acted similarly, with the minimal frequency still being used but for much less time relative to no vram limit with live output most often sticking to 110 GHz-d/day but going up to as much as 134 GHz-d/day. There's still cycling but improved throughput. vramlimit=512 is an odd result that I might investigate further at a later date (longer stretches of minimal frequency and no boosting beyond 110 GHz-d/day), it could be that it has the worst of both worlds (not enough RAM but enough to trigger a dynamic allocation bug). There is a lot of variability so the tests are only ballpark but they show a trend. Setting vramlimit to 511 or below spams dmesg with errors like below but the tests complete fine, probably due to setting vramlimit below the amount dedicated to the GPU:
Code:
[ 3581.454872] ------------[ cut here ]------------
[ 3581.454974] WARNING: CPU: 0 PID: 2178 at /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_gmc.h:265 amdgpu_cs_bo_validate+0x196/0x1c0 [amdgpu]
[ 3581.454977] Modules linked in: ccm nls_iso8859_1 rtsx_usb_ms memstick btusb btrtl btbcm btintel bluetooth joydev input_leds ecdh_generic ecc snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel tps6598x snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm iwlmvm snd_seq_midi snd_seq_midi_event mac80211 edac_mce_amd snd_rawmidi libarc4 kvm_amd ccp kvm snd_seq crct10dif_pclmul ghash_clmulni_intel snd_seq_device aesni_intel ipmi_devintf iwlwifi crypto_simd snd_timer wmi_bmof cryptd k10temp ipmi_msghandler eeepc_wmi glue_helper asus_wmi sparse_keymap snd_rn_pci_acp3x snd cfg80211 soundcore snd_pci_acp3x ite_cir rc_core ucsi_acpi typec_ucsi typec i2c_multi_instantiate mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) zlua(PO) rtsx_usb_sdmmc rtsx_usb hid_generic usbhid amdgpu(OE) amd_iommu_v2 amd_sched(OE) amdttm(OE) amdkcl(OE) i2c_algo_bit drm_kms_helper
[ 3581.455027]  nvme syscopyarea sysfillrect crc32_pclmul sysimgblt fb_sys_fops ahci drm i2c_piix4 libahci nvme_core r8169 realtek wmi video i2c_hid hid
[ 3581.455038] CPU: 0 PID: 2178 Comm: Xorg:cs0 Tainted: P        W  OE     5.4.0-51-generic #56-Ubuntu
[ 3581.455040] Hardware name: ASUSTeK COMPUTER INC. MINIPC PN50/PN50, BIOS 0416 08/27/2020
[ 3581.455137] RIP: 0010:amdgpu_cs_bo_validate+0x196/0x1c0 [amdgpu]
[ 3581.455142] Code: ff 77 42 74 1d f6 83 b8 02 00 00 01 74 14 49 8b 86 c0 00 00 00 49 39 86 d0 00 00 00 0f 83 ef fe ff ff 44 8b 03 e9 eb fe ff ff <0f> 0b e9 2f ff ff ff 8b 53 04 44 39 c2 0f 84 77 ff ff ff 41 89 d0
[ 3581.455143] RSP: 0018:ffffad61844c7a20 EFLAGS: 00010206
[ 3581.455144] RAX: 0000000000000000 RBX: ffff9be9a2b1ac00 RCX: 0000000010000000
[ 3581.455146] RDX: 0000000000000001 RSI: 0000000000040002 RDI: 0000000000000000
[ 3581.455147] RBP: ffffad61844c7a78 R08: 0000000000000002 R09: ffff9be9a2b1ac14
[ 3581.455148] R10: ffff9be9ef417848 R11: 0000000000000000 R12: ffff9be9a2b1ac50
[ 3581.455149] R13: ffff9be9a2b1ac30 R14: ffffad61844c7b80 R15: ffff9be9d95e50a0
[ 3581.455150] FS:  00007fc891a42700(0000) GS:ffff9be9ef400000(0000) knlGS:0000000000000000
[ 3581.455152] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3581.455155] CR2: 000055ad591d7fe4 CR3: 00000007a467e000 CR4: 0000000000340ef0
[ 3581.455156] Call Trace:
[ 3581.455254]  amdgpu_cs_validate+0x17/0x40 [amdgpu]
[ 3581.455354]  amdgpu_cs_list_validate+0x100/0x140 [amdgpu]
[ 3581.455454]  amdgpu_cs_ioctl+0x1a55/0x1f00 [amdgpu]
[ 3581.455557]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[ 3581.455577]  drm_ioctl_kernel+0xae/0xf0 [drm]
[ 3581.455594]  drm_ioctl+0x234/0x3d0 [drm]
[ 3581.455694]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[ 3581.455792]  amdgpu_drm_ioctl+0x4e/0x80 [amdgpu]
[ 3581.455798]  do_vfs_ioctl+0x407/0x670
[ 3581.455802]  ? do_futex+0x160/0x1e0
[ 3581.455806]  ksys_ioctl+0x67/0x90
[ 3581.455808]  __x64_sys_ioctl+0x1a/0x20
[ 3581.455811]  do_syscall_64+0x57/0x190
[ 3581.455815]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3581.455819] RIP: 0033:0x7fc89834350b
[ 3581.455820] Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
[ 3581.455821] RSP: 002b:00007fc891a41868 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 3581.455823] RAX: ffffffffffffffda RBX: 00007fc891a418d0 RCX: 00007fc89834350b
[ 3581.455824] RDX: 00007fc891a418d0 RSI: 00000000c0186444 RDI: 000000000000000d
[ 3581.455825] RBP: 00000000c0186444 R08: 00007fc891a41a20 R09: 0000000000000020
[ 3581.455826] R10: 00007fc891a41a20 R11: 0000000000000246 R12: 00005606f5538940
[ 3581.455827] R13: 000000000000000d R14: 00005606f55f4c2c R15: 00005606f55eca38
[ 3581.455829] ---[ end trace f6219db7c8930d05 ]---
There is another boot parameter vis_vramlimit=x which can be used to limit how much dynamic memory is used, haven't investigated it yet. The default of 0 means unlimited so am unsure how to set dynamic memory to zero. Logically dynamic memory use should naturally be 0 if vramlimit is set to 512 or less but there may be shenanigans.
M344587487 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD Ryzen 7 3700X? lukerichards Hardware 13 2020-07-28 07:43
Ryzen help Prime95 Hardware 9 2018-05-14 04:06
Ryzen 2 efficiency improvements M344587487 Hardware 3 2018-04-25 15:23
29.2 benchmark help #2 (Ryzen only) Prime95 Software 10 2017-05-08 13:24
AMD Ryzen is risin' up. jasong Hardware 11 2017-03-02 19:56

All times are UTC. The time now is 00:01.

Mon Nov 30 00:01:34 UTC 2020 up 80 days, 21:12, 3 users, load averages: 1.13, 1.18, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.