View Single Post
Old 2021-01-16, 21:06   #70
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

101101011011012 Posts
Default

@tdulcet: Extremely busy this past month working on a high-priority 'intermediate' v19.1 release (this will restore Clang/llvm buildability on Arm, problem was first IDed on the new Apple M1 CPU but is more general), alas no time to give the automation of best-total-throughput-finding the attention it deserves. But that's where folks like you come in. :)

First off - the mi64.c compile issue has been fixed in the as-yet-unreleased 19.1 code, as the mods in that file are small I will attach it here, suggest you save a copy of the old one so you can diff and see the changes for yourself. Briefly, a big chunk of x86_64 inline-asm needed extra wrapping inside a '#ifdef YES_ASM' preprocessor directive. That flag is def'd (or not) in mi64.h like so:
Code:
  #if(defined(CPU_IS_X86_64) && defined(COMPILER_TYPE_GCC) && (OS_BITS == 64))
	#define YES_ASM
  #endif
Re. your core/thread-combos-to-try on an example 8c/16t system, those look correct. The remaining trick, though, is figuring out which of the most promising c/t combos give the best total-throughput on the user's system. For example - sticking to just 1-thread-per-physical-core for the moment - we expect 1t to run roughly 2x slower that 2t. Say the ratio is 1.8, and the user has an 8-core system. The real question is, how does the total-throughput compare for 8x1t jobs versus 4x2t?

Similarly, we usually see a steep dropoff in || scaling beyond 4 cores - but that need not imply that running two 4-thread jobs is better than one 8-thread one. If said dropoff is due to the workload saturing the memory banwidth, we might well see a similar performance hit with two 4-thread jobs
Attached Files
File Type: bz2 mi64.c.bz2 (75.6 KB, 48 views)
ewmayer is offline   Reply With Quote