![]() |
![]() |
#45 |
Mar 2003
New Zealand
100100001012 Posts |
![]()
The main loop for SSE2 and x86-64 machines is now 100% assembly instead of a mixture of C and inline assembly, and tries to read memory in a more predictable way.
The 32-bit executable runs about 15% faster on my P4, and the 64-bit executable runs about 60% faster on my C2D. (64-bit is now almost twice as fast as 32-bit on the C2D). |
![]() |
![]() |
#46 |
Mar 2003
New Zealand
13·89 Posts |
![]()
The main loop for x86 machines without SSE2 is now 100% assembly. It runs about 30% faster on my P3.
|
![]() |
![]() |
#47 |
Mar 2003
New Zealand
48516 Posts |
![]()
Version 1.0.16 has support for software prefetching, using the prefetchnta instruction available for SSE machines, or GCC's __builtin_prefetch() function for non x86/x86-64 builds.
Prefetching should result in a speedup in the case that the sieve is too large to fit in L2 cache (each sieve term takes 8 bytes), but on some machines it results in a slowdown instead, probably because it interferes with the automatic hardware prefetcher. So before sieving starts some test runs are made with and without prefetch, and the faster method selected. Use the --verbose switch to see whether prefetch was selected. To override the automatic selection, use these new switches: --prefetch: Force use of prefetch. --no-prefetch: Prevent use of prefetch. Here are some times for a 216000 term sieve (Primegrid Cullen 10M) at p=1000e9: Code:
--no-prefetch --prefetch P3 450MHz, 512Kb L2: 1167 p/sec 1502 p/sec +29% P3 600MHz, 256Kb L2: 1462 p/sec 1993 p/sec +36% P4 2.9GHz, 512Kb L2: 12224 p/sec 11711 p/sec -4% |
![]() |
![]() |
#48 |
Jun 2007
Moscow,Russia
8516 Posts |
![]()
Could you provide executable for windows XP athlon machine?
|
![]() |
![]() |
#49 |
Mar 2003
New Zealand
13×89 Posts |
![]()
The executable in gcwsieve-X.Y.Z-windows-x86.zip at http://www.geocities.com/g_w_reynolds/gcwsieve/ should work, or is there a problem on that machine?
|
![]() |
![]() |
#50 |
Mar 2003
New Zealand
115710 Posts |
![]()
Version 1.0.17 should properly detect the availability of prefetch instructions on AMD machines with 3DNow! but without SSE. (Some earlier Athlons).
A more compact ABC file format will now be written by default. The old format will still be written if the --multisieve switch is given. Either format can be used for the input file: Old format: Code:
ABC $a*$b^$a$c // CW Sieved to: 100000000000 with gcwsieve 2000055 2 +1 2000110 2 +1 2000116 2 +1 2000128 2 +1 Code:
ABC $a*2^$a+1 // CW Sieved to: 100000000000 with gcwsieve 2000055 2000110 2000116 2000128 |
![]() |
![]() |
#51 |
Mar 2003
New Zealand
13×89 Posts |
![]()
This version has two minor bugfixes:
Test for Extended 3DNow instead of just 3DNow to determine whether the prefetchnta instruction is available on AMD CPUs. This affected K6-2 CPUs. Use the best benchmark time instead of the average benchmark time when deciding whether or not to use software prefetching. The average times could be inaccurate when there were other processes running on the same CPU. There are also some changes to the status line display: The percentage of CPU usage (cpu_time/elapsed_time) is now reported, the status line alternates between these two sets of stats: Code:
p=1071802477019, 249775 p/sec, 16 factors, 100.0% cpu, 2953 sec/factor p=1071817422251, 249836 p/sec, 16 factors, 16.9% done, ETA 24 Aug 14:23 -R --report-primes Reports primes/sec (the number of prime factors tested per second) instead of p/sec (the increase in p per second). -e --elapsed-time Reports p/sec, primes/sec, and sec/factor using elapsed time instead of CPU time. |
![]() |
![]() |
#52 |
Mar 2003
New Zealand
13×89 Posts |
![]()
The x86-64 executable now has seperate code paths optimised for Intel (Core 2) and AMD (Athlon 64) CPUs. The Athlon 64 code should be about 15% faster than previous versions. Thanks to jmblazek for testing it.
The appropriate code path should be selected automatically, but can be overridden with the --amd or --intel command-line switches. |
![]() |