View Single Post
Old 2003-12-08, 12:53   #2
Dresdenboy's Avatar
Apr 2003
Berlin, Germany

192 Posts

Ok, some more details about the use of these large pages:

Oracle managed to get a 8% speedup by using the large pages. Although I have little experience in this area I think for FFTs the speedup will be much larger, because:
  • Even if data is already in the L1 cache the accessing time can increase if the memory addresses of these data are actually spread over many memory pages.
  • The limited amount of TLB entries requires fine tuning of FFT algorithms to avoid TLB thrashing as much as possible - but this avoidance could cause less efficient algorithms.
  • why is it so hard for large size FFTs to come at least close to the FFT MFLOPS for FFTs running completely inside L1 (or L2) cache in times of memory prefetching?
  • I need at least 2 mem-read/write passes to do a large size FFT - but todays max transfer rates for P4/Opteron/AFX systems (6.4GB/s = reading up to 750 times the 1024k FFT data set per second) is hardly reachable because it drops significantly for large strides.
I roughly estimate that at least a speedup of 10-30% could be possible.

Older discussions regarding this topic can be found here:
Dresdenboy is offline   Reply With Quote