mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2006-06-05, 14:30   #89
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

746010 Posts
Thumbs up

Quote:
Originally Posted by TheJudger
kind off...

1024k FFT 15min test time:

Dualcore Opteron 170 clocked @3.2GHz:
http://img153.imageshack.us/my.php?i...20015044gz.png

lower bound: 5*4000 iterations/(15m*60) = 22.2 iterations per second <=> 45ms per iteration
upper bound: 6*4000 iterations/(15m*60) = 26.7 iterations per second <=> 37.5ms per iteration

http://img242.imageshack.us/my.php?i...71313250sz.png
Conroe E6600 clocked @2.7GHz
lower bound: 9*4000 iterations/(15m*60) = 40.0 iterations per second <=> 25ms per iteration
upper bound: 10*4000 iterations/(15m*60) = 44.4 iterations per second <=> 22.5ms per iteration

Nice.

If I provide code (source if you like) and data could you run both a
single-thread and double-thread benchmark of the lattice sieve on this
machine?
R.D. Silverman is offline   Reply With Quote
Old 2006-06-05, 15:23   #90
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Quote:
Originally Posted by R.D. Silverman
If I provide code (source if you like) and data could you run both a
single-thread and double-thread benchmark of the lattice sieve on this
machine?
If I would own such a machine: sure.
But sadly it's not my machine :(

I've found these screenshots in a german hardware forum.
TheJudger is offline   Reply With Quote
Old 2006-06-08, 16:52   #91
dsouza123
 
dsouza123's Avatar
 
Sep 2002

10100101102 Posts
Default

There are now two companies making FPGA based Opteron coprocessors.

http://www.eetimes.com/news/semi/sho...leID=188702712

The coprocessors plug directly into an empty CPU socket and can be dynamically reconfigured, thus permitting users to change logic configurations to better match the algorithms that need acceleration.

DRC Computer Corp. and XtremeData Inc., are delivering programmable solutions that can accelerate time-critical algorithms.

These coprocessors leverage the flexibility of Xilinx and Altera FPGAs, respectively, so that they can be configured to accelerate graphics, XML, floating point, video transcoding and other applications.

Both the DRC and XtremeData solutions are modules that combine an FPGA with static RAM, flash memory (XtremeData only), and interface logic to support 8- or 16-bit HyperTransport interfaces. DRC offers three versions of its module: the DRC100-L60ES and L60, which are based on the 60k logic cell LX60 Virtex 4 FPGA, and the DRC110-L160, which is based on the 152k logic cell LX160 FPGA.

The XD1000 from XtremeData employs Altera's largest Stratix II FPGA, the EP2S180...the company has several enhanced versions of XD1000 planned for future release.

To develop the hardware-based algorithms XtremeData leverages Altera's SOPC Builder and C2H (C-language to hardware) tools as well as Altera's soft intellectual property blocks such as the NIOS processor core. A full development system with a dual-socket motherboard and one XD1000 module sells for about $15,000 in small quantities; the XD1000 module sells for $6,500 a piece.
dsouza123 is offline   Reply With Quote
Old 2006-06-08, 18:06   #92
dsouza123
 
dsouza123's Avatar
 
Sep 2002

2×331 Posts
Default

A link with more details for the XtremeData coprocessor using Altera's FPGA.

http://www.altera.com/corporate/news...tml?f=hp&k=wn1

XtremeData has packaged the Stratix II EP2S180 device, the industry’s highest-density, highest-performance FPGA in production, onto a credit card-sized board that fits into the secondary CPU sockets of any 2P or 4P AMD Opteron processor-based motherboard. The XD1000 supports tight board-height form factors, including 1U servers, server blades and Advanced Telecom Computing Architecture (ATCA) platforms. The XD1000 includes multiple HyperTransport interfaces that are 16 bits wide running at 3.2 Gbps. It also features a 128-bit-wide DDR333 memory interface, up to 8 Mbytes of high-speed SRAM and 32 Mbytes of flash memory. Additionally, XtremeData has several next-generation variants of XD1000 planned for future release.

XtremeData used Altera’s SOPC Builder system integration tool and the Nios® II soft-core CPU to develop the XD1000. The XD1000 uses a HyperTransport bus to achieve low-latency communication with the host AMD Opteron processor. This means that the traditional latency chain of CPU-to-north bridge-to-south bridge (via PCI interface)-to-FPGA has been reduced to a point-to-point CPU-to-FPGA link. Compared to competing I/O board systems, the XD1000 offers a more scalable solution. It gives access to more memory (via DIMM modules) and provides higher bandwidth and lower latency interconnects than north bridge solutions, at a much lower total cost of ownership.

For example, FPGA-based hardware acceleration used in medical CT imaging runs the overall application 10 times faster when each 3-GHz AMD Opteron processor is coupled with an FPGA, resulting in significant system-level savings for power, space and cost.

The XtremeData coprocessing development system is a complete design environment. It includes a 2P AMD Opteron processor-based PC with an XD1000 coprocessor module, a reference design containing HyperTransport and DDR interfaces and a JTAG download cable for configuring the FPGA and probing internal FPGA signals using Altera’s SignalTap® II embedded logic analyzer. Altera and XtremeData are committed to jointly developing libraries and tools that can be easily used by application developers. The two companies are also working with several leading universities to make the XD1000 available as a research platform to enable additional innovations.
Attached Thumbnails
Click image for larger version

Name:	altera.jpg
Views:	154
Size:	24.8 KB
ID:	1154  

Last fiddled with by dsouza123 on 2006-06-08 at 18:17
dsouza123 is offline   Reply With Quote
Old 2006-10-11, 12:30   #93
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

AMD presented more details on MPF:
http://www.thechannelinsider.com/pri...ls/191008.aspx
A photo of the beast:
http://news.com.com/2300-1006_3-6124...4500&subj=news

Most interesting for Prime95 should be these features (many are already known, but several details were not):
  • 128 bit SSE paths
  • 2x128 bit L1D bandwidth
  • 32B instruction fetch window (since Prime95 uses loads of long SSE2 instructions)
  • 128 bit L2/NB bandwidth
  • 36 dedicated 128 bit ops in the FPU scheduler (vs. only 18 128 bit ops before)
  • FMISC unit can execute SSE MOV (128 bit/cycle)
  • an AMD slide mentioned a max of 2 128 bit SSE ops + 1 SSE MOV + 2 SSE loads/cycle (as memory operands)
  • generally 2 SSE loads/cycle (if there is not such a bottleneck as with K8, then this should actually quadruple the load bandwidth during blocks of MOVPDs at the beginning of most FFT butterfly macros
  • L3 cache and separate 64 bit memory channels might help reducing latency for the multi megabyte Prime95 workloads.. especially for multiple instances (not a multithreaded variant), which work on different working sets
Dresdenboy is offline   Reply With Quote
Old 2006-10-11, 22:52   #94
dsouza123
 
dsouza123's Avatar
 
Sep 2002

10100101102 Posts
Default

Other AMD features (reductions):

The L1 cache drops from 128KB (64KB data and 64KB code) to 64KB (32 and 32), and the L2 drops from 1024KB to 512KB.

The 64KB L1 is a supprising change, the Athlon/Opteron chips have had 128KB since the beginning, the 512KB is within the range of previous L2 amounts from 256KB to 512KB to more recently 1024KB.
dsouza123 is offline   Reply With Quote
Old 2006-10-12, 07:47   #95
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

5518 Posts
Default

Quote:
Originally Posted by dsouza123 View Post
Other AMD features (reductions):

The L1 cache drops from 128KB (64KB data and 64KB code) to 64KB (32 and 32), and the L2 drops from 1024KB to 512KB.

The 64KB L1 is a supprising change, the Athlon/Opteron chips have had 128KB since the beginning, the 512KB is within the range of previous L2 amounts from 256KB to 512KB to more recently 1024KB.
That is still discussed on some forums out there. But months ago someone from AMD already stated, that the L1 caches will still be 2 x 64 kB per core. Also the die plots and die photos support this, since the size of the L1 caches relative to the core didn't change much. Mostly because of different SRAM cells. And many knowledgable people (e.g. Hans de Vries, who discovered 64bit in Prescott die photo) didn't see a cache reduction.

The confusion might be caused by an AMD slide showing the cache infrastructure, where only 64 kB L1 per core are shown. But this is actually the infrastructure for data cache. See here:
http://epscontest2.home.comcast.net/...ad/Slide51.JPG
About these 64kB they say: "keeps most critical data", "2 128 bit data paths" (L1D+L1I will have four 128 bit data paths in Barcelona), "2 loads per cycle" (same as for K8 L1D).
Dresdenboy is offline   Reply With Quote
Old 2006-10-13, 07:09   #96
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Confirmation for 128 kB L1 from Johan (from Anandtech):
Quote:
At our last phone call, Damon Muzny repeated this at least 3 times that the figure with Data cache might have confused a lot of people: but the L1 is still 64KB D + 64 KB I, just like it was before.
http://www.aceshardware.com/forums/r...8309&forumid=1
Dresdenboy is offline   Reply With Quote
Old 2007-05-16, 07:06   #97
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default Optimization Manual

The "Software Optimization Guide for AMD Family 10h Processors" is available now:
http://www.amd.com/us-en/assets/cont...docs/40546.pdf

Besides all the stuff already known, there are some informations even new to me, like that the L3 cache is bandwidth adaptive, which means, that goes to lower latency and bandwidth, if there is less traffic and increases bandwidth (while also increasing latency) in the case of cache traffic reaching some treshold.

Most SSEn instructions are now decoded more efficiently, allowing more of them to reside in the scheduler, so that it can exploit ILP better.

I've got an idea how to find out, how Prime95 might run on K10 compared to K8. The availability of this manual allows to run some simulations, which should come closer to reality in the labs than any SWAG.
Dresdenboy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD's 8- and 12-core CPU monsters joblack Hardware 4 2010-04-02 14:23
Upcoming features Xyzzy Forum Feedback 1 2007-11-26 18:57
Prime95 and Dual Processors AntonVrba Hardware 6 2006-06-14 19:49
Prime95, hyperthreading, multiple processors, Win2003, etc... pcr Software 8 2005-12-22 14:43
Monsters and Monster farms Unregistered Data 6 2004-08-12 00:28

All times are UTC. The time now is 07:24.

Thu May 6 07:24:05 UTC 2021 up 28 days, 2:04, 0 users, load averages: 2.15, 2.13, 2.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.