mersenneforum.org Google Cloud Compute 31.4 Trillion Digits of Pi
 Register FAQ Search Today's Posts Mark Forums Read

2019-06-28, 00:39   #12
aurashift

Jan 2015

11×23 Posts

Howdy.

How long would it take with say, 25, 50, 75, 120, 320, and 960GB/s storage bandwidth? and 20M-Hundreds of millions of IOPS? using the (hypothetical) EPYC2 64 cores with 1, 2, or 4 sockets, across 1-3 nodes?

Can the software run in parallel on more than one compute node if they're tied into the same storage device?

edit:

Quote:
 The were performance issues with live migration due to the memory-intensiveness of the computation. (The 1.4 TB of memory would have been completely overwritten roughly once every ~10 min. for much of the entire computation.)
Quote:
 Computation: 1 x n1-megamem-96 (96 vCPU, 1.4TB) with 30TB of SSD
I'm assuming this was run on a quad socket intel 24c box? 6 channels of RAM per socket? do you think maybe the UPI intersocket communication speed maybe lagged down your CPU throughput?

Last fiddled with by aurashift on 2019-06-28 at 00:53 Reason: i felt like it

 2019-06-28, 01:26 #13 aurashift   Jan 2015 11·23 Posts Then again if this is limited to one compute node...maybe this is the solution, granted the network fabric is fast enough: hint:it might be https://www.scalemp.com/products/vsmp-clusterone/
2019-06-28, 03:52   #14
Mysticial

Sep 2016

23×43 Posts

Quote:
 Originally Posted by aurashift Howdy. So, after reading your blog post I have some questions. How long would it take with say, 25, 50, 75, 120, 320, and 960GB/s storage bandwidth? and 20M-Hundreds of millions of IOPS? using the (hypothetical) EPYC2 64 cores with 1, 2, or 4 sockets, across 1-3 nodes?
320 GB/s and up isn't possible since that's more than the memory bandwidth of the system. 120 GB/s is as good as infinite for that node since that's about the memory bandwidth. And that would probably make the computation go under 1 month.

Quote:
 Can the software run in parallel on more than one compute node if they're tied into the same storage device?
Currently no. All the computation needs to be one process under the same address space.

Quote:
 edit: From your blog: I'm assuming this was run on a quad socket intel 24c box? 6 channels of RAM per socket? do you think maybe the UPI intersocket communication speed maybe lagged down your CPU throughput?
Intersocket communication absolutely is overhead. But it doesn't matter because the disk access was 8x worse.

Quote:
 Originally Posted by aurashift Then again if this is limited to one compute node...maybe this is the solution, granted the network fabric is fast enough: hint:it might be https://www.scalemp.com/products/vsmp-clusterone/
The scalability of the program drops drastically with more than several NUMA nodes. So it's not going to be able to efficiently utilize something like a 32 socket system.

Last fiddled with by Mysticial on 2019-06-28 at 04:00

2019-06-28, 20:20   #15
aurashift

Jan 2015

11×23 Posts

Quote:
 Originally Posted by Mysticial 320 GB/s and up isn't possible since that's more than the memory bandwidth of the system. 120 GB/s is as good as infinite for that node since that's about the memory bandwidth. And that would probably make the computation go under 1 month.
Yeah for Intel, and faster isn't necessarily doable right this minute, and I can't guarantee anything but unofficially I've got a nod and a wink from AMD that we're going to see a doubling of DIMM channels in ROME from 8 to 16 this year. They're already doubling the PCIe lanes from ~160ish to ~320ish Hopefully it isn't a bastardization like Cascade Lake-AP where they just stuck two dies on one socket (such a freaking letdown).

Quote:
 Currently no. All the computation needs to be one process under the same address space. Intersocket communication absolutely is overhead. But it doesn't matter because the disk access was 8x worse. The scalability of the program drops drastically with more than several NUMA nodes. So it's not going to be able to efficiently utilize something like a 32 socket system.
Good to know. QPI/UPI/Socket to Socket infinity fabric is kind of irksome, not something many people think much about normally day to day, and something I'm trying to banish from my boxes now that the race for core count has essentially been won for the moment.
Given that lately the modern UPI is about 10 GT/s, and there's usually 2-3 UPI's in a given box, it may be possible that if I achieve 400, 600, 800Gbit and maybe even 1Tbit in this POC, then spanning a single OS across multiple single socket systems is faster than using the internal NUMA...if I'm doing the numbers right in my head anyway. I as of yet don't have any hands on experience actually doing this, but its coming. HPE's SuperDome Flex is a good example of this in action, but its not quite as bleeding edge as what I'm describing. Superdome X and MC990x are other examples of 32 socket suckers that probably don't have the beefy backend to get it done right.

Some of the numbers in the first post I gave are still up in the air since this is still in the planning phase, but they're definitely going to be feasible and not much of a stretch.

The fact that you were only getting 2-3 GB/s out of your 24 node storage cluster tells me there was some kind of crippling bottleneck somewhere in the pipeline. I'd be curious to know how that was configured if its not too much bother.

2019-06-28, 20:34   #16
Mysticial

Sep 2016

23·43 Posts

Quote:
 Originally Posted by aurashift I should introduce this thread before continuing https://mersenneforum.org/showthread.php?t=24547 Yeah for Intel, and faster isn't necessarily doable right this minute, and I can't guarantee anything but unofficially I've got a nod and a wink from AMD that we're going to see a doubling of DIMM channels in ROME from 8 to 16 this year. They're already doubling the PCIe lanes from ~160ish to ~320ish Hopefully it isn't a bastardization like Cascade Lake-AP where they just stuck two dies on one socket (such a freaking letdown). Good to know. QPI/UPI/Socket to Socket infinity fabric is kind of irksome, not something many people think much about normally day to day, and something I'm trying to banish from my boxes now that the race for core count has essentially been won for the moment. Given that lately the modern UPI is about 10 GT/s, and there's usually 2-3 UPI's in a given box, it may be possible that if I achieve 400, 600, 800Gbit and maybe even 1Tbit in this POC, then spanning a single OS across multiple single socket systems is faster than using the internal NUMA...if I'm doing the numbers right in my head anyway. I as of yet don't have any hands on experience actually doing this, but its coming. HPE's SuperDome Flex is a good example of this in action, but its not quite as bleeding edge as what I'm describing. Superdome X and MC990x are other examples of 32 socket suckers that probably don't have the beefy backend to get it done right. Some of the numbers in the first post I gave are still up in the air since this is still in the planning phase, but they're definitely going to be feasible and not much of a stretch. The fact that you were only getting 2-3 GB/s out of your 24 node storage cluster tells me there was some kind of crippling bottleneck somewhere in the pipeline. I'd be curious to know how that was configured if its not too much bother.
It's explained in my blog. The bottleneck is a combination of the network connection and artificial limits set by the platform. The network card on each node itself isn't capable of going above like 3.1 GB/s. The GCP platform itself imposes quotas of 2 GB/s per node egress. Since there's only one compute node, the entire computation was limited to 3.1 GB/s read and 1.8 GB/s write.

The artificial cap is somewhat of a self-inflicted wound. But they seem to do it for market segmentation. And there's no "unbounded" tier that completely eliminates the cap.

When I spoke with some of the people at GCP who were involved in this, I told them that if they want GCP to enter the HPC space for stuff like this, they will need to upgrade their interconnects to something comparable to supercomputers.

------

Given what the program was capable of and what GCP had to offer, there wasn't much left to improve the performance strictly from a configuration standpoint. But there is plenty of room for improvement on the $efficiency of the computation. I can disclose this now since the person has published a blog on it. But there is a computation of 50 trillion digits going on right now using less capable computing hardware. He recently upgraded the storage to where it's now about double the speed of Google's run. But it's taking longer than we anticipated since I underestimated the impact of having very little ram. Last fiddled with by Mysticial on 2019-06-28 at 20:55 2019-06-28, 21:03 #17 aurashift Jan 2015 11·23 Posts Quote:  Originally Posted by Mysticial When I spoke with some of the people at GCP who were involved in this, I told them that if they want GCP to enter the HPC space for stuff like this, they will need to upgrade their interconnects to something comparable to supercomputers. \/\/\/\/\/\/\/\/\/\/ >>>YUP<<< ^^^^^^^^^^^^ The next great bottleneck is the interconnects, be it network, PCIe, etc. The uprising of NVMe, EDSFF, and other storage innovations, along with the exponentially rising amount of data needing to be stored and processed in the enterprise makes this an immediately clear pain point. The market is realizing this, and you're seeing things happen in the news. Nvidia just bought mellanox. Intel just bought someone (forget who) and they've also got OmniPath. HPE bought Arista. Couple of companies are throwing stuff on top of PCIe for interconnects besides the network. HPE is trying to replace all the electronics with photonics. I'm already over 100Gbit. I want terabit, gosh darn it. I did three and a half maths to figure it out, you need a PCIe 5.0 x32 slot for a terabit NIC. Soon... Anyway for now I'll play around with ycruncher on my laptop to familiarize and keep it in mind for a test case in the next year. 2019-06-28, 21:20 #18 Mysticial Sep 2016 15816 Posts Quote:  Originally Posted by aurashift \/\/\/\/\/\/\/\/\/\/ >>>YUP<<< ^^^^^^^^^^^^ The next great bottleneck is the interconnects, be it network, PCIe, etc. The uprising of NVMe, EDSFF, and other storage innovations, along with the exponentially rising amount of data needing to be stored and processed in the enterprise makes this an immediately clear pain point. The market is realizing this, and you're seeing things happen in the news. Nvidia just bought mellanox. Intel just bought someone (forget who) and they've also got OmniPath. HPE bought Arista. Couple of companies are throwing stuff on top of PCIe for interconnects besides the network. HPE is trying to replace all the electronics with photonics. I'm already over 100Gbit. I want terabit, gosh darn it. I did three and a half maths to figure it out, you need a PCIe 5.0 x32 slot for a terabit NIC. Soon... Anyway for now I'll play around with ycruncher on my laptop to familiarize and keep it in mind for a test case in the next year. I recently refactored much of the disk swapping code to throw in an interface that separates the computation from all the swap file implementation. So now I can do new swap file implementations and just plug it in. One of the things that I'm considering is a two-level implementation that combines the classic massive-RAID0 hard drive solution with a massive-RAID0 NVMe cache. IOW, I'm imagining a scenario where you have put like 16 NVMes into the system (with the help of PCIe cards) along with the SATA cards needed to hold 300+TB of HD storage. There are two problems though: • This is going to be one of the fastest ways to kill SSDs. • It's unclear how much spacial locality there is to exploit. You could just be burning SSDs for little to no gain. There will probably need to be algorithm tweaks to expose some locality. 2019-06-28, 22:01 #19 aurashift Jan 2015 25310 Posts Quote:  Originally Posted by Mysticial I recently refactored much of the disk swapping code to throw in an interface that separates the computation from all the swap file implementation. So now I can do new swap file implementations and just plug it in. One of the things that I'm considering is a two-level implementation that combines the classic massive-RAID0 hard drive solution with a massive-RAID0 NVMe cache. IOW, I'm imagining a scenario where you have put like 16 NVMes into the system (with the help of PCIe cards) along with the SATA cards needed to hold 300+TB of HD storage. There are two problems though:This is going to be one of the fastest ways to kill SSDs. It's unclear how much spacial locality there is to exploit. You could just be burning SSDs for little to no gain. There will probably need to be algorithm tweaks to expose some locality. My comments on this will probably change after I actually run the program for the first time, but... Locality can be super abstract but still important to think about. Unless you're dealing with a single node, it isn't going to be local whether it ends up going over FC/FCoE/iSCSI/RDMA/ROCE/IWARP/NFS/SMB/NVMEoF/whatever to a NAS/SAN/ClusterFS, etc. Especially if you're talking about 170+TB files. Sorry if you already know some of this but I don't like to make assumptions... -RAID0 isn't necessarily a good or bad thing. It really depends on the implementation and a lot of other factors. In the consumer space, I tried a RAID0 with my laptop's two 2TB M.2 NVMe drives and it was faster to just run them as independent disks. -A lot of the CURRENT commodity servers I see are limited to about ~20 U.2 NVMe disks. -M.2's have problems and I won't touch them or give them much further thought, because... -EDSFF disks are coming. Not the same as Intel's ruler or Samsungs NF1 design, but very similar. -Each EDSFF interface is spec'd to max out at 112Gbit/s (14GB/s) -A gen3 PCIe x1 lane is pretty close to a 1GB/s speed, so this makes these comparisons easy. -Intel is crippled in the expansion bus area compared to AMD EPYC nowadays. The latest Xeon's give you 40-48 PCIe lanes per socket. -The 10GT/s per UPI link (approx 8GB/s, its pretty much a custom PCIe 3.0 x8 lane from what I can tell...) bottleneck doesn't mean you can just throw four sockets in a system and get 40+40+40+40 PCIe lanes for 160GB/s of bandwidth. As you know, NUMA is a PITA. -Your PCIe card to SATA disks (SAS lets just say?) is going to be the bottleneck. Even if you somehow get 120 SAS disks all capable of 2GB/s per disk for a theoretical 240GB/s, you're still going to be limited to that card's x8 or x16 PCIe data transfer rate. Some truly dense 800TB 108x EDSFF nodes are just barely hitting the market, and I *think* they use a PCIe switch or something so you're not just going to see 108*112Gbit/s levels of bandwidth. That's why I'm so excited for ROME (epyc2) with its 320 PCIe lanes and will probably start using those right out of the gate. https://echostreams.com/products/flachesan2n108m-un SuperMicro has some EDSFF stuff too now. https://ftpw.supermicro.com.tw/en/products/nvme-edsff  2019-07-17, 22:40 #20 aurashift Jan 2015 111111012 Posts Sooo...If I break this record where can I upload the results? My company most likely wouldn't shell out a few$k a month to host a big honkin file like that on the public cloud. Backblaze has unlimited space for $5 a month for consumer purposes....I'd probably be breaking an EULA somehow though. Last fiddled with by aurashift on 2019-07-17 at 22:42 2019-07-18, 04:24 #21 Mysticial Sep 2016 5308 Posts Quote:  Originally Posted by aurashift Sooo...If I break this record where can I upload the results? My company most likely wouldn't shell out a few$k a month to host a big honkin file like that on the public cloud. Backblaze has unlimited space for \$5 a month for consumer purposes....I'd probably be breaking an EULA somehow though.
I don't actually have a good answer for that. It's just something I kindly ask of the record breakers to do it out of goodwill.

The point is that people will want to access the digits. And I don't have the capacity to host them.

 2019-07-18, 16:56 #22 aurashift   Jan 2015 11×23 Posts Seems like the sort of thing bittorrent might be good at, if we broke up the file and got a few hundred volunteers to download and seed a chunk of it...that sounds like a lot of effort but I don't have other ideas besides getting someone to sponsor it. There's a potential possibility I could break 1PB....that might be newsworthy enough for corporate sponsorship. A petabyte of pi.

 Similar Threads Thread Thread Starter Forum Replies Last Post GP2 Cloud Computing 4 2020-08-03 11:21 GP2 Cloud Computing 0 2019-01-12 12:52 pepi37 Software 14 2018-09-08 01:26 kergy47 Cloud Computing 0 2018-05-31 11:35 GP2 Cloud Computing 32 2018-01-23 02:16

All times are UTC. The time now is 11:25.

Thu Dec 9 11:25:46 UTC 2021 up 139 days, 5:54, 0 users, load averages: 0.93, 1.05, 1.17