mersenneforum.org How-to guide for running LL tests on the Amazon EC2 cloud
 Register FAQ Search Today's Posts Mark Forums Read

 2021-06-05, 20:58 #56 drkirkby   "David Kirkby" Jan 2021 Althorne, Essex, UK 3×151 Posts I was having a read on Amazon, and at least some of the c5.* (certainly c5.metal) instances use 2nd generation Intel Skylake processors. The instance c5.metal gives 96 vCPUs (48 cores). If you are going to pay for AWS, it seems a false economy to use anything other than the 96 vCPUs. About $4.50/hour might seem a lot, but it is more cost-effective to run one of those instances now and again, rather than use the cheaper instances all the time. For what it is worth, the CPUs in my Dell 7920, the 26-core 2.0 GHz Platinum 8167Ms, are 1st generation Skylake. I would love to upgrade and put the 2nd generation Skylake CPUs in, but they are very expensive. I could probably get about 60% better performance than I currently have, if I spend 1000% more on CPUs than I have done. 2021-06-06, 00:35 #57 VBCurtis "Curtis" Feb 2005 Riverside, CA 7×11×73 Posts Quote:  Originally Posted by drkirkby If you are going to pay for AWS, it seems a false economy to use anything other than the 96 vCPUs. About$4.50/hour might seem a lot, but it is more cost-effective to run one of those instances now and again, rather than use the cheaper instances all the time.
Please show your work. How did you calculate this? By "cost-effective", you mean lowest cost per prp test (for a candidate of fixed size), right?

2021-06-06, 10:59   #58
drkirkby

"David Kirkby"
Jan 2021
Althorne, Essex, UK

1C516 Posts

Quote:
 Originally Posted by VBCurtis Please show your work. How did you calculate this? By "cost-effective", you mean lowest cost per prp test (for a candidate of fixed size), right?
If you go to https://calculator.aws/ then scroll down to Amazon EC2, you will see you can open a pricing calculator. You need to pick a physical location for the server - there's no option for cheapest, but USA East (N. Virginia) seems to be the cheapest in all circumstances. Next pick the number of vCPUs you want. Leave RAM at 1 GB, as any instance they quote you for will have more than 1 GB RAM.

The calculator then gives you the cheapest instance. Here are some examples. You might get slightly different results, as prices can change. I've based all prices on using "On-demand hourly cost" - you can get cheaper prices if you agree to a contract for a set period (1 or 3 years) or set CPU hours per month. But I'm basing prices on what you can get without tying yourself into a contract.

Note, Amazon's vCPUs are include hyper-threading in the calculation, so N vCPUs, is only N/2 cores.

At this moment, this is what the Amazon calculator is showing.
* 16 vCPUs will be instance a1.metal, which will give you 16 vCPUs 16 GB RAM at $0.408/hour. So the cost per vCPU per hour is 0.408/16=$0.0255.
* 32 vCPUs will be instance c6g.8xlarge, which will give you 32 CPUs, 16 GB RAM, The cost is $1.088/hour. So cost per vCPU per hour = 1.088/16=$0.068
* 96 vCPUs will be instance c5a.24xlarge, which will give you 96 vCPUs and 192 GB RAM. Cost per hour is $3.696. So the cost per vCPU per hour = = 3.696/96=$0.0385
* 100 vCPUs will be require instance x1.32xlarge, which gives you 128 vCPUs and 1952 GB RAM, at a cost of $13.338/hour. 13.338/128 =$0.104203
* 256 vCPUs will be require instance u-6tb1.112xlarge, which gives you 448 vCPUs and 6144 GB RAM at a massive cost of $54.6/hour. So the cost per vCPU per hour = 54.8/448=$0.122321

Although initially it might look that 16 vCPUs is the cheapest, they are running on very old hardware. I forget the instance I had, but it had 16 vCPUs. The CPUs were very slow, and when I looked them up, I realised the CPUs had been released in 2014. I had run one about 4 days, and completed 5% of a PRP test around the 110 M mark

Assuming the time it takes to perform a PRP test of a given size scales linearly with the number of vCPUs, I think 96 vCPUs is optimal. I need to get my account upgraded to run the 96 vCPU instances, so have not at this point verified this.

I have a couple of free Amazon accounts, which given one vCPU. They are going to take over 500 days to complete the PRP test. They are all category 4, so I have a year to complete them, but even a year would not be long enough. I will move them to a faster machine later.

My experience running on hardware at home, is that P-1 factoring is using 305 GB RAM on current wavefront exponents, so just maybe there would be some advantage in having more than the 192 GB you get with 96 vCPUs. The instance for that would give you the same number of vCPUs, but 384 GB RAM at $4.128/hour. I can't believe that is is worth paying 4.128-3.696 =$0.432 per hour just to get the extra RAM that would only benefit stage 2 of the P-1 test.

Based on the Amazon calculator, and my experience of running on the smaller instances on older hardware, I conclude that 96 vCPUs is optimal.

Last fiddled with by drkirkby on 2021-06-06 at 11:02

2021-06-06, 14:51   #59
chalsall
If I May

"Chris Halsall"
Sep 2002

101011001100112 Posts

Quote:
 Originally Posted by drkirkby Based on the Amazon calculator, and my experience of running on the smaller instances on older hardware, I conclude that 96 vCPUs is optimal.
Are you taking into account "Spot Instances" in your calculus?

Separately... Only P[-/+]1 work needs much RAM by today's standards. So unless you're doing that type of work it would likely be more cost-effective to choose the instance type with the most compute as a function of cost-per-hour, ignoring the RAM given.

Let's leave out the "wall-clock" vs. "cpu-clock" dimension of the analysis for now...

2021-06-06, 15:24   #60
axn

Jun 2003

5×1,087 Posts

Quote:
 Originally Posted by drkirkby * 32 vCPUs will be instance c6g.8xlarge, which will give you 32 CPUs, 16 GB RAM, The cost is $1.088/hour. So cost per vCPU per hour = 1.088/16=$0.068
1.088/32 = 0.034

c6g's are cheaper than c5a's are cheaper than c5's. Fortunately for you, it doesn't make any difference that c6g is the cheapest. Because those are Amazon's custom Graviton chips which are ARM-based and cannot run P95.

I believe c5a is AMD and c5 is Intel Xeons. If you can get a Intel Xeon supporting AVX-512 (Skylake or better), then you'll get much higher per-core performance. Therefore mindlessly comparing $/core is useless. Also, the assumption that thruput scales linearly with cores is suspect. Short of actual benchmarks, you can't make any such conclusions. Finally, as chalsall indicated, spot instances are much cheaper than even 3-year reserved instances, let alone on-demand ones. However, you do need to do fair bit of babysitting, since they can get terminated abruptly. EDIT:- It is almost as if you haven't read thru the very thread you're posting to. Last fiddled with by axn on 2021-06-06 at 15:25 2021-06-06, 15:30 #61 drkirkby "David Kirkby" Jan 2021 Althorne, Essex, UK 45310 Posts Quote:  Originally Posted by chalsall Are you taking into account "Spot Instances" in your calculus? Separately... Only P[-/+]1 work needs much RAM by today's standards. So unless you're doing that type of work it would likely be more cost-effective to choose the instance type with the most compute as a function of cost-per-hour, ignoring the RAM given. Let's leave out the "wall-clock" vs. "cpu-clock" dimension of the analysis for now... I'm using the calculator at https://calculator.aws/#/createCalculator/EC2 and taking the "On-Demand hourly cost". I was for 99% of my post, ignoring memory. That's why I wrote "Leave RAM at 1 GB, as any instance they quote you for will have more than 1 GB RAM." (Looking again, I do see there are a few instances that come with 0.5 GB RAM, but they have <=2 vCPUs, so are pretty irrelevant). Then Amazon will select an instance that has at least the number of vCPUs you specify, and at least the 1 GB RAM. Generally speaking, the more vCPUs you have, the more memory you will get - whether you want that RAM or not. You will get far more RAM than you want, but unfortunately you do not get the option to relinquish some of that RAM to gain extra vCPUs or reduce the cost. Only once did I briefly mention adding extra RAM, and concluded that the extra cost did not warrant the marginal benefit from the 2nd stage of P-1 factoring. Amazon also have GPUs. They might be worth trying instead of CPUs. I did look at some of the prices of GPUs and see they were quite modest. I assume that setting up the GIMPS software to use Amazon's GPUs would take more effort, so one might as well spend that time on a slow, low-cost instance. Last fiddled with by drkirkby on 2021-06-06 at 15:35 2021-06-06, 15:39 #62 chalsall If I May "Chris Halsall" Sep 2002 Barbados 101011001100112 Posts Quote:  Originally Posted by drkirkby I assume... Sigh... 2021-06-06, 15:57 #63 drkirkby "David Kirkby" Jan 2021 Althorne, Essex, UK 3·151 Posts Quote:  Originally Posted by axn I believe c5a is AMD and c5 is Intel Xeons. If you can get a Intel Xeon supporting AVX-512 (Skylake or better), then you'll get much higher per-core performance. Therefore mindlessly comparing$/core is useless. EDIT:- It is almost as if you haven't read thru the very thread you're posting to.
I believe c5.metal is 2nd generation Skylake. My Dell 7920 uses a pair of 1st Generation Skylake processors which support AVX-512. Changing from 26 cores to 13 cores seems to fairly well double the time per itteration. I need to have two workers, since there are dual CPUs. I am guessing multiple workers would be the same on Amazon.

One does not have to go far back in this thread to see posts written in 2017. A lot would have changed since then, and I don't have time to read every post when the date on it would suggest the information has a reasonable probability of being outdated.

2021-06-06, 18:17   #64
chalsall
If I May

"Chris Halsall"
Sep 2002

2B3316 Posts

Quote:
 Originally Posted by drkirkby A lot would have changed since then, and I don't have time to read every post when the date on it would suggest the information has a reasonable probability of being outdated.
Read the prior art. And/or run your own experiments. So you can answer any and all questions authoritatively.

Otherwise, you're just "dead in the water". Sometimes radiating a lot of noise.

IMHO.

 2021-06-14, 00:44 #65 drkirkby   "David Kirkby" Jan 2021 Althorne, Essex, UK 3×151 Posts I spun up a c5.metal instance yesterday (2 x Intel Xeon Platinum 8275CL CPU @ 3.00GHz) in N. Virginia. Those CPUs are non-standard ones, and are probably only available to Amazon. There's a good chance they would not work in a mainstream computer (Dell, IBM, HP etc). Supermicro motherboards tend to be a good choice if one wants to use obscure CPUs. I run the mprime benchmark at a single FFT size of that needed for the exponent I had been allocated (around 111 million), as the server will only hand out category 4 to a new machine). I suppose getting a manual assignment would have been more sensible, as the exponent would have been smaller than 111 million. What surprised me was that throughput was best with 3 workers. Given it's dual socket server, I would have expected the optimal number of workers to be an integer multiple of 2, but it was actually 3 workers. My Dell 7920 has similar, but older/slower CPUs. I don't think I ever measured the throughput with 3 workers - I will try that some time, as it might give better throughput than the two workers I use. After determining 3 workers gave the best throughput, I set mprime running with 3 workers, each with 16 cores. However, this gave one exponent that would have completed in 54 hours, and two that would have taken 4 days (rough figures). Had it been two working fast, and one working slow, I could have understood that. This could cause that 1) Worker 1 on CPU 1 using 16 cores 2) Worker 2 on CPU 2 using 16 cores 3) Worker 3, using both CPU 1 and CPU 2, with 8 cores from each, which would slot it. But instead I had one fast(ish) worker, and two slow ones. Does anyone know why that might happen? Why would 3 workers give the most throughput on a dual-socket computer? (The c5.metal is bare hardware, so there's no virtualisation, so one is not sharing the hardware with anyone else.) I hit another problem I was not expecting - I only had 8 GB of disk space, which seems a bit silly with 48 cores (96 vCPUs) and 192 GB RAM. I never bothered checking the disk space before launching the instance - I just assumed it would be reasonable given the number of cores and RAM. Anyway, with the the limited disk space, the very unequal estimated completion times, I decided to run with just 2 workers. That's going at a pretty decent rate Code: [Worker #1 Jun 13 23:41] Iteration: 53430000 / 111178363 [48.05%], ms/iter: 1.109, ETA: 17:47:09 [Worker #2 Jun 13 23:41] Iteration: 54010000 / 111178423 [48.57%], ms/iter: 1.086, ETA: 17:14:28 My Dell 7920 has slightly more cores (52 vs 48), but significantly slower CPUs (2.0 vs 3.0 GHz) compared to the Amazon machine. My CPUs are first generation Skylake too. The Dell 7920, with its 2 GHz Platinum 8167M CPUs are taking around 1.5 ms/iteration on exponents around 104 million, so there's no doubt the Amazon c5.metal is significantly quicker than my Dell 7920, given the Amazon hardware is taking less time per iteration, on larger exponents. I hit another problem I should have expected, although it is difficult to know how to prevent it. Every exponent I was given needed P-1 factoring. Given they were all started together, they could each have expected to reach stage-2 at the same time, so each would have wanted a lot of RAM at the same time .Hence I had to restrict the RAM on each worker. I don't have this problem on my Dell, since I run 2 exponents, such that one is around 50% complete as the other starts. Code: [Worker #1 Jun 14 01:02] Iteration: 43500000 / 104329649 [41.69%], ms/iter: 1.460, ETA: 24:40:02 [Worker #2 Jun 14 01:04] Iteration: 90200000 / 104331181 [86.45%], ms/iter: 1.536, ETA: 06:01:47 So no two exponents ever need a lot of RAM at the same time. So I can give P-1 factoring 370 GB RAM, without putting any constraints on any worker. (Also, since my Dell has been doing a lot of work quickly, I get only category 0 or 1 exponents. On most occasions, someone else has already done the P-1 factoring.) Last fiddled with by drkirkby on 2021-06-14 at 00:46
2021-06-14, 14:05   #66
chalsall
If I May

"Chris Halsall"
Sep 2002

11,059 Posts

Quote:
 Originally Posted by drkirkby Does anyone know why that might happen? Why would 3 workers give the most throughput on a dual-socket computer?
Just a guess... Possibly scaling of the IPC between threads?

Because George codes so "close to the metal", empirical experiments are just about the only way to find "optimal" for a particular goal when you're working "at the edge" with non-nominal kit. (Optimal being defined as Wall-clock vs CPU-clock, etc.)

Quote:
 Originally Posted by drkirkby I hit another problem I should have expected, although it is difficult to know how to prevent it. Every exponent I was given needed P-1 factoring. Given they were all started together, they could each have expected to reach stage-2 at the same time, so each would have wanted a lot of RAM at the same time .Hence I had to restrict the RAM on each worker.
I don't have time to point you to the exact settings, but George foresaw this, and there's a way of constraining the overall RAM usage such that only one P-1 worker will be doing Stage 2 at any one time. See the UnDoc docs...

 Similar Threads Thread Thread Starter Forum Replies Last Post GP2 Cloud Computing 4 2020-08-03 11:21 ZFR Software 4 2018-02-02 20:18 kladner Science & Technology 7 2017-03-02 14:18 dragonbud20 Information & Answers 12 2015-09-26 21:40 GARYP166 Information & Answers 11 2009-07-13 19:39

All times are UTC. The time now is 22:40.

Thu Jan 26 22:40:58 UTC 2023 up 161 days, 20:09, 0 users, load averages: 1.29, 1.10, 1.03