mersenneforum.org > Math Let's attack the Bayesian-ECM-bounds again
 Register FAQ Search Today's Posts Mark Forums Read

2016-07-03, 11:26   #1
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

22×1,613 Posts
Let's attack the Bayesian-ECM-bounds again

The Bayesian analysis is in some sense really easy.

For each number of digits D in the putative factor, consider P(finding such a factor | running 1 curve at size B1). This can, it turns out, be very well-approximated as $$P_B = a_{B} * \exp(-b_{B} D)$$) for values in the table below

Code:
B  multiplier  exponent
43e6   29310  -0.385
110e6  16139  -0.356
260e6   9669  -0.331
850e6   5081  -0.302
Make the initial assumption that the probabilities of having a given 'reasonable' number of digits in the factor are all the same \alpha

Then the posterior distribution, PP(didn't find a factor of D digits after N curves at B1=B) is just $\alpha * 1-(1-P_B)^N$. But that's not a real distribution, so you have apply a scale so that all the PP values add up to 1.

This gives you a sigmoidal distribution for possible-factor-size-left; you then compute P(success for a D-digit factor after N' curves at B1=B'), multiply the sigmoidal distribution by this new probability, and the sum over all D of that column is the probability of success.

If you then divide the runtime for N' curves by the probability of success, and look at this value for lots of different N', you see that there is a minimum (which happens when you're running just a single curve); which translates to there being no point running the larger curves if this minimum is longer than the time it takes to do NFS.

See the attached Excel spreadsheet, or do the calculations yourself.
Attached Files
 bayes-ECM-progess.xlsx.gz (88.4 KB, 274 views)

Last fiddled with by fivemack on 2016-07-04 at 08:41

2016-07-10, 15:48   #2
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

22×1,613 Posts
Implementation

Unpack the small attached file into a directory of your choice, and run with some command line like

python ecm-toy.py G182 18000@43

python ecm-toy.py S240 1@3

and it will try to give you a recipe, defined by the rules above and using fits to quite a lot of measured data, for how much more ECM you should do at which bit levels before switching to NFS, assuming you have already done a certain amount of ECM.

The recipes come out as a lot less than the current suggested values, because the current suggested values really don't put enough emphasis on the fact that a number is quite likely to have a factor completely intractable by SNFS. I would be happy if someone could find a problem with my code, but I'm reasonably confident with my measured timings and with the stats I'm using.

ecm-timings.txt simply contains lots of lines of the form

tractor 3 180 24.55

where that gives the timing in seconds (I've only measured them on one computer, but there's a computer-name field in case you have much faster equipment somewhere) for running 1 curves of ECM with B1=3e6 and the next prime larger than 10^180 as the input number.
Attached Files
 ecm-toy.tgz (2.3 KB, 195 views)

 2016-07-10, 18:15 #3 henryzz Just call me Henry     "David" Sep 2007 Liverpool (GMT/BST) 10111010000112 Posts Had some problems initially getting this running under python 3 on windows. I had to add brackets around all of the print statements and replace R.sort() with R = sorted(R). This produced interesting results with reduced curves at smaller B1s and small amounts at larger B1s. It would be interesting to see results for the small numbers that are normally run through yafu starting from even smaller ecm than 3e6. I would imagine that there could be great improvements. If possible it would be interesting for it to output information like the expected factor size/probability of finding factor of size n etc. I am not convinced by your prior being constant for different sizes of factor. akruppa's post here suggests to me that the probability of having a 45 digit factor is quite different to 60 digit factor for instance. Last fiddled with by henryzz on 2016-07-10 at 18:16
2016-07-10, 20:07   #4
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

22×1,613 Posts

With the amount of evidence pushed in by the extra curves, Bayes is strikingly insensitive to the choice of prior. I've attached a new version of the code where you can switch between a flat prior and akruppa's log-ratio prior (by changing line 129), and which prints out the posterior distribution at the end (columns are P(this range) and cumulative). I get

Flat prior
Code:
% python ecm-toy.py G173 7600@43
[[43, 5800], [110, 700]]
30..34 0.0000 0.0000
35..39 0.0000 0.0000
40..44 0.0000 0.0000
45..49 0.0024 0.0024
50..54 0.0627 0.0651
55..59 0.1333 0.1985
60..64 0.1514 0.3499
65..69 0.1544 0.5043
70..74 0.1549 0.6592
75..79 0.1549 0.8141
80..84 0.1549 0.9690
85..89 0.0310 1.0000
Kruppa prior
Code:
[[43, 5800], [110, 700]]
30..34 0.0000 0.0000
35..39 0.0000 0.0000
40..44 0.0000 0.0000
45..49 0.0024 0.0024
50..54 0.0628 0.0652
55..59 0.1334 0.1986
60..64 0.1515 0.3502
65..69 0.1544 0.5046
70..74 0.1548 0.6594
75..79 0.1548 0.8142
80..84 0.1548 0.9690
85..89 0.0310 1.0000
If I put in the obviously-wrong prior that the probability of an N-digit factor is proportional to N, it's a little different, but only a few percent out ... note that it runs rather fewer curves, because with the probability density skewed so much towards curves-that-can't-work the probabilities of success have gone down a bit.

Code:
[[43, 3500]]
30..34 0.0000 0.0000
35..39 0.0000 0.0000
40..44 0.0000 0.0000
45..49 0.0039 0.0039
50..54 0.0581 0.0620
55..59 0.1124 0.1744
60..64 0.1340 0.3085
65..69 0.1468 0.4553
70..74 0.1581 0.6134
75..79 0.1691 0.7825
80..84 0.1801 0.9627
85..89 0.0373 1.0000
Attached Files
 ecm-toy-2.tgz (2.6 KB, 193 views)

Last fiddled with by fivemack on 2016-07-10 at 20:07

 2016-07-11, 10:52 #5 henryzz Just call me Henry     "David" Sep 2007 Liverpool (GMT/BST) 3·5·397 Posts Ok I was unsure about how much the prior would effect the posterior. Switching between the modes is a good check of sanity. Setting priors is one of the things that turns people off Bayesian statistics. This is interesting stuff. I have reckoned for a while that running a few higher curves is beneficial to the chance of finding a factor rather than just running a full t50 etc. This is a situation where having RDS to check the maths would be useful. There would be nothing worse than discovering that there was a significant bug in a year or two.
2016-07-13, 23:22   #6
WraithX

Mar 2006

5·101 Posts

Quote:
 Originally Posted by fivemack Unpack the small attached file into a directory of your choice, and run with some command line like python ecm-toy.py G182 18000@43 python ecm-toy.py S240 1@3 and it will try to give you a recipe, defined by the rules above and using fits to quite a lot of measured data, for how much more ECM you should do at which bit levels before switching to NFS, assuming you have already done a certain amount of ECM. The recipes come out as a lot less than the current suggested values, because the current suggested values really don't put enough emphasis on the fact that a number is quite likely to have a factor completely intractable by SNFS. I would be happy if someone could find a problem with my code, but I'm reasonably confident with my measured timings and with the stats I'm using. ecm-timings.txt simply contains lots of lines of the form tractor 3 180 24.55 where that gives the timing in seconds (I've only measured them on one computer, but there's a computer-name field in case you have much faster equipment somewhere) for running 1 curves of ECM with B1=3e6 and the next prime larger than 10^180 as the input number.
I don't know enough (or, anything yet, really) about Bayesian statistics, but this is a topic that I've been interested in for a long time, so this is a great way to get into that. I wasn't sure what the output of your program meant until I saw henryzz's post saying that a line like [[43, 200],[110,1000]] meant that you should run 200@43e6 and 1000@110e6, and then switch to nfs. Now that makes a little more sense. Also, is the input on the command line telling the program how much ecm has already been done?

One big question I have is, does your analysis take into account the fact that curves at one B1 level have a chance to find factors that are normally found at a different B1 level? For example, gmp-ecm (7) will tell us (with -v) that curves at 110e6 have the following expected number of curves:
Code:
35      40      45      50      55      60      65      70      75      80
38      151     692     3583    20479   128305  872747  6392233 5e+007  4.3e+008
So, if we run 7000@110e6, then we have done: 184.21*t35, 46.357*t40, 10.115*t45, 1.953*t50, 0.341*t55, 0.054*t60, etc.
Does your analysis take this into account? Does it need to?

And my next big question is, when we say we have completed 1*t55, I have heard that there is still a chance that we have missed a factor = e^-1 ~= 36.78%. If we complete 2*t55, does that mean the chance of a missed "55-digit" factor is e^-2 ~= 13.53%? Or, if we have completed 0.3*t55, is the chance of a missed "55-digit" factor = e^-0.3 ~= 74.08%? And with these different t levels, does your analysis take these chance-of-missed-factor into account? Does it need to?

On a side note, but kind of related, I've wanted to ask everyone what they think about changing the list of recommended B1 values. I've been thinking that we could use a list that grows in a predictable way, that will give us a formula for easily figuring out how much work has been done at a specific t level and to make it easy to extend in the future. The new list of B1 that I am thinking of is:
Code:
 Digits                 B1            ecm7 curves
-------------------------------------------------
5 -             1,000 =   1.00e3          1
10 -             3,160 =   3.16e3          2
15 -            10,000 =  10.00e3          9
20 -            31,600 =  31.60e3         34
25 -           100,000 = 100.00e3        123
30 -           316,000 = 316.00e3        389
35 -         1,000,000 =   1.00e6       1071
40 -         3,160,000 =   3.16e6       2642
45 -        10,000,000 =  10.00e6       5613
50 -        31,600,000 =  31.60e6      12111
55 -       100,000,000 = 100.00e6      22113
60 -       316,000,000 = 316.00e6      39678
65 -     1,000,000,000 =   1.00e9      67663
70 -     3,160,000,000 =   3.16e9     107544
75 -    10,000,000,000 =  10.00e9       etc
80 -    31,600,000,000 =  31.60e9
85 -   100,000,000,000 = 100.00e9
90 -   316,000,000,000 = 316.00e9
95 - 1,000,000,000,000 =   1.00e12
100 - 3,160,000,000,000 =   3.16e12
These values are based on powers of sqrt(10) ~= 3.16. This list is reasonably close to the existing list, but this would be much easier to expand or figure out how much equivalent work has been done. With this list we have two associated formulas:
B1 = 10^(digits/10 + 2.5)
digits = 10*log10(B1) - 25

ie, if you wanted to do work up to t53, then you would calculate your B1 = 10^(53/10 + 2.5) ~= 63e6.
or, if you have done work at B1 = 400e6, then the corresponding digit level would be = 10*log10(400e6) - 25 ~= 61.02

I don't have enough of a math(s) background to know if this is a sensible recommendation. I'm hoping we can discuss it here to find out if this is a good recommendation or not. If you don't think this topic is right for this discussion, I can start a new thread about it.

Also, in another thread, henryzz said:
Quote:
 Originally Posted by henryzz edit: It maybe be worth making a script that people can run that generates all the timing and probability data. This would allow full flexibility in the optimization of ecm effort. There is no reason that we need to be restrained by the traditional bounds. The optimal solution would have a different bound for each curve.
I have put together a list of numbers from 150 to 300 digits, in increment of 5, that I am putting into a python script that will automate the gathering of timing information. I haven't quite finished it because I was wondering what you thought of the multiple t level issues I discussed up above. I can post a basic version of the script in the next day or two. Should I go lower than 150? Would we need information below 150, perhaps down to 130 or 120? Also, do we need to gather data from several different computers? Or is it enough to have one complete set of data for one computer and run all of our calculations based on that?

2016-07-14, 07:11   #7
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

22×1,613 Posts

Quote:
 Originally Posted by WraithX Also, is the input on the command line telling the program how much ecm has already been done?
Yes, precisely.

Quote:
 One big question I have is, does your analysis take into account the fact that curves at one B1 level have a chance to find factors that are normally found at a different B1 level?
Yes, that's exactly what the program does; its main data structure stores, for each digit count, the probability that the work we've done so far has left a factor of that size, and then applies to it the probability that the work we might want to do next would remove a factor of that size.

By the end of the run it knows what the probability is of having factors of any size, and has determined that no ECM that it knows about has an expected time to find a factor less than the time it will take to run NFS.

Quote:
 On a side note, but kind of related, I've wanted to ask everyone what they think about changing the list of recommended B1 values. I've been thinking that we could use a list that grows in a predictable way, that will give us a formula for easily figuring out how much work has been done at a specific t level and to make it easy to extend in the future.
I think my work makes the concept of 'at a specific t level' somewhat go away, since it lets you record the amount of work that has been done in the literal sense of 'this many curves at this B1 bound', and the whole point is to reflect that a given curve has some effect on the probability distribution across all digit counts.

The 'classic' tN ECM bounds are from the rather flat bottom of an optimisation of run_time * probability_of_success_at_N_digits; it doesn't make all that much sense to use them as the only sample points when doing something more continuous, but since I needed to pick some sample points I thought I might as well use familiar ones. Maybe going to something more like powers of 10^(0.25) might work if we're doing the time-success tradeoff in a different way.

The two obvious improvements I need to make to the toy are to allow specification of input size as well as SNFS difficulty in the SNFS case (because at the moment I'm assuming they're equal, which means I'm modelling the ECM as slower than it is), and to provide GPU support since I'll in the future be running most of my step-1 ECM on GPUs. I suspect this means I need to split the ECM time into step-1 and step-2 since a 'GPU curve' is a step-1 GPU curve followed by a step-2 CPU curve, and to change the accounting a bit since a GPU does step-1 curves in blocks of a large fixed size ... so if you are doing large calibration runs, please keep the output file from ecm so you can reparse it and get the values at both stages.

Last fiddled with by fivemack on 2016-07-14 at 17:52

2016-07-14, 08:06   #8
henryzz
Just call me Henry

"David"
Sep 2007
Liverpool (GMT/BST)

174316 Posts

Quote:
 Originally Posted by WraithX I have put together a list of numbers from 150 to 300 digits, in increment of 5, that I am putting into a python script that will automate the gathering of timing information. I haven't quite finished it because I was wondering what you thought of the multiple t level issues I discussed up above. I can post a basic version of the script in the next day or two. Should I go lower than 150? Would we need information below 150, perhaps down to 130 or 120? Also, do we need to gather data from several different computers? Or is it enough to have one complete set of data for one computer and run all of our calculations based on that?
Personally I would include smaller numbers if only because they won't take so long to run the curves. I would also aim to run a wide range of b1s both high and low. I would go into the traditional 80 digit curves or even more. This is based upon the behaviour I have seen so far when small numbers have much larger curves run on them than expected. We don't want to limit the larger numbers.

 2016-07-14, 08:58 #9 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 22·1,613 Posts The problem with calibration is that the GNFS and SNFS calibration data is recorded on a single computer (the 'tractor' that appears in all the files), and is the reflection of a couple of CPU-centuries of computation; I don't think there are very many people in a position to re-collect that information. I would be a bit inclined to acquire ECM data from other computers, convert it to a 'speed ratio to tractor', and multiply the NFS calibration figures by that ratio; because it's important, for the time-estimation procedure to make sense, that all times in the program are in the same units, and at the moment that unit is 'seconds taken on tractor'.
2016-07-14, 17:03   #10
henryzz
Just call me Henry

"David"
Sep 2007
Liverpool (GMT/BST)

3·5·397 Posts

Quote:
 Originally Posted by fivemack The problem with calibration is that the GNFS and SNFS calibration data is recorded on a single computer (the 'tractor' that appears in all the files), and is the reflection of a couple of CPU-centuries of computation; I don't think there are very many people in a position to re-collect that information. I would be a bit inclined to acquire ECM data from other computers, convert it to a 'speed ratio to tractor', and multiply the NFS calibration figures by that ratio; because it's important, for the time-estimation procedure to make sense, that all times in the program are in the same units, and at the moment that unit is 'seconds taken on tractor'.
I would be happy with a ratio for ecm being applied to nfs. It is hard for most people to tune nfs. Maybe we could add a smalled nfs or two to the tuning. It would have to be smaller than most of the numbers we have looked at so far to be part of an automated tuning system.

2016-07-17, 01:05   #11
WraithX

Mar 2006

5·101 Posts

Quote:
 Originally Posted by fivemack The 'classic' tN ECM bounds are from the rather flat bottom of an optimisation of run_time * probability_of_success_at_N_digits; it doesn't make all that much sense to use them as the only sample points when doing something more continuous, but since I needed to pick some sample points I thought I might as well use familiar ones. Maybe going to something more like powers of 10^(0.25) might work if we're doing the time-success tradeoff in a different way. The two obvious improvements I need to make to the toy are to allow specification of input size as well as SNFS difficulty in the SNFS case (because at the moment I'm assuming they're equal, which means I'm modelling the ECM as slower than it is), and to provide GPU support since I'll in the future be running most of my step-1 ECM on GPUs. I suspect this means I need to split the ECM time into step-1 and step-2 since a 'GPU curve' is a step-1 GPU curve followed by a step-2 CPU curve, and to change the accounting a bit since a GPU does step-1 curves in blocks of a large fixed size ... so if you are doing large calibration runs, please keep the output file from ecm so you can reparse it and get the values at both stages.
Ok, I've created a simple version of a program that gathers timings for running gmp-ecm on various input numbers at various B1 values. I've included the traditional list of B1 values, and my updated list of B1 values. By default the program will use the traditional list of B1 values. There is a a user-defined-variables section where you can specify a computer name, you can pick which B1 list you want, you can specify which B1 "levels" you want to test (say 20 digits to 50 digits, or 35 digits to 80 digits, for example), and you can specify if you are using gmp-ecm 6 or 7, and with 7 you can specify if you want to use param=0 or param=1 (ecm7 with param=1 is fastest for almost all cases). While the script is running, it will show you the equivalent command line for gmp-ecm and the output from the program. At the end, it will print out the results to screen and to a file called 'ecm_times.txt'. One important note, this script is written on the premise that the ecm executable is 1) named ecm and 2) that it is either in the same directory as the script, or somewhere in your current PATH. I can make this more flexible in a future update if that would help anyone out.

Here are two examples of the output:
Code:
param = 0 (this line not currently included in any output, it's just here so you can better understand timing differences)
# computer, digits, 11e3, 50e3, 250e3, 1e6, 3e6, 11e6, 43e6
pc1, 150, 0.031s+0.032s, 0.109s+0.062s, 0.499s+0.250s, 1.841s+1.061s, 5.491s+2.387s, 19.718s+8.034s, 73.928s+23.354s
pc1, 155, 0.031s+0.016s, 0.140s+0.078s, 0.577s+0.249s, 2.199s+1.077s, 6.474s+2.527s, 23.836s+8.471s, 92.305s+25.850s
pc1, 160, 0.062s+0.016s, 0.140s+0.094s, 0.530s+0.265s, 2.106s+1.107s, 6.177s+2.496s, 22.698s+8.361s, 88.140s+25.475s
pc1, 165, 0.031s+0.015s, 0.124s+0.063s, 0.577s+0.265s, 2.091s+1.123s, 6.287s+2.558s, 22.526s+8.642s, 88.312s+26.801s
pc1, 170, 0.031s+0.032s, 0.140s+0.078s, 0.577s+0.265s, 2.121s+1.186s, 6.208s+2.590s, 22.542s+8.564s, 88.514s+26.302s

param = 1 (this line not currently included in any output, it's just here so you can better understand timing differences)
# computer, digits, 11e3, 50e3, 250e3, 1e6, 3e6, 11e6, 43e6
pc1, 150, 0.046s+0.016s, 0.093s+0.078s, 0.390s+0.234s, 1.498s+0.998s, 3.978s+2.277s, 14.555s+7.675s, 57.003s+23.431s
pc1, 155, 0.031s+0.031s, 0.109s+0.078s, 0.468s+0.250s, 1.670s+1.092s, 4.852s+2.511s, 17.722s+8.408s, 69.155s+25.693s
pc1, 160, 0.031s+0.031s, 0.109s+0.078s, 0.468s+0.250s, 1.654s+1.108s, 4.821s+2.496s, 17.628s+8.346s, 69.015s+25.490s
pc1, 165, 0.046s+0.032s, 0.109s+0.093s, 0.468s+0.265s, 1.654s+1.139s, 4.851s+2.543s, 17.675s+8.611s, 68.812s+26.442s
pc1, 170, 0.031s+0.032s, 0.109s+0.078s, 0.452s+0.250s, 1.654s+1.139s, 4.867s+2.558s, 17.628s+8.611s, 68.968s+26.349s
I recommend that we go to this comma-separated-values (csv) format to make the results more compact, easier to read, and easier to parse/search in future scripts. The first line is basically just a column header line, describing the computer name, the number of digits in the tested number, and which B1 value was used to generate the given times. I have split the times up into step1+step2, and I have converted from milliseconds to seconds. It will be easy to write a function to add the step1 and step2 times together to get a total time for a given B1 value.

Please let me know if you have any problems with this script, or would like to see any features added or modified.

Quote:
 Originally Posted by fivemack The 'classic' tN ECM bounds are from the rather flat bottom of an optimisation of run_time * probability_of_success_at_N_digits; it doesn't make all that much sense to use them as the only sample points when doing something more continuous, but since I needed to pick some sample points I thought I might as well use familiar ones. Maybe going to something more like powers of 10^(0.25) might work if we're doing the time-success tradeoff in a different way.
If you think my list of new B1 values is appropriate, we can generate as continuous a list of B1 values as you want. We could generate B1 values for every digit, or every 0.1 digits, etc. With the formula B1 = 10^(digits/10 + 2.5), we could get a list like:
Code:
digits -    B1
40 -  3.16e6
41 -  3.98e6
42 -  5.01e6
43 -  6.30e6
44 -  7.94e6
45 - 10.00e6
...etc...
That way, we could just pick which "digit levels" to test (which can be either "for d in range(a,b)", or "for d in d_list") and then the formula will give us the corresponding B1 values to use. I was going to implement this in the program, but wanted to ask what you thought of it first.
Attached Files
 gen_ecm_times.zip (2.6 KB, 199 views)

 Similar Threads Thread Thread Starter Forum Replies Last Post MooMoo2 Soap Box 40 2018-01-19 23:48 144 Information & Answers 5 2017-03-15 13:36 Uncwilly GPU Computing 29 2013-09-08 20:53 S485122 Hardware 3 2010-08-24 01:19 ewmayer Lounge 12 2007-01-30 05:56

All times are UTC. The time now is 04:24.

Thu Jan 20 04:24:30 UTC 2022 up 180 days, 22:53, 0 users, load averages: 1.39, 1.37, 1.43