 mersenneforum.org Generating a uniform random n-bit semiprime
 Register FAQ Search Today's Posts Mark Forums Read  2017-09-12, 22:50 #1 danaj   "Dana Jacobsen" Feb 2011 Bangkok, TH 22·227 Posts Generating a uniform random n-bit semiprime I have a routine that generates a random n-bit semiprime and would like to improve its efficiency. This is the "unrestricted" semiprime in that I am not just looking for an even bit split, so different than RSA key generation or examples for hard factoring. There are many papers about generation methods for those. The input is a number of bits n, and the output is an n-bit semiprime. The probability of a given result should be equal to that of all other n-bit semiprimes. Hence with an input of 6, the values 33,34,35,38,39,46,49,51,55,57,58,62 should be returned with equal likelihood. See CRGreathouse's 2010 post for one possible implementation. For very small bits, e.g. 7 and under, we can just uniformly select from a list. Perfect distribution, super fast. For larger inputs, we can randomly select an n-bit number, do an is_semiprime() test, repeating until that is true. This is reasonably fast at small sizes, e.g. average 1 microsecond for 20-bit, 3uS for 32-bit, 10uS for 48-bit, 40uS for 60-bit. Your implementation will differ of course. But at some point this really starts slowing down, and basically for >64-bit this is not a good method. Charles' idea is clever, weighting the distributions. At a high level this gives a good approximation to the distribution. However it skews heavily toward primes with a large following gap. So much so that 3 occurs as often as 2, and 7 more than 5. I did basically a hack on this method, generating a (mostly) proper curve using a table for small primes then use Charles' method if we selected a larger one. This seems to work reasonably well and gives results much closer to the rejection sampling method. Any ideas on better ways?   2017-09-13, 01:13 #2 a1call   "Rashid Naimi" Oct 2015 Remote to Here/There 2,017 Posts A bit over my head. With that disclaimer and for what it's worth. Not sure how is_semiprime() works. But with the right trial-by-division algorithm up to the cube-root of n (which is significantly less taxing than trial-by-division up to the square-root of) n you will be left by a set of primes and semi-primes.   2017-09-13, 02:53   #3
danaj

"Dana Jacobsen"
Feb 2011
Bangkok, TH

16148 Posts Quote:
 Originally Posted by a1call Not sure how is_semiprime() works.
There are various ways to implement it. With mine, 95% of all inputs require little more than a primality test, which is often trivially decided. Then partial factoring quickly finds most solutions. Charles has an example in Pari: http://www.mersenneforum.org/showpos...postcount=2074 that is similar. We both found these tests work well up to a certain point after which it is faster to do more sophisticated factoring. Yes we stop if the input is tiny and we pass the cube root of n.

Quote:
 But with the right trial-by-division algorithm up to the cube-root of n (which is significantly less taxing than trial-by-division up to the square-root of) n you will be left by a set of primes and semi-primes.
This sounds like, to generate a single 64-bit semiprime, we'd create a list of all 64-bit primes and semiprimes (or on average half of them since we're randomly selecting one). That's a stupendously large list. 128-bit semiprimes would be far worse. If n is 1024, then we'd need to work up to the cube root of 2^1024 -- not going to happen.   2017-09-13, 04:16 #4 a1call   "Rashid Naimi" Oct 2015 Remote to Here/There 2,017 Posts Again keeping my disclaimer in mind, i will offer some more simpleton comments which is all I can do. The following example can be stretched in different ranges and is just a concept. Suppose you are aiming to produce the set of all the primes and semiprimes between 100 to 200. You could do trial by division up to n^(1/4). This would filter out all the factors of 2 and 3 for this example range. Then you could perform gcd operations between the remaining items which will distinguish all the primes from 5 to 37. The remaining primes would be to large to form a semi prime in the range (unless multiplied by 2 or 3). Last fiddled with by a1call on 2017-09-13 at 04:38   2017-09-13, 05:12   #5
danaj

"Dana Jacobsen"
Feb 2011
Bangkok, TH

16148 Posts Quote:
 Originally Posted by a1call Suppose you are aiming to produce the set of all the primes and semiprimes between 100 to 200.
That's not what I'm asking for.

My problem is, for example, generating 1000 random 128-bit semiprimes, each with approximately uniform probability. I don't want to generate all the semiprimes in the range, even if that were at all feasible. There are something like 8 * 10^36 of them.

Even with more tame sizes, the point isn't to generate all of them. It is to quickly generate a single random semiprime that has the right probability. Generating them with a somewhat incorrect probability is pretty easy (e.g. generate a uniform random prime p between 2 and 2^(n-1), then generate another uniform random prime q in the range such that p*q is an n-bit number).   2017-09-13, 13:07   #6
Dr Sardonicus

Feb 2017
Nowhere

3×11×137 Posts Quote:
 Originally Posted by danaj The input is a number of bits n, and the output is an n-bit semiprime. The probability of a given result should be equal to that of all other n-bit semiprimes. Hence with an input of 6, the values 33,34,35,38,39,46,49,51,55,57,58,62 should be returned with equal likelihood.
Hmm. I'm not sure about the usage of either "uniform" or "random" here. You're looking at the interval

[2^(n-1), 2^n - 1].

A uniform distribution in an interval usually means, the probability of landing in a sub-interval is proportional to its length. But the numbers from the set you're interested in aren't uniformly distributed in the interval. So a process which does well at selecting numbers which are uniformly distributed in the interval, may do less well at selecting members of the subset of semiprimes with equal probability.

As to randomness in your subset, theoretical perfection would seem to require knowing in advance all the numbers in the set (here, the n-bit semiprimes), then having a process to select one of them randomly, say by enumerating them 1 to K, and randomly selecting an integer from 1 to K.

Obviously, knowing the complete set of semiprimes in the interval is impracticable when the bit size is large (as is, therefore, being able to know how "random" your choice of semiprime might be.)

So it seems to me that some sort of "weighted" approach -- as already mentioned -- might be a practical compromise, to get reasonably close to what you want with a reasonable amout of computing power.

There is an asymptotic estimate due to Landau for the number of semiprimes less than x,

x*(log(log(x))/log(x)

but this estimate is not all that great.

I'm sure there are estimates of the distribution in terms of the size of the smaller of the two prime factors. I'm equally sure that, if I tried to reproduce such an estimate "on the fly" I would screw it up royally. But such an estimate seems to me to be a good way to go. Use the distribution in terms of the size of the smaller factor to create a "stratified random sample," and then cross your fingers...   2017-09-13, 15:28   #7
danaj

"Dana Jacobsen"
Feb 2011
Bangkok, TH

38C16 Posts I think you've covered most of what I've got.

Two ways that generate the right distribution are:
• Rejection sampling. Generate a uniform value in the range 2^(n-1) to 2^n-1. If not semiprime, try again. The chances of any given semiprime being chosen are identical to any other. The main downside is that is_semiprime gets expensive as the size goes up (e.g. n > 64)
• Determine the number of semiprimes S_n in the range 2^(n-1) to 2^n-1. Generate a uniform integer 0 < i <= S_n. Select the i'th semiprime in the range. The downside is having to have efficient versions of semiprime_count and nth_semiprime.

Selecting uniform random primes in a range can be done similarly, and I have done each in that context. For primes the second is somewhat reasonable given extremely optimized prime count and nth prime routines, and for semiprimes Charles and I use this in a hardcoded fashion for tiny inputs. With primes, the former has some optimizations that cut out large groups of composites from the selection. I don't think those work nearly as well with semiprimes.

The rejection sampling method is practical to 64-bits, and feasible probably up into 128 or so bits. Depends on ones available factoring algorithms and performance tolerance. It'd be nice to have something better, but this works.

Quote:
 Use the distribution in terms of the size of the smaller factor to create a "stratified random sample," and then cross your fingers...
I believe technically this is related to inverse transform sampling. Charles does this selecting the first prime. I found it didn't work as well as I'd like, so applied some hacky corrections, which makes it better but still dubious. At least at this size almost nobody will notice if we're not perfect.   2017-09-13, 20:34 #8 CRGreathouse   Aug 2006 32×5×7×19 Posts If you use 'my' method, be sure to replace the hacky nextprime() bit with the proper random prime generation which should make the overall method nearly uniform. (PARI/GP now has this bit built in as randomprime.)   2017-09-13, 22:49   #9
danaj

"Dana Jacobsen"
Feb 2011
Bangkok, TH

22·227 Posts Quote:
 Originally Posted by CRGreathouse If you use 'my' method, be sure to replace the hacky nextprime() bit with the proper random prime generation which should make the overall method nearly uniform. (PARI/GP now has this bit built in as randomprime.)
The issue is with the selection of p rather than q. The difference in probability of (2,3,5,7) is completely different with inputs of 25 vs. 26, where the former uses rejection sampling and the latter selects two primes using the expected probability.

For 100k 26-bit random semiprimes, using the first code (rejection sampling) and looking at the histogram of smallest factor:

Code:
17026 2
11633 3
7168 5
5300 7
3447 11
2956 13
2296 17
2098 19
1810 23
1368 29
Then the result with the second code (probability):

Code:
15553 3
14520 2
8509 7
7661 5
4050 13
2964 23
2651 19
2649 11
2100 31
1540 17   2017-09-14, 08:32 #10 axn   Jun 2003 22×3×7×59 Posts Here's an engineer's solution. Build a table of frequency of b-bit semiprimes where the smaller factor is p. But rather than do it for each prime < sqrt(2^b), we do it in batches. Since the smaller primes have more contribution, we can do it at invidividual prime levels for small primes (say < 128). After that we do it at at batches of, say, 0.5 bits, like 2^7-2^7.5, 2^7.5-2^8, etc... upto 2^(b/2). For individual primes, we can estimate the # of semiprimes as Pi( 2^b/p ) - Pi( 2^(b-1)/p ). For a batch of primes between p&q, we can calculate it as if there are n primes (n=Pi(q)-Pi(p)) each of size Mean(p,q) [i dont know which would give better result, AM,GM or HM]. Once we have the frequency counts, we can generate a random number in [0,1) and use it to select prime/batch (in case of batch, select a random prime within the batch). There will be lot of approximations & estimations, but just might work. For a given b, there will be some initialization which can be amortized over multiple calls. Last fiddled with by axn on 2017-09-14 at 08:35   2017-09-14, 15:48   #11
Antonio

"Antonio Key"
Sep 2011
UK

32·59 Posts Don't know if it helps, but I have used the attached perl code which seems to work ok. It uses your random_prime.

Prime distribution, routine executed until one of the counted prime factors = 100k (26bit semi-primes):
Code:
Prime   Count
2    98900
3    98719
5    99101
7    99327
11    99254
13    98703
17    98651
19    99279
23    99474
29    99749
31    98857
37    99465
41    99004
43    99085
47    98477
53   100000
59    98947
61    98380
67    98631
71    98491
73    99117
79    99165
83    98799
89    98832
97    99597
101    98376
103    99035
107    98870
109    98463
113    98726
127    98853
131    98661
137    98324
139    98539
149    98855
151    99396
157    99167
163    98829
167    98532
173    99200
179    99195
181    99165
191    98809
193    99163
197    99418
199    99166
Attached Files random_semi.7z (605 Bytes, 74 views)   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post sweety439 sweety439 11 2020-09-23 01:42 jasong Lounge 46 2017-05-09 12:32 tal Msieve 31 2011-05-22 18:17 davar55 Math 14 2011-02-20 16:06 Greenk12 Factoring 1 2008-11-15 13:56

All times are UTC. The time now is 23:57.

Thu May 6 23:57:40 UTC 2021 up 28 days, 18:38, 0 users, load averages: 1.90, 1.68, 1.57