mersenneforum.org Draft proposed standard for Mersenne hunting software neutral exchange format
 Register FAQ Search Today's Posts Mark Forums Read

2018-06-09, 14:55   #1
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10010111010102 Posts
Draft proposed standard for Mersenne hunting software neutral exchange format

The idea for a a portable neutral exchange format and discussion of it began on http://www.mersenneforum.org/showthr...493#post489493, with considerable contribution by Preda, including the original idea and sketch of a possible format, and some contributions by ewmayer and kriesel. (Possibly others, I don't recall.) The idea of parallel confirmation for speed of a new prime discovery from a collection of interim save files has been kicking around for a while. A neutral exchange capability would aid that.

In the quote insert below is a draft I've put together quickly from that discussion and some further thought on a possible standard.

Creating a draft standard raises for me the question of who gets to declare it a standard or conversely discourage the development of neutral file exchange capability. I think that would be the Mersenne Research Inc. board. Or, as things often go informally and by volunteer action here, authors or maintainers of the various Mersenne hunting software programs adopt it or something similar. Another possible way to approach exchange is to create separate standalone export and import utilities. I would be very interested in Prime95's thoughts on the pros and cons of a draft standard and such capabilities, as well as other authors of relevant code. Should standalone translators be discouraged, as possible threats to data integrity or provenance of the data?

Quote:

 2018-06-09, 17:52 #2 Uncwilly 6809 > 6502     """"""""""""""""""" Aug 2003 101×103 Posts 2×4,583 Posts Not writing any of the code that tests Mersenne numbers, you can ignore my comments if you wish. For P-1, bounds should be recorded in a common format and location. I don't know if any software other than Prime95 does the BS extension and if that would make a difference in a save file. Why not include TF in the format? It takes up almost no space, but if we are seeking to unify PRP, P-1, and LL, why not add TF? In the NOTE section: if software imports a file from a different version or program, it should write the original line #2 (your section 3.4), and iteration line into the NOTE section. If a different program is used to continue, it should preserve the previous NOTE and add the appropriate lines. Maybe with a tag that indicates that it is the history of the test. (I will use HISTORY in my example.) Example of original file: Code: Prime95, v29.4b Type LL Exponent 332191111 Iteration 100000 Example after CUDALucas resumed from that file: Code: CUDALucas, v2.06beta, c149447533., "2018-06-09 15:37:23 UTC" Type LL Exponent 332191111 Iteration 1000400 ...... NOTE HISTORY Prime95, v29.4b NOTE HISTORY Iteration 100000 Example after MLucas resumed from that file: Code: Mlucas, V17.0 Type LL Exponent 332191111 Iteration 30003300 ..... NOTE HISTORY CUDALucas, v2.06beta, c149447533., "2018-05-09 15:37:23 UTC" NOTE HISTORY Iteration 1000400 NOTE HISTORY Prime95, v29.4b NOTE HISTORY Iteration 100000 What about the file name? Shouldn't that be specified? How will the software know which file to look at? Maybe Mexponent_number YY-MM-DD hh:mm:ss-ms [tz].MNE so for the above example it would be: M332191111 18-06-09 02:59:59-2001 +5.NME That way the save files would have unique names and the software could find the latest, even if the original file dates and times are lost in transfer and copying. Last fiddled with by Uncwilly on 2018-06-09 at 18:05
2018-06-09, 18:59   #3
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

113528 Posts

Quote:
 Originally Posted by Uncwilly Not writing any of the code that tests Mersenne numbers, you can ignore my comments if you wish. For P-1, bounds should be recorded in a common format and location. I don't know if any software other than Prime95 does the BS extension and if that would make a difference in a save file. Why not include TF in the format? It takes up almost no space, but if we are seeking to unify PRP, P-1, and LL, why not add TF? In the NOTE section: if software imports a file from a different version or program, it should write the original line #2 (your section 3.4), and iteration line into the NOTE section. If a different program is used to continue, it should preserve the previous NOTE and add the appropriate lines. Maybe with a tag that indicates that it is the history of the test. (I will use HISTORY in my example.) Example of original file: Code: Prime95, v29.4b Type LL Exponent 332191111 Iteration 100000 Example after CUDALucas resumed from that file: Code: CUDALucas, v2.06beta, c332191111., "2018-06-09 15:37:23 UTC" Type LL Exponent 332191111 Iteration 1000400 ...... NOTE HISTORY Prime95, v29.4b NOTE HISTORY Iteration 100000 Example after MLucas resumed from that file: Code: Mlucas, V17.0 Type LL Exponent 332191111 Iteration 30003300 ..... NOTE HISTORY CUDALucas, v2.06beta, c332191111., "2018-05-09 15:37:23 UTC" NOTE HISTORY Iteration 1000400 NOTE HISTORY Prime95, v29.4b NOTE HISTORY Iteration 100000

My intent was not to unify the various computation types, but to include in the draft standard the ones where there's sufficient utility and interest. And generate discussion of what would be most useful. It's one draft describing several different neutral exchange file formats, for the very different computation types, which share some common elements.

The Exponent is constant, as is the computation type, during the life of a particular exchange file. Stage, iteration, etc may change as computation progresses, but exponent and computation type do not. Each exchange file is for communicating one exponent's state of progress, of one computation type. CUDALucas includes the exponent number in the checkpoint file name.

CUDAPm1 v0.20 does P-1 stage 1 and stage 2, and the B-S extension. There are up to 9 or 25 32-bit unsigned integer words beyond the long binary data, in its storage format, including several reserved words. It's not simple.

TF may be included at some point, if the standard goes forward. TF was not included for four reasons. One, it took a while to write without it, and quickly became long.
Two, I wanted to get some feedback before I went further, especially if the idea doesn't fly.
Three, the run time represented by a TF checkpoint is generally smaller than LL or PRP.
Four, the data size is much smaller for TF; bytes vs. megabytes. One could muddle through with a hex editor on a TF file, or even Notepad. Here's what one contained, more or less, for M651m: "651302251 80 81 4620 0.20: 392 0 056C1BA7" (A few chars have been changed here) I think those fields are exponent, starting bit level, next bit level, class count, mfaktc version, current class, ??, ??. Approx 42 bytes, versus the tens of megabytes for P-1 or LL or PRP of 100-megadigit exponents.
Mfakto has very similar. Something like "580700307 80 81 4620 mfakto 0.15pre6-Win: 525 0 57C49BFA". Mfakto was derived from mfaktc, hence the close similarity.

I came at this from the point of view of having read engineering data format standards, (IGES, STEP, etc.) and seen the inevitable differences in implementations, from one software to the next, from having been a user and CAD system manager, and writing occasionally "IGES standard" to "IGES standard" translators since some software subsetted the standard so much that conversion from software A to IGES to software B was poor.

I like your idea of a program generated NOTE HISTORY subcase. I'd probably make it a multivalue record, perhaps something like
"NOTE HISTORY Prime95, v29.4B, Start iteration 100000, End iteration 150000".
Then other runs, on Prime95 or elsewhere, could read chronologically line by line.
Or maybe it merits its own case: HISTORY.

Last fiddled with by kriesel on 2018-06-09 at 19:11

2018-06-09, 19:25   #4
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×32×269 Posts

Quote:
 Originally Posted by Uncwilly What about the file name? Shouldn't that be specified? How will the software know which file to look at?
That's what the program extension to support the command line option -import <filename> is for.
Expanding a bit, I think by design, exchange file import and export should not be an automatic function. I feel it should require user intervention, and be infrequent.

The exchange formats are not intended as substitutes for the programs' native file format(s).
Exchange files are fundamentally different from a worktodo.add file, which is supposed to be found and processed automatically. They're also fundamentally different than routine savefiles, which are supposed to be generated automatically every n iterations.

Exchange files are supposed to not be processed automatically or generated automatically. The use cases are the exceptions, not the rule, not production running.

Time/date stamp of file creation (export) is among the data included in the header; line 2 in the exchange file, 3.4 in the draft standard.

Last fiddled with by kriesel on 2018-06-09 at 19:30

 2018-06-09, 22:51 #5 ewmayer ∂2ω=0     Sep 2002 República de California 264728 Posts Hi, Ken: Thanks for taking the initiative on this. Couple of questions: 1. Is there any good reason to save P-1 stage 2 residues? It was always my understanding that only saving the stage 1 residue was important, since a stage 2 run for any desired stage 2 prime [plo,phi] bounds needs only the stage 1 residue, i.e. if another user did some stage 2 work starting from the same stage 1 residue and covering (say) a lower numeric stage 2 interval, any resulting stage 2 residue is irrelevant to anyone else's stage 2 work. 2. "Binary form is an ordered byte stream, most significant byte first." -- If I'm writing/reading a bytewise residue to/from a file, I have the residue in a byte array and just do fwrite/fread of that, which results in a low-to-high byte ordering. Since for these kinds of residue data every byte is as important as every other (i.e. there is no relative-significance hierarchy in terms of data integrity), why make the required coding more awkward than it needs to be?
2018-06-10, 02:27   #6
preda

"Mihai Preda"
Apr 2015

31×43 Posts

Quote:
 Originally Posted by kriesel 3.5 Third line is the type of test and subtype if applicable, in ASCII text. Examples: "Type LL" "Type PRP"
I see two distinct ways to identify the field: one is the line number ("third line"), and the other is the name of the field ("Type"). I think we should settle for one or the other. (my preference would be for field-ids (being explicit), not line position). (If the line position is used, then no field-id should be included).

I'll give it more thought.

Last fiddled with by preda on 2018-06-10 at 02:27

2018-06-10, 02:31   #7
preda

"Mihai Preda"
Apr 2015

31·43 Posts

Quote:
 Originally Posted by ewmayer 2. "Binary form is an ordered byte stream, most significant byte first." -- If I'm writing/reading a bytewise residue to/from a file, I have the residue in a byte array and just do fwrite/fread of that, which results in a low-to-high byte ordering. Since for these kinds of residue data every byte is as important as every other (i.e. there is no relative-significance hierarchy in terms of data integrity), why make the required coding more awkward than it needs to be?

Yep, I think we all store "byte-0" (least significant) first, so probably we'll do the same in the save-file as well. I don't think there's any problem here (just need to update that bit of spec).

2018-06-10, 13:39   #8
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·32·269 Posts

Quote:
 Originally Posted by ewmayer Hi, Ken: Thanks for taking the initiative on this. Couple of questions: 1. Is there any good reason to save P-1 stage 2 residues? It was always my understanding that only saving the stage 1 residue was important, since a stage 2 run for any desired stage 2 prime [plo,phi] bounds needs only the stage 1 residue, i.e. if another user did some stage 2 work starting from the same stage 1 residue and covering (say) a lower numeric stage 2 interval, any resulting stage 2 residue is irrelevant to anyone else's stage 2 work. 2. "Binary form is an ordered byte stream, most significant byte first." -- If I'm writing/reading a bytewise residue to/from a file, I have the residue in a byte array and just do fwrite/fread of that, which results in a low-to-high byte ordering. Since for these kinds of residue data every byte is as important as every other (i.e. there is no relative-significance hierarchy in terms of data integrity), why make the required coding more awkward than it needs to be?
1 If there's no intent to extend the B2 later, which I've read is complicated (not clear to me if it's even practical), and not implemented for CUDAPm1, and there's no need for debugging the file contents, because the run completed, CUDAPm1 does not keep stage 2 files. It also does not keep stage one files if the stage 2 completes. That's how it manages its own checkpoint files, which can be considered temporary files. That's entirely separate from savefiles, which are saved for potential restart from any point in case of error, in a specified subdirectory, and accumulate there until removed by the user, if periodically making save files is enabled. And exchange files are separate from both save files and checkpoint files; exchange files would either be created by an extension to the program, or by a separate standalone program, only when the user takes action to cause it, such as by using the -export option of the extended program, or running the separate standalone program. (64-bit residues along the way appear in console output and would get logged if there was and are a currently untapped means of detecting and halting on certain errors.

For production running, there may be (usually) no reason to keep stage one or stage two checkpoint files or savefiles or exchange files after the P-1 run is done. For debugging, or for working around program limitations, there may be. Possibly if a remarkably large factor was found, that passed pseudoprime testing itself, keeping the files for confirmation and proof might be useful.

Running CUDAPm1 as it is, I am finding that for large exponents some runs produce nearly complete stage 1 runs and then the gcd fails. Others get through the stage 1 gcd and then fail early in stage 2, in a variety of ways. These failed run files are being kept, and I hope to finish them later and also examine what's different about them. I wonder what gmp-ecm's input form for that would look like or if there's a better candidate for carrying to completion.

One of the ways CUDAPm1 stage 2 fails is the 64-bit residue in its progress is always a match to the last one from stage 1. This I think occurs when there's not quite enough memory for stage 2. Such a checkpoint file is broken in some way; the repeating residue persists if it's moved to a gpu with much more memory. An earlier checkpoint must be used instead.

2. Point taken. It will be revised. (Murphy and a coin toss?) LSB first is how CUDAPm1 checkpoint files are organized also, if I'm reading the hex editor correctly. (Source indicates 32-bit unsigned ints, and hex editor shows it little-endian storage, not surprising.) What's in the checkpoint files is one iteration after the iteration count and printed 64-bit residue, which complicates things.

I knocked out a prototype CUDAPM1 standalone translator since posting the draft. It's terribly memory inefficient and probably runtime too, but has the right number of bytes in the output and looks right to me and is doing LSB first. Rewrite coming up. The exercise has provided a useful "look under the hood" of the CUDAPm1 checkpoint storage format.

Your commentary is exactly the sort of feedback I was looking for. Thanks Ernst.

2018-06-10, 14:25   #9
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10010111010102 Posts

Quote:
 Originally Posted by preda I see two distinct ways to identify the field: one is the line number ("third line"), and the other is the name of the field ("Type"). I think we should settle for one or the other. (my preference would be for field-ids (being explicit), not line position). (If the line position is used, then no field-id should be included). I'll give it more thought.
A little redundancy is a good thing.
1) If there is no redundancy in a format, errors could go undetected.
2) Fixed position means there's no need to search for type. We know where to find it.
3) Unlikely, but if there isn't well-formed type information and nothing else in record three, the file was probably written or transmitted incorrectly.

Having the type record early in the file is useful I think. It tells how to interpret the rest.

Flexibility is also a good thing. The type-specific header items are not all specified as to position. They may have no benefit of a particular type of ordering, as far as programs go.

In the header, a convention on ordering, and clear labeling, I think makes it more user friendly and readable. Having a familiar stable ordering makes it easier to see if something is missing or mangled. For a CUDAPm1 export translator, I used an ordering the same in most respects as the sequential storage of the data in the checkpoint file. That would be convenient for the program authors.

There are some features not present in this spec draft, that are common in other standards for somewhat similar purposes. For example, IGES includes counts of how many records there should be in a particular section. It's as a separate record. https://en.wikipedia.org/wiki/IGES#File_format shows the ugly old fixed format and hollerith character handling of this old format developed 30 years ago, for systems that were not new then. Looks like punch card data from 40+ years ago.
STEP followed. https://en.wikipedia.org/wiki/ISO_10303

Something I'm considering is whether there should be an explicit byte count for the bulk data, included in the header. Something like "Bulk Data Size 12345678". For a given application and computation type, it's perhaps derivable from the exponent & perhaps requiring additional parameters. It might be more convenient if it was there for comparison.

Also, is it worth putting a compact header CRC after the "End of Header"? As things stand now, there's no header CRC, line count, byte count, etc.

Data transmission has become much more reliable over the past 30 years, so maybe we can be a bit more relaxed about redundancy. But data sizes have gone up a lot. 77M bits vs 110k in 30 years. http://primes.utm.edu/mersenne/index.html#known

Last fiddled with by kriesel on 2018-06-10 at 14:31

 2018-06-10, 15:28 #10 M344587487     "Composite as Heck" Oct 2017 2DD16 Posts For simplicity consider locking down the newlines in the header to \n only (excluding \r\n), and end the header with 0x00. Easily extracting the header as a C-string and separating header from data is a useful feature IMO.
2018-06-10, 16:24   #11
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×32×269 Posts

Quote:
 Originally Posted by M344587487 For simplicity consider locking down the newlines in the header to \n only (excluding \r\n), and end the header with 0x00. Easily extracting the header as a C-string and separating header from data is a useful feature IMO.
Thanks for your input. As often happens, it's led to me thinking about some other aspects too.

I'm guessing what you suggest for header line separators would be popular with linux users, and unpopular with Windows users, for whom the absence of 0d 0a means there's no line wrap occurring, until each file is opened in a suitable editor and saved again (which is likely to hose the binary data following). The opposite case, \r\n regardless, would mean linux systems would show double-spaced header lines. Linux versions of import software could simply ignore null length header lines in files received from Windows.
The current, OS-dependent termination, allows a common use case, exchange between disparate programs on the same system, and user review of the files, without such nuisances. I'm in the mostly-Windows camp. Part of the motivation for defining the draft as I did is to make it user-friendly. (That's why there's an ASCII hexadecimal choice for the bulk data.) I'd rather not get drawn into the \n vs \r\n and OS supremacy & popularity battles. https://stackoverflow.com/questions/...-and-r#1761086 (I had an instance of mfakto on Windows 10 exhibiting the \n only behavior, and it's a real nuisance, ongoingly, to look at its logs; things are not in columns as they should be, and it's much less readable, as a result. Mfakto does not do that on Windows 7.)
Perhaps the exchange software on all OSes ought be prepared to deal with both header record termination cases as input and translate to its host OS convention automatically for local user convenience. Like some modern text editors do.

Since the header being readable ASCII ensures 0x00 won't be present there, I see no immediate issue with having a 0x00 byte between the End of Header record and its record termination byte(s) and the bulk data. I suppose I should nail the header character set down some; no control characters (bell, tab, del, esc, etc.) in the header section except record terminations.

I think there's not interest in having the binary aligned to quad-byte boundary. If there was, it could be accomplished with a varying number of nul bytes. Single is simpler.

 Similar Threads Thread Thread Starter Forum Replies Last Post kriesel GPU Computing 48 2020-01-14 02:56 xilman Msieve 2 2015-11-27 09:54 T.Rex Math 1 2010-01-03 11:34 SPWorley Factoring 7 2009-08-16 00:23 m_f_h Math 8 2007-05-18 13:49

All times are UTC. The time now is 15:19.

Fri Jan 15 15:19:52 UTC 2021 up 43 days, 11:31, 0 users, load averages: 1.92, 1.79, 1.77