2020-09-08, 22:35
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

37·193 Posts

Quote:
 Originally Posted by Mark Rose So I got some literally free AWS hardware and over the past week I've been running mprime v30b3 on an m5.12xlarge. The 8 GB disk I gave it filled up at some point during the past week with .residues files. Would it be possible to have mprime/Prime95 configure itself appropriately based on the available disk space at the time a PRP test is started? It would have been fine with a Proof Power of 7 for the two 110M exponents it's working on. After increasing the disk size to 50 GB, mprime wrote out a 3.3 GB and a 14 GB file. It should have never thought it would be possible to write out a 14 GB file in the first place. Seems processing was completely stuck for 5 or 6 days. The system has 186 GB of memory free, so it could have kept processing....
There are settings for these things (disk space to use and emergency memory). Assuming you left settings at the default (6GB disk / worker), I'd be interested in why prime95 was trying to write a 14GB file. Can you send screen outputs, log files, etc. that might shed some light on what happened.

 Build 5 is now available. Other than a couple of little fixes, this version uses a new proof file format that is, on average, 0.000128% faster with default settings on a 100Mbit exponent! OK, that's not very impressive. The proof files are about 5% easier to process on the server plus they allow us to calculate the res64 value from the proof file -- a small extra level of checking.
2020-09-09, 00:25
Mark Rose

"/X\(‘-‘)/X\"
Jan 2013

2·1,439 Posts

Quote:
 Originally Posted by Prime95 There are settings for these things (disk space to use and emergency memory). Assuming you left settings at the default (6GB disk / worker), I'd be interested in why prime95 was trying to write a 14GB file. Can you send screen outputs, log files, etc. that might shed some light on what happened.
local.txt was configured entirely by using the menu system or automatically
prime.txt was primarily configured by me
ls -l.txt shows the system as it is now
screen snapshot.txt shows the error messages I was seeing. I don't have screen logs from when the errors started, but the errors went away once I added more space

It's possible the system wasn't actually idle during past week. These 110M exponents take a long time.
2020-09-09, 00:29
Mark Rose

"/X\(‘-‘)/X\"
Jan 2013

1011001111102 Posts

Quote:
 Originally Posted by Prime95 There are settings for these things (disk space to use and emergency memory). Assuming you left settings at the default (6GB disk / worker), I'd be interested in why prime95 was trying to write a 14GB file. Can you send screen outputs, log files, etc. that might shed some light on what happened.
Attached here are the logs.

gwnum.txt was truncated at 16 KB; I guess running out of space for writes will do that

One strange thing I did notice was that the expected completion at PrimeNet had these exponents as finishing yesterday and today. Once I added disk space, I did stop and restart mprime, which then communicated the current expected time for the exponents to finish. That's why I thought processing may have completely stopped.
Last fiddled with by Mark Rose on 2020-09-09 at 00:30

2020-09-09, 01:50
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

37×193 Posts

Quote:
 Originally Posted by Mark Rose local.txt was configured entirely by using the menu system or automatically prime.txt was primarily configured by me ls -l.txt shows the system as it is now.
From the file sizes, it is clear that worker #1 is trying to generate a power=10 proof (14GB) and worker #2 is generating the expected power=8 (3GB) proof.

The question is, was the temp disk space mprime ever set to more than 6GB -- specifically at the time worker #1 started it's PRP assignment? If not, and I suspect it never was, then mprime has a bug that under some conditions it miscalculates the proper proof power to use.

 2020-09-09, 02:03
Prime95

Another weird thing. At start of the PRP test, the entire interim residues file is allocated. If an error occurs, mprime drops the proof power down. Why did this fail-safe not work?
 2020-09-09, 02:36
LaurV

stupid question: Can it be that the resources on those aws clusters vary in some cases by occupancy or by method of interrogation? (i.e. when you ask how space is free on disk or memory, you get larger values, but in reality, or later as the cluster gets busier, the values are smaller / restricted, etc. - you know what I mean, that's why is called "elastic" computing, haha, well, sorry for the stupidity, I am the layman here, I never played with ec2).
2020-09-09, 02:49
Mark Rose

"/X\(‘-‘)/X\"
Jan 2013

54768 Posts

Quote:
 Originally Posted by Prime95 From the file sizes, it is clear that worker #1 is trying to generate a power=10 proof (14GB) and worker #2 is generating the expected power=8 (3GB) proof. The question is, was the temp disk space mprime ever set to more than 6GB -- specifically at the time worker #1 started it's PRP assignment? If not, and I suspect it never was, then mprime has a bug that under some conditions it miscalculates the proper proof power to use.
Quote:
 Originally Posted by Prime95 Another weird thing. At start of the PRP test, the entire interim residues file is allocated. If an error occurs, mprime drops the proof power down. Why did this fail-safe not work?
This is a brand new installation. I had never configured it to be anything other than 6 GB, and I believe I had done that via the menus when I first set it up.

I had initially copied my config file which tells mprime to do DC work, with WorkPreference=151 iirc. mprime fetched 6 LLDC assignments for 6 workers. I then ran some benchmarks and determined the optimal number of workers to run on this hardware was 2. I stopped mprime, edited worktodo.txt for two workers and modified local.txt from 6 workers 4 cores each to 2 workers 12 cores each.

It hadn't started any PRP work until three or so days later, well after I made changes to config files. I didn't configure it to do PRP, but somehow the WorkPreference got changed to 0.

It does appear worker #1 is significantly ahead of worker #2 and started doing its PRP assignment 16 hours earlier based on the NF-PM1 results I see at mersenne.org.

Also, mersenne.org is still showing 6 workers for the CPU instead of the 2 that are configured. Don't know if that's a bug with mprime or mersenne.org.

2020-09-09, 02:52
Mark Rose

"/X\(‘-‘)/X\"
Jan 2013

2·1,439 Posts

Quote:
 Originally Posted by LaurV stupid question: Can it be that the resources on those aws clusters vary in some cases by occupancy or by method of interrogation? (i.e. when you ask how space is free on disk or memory, you get larger values, but in reality, or later as the cluster gets busier, the values are smaller / restricted, etc. - you know what I mean, that's why is called "elastic" computing, haha, well, sorry for the stupidity, I am the layman here, I never played with ec2).
Not in this case. This instance is using the ext4 filesystem, with an initial size of 8 GB. I manually grew it. Log files and whatnot can consume a little space, but nothing that would make it think there was more disk available.

You could get weird stuff like that using btrfs in raid1.

AWS also has an NFS implementation they call EFS, where I suppose that could happen.

2020-09-09, 03:59
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

37·193 Posts

Quote:
 Originally Posted by Mark Rose I didn't configure it to do PRP, but somehow the WorkPreference got changed to 0.
We're not going to be able to reproduce this problem. It looks like memory corruption. WorkPreference and Allowable temp disk space are both stored in global variables. Given AWS' proven reliablilty we conclude it is a rather rare program bug.

I'll try testing running out of disk space on a Windows box to see if I can discover this was not handled better.

2020-09-09, 04:52
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

37·193 Posts

Quote:
 Originally Posted by Prime95 I'll try testing running out of disk space on a Windows box to see if I can discover this was not handled better.
Aha - a bug.

Code:
[Mon Aug 31 09:07:44 2020]
Error pre-allocating proof interim residues file
Errno: 2, No such file or directory
Will use proof power 10 instead of 8.
The code that was supposed to reduce the proof power when an error occurs preallocating the disk space was in fact increasing the proof power.

Now the only unexplained phenomenon is the work preference changing.