mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-09-08, 22:35   #243
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

37·193 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
So I got some literally free AWS hardware and over the past week I've been running mprime v30b3 on an m5.12xlarge. The 8 GB disk I gave it filled up at some point during the past week with .residues files. Would it be possible to have mprime/Prime95 configure itself appropriately based on the available disk space at the time a PRP test is started? It would have been fine with a Proof Power of 7 for the two 110M exponents it's working on.

After increasing the disk size to 50 GB, mprime wrote out a 3.3 GB and a 14 GB file. It should have never thought it would be possible to write out a 14 GB file in the first place.

Seems processing was completely stuck for 5 or 6 days. The system has 186 GB of memory free, so it could have kept processing....
There are settings for these things (disk space to use and emergency memory). Assuming you left settings at the default (6GB disk / worker), I'd be interested in why prime95 was trying to write a 14GB file. Can you send screen outputs, log files, etc. that might shed some light on what happened.
Prime95 is offline   Reply With Quote
Old 2020-09-08, 22:50   #244
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

37·193 Posts
Default Build 5

Build 5 is now available. Other than a couple of little fixes, this version uses a new proof file format that is, on average, 0.000128% faster with default settings on a 100Mbit exponent!

OK, that's not very impressive. The proof files are about 5% easier to process on the server plus they allow us to calculate the res64 value from the proof file -- a small extra level of checking.
Prime95 is offline   Reply With Quote
Old 2020-09-09, 00:25   #245
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2·1,439 Posts
Default

Quote:
Originally Posted by Prime95 View Post
There are settings for these things (disk space to use and emergency memory). Assuming you left settings at the default (6GB disk / worker), I'd be interested in why prime95 was trying to write a 14GB file. Can you send screen outputs, log files, etc. that might shed some light on what happened.
local.txt was configured entirely by using the menu system or automatically
prime.txt was primarily configured by me
ls -l.txt shows the system as it is now
screen snapshot.txt shows the error messages I was seeing. I don't have screen logs from when the errors started, but the errors went away once I added more space

It's possible the system wasn't actually idle during past week. These 110M exponents take a long time.
Attached Files
File Type: txt ls -l.txt (4.8 KB, 22 views)
File Type: txt local.txt (683 Bytes, 19 views)
File Type: txt prime.txt (541 Bytes, 22 views)
File Type: txt screen snapshot.txt (6.9 KB, 18 views)
File Type: txt worktodo.txt (143 Bytes, 20 views)
Mark Rose is offline   Reply With Quote
Old 2020-09-09, 00:29   #246
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

1011001111102 Posts
Default

Quote:
Originally Posted by Prime95 View Post
There are settings for these things (disk space to use and emergency memory). Assuming you left settings at the default (6GB disk / worker), I'd be interested in why prime95 was trying to write a 14GB file. Can you send screen outputs, log files, etc. that might shed some light on what happened.
Attached here are the logs.

gwnum.txt was truncated at 16 KB; I guess running out of space for writes will do that

One strange thing I did notice was that the expected completion at PrimeNet had these exponents as finishing yesterday and today. Once I added disk space, I did stop and restart mprime, which then communicated the current expected time for the exponents to finish. That's why I thought processing may have completely stopped.
Attached Files
File Type: txt gwnum.txt (16.0 KB, 19 views)
File Type: log prime.log (20.8 KB, 22 views)
File Type: txt results.txt (4.0 KB, 16 views)
File Type: txt results.json.txt (8.8 KB, 18 views)

Last fiddled with by Mark Rose on 2020-09-09 at 00:30
Mark Rose is offline   Reply With Quote
Old 2020-09-09, 01:50   #247
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

37×193 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
local.txt was configured entirely by using the menu system or automatically
prime.txt was primarily configured by me
ls -l.txt shows the system as it is now.
From the file sizes, it is clear that worker #1 is trying to generate a power=10 proof (14GB) and worker #2 is generating the expected power=8 (3GB) proof.

The question is, was the temp disk space mprime ever set to more than 6GB -- specifically at the time worker #1 started it's PRP assignment? If not, and I suspect it never was, then mprime has a bug that under some conditions it miscalculates the proper proof power to use.
Prime95 is offline   Reply With Quote
Old 2020-09-09, 02:03   #248
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

714110 Posts
Default

Another weird thing. At start of the PRP test, the entire interim residues file is allocated. If an error occurs, mprime drops the proof power down. Why did this fail-safe not work?
Prime95 is offline   Reply With Quote
Old 2020-09-09, 02:36   #249
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22×7×317 Posts
Default

stupid question: Can it be that the resources on those aws clusters vary in some cases by occupancy or by method of interrogation? (i.e. when you ask how space is free on disk or memory, you get larger values, but in reality, or later as the cluster gets busier, the values are smaller / restricted, etc. - you know what I mean, that's why is called "elastic" computing, haha, well, sorry for the stupidity, I am the layman here, I never played with ec2).
LaurV is offline   Reply With Quote
Old 2020-09-09, 02:49   #250
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

54768 Posts
Default

Quote:
Originally Posted by Prime95 View Post
From the file sizes, it is clear that worker #1 is trying to generate a power=10 proof (14GB) and worker #2 is generating the expected power=8 (3GB) proof.

The question is, was the temp disk space mprime ever set to more than 6GB -- specifically at the time worker #1 started it's PRP assignment? If not, and I suspect it never was, then mprime has a bug that under some conditions it miscalculates the proper proof power to use.
Quote:
Originally Posted by Prime95 View Post
Another weird thing. At start of the PRP test, the entire interim residues file is allocated. If an error occurs, mprime drops the proof power down. Why did this fail-safe not work?
This is a brand new installation. I had never configured it to be anything other than 6 GB, and I believe I had done that via the menus when I first set it up.

I had initially copied my config file which tells mprime to do DC work, with WorkPreference=151 iirc. mprime fetched 6 LLDC assignments for 6 workers. I then ran some benchmarks and determined the optimal number of workers to run on this hardware was 2. I stopped mprime, edited worktodo.txt for two workers and modified local.txt from 6 workers 4 cores each to 2 workers 12 cores each.

It hadn't started any PRP work until three or so days later, well after I made changes to config files. I didn't configure it to do PRP, but somehow the WorkPreference got changed to 0.

It does appear worker #1 is significantly ahead of worker #2 and started doing its PRP assignment 16 hours earlier based on the NF-PM1 results I see at mersenne.org.

Also, mersenne.org is still showing 6 workers for the CPU instead of the 2 that are configured. Don't know if that's a bug with mprime or mersenne.org.
Mark Rose is offline   Reply With Quote
Old 2020-09-09, 02:52   #251
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2·1,439 Posts
Default

Quote:
Originally Posted by LaurV View Post
stupid question: Can it be that the resources on those aws clusters vary in some cases by occupancy or by method of interrogation? (i.e. when you ask how space is free on disk or memory, you get larger values, but in reality, or later as the cluster gets busier, the values are smaller / restricted, etc. - you know what I mean, that's why is called "elastic" computing, haha, well, sorry for the stupidity, I am the layman here, I never played with ec2).
Not in this case. This instance is using the ext4 filesystem, with an initial size of 8 GB. I manually grew it. Log files and whatnot can consume a little space, but nothing that would make it think there was more disk available.

You could get weird stuff like that using btrfs in raid1.

AWS also has an NFS implementation they call EFS, where I suppose that could happen.
Mark Rose is offline   Reply With Quote
Old 2020-09-09, 03:59   #252
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

37·193 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I didn't configure it to do PRP, but somehow the WorkPreference got changed to 0.
We're not going to be able to reproduce this problem. It looks like memory corruption. WorkPreference and Allowable temp disk space are both stored in global variables. Given AWS' proven reliablilty we conclude it is a rather rare program bug.

I'll try testing running out of disk space on a Windows box to see if I can discover this was not handled better.
Prime95 is offline   Reply With Quote
Old 2020-09-09, 04:52   #253
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

37·193 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I'll try testing running out of disk space on a Windows box to see if I can discover this was not handled better.
Aha - a bug.

From your results.txt:

Code:
[Mon Aug 31 09:07:44 2020]
Error pre-allocating proof interim residues file
Errno: 2, No such file or directory
Will use proof power 10 instead of 8.
The code that was supposed to reduce the proof power when an error occurs preallocating the disk space was in fact increasing the proof power.

Now the only unexplained phenomenon is the work preference changing.
Prime95 is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 08:43.

Sat Oct 31 08:43:19 UTC 2020 up 51 days, 5:54, 2 users, load averages: 2.47, 2.01, 1.77

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.