mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   PrimeNet (https://www.mersenneforum.org/forumdisplay.php?f=11)
-   -   OFFICIAL "SERVER PROBLEMS" THREAD (https://www.mersenneforum.org/showthread.php?t=5758)

snme2pm1 2014-05-24 02:23

[QUOTE=chalsall;374104]Question everything[/QUOTE]

I come from an early background of electronics fault finding, and later entered into computing.
I have learned to not assume that repairing a single fault will resolve an issue. Multiple faults can seriously frustrate fault finding efforts, and isolation techniques can assist.
The experience of diagnosing a piece of equipment on the workbench is of course is not consistent with the circumstances confronting an active world facing sever with all sorts of dynamics and where nasty agents are constantly in play.

TheMawn 2014-05-25 17:47

Well, I'm back. Looks like we had some fun while I was away.

LaurV 2014-05-26 03:25

[QUOTE=TheMawn;374252]Well, I'm back. Looks like we had some fun while I was away.[/QUOTE]
So, was that because of you ? :tantrum:
Next time you stay here! don't go anywhere!

kracker 2014-05-26 03:33

[QUOTE=LaurV;374284]So, was that because of you ? :tantrum:
Next time you stay here! don't go anywhere![/QUOTE]


Well.... :razz:

Chuck 2014-05-30 13:15

MISFIT not uploading results
 
I can access Primenet pages but MISFIT cannot upload results.

[CODE]
5/30/2014 9:07:06 AM:Stand by for queue check...
5/30/2014 9:07:11 AM:Checking GIOM_STAGED for files to upload
5/30/2014 9:07:11 AM:Found 5 file(s)
5/30/2014 9:07:11 AM:Begin upload process for GIOM_STAGED\130459223203721471-9c23e.txt
5/30/2014 9:08:42 AM:Error! The remote server returned an error: (502) Bad Gateway.
Will try the upload again in about 30 minutes...
[/CODE]

lycorn 2014-05-30 14:28

There must be some issue with uploading "Factors found" lines. It´s now more than 3 hours since the server last recorded an "F" in the Recent Cleared report

James Heinrich 2014-05-30 14:47

[QUOTE=lycorn;374589]There must be some issue with uploading "Factors found" lines. It´s now more than 3 hours since the server last recorded an "F" in the Recent Cleared report[/QUOTE]Factors are validated on submission. That's where the memory leak occurs: in the spawned process that validates the submitted factor. Eventually it causes primenet to grind to a halt, but in this intermediate state between working and broken you can submit all the no-factor results you want just fine, but it'll hang on any factor reports. Once George gives it a swift kick in the reboot it'll be back to normal.

LaurV 2014-05-30 15:07

Same here, Misfit crashed today (?!?!?!?) then I came back from work and restarted, try to report, crashed again. I removed all "factor found" from all files, reported again, went through immediately. Trying to report factors, there is an "error 502 (bad gateway)", i.e. Misfit says that "the server reported that..." and it crashes (i.e. non responsive for a while). This is very strange. Went back to older misfit (2.6.4) and it does not crash, but still the same error, and can't report factors.

Backed them up for now :smile:
Will report them later.

brilong 2014-05-30 15:47

I've also experienced multiple problems over the last few weeks. At first I thought it was the submit_spider Perl script I'm using. Occasionally it gives output like this (missing the number before GHz):

[code]
20140530_144705 INFO: M69277711 submitted; 13.8069 GHz Days credit.
Use of uninitialized value $GHzDays in concatenation (.) or string at /home/horde/GIMPS/gpu0/submit_spider line 154, <IN> line 12.
20140530_144728 INFO: M69277711 submitted; GHz Days credit.
[/code]

It turns out the server is causing this issue (notice the 23 second gap). I've been able to resubmit the same entry again and it usually works fine, but it's a manual process.

I've also had issues submitting new factors. Even when I try to submit them on the Manual webpage, I get an error:
[code]CGI Timeout

The specified CGI application exceeded the allowed time for processing. The server has deleted the process.[/code]

It appears submit_spider needs to be improved to catch these errors, sleep a while and try resubmitting. It does NOT label the results file "not_submitted" with the bad entry in this particular case.

chalsall 2014-05-30 16:10

[QUOTE=brilong;374601]It appears submit_spider needs to be improved to catch these errors, sleep a while and try resubmitting. It does NOT label the results file "not_submitted" with the bad entry in this particular case.[/QUOTE]

Thank you for the detailed bug report. I will try to find cycles to deal with this this weekend, and report when an update is available dealing with this issue.

kracker 2014-05-30 17:30

Yep, primenet is down ([I]again[/I])...

kladner 2014-05-30 17:33

Same here: UTC 1733.

EDIT: Back up by UTC 1825.

Prime95 2014-05-30 18:28

She's rebooted.

I'm not convinced the factoring program is the problem -- though there definitely is a problem related to it. The factoring program the server uses is msieve, albeit a very old version, which is fairly high quality software. The problem could just as easily be some other resource leakage during the spawning of the msieve process by SQLServer. In theory, any resource leak by msieve should be cleaned up when the msieve process terminates.

When Primenet was originally written, msieve was not called for factors less than something like 30 digits as we were TFing in mid-60s bit area (msieve is only called to test if the factor is composite and break it down into prime factors). This worked well until someone started submitting composite 25-or-so digit factors of Mersenne numbers that already had known tiny factors (P-1 or ECM on small Mersennes).

BTW, does anyone know how to auto-reboot the MS SQLServer and IIS services?

kracker 2014-05-30 18:33

[QUOTE=Prime95;374615]She's rebooted.

I'm not convinced the factoring program is the problem -- though there definitely is a problem related to it. The factoring program the server uses is msieve, albeit a very old version, which is fairly high quality software. The problem could just as easily be some other resource leakage during the spawning of the msieve process by SQLServer. In theory, any resource leak by msieve should be cleaned up when the msieve process terminates.

When Primenet was originally written, msieve was not called for factors less than something like 30 digits as we were TFing in mid-60s bit area (msieve is only called to test if the factor is composite and break it down into prime factors). This worked well until someone started submitting composite 25-or-so digit factors of Mersenne numbers that already had known tiny factors (P-1 or ECM on small Mersennes).

BTW, does anyone know how to auto-reboot the MS SQLServer and IIS services?[/QUOTE]

Dunno, try updating msieve?

chalsall 2014-05-30 19:05

[QUOTE=Prime95;374615]She's rebooted.[/QUOTE]

Thanks George.

[QUOTE=Prime95;374615]The problem could just as easily be some other resource leakage during the spawning of the msieve process by SQLServer. In theory, any resource leak by msieve should be cleaned up when the msieve process terminates.[/QUOTE]

Agree. But, are you absolutely sure that MS SQLS is what launches msieve? I would have thought it would instead be the MS IIS which launches that sub-process.

[QUOTE=Prime95;374615]BTW, does anyone know how to auto-reboot the MS SQLServer and IIS services?[/QUOTE]

I don't personally. But I suspect there are many here skilled in such arts.

Very sincere regards.

James Heinrich 2014-05-30 20:25

[QUOTE=Prime95;374615]BTW, does anyone know how to auto-reboot the MS SQLServer and IIS services?[/QUOTE]You could use [url=http://clickhome.freshdesk.com/support/articles/107608-how-to-setup-windows-auto-scheduler-task-to-restart-clickhome-windows]this[/url] as a rough guide (it has pretty pictures), substituting the appropriate service name(s). Basically: make a batch file with[b]net stop [i]servicename[/i][/b] and [b]net start [i]servicename[/i][/b] and call it with Task Scheduler. For my WAMP development machine the batch file looks like this:[code]net stop apache2.2
net stop mysql
net start mysql
net start apache2.2[/code]I'm not sure what IIS and MSSQL services are called. Then you can schedule it to restart the services every day (or week or whatever), or when triggered by any number of events as allowed by Task Scheduler.

snme2pm1 2014-05-30 22:12

[QUOTE=Prime95;374615]The problem could just as easily be some other resource leakage[/QUOTE]

Many process consumption figures could be viewed at the server, and radical growth statistics localised.
I tend to use procexp (process explorer) but other tools exist.
By the way, what OS version is the server?

Prime95 2014-05-30 22:50

[QUOTE=snme2pm1;374634]By the way, what OS version is the server?[/QUOTE]

Um, Windows Server 2000. Isn't that the latest version?

science_man_88 2014-05-30 22:54

[QUOTE=Prime95;374637]Um, Windows Server 2000. Isn't that the latest version?[/QUOTE]

[URL="https://www.google.ca/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=windows+server+2012&safe=active"]windows server 2012[/URL]

quick search gets these results for an attempt at windows server versions.

kracker 2014-05-30 22:56

[QUOTE=Prime95;374637]Um, Windows Server 2000. Isn't that the latest version?[/QUOTE]

Heh... Try Wikipedia .

snme2pm1 2014-05-30 22:57

[QUOTE=Prime95;374637]Um, Windows Server 2000. Isn't that the latest version?[/QUOTE]

I sometimes spend a little time fiddling with a couple of 2003 boxes, which are close to WinXP lineage.
MS so far still issuing updates for server 2003.

axn 2014-05-31 06:39

[QUOTE=science_man_88;374639][URL="https://www.google.ca/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=windows+server+2012&safe=active"]windows server 2012[/URL]

quick search gets these results for an attempt at windows server versions.[/QUOTE]

[QUOTE=kracker;374640]Heh... Try Wikipedia .[/QUOTE]

Woosh!

snme2pm1 2014-06-09 22:41

[QUOTE=James Heinrich;369618]It's true, I'm failing. :no:

I did [Nov 2013] rewrite the results-parsing code (basically lifted directly from mersenne.ca) but I had great trouble with troubleshooting the process of inserting results into the database. I speak MySQL, Primenet speaks MS-SQL, and we don't get on. The process looks something like this: :bangheadonwall:

I guess this is my poke to have a 4th look at it and see if I can get somewhere closer than I was before. I used to be optimistic, that has gone, but I'll try taking another look.[/QUOTE]

I recently lodged TF results with first line ... has a factor, accidentally.
Yet this time such result was properly reported as F rather than F-PM1.
So do we believe this ancient problem is fixed now?

James Heinrich 2014-06-09 23:21

[QUOTE=snme2pm1;375459]So do we believe this ancient problem is fixed now?[/QUOTE]Not by anything I've done, unfortunately.
The wrong-factor-type issue only occurs under some circumstances. It is also possible that the ancient code also guesses correctly that it's TF, depending (mostly) on the factor size.

Prime95 2014-06-10 00:13

[QUOTE=snme2pm1;375459]I recently lodged TF results with first line ... has a factor, accidentally.
Yet this time such result was properly reported as F rather than F-PM1.
So do we believe this ancient problem is fixed now?[/QUOTE]

No. As long as there is one "no factor" line all will be well. The line can appear anywhere in the submission.

brilong 2014-06-17 13:12

[QUOTE=chalsall;374605]Thank you for the detailed bug report. I will try to find cycles to deal with this this weekend, and report when an update is available dealing with this issue.[/QUOTE]

Any updates on an improved submit_spider script? I've had more issues where submit_spider saves results in a "submitted" file, but the server does not register the factor as completed. If I grab the results from the submitted file, append it to results.txt and rerun the spider, it works. This is a tedious process.

chalsall 2014-06-18 19:34

[QUOTE=brilong;376023]Any updates on an improved submit_spider script? I've had more issues where submit_spider saves results in a "submitted" file, but the server does not register the factor as completed. If I grab the results from the submitted file, append it to results.txt and rerun the spider, it works. This is a tedious process.[/QUOTE]

Did these situations involve the previous reported "undefined variable" issue?

I've been collecting cases from my own machine which I can use for experimentation.

I'm sorry if this has caused your problems. I'm working it as much as I have time for.

NBtarheel_33 2014-06-18 20:07

Hey mods, this thread needs weedwhacked...
 
This thread would do well with a clean-up. It is to facilitate quick, urgent reports of server problems so that they can easily be spotted and addressed. But the thread has sort of stumbled off in every direction and is now 21 pages long. Some pruning, redirecting, and clean-up might be in order at this point.

kladner 2014-06-18 20:58

I agree about the cleanup. On the subject of prompt response to problems, please note the following from Old Man Prime Net, in the second post of the thread-
[QUOTE]For immediate communications please email me the particulars of any issue at [EMAIL="primenet@mersenne.org"]primenet@mersenne.org[/EMAIL].[B] I respond to email significantly earlier than forum posts.[/B][/QUOTE](emphasis mine)

chalsall 2014-06-21 15:42

Just a heads up... "Spidy" is seeing logs of errors from Primenet. Probably time for a restart before she becomes fully inoperative.

snme2pm1 2014-06-28 05:54

After entering username and password, server spews...
 
Warning: session_write_close() [function.session-write-close]: write failed: No space left on device (28) in C:\v5\www\2013\userid_session_state.inc.php on line 42

Warning: session_write_close() [function.session-write-close]: Failed to write session data (files). Please verify that the current setting of session.save_path is correct (/php/sessionstate/) in C:\v5\www\2013\userid_session_state.inc.php on line 42

snme2pm1 2014-06-28 07:08

Work distribution statistics void
 
[url]http://www.mersenne.org/primenet/[/url]
Main table was void for 6:01 UTC and also 7:00 UTC.

lycorn 2014-06-30 09:59

Can´t access Primenet pages
I get either a "plain" CGI timeout or
Warning: odbc_pconnect() [function.odbc-pconnect]: SQL error: [Microsoft][ODBC SQL Server Driver]Timeout expired, SQL state S1T00 in SQLConnect in C:\v5\www\2013\v5server\0.96_database.inc.php on line 21
pnErrorResult=3 pnErrorDetail=Database unavailable ==END==

And that´s it...

Prime95 2014-06-30 12:30

Rebooted

tcharron 2014-07-01 01:04

Still some issues...
[code]
Warning: odbc_pconnect() [function.odbc-pconnect]: SQL error: [Microsoft][ODBC SQL Server Driver][Shared Memory]General network error. Check your network documentation., SQL state 08S01 in SQLConnect in C:\v5\www\2013\v5server\0.96_database.inc.php on line 21
pnErrorResult=3 pnErrorDetail=Database unavailable ==END==[/code]

ric 2014-07-01 15:59

"Server database full or broken"
 
This is what I got a few moments ago (17:47 CEST), as per prime.log (running ver 26.6 on an xp-64 box)

[CODE]
PrimeNet error 13: Server database full or broken
ar Insert t_gimps_results_log failed: <my id> GUID: <redacted>, exponent: 279003281, C2D_CPU_GHz_days: 0.053567397141541
[/CODE]

Rodrigo 2014-07-01 16:52

Me, too.

A few minutes ago I tried to do a manual submission and it accepted the first couple dozen results, then those same errors started coming in. Retrying the failed results yielded the same errors.

Rodrigo

Gordon 2014-07-01 17:54

[QUOTE=Rodrigo;377124]Me, too.

A few minutes ago I tried to do a manual submission and it accepted the first couple dozen results, then those same errors started coming in. Retrying the failed results yielded the same errors.

Rodrigo[/QUOTE]

and now it's really flucked up. given that 3TB of disk space is about £60 how can any machine nowadays ever run out of storage....

**

Warning: odbc_exec() [function.odbc-exec]: SQL error: [Microsoft][ODBC SQL Server Driver][SQL Server]Could not allocate space for object 'dbo.t_gimps_results_log'.'ix_dt_received' in database 'primenet' because the 'PRIMARY' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup., SQL state 37000 in SQLExecDirect in C:\v5\www\2013\v5server\0.96_database.inc.php on line 284

Error code: 13, error text: ar Insert t_gimps_results_log failed: nitro GUID: xxxxxxxxxxxxxxxxx, exponent: 69045971, C2D_CPU_GHz_days: 27.706488236821
Processing result: no factor for M69045997 from 2^73 to 2^74 [mfaktc 0.20 barrett76_mul32_gs]

Chuck 2014-07-01 19:10

I am having similar problems; I am stopping MISFIT automatic uploads until this gets resolved.

TheMawn 2014-07-01 20:18

Same thing. About three hours ago, my results wouldn't go in. Prime95 tried again just a few minutes ago, and it worked fine.

kracker 2014-07-01 20:57

Same. If it really is "low on space" then....

:gah: Seriously?

Chuck 2014-07-02 01:10

It worked for a while and is now failing again. This is especially aggravating since MISFIT does not detect the unsuccessful upload and then the next upload, even if successful with GIMPS, is rejected because previous bitlevels are missing.

garo 2014-07-02 07:49

Am having trouble logging in to the manual results page

_write_close() [[URL="http://www.mersenne.org/manual_result/function.session-write-close"]function.session-write-close[/URL]]: write failed: No space left on device (28) in [B]C:\v5\www\2013\userid_session_state.inc.php[/B] on line [B]42[/B]

tha 2014-07-02 09:28

hourly reports are not generated either

LaurV 2014-07-02 15:29

Last server adventures affected me in a strange way: even if I was logged in, when I reported one DC and few P-1 results, the credit went to anonymous, and the feedback report came back to me logging me off. I tried again and it said the results not needed, this computer already reported, blah blah, and logged me off again. I am starting to get frustrated. The exponent was 32347897 and this is the completion line, including the (valid) assignment key. [CODE]M( 32347897 )C, 0xcba8b97b84ff9e94, offset = 1624, n = 1728K, CUDALucas v2.05 Beta, g_AID: 1867B4C287AB0DBA897442E4BCC56FA0[/CODE]I forgot the P-1 exponents, and don't have log (got them from GPU72) I think one was 72035111 (I remember the ones).
Should be nice to have the credit back, at least for the DC, but even nicer would be to fix once forever these problems, I would even forget about the credit :razz:

[edit: interesting enough, after reporting it and registered as anonymous, the assignment disappeared from my assignment list, but I assume this is related to the key, and not to the user who reports]

kracker 2014-07-03 01:52

[code]
Warning: odbc_pconnect() [function.odbc-pconnect]: SQL error: [Microsoft][ODBC SQL Server Driver][Shared Memory]General network error. Check your network documentation., SQL state 08S01 in SQLConnect in C:\v5\www\2013\v5server\0.96_database.inc.php on line 21
pnErrorResult=3 pnErrorDetail=Database unavailable ==END==
[/code]
.......

Prime95 2014-07-05 02:41

I'm going to take the server offline and try to rebuild some indexes to defragment them

snme2pm1 2014-07-05 03:38

Take as many hours as necessary to the extent you reckon needed to make the site database stable again, assuming that is the crux of the issues, after many outages in recent days.
Best let it be known if it lots of hours are needed.
At somewhat similar circumstances I recall the my boss hovering over me to tell of when the system will be available again! Estimates are never good enough for the boss people.

Prime95 2014-07-05 05:06

Uh oh. That didn't go well. SQLServer is not happy. I've emailed Scott and gone to bed -- the server is offline until this gets sorted out.

snme2pm1 2014-07-05 09:07

[QUOTE=Prime95;377428]Uh oh...gone to bed[/QUOTE]

Hilarious...
Good thing that I procured a queue of work many hours ago, perhaps tough for others.

Prime95 2014-07-05 18:58

The server magically healed itself while I slept.

I rebuilt a few indexes, but some failed due to lack of disk space. Similarly a "DBCC CHECKDB" failed due to disk space.

I've emailed Scott about buying a couple of cheap SCSI drives. I don't know if the machine has the space or power to support more drives. I'll let everyone know what he says.

Anyway, we are back online for now. Free disk space is still a paltry 1.6GB.

chalsall 2014-07-05 19:25

[QUOTE=Prime95;377456]Anyway, we are back online for now. Free disk space is still a paltry 1.6GB.[/QUOTE]

Ouch!!! That's less than the amount of RAM on a modern smart-phone!!! :sad:

Gordon 2014-07-05 19:47

[QUOTE=chalsall;377459]Ouch!!! That's less than the amount of RAM on a modern smart-phone!!! :sad:[/QUOTE]

How about we just run an [URL="http://droidphp.github.io/"]Android Web Server[/URL] complete with php and mysql...

This project can't take up much space. I'm a coordinator over at the [URL="http://confluence.org"]Confluence Project[/URL] we store over 100,000 digital photo's and the backup is only about 40gig...and I can store ALL of that on my mobile phone.

Why not open up a donations page so we can buy a "proper" server?

Uncwilly 2014-07-05 20:54

[QUOTE=Prime95;377456]I've emailed Scott about buying a couple of cheap SCSI drives. I don't know if the machine has the space or power to support more drives. I'll let everyone know what he says.[/QUOTE]What is the current config of the machine? And instead of adding drives, what about a wholesale change out to big'uns?

chalsall 2014-07-05 21:10

[QUOTE=Uncwilly;377468]What is the current config of the machine? And instead of adding drives, what about a wholesale change out to big'uns?[/QUOTE]

Indeed.

I've said before; I'll say again. Consider leasing a modern co-located server.

It costs me USD $50 a month for the server to run GPU72.

The prices have recently dropped to $40 a month.

YMMV.

srow7 2014-07-05 21:33

hourly reports
 
hourly reports not being generated.

chalsall 2014-07-05 22:07

[QUOTE=srow7;377472]hourly reports not being generated.[/QUOTE]

Where? (BTW, srow7 is a recent GPU72 worker.)

Oh, yeah. I see what you mean. [url]http://mersenne.org/primenet/[/url] hasn't updated for over a day!

TheMawn 2014-07-05 22:16

[QUOTE=Uncwilly;377468]What is the current config of the machine? And instead of adding drives, what about a wholesale change out to big'uns?[/QUOTE]

[QUOTE=chalsall;377470]I've said before; I'll say again. Consider leasing a modern co-located server.[/QUOTE]

I like the sound of this. If money is an issue, I'm also totally for the idea of setting up a donation box. Hell, if it's really just $40 per month, 2014 Q4 is on me.

Prime95 2014-07-06 00:04

[QUOTE=TheMawn;377475]I like the sound of this. If money is an issue, I'm also totally for the idea of setting up a donation box. Hell, if it's really just $40 per month, 2014 Q4 is on me.[/QUOTE]

Money is not the issue. Converting from Windows to Linux, SQLServer to MySQL, transferring the database, volunteer time to get all that done is the issue.

Even if we leased, it would take time to do all the above work. In the meantime, we are stuck with the current server. If I can upgrade the disks for peanuts and buy time, it seems like a no-brainer.

Basically, we need both a short-term solution and a long-term solution.

chalsall 2014-07-06 01:23

[QUOTE=Prime95;377477]Money is not the issue. Converting from Windows to Linux, SQLServer to MySQL, transferring the database, volunteer time to get all that done is the issue.

Even if we leased, it would take time to do all the above work. In the meantime, we are stuck with the current server. If I can upgrade the disks for peanuts and buy time, it seems like a no-brainer.

Basically, we need both a short-term solution and a long-term solution.[/QUOTE]

George...

With all due respect...

There have been many people willing to pay for the new server.

And many people willing to do the work to migrate from a M$ stack to a LAMP stack.

James has already invested a great deal of time wrapping his head around the M$ stack, and clearly understands LAMP stacks well.

Perhaps we can take the next step forward?

James Heinrich 2014-07-06 13:25

[QUOTE=chalsall;377481]James has already invested a great deal of time wrapping his head around the M$ stack, and clearly understands LAMP stacks well.[/QUOTE]I more-or-less gave up on revitalizing the existing WIMP mersenne.org because I [i]couldn't[/i] wrap my head around MSSQL, more wrapping my head into it :bangheadonwall:

That said, give me the MSSQL database and six hours and I'll give you a MySQL version. Possibly not counting any stored procedures, I don't have a clear picture of what's involved there -- I know there are some (e.g. for hourly report generation) but I'm not sure how extensively they're used).

PHP is generally very nicely platform-independent, there should be very little needed in the way of PHP code modification.

VictordeHolland 2014-07-06 15:02

Offline (Offsite) Backup
 
Just a general question/remark: Is there a (recent) offline backup in case these server issues get worse and the database won't rebuild itself? External HDDs are so cheap and it would be such a pitty if we lose days/weeks of work...

Prime95 2014-07-06 20:23

[QUOTE=VictordeHolland;377503]Just a general question/remark: Is there a (recent) offline backup in case these server issues get worse and the database won't rebuild itself? External HDDs are so cheap and it would be such a pitty if we lose days/weeks of work...[/QUOTE]

Scott is in charge of backup strategy. I know he is saving the daily transaction log on another disk. I don't know how often he creates full backups.

chalsall 2014-07-06 23:48

[QUOTE=Prime95;377530]Scott is in charge of backup strategy. I know he is saving the daily transaction log on another disk. I don't know how often he creates full backups.[/QUOTE]

One of the most important "take-a-ways" for me from the president of the United States of America's Ronald Reagan was "Trust, but verify.

Gordon 2014-07-08 09:12

Gone again....

Warning: odbc_pconnect() [function.odbc-pconnect]: SQL error: [Microsoft][ODBC SQL Server Driver][Shared Memory]General network error. Check your network documentation., SQL state 08S01 in SQLConnect in C:\v5\www\2013\v5server\0.96_database.inc.php on line 21
pnErrorResult=3 pnErrorDetail=Database unavailable ==END==

beduzar 2014-07-08 12:04

Again, problem downloading a result
 
Hello,

A failure occurred again this morning sending result to server for number M61637711.

[CODE][Comm thread Jul 8 08:19] Sending result to server: UID: beduzar/vega, M61637711 is not prime. Res64: 044D5FA731574B8A. We8: 7A9A216C,49492046,00000000, AID: B6DEB7FAB2B4D88B39817D5124750817
[Comm thread Jul 8 08:19]
[Worker #1 Jul 8 08:19] Starting primality test of M33352387 using AVX FFT length 1728K, Pass1=384, Pass2=4608
[Comm thread Jul 8 08:22] CURL library error: Operation timed out after 180000 milliseconds with 0 bytes received
[Comm thread Jul 8 08:22] CURL library error: Operation timed out after 180000 milliseconds with 0 bytes received
[Comm thread Jul 8 08:22] Visit http://mersenneforum.org for help.
[Comm thread Jul 8 08:22] Will try contacting server again in 70 minutes.
[Worker #2 Jul 8 08:29] Iteration: 37400000 / 61992289 [60.33%]. Per iteration time: 0.023 sec.
[Worker #1 Jul 8 08:39] Iteration: 100000 / 33352387 [0.29%]. Per iteration time: 0.012 sec.
[Worker #1 Jul 8 08:59] Iteration: 200000 / 33352387 [0.59%]. Per iteration time: 0.012 sec.
[Worker #2 Jul 8 09:08] Iteration: 37500000 / 61992289 [60.49%]. Per iteration time: 0.023 sec.
[Worker #1 Jul 8 09:18] Iteration: 300000 / 33352387 [0.89%]. Per iteration time: 0.012 sec.
[Comm thread Jul 8 09:32] Sending result to server: UID: beduzar/vega, M61637711 is not prime. Res64: 044D5FA731574B8A. We8: 7A9A216C,49492046,00000000, AID: B6DEB7FAB2B4D88B39817D5124750817
[Comm thread Jul 8 09:32]
[Comm thread Jul 8 09:32] PrimeNet error 40: No assignment
[Comm thread Jul 8 09:32] This computer has already sent in this LL result for M61637711
[Comm thread Jul 8 09:32] Done communicating with server.
[/CODE]The problem is the same as on 07 Apr 14,

[URL]http://www.mersenneforum.org/showthread.php?p=370469#post370469[/URL]

The result is well recorded but it is not credited to my account.

Last time Prime95 fixed the problem by [B]deleting the row from the LL results table and resubmitting the result manually[/B]. I don't think I can do that myself so please, could you help me again?

ric 2014-07-08 12:52

[QUOTE=Gordon;377646]Gone again....

Warning: odbc_pconnect() [function.odbc-pconnect]: SQL error: [Microsoft][ODBC SQL Server Driver][Shared Memory]General network error. Check your network documentation., SQL state 08S01 in SQLConnect in C:\v5\www\2013\v5server\0.96_database.inc.php on line 21
pnErrorResult=3 pnErrorDetail=Database unavailable ==END==[/QUOTE]
Same thing happening to me, when trying to retrieve manual assignments from the site page...

TheMawn 2014-07-08 18:27

[QUOTE]Sending result to server: UID: beduzar/vega, M61637711 is not prime. Res64: 044D5FA731574B[B]XX[/B]. We8: 7A9A216C,49492046,00000000, AID: B6DEB7FAB2B4D88B39817D5124750817
[Comm thread Jul 8 09:32][/QUOTE]

For anyone reading this, I would like to remind you that it is not a good idea to post full 64 bit residues in a public place. I wasn't paying attention and made the same mistake myself a few days ago.

I have grabbed M61637711 as a DC assignment to make sure that this assignment is done properly. Otherwise, someone could fake the test and submit the same result line as you and get free credit for an assignment that was not done, and potentially miss any error you might have had during the test.

Normally, the last two digits of the residue are masked so I would not be able to do that.

chalsall 2014-07-08 18:56

[QUOTE=TheMawn;377680]Normally, the last two digits of the residue are masked so I would not be able to do that.[/QUOTE]

Fast learn you do.

Madpoo 2014-07-09 19:24

[QUOTE=Prime95;374615]BTW, does anyone know how to auto-reboot the MS SQLServer and IIS services?[/QUOTE]

So, I know I'm SUPER late with this boring reply, and George probably figured it out by now, but...

Restarting SQL can be accomplished by:
net stop /y mssqlserver
net start mssqlserver
(if SQL agent is used add this line too)
net start sqlserveragent

The first stop command with /y will stop the service and any dependencies as well, like SQL Agent. I'm also partial to the "psservice.exe" from the SysInternals set of tools... with that it's as easy as running "psservice restart mssqlserver" which will restart it and all dependencies in one tidy command.

Resetting IIS is just running this:
iisreset

It takes care of restarting the web service and WAS if there's any .NET stuff involved.

If George or anyone wants any help with server junk, let me know. George and Scott are probably still familiar with me, from a certain incident back in 1998 or so. :)

Right now I only have a single system contributing to GIMPS, but besides that I'm still doing system admin things in the Windows world, including extensive IIS and MS SQL work. I'd be more than happy to look over anything and see if I have anything useful to add. Optimization in the Microsoft world is one of my specialties.

From my 30,000 foot level, without knowing squat about how SQL is setup, it sounds like there's not enough disk space but that could be from the transaction log growing uncontrollably. Maybe it's due to transactions being stalled while factors are being checked by an external process?

Maybe it already does this but if I were designing a system to accept submissions like factor results and then check them, I'd put them in a pending column and send the factor job off on it's own to handle them as needed where that job would independently move them from pending to confirmed. Could be a totally different machine or the same...whatever. So long as it doesn't tie up SQL in the meantime.

Maybe it means newly submitted requests are "pending" for a while, even days in the case of that person who submits thousands at a time, but it won't take down the site. And the processing could be scheduled by user, so one user submitting thousands has it's own queue of "one at a time" and won't block other users who submit one or two.

chalsall 2014-07-14 17:33

She's dead (again) Jim....
 
Reboot, rinse, repeat....

brilong 2014-07-14 17:41

What's the hardware config for the backend? If we need a bandaid approach to keep it running, I have some SAS and/or SATA disks I could possibly send you depending on the geographic location (San Diego?). What's the server model number, RAID controller, etc?

kladner 2014-07-16 01:14

She may not be dead, Jim, but she's one sick puppy.

[QUOTE][B]Warning[/B]: odbc_pconnect() [[URL="http://www.mersenne.org/function.odbc-pconnect"]function.odbc-pconnect[/URL]]: SQL error: [Microsoft][ODBC SQL Server Driver][Shared Memory]General network error. Check your network documentation., SQL state 08S01 in SQLConnect in [B]C:\v5\www\2013\v5server\0.96_database.inc.php[/B] on line [B]21[/B]
pnErrorResult=3 pnErrorDetail=Database unavailable ==END== [/QUOTE]

Prime95 2014-07-16 01:48

[QUOTE=kladner;378190]She may not be dead, Jim, but she's one sick puppy.[/QUOTE]

Yes, I did a "preventative" restart 5 minutes ago.

kladner 2014-07-16 02:40

[QUOTE=Prime95;378193]Yes, I did a "preventative" restart 5 minutes ago.[/QUOTE]

Thanks!

Robert_JD 2014-07-16 08:31

[QUOTE=Prime95;378193]Yes, I did a "preventative" restart 5 minutes ago.[/QUOTE]

So far so good...Thanks George! :smile:

NBtarheel_33 2014-07-16 18:27

She's mortally wounded, James.

chalsall 2014-07-16 19:11

[QUOTE=NBtarheel_33;378246]She's mortally wounded, James.[/QUOTE]

George...

Would it help if GPU72 didn't run its queries so often?

Would it help if very old log files were deleted from Primenet's harddrives?

This issue is clearly critical. Primenet now seems more unstable than stable most of the time (happy to share the logs empirically demonstrating this).

TheMawn 2014-07-16 20:22

Warning: odbc_pconnect() [function.odbc-pconnect]: SQL error: [Microsoft][ODBC SQL Server Driver][Shared Memory]General network error. Check your network documentation., SQL state 08S01 in SQLConnect in C:\v5\www\2013\v5server\0.96_database.inc.php on line 21
pnErrorResult=3 pnErrorDetail=Database unavailable ==END==


EDIT: Oops same message as kladner.

Prime95 2014-07-16 20:33

[QUOTE=chalsall;378253]Would it help if GPU72 didn't run its queries so often?[/quote]

I have no idea. If I knew why the database crashes so often I might could answer that question more intelligently.

[quote]Would it help if very old log files were deleted from Primenet's harddrives?[/QUOTE]

I've been cleaning up some. This latest crash was not related to disk space -- just an ordinary database death for no good reason.

TheMawn 2014-07-16 20:57

Well unless I am missing something, our options are to either get rid of the data or to expand storage capacity. I always think there is no such thing as too much data, but I also like to live in that perfect world where storage space is not an issue. GIMPS and I clearly don't live together in that world.

How much data is kept for every exponent? I can only work from memory at this point because the server is dead, but when I look up an exponent, I vaguely remember seeing every stage of trial factoring, including who did it and when; any P-1 work, including who did it, when and what the bounds were; LL and DC tests, if applicable, with 64-bit residues, and who did it and when.

For the sake of history, I think that keeping every bit of data we can is important, but if it's a choice between server stability and precision of data, I'd choose stability any day. To this end, perhaps if we simplified all activity before xx/yy/zzzz, we would free up enough space to keep us going until a new server is arranged?

Presently an exponent might look like:

[CODE]No factors below 2^66
No factors from 2^66 to 2^68, TheMawn, 20/03/2010
No factors from 2^68 to 2^69, Joe Pesci, 16/04/2010
No factors from 2^69 to 2^70, Factor_Dude_669, 05/05/2010
P-1 B1=135000 B2=1697500, P-1_Dude_669, 03/07/2011
No factors from 2^70 to 2^72, TheMawn, 19/11/2010
MXXXXXXXX is not prime Res64: BLABLA, curtisc, 03/01/2012
No factors from 2^72 to t^73, John Travolta, 01/02/2012
No factors from 2^73 yo 2^74, FourHorsemen, 01/07/2013
P-1 B1=249000 B2=3125000, Darian Durant, 07/07/2013
MXXXXXXXX is not prime Res64: BLABLA, DC_Dude_669, 16/07/2013[/CODE]

If we wanted to simplify the data, we really have a lot of options. I don't really care who did the TF and when. How often have we found erroneous TF results? Do we really care if we found out that Factor_Dude_669 missed a factor three years ago? Would we do anything about it? Then why bother remembering?

Same goes for the P-1. What value do we really get from the time stamp and the user ID?

For the LL and DC, I understand that the user and date are a bit more important if we ever saw a particular user having a high error rate or is suspected of falsifying results, so it might be worth keeping those. If a DC hasn't been completed yet or if the DC comes out incorrect, it would be very important that we continue to keep all the information we can, however.

If the data looked like:

[CODE]
No factors below 2^74
P-1 B1=135000 B2=1697500
P-1 B1=249000 B2=3125000
MXXXXXXXX is not prime Res64: BLABLA, curtisc
MXXXXXXXX is not prime Res64: BLABLA, DC_Dude_669[/CODE]

We might save a lot of space and lose little of importance. If we were to compress everything before, say, 01/01/2014, and keep the rest on an external source until we get the issues permanently resolved, this could buy us some time?


EDIT: Now that the server is back up, I went and looked at the history for M55174403, one that I am working on right now. It looks a bit different than I remembered. In fact, there is a little box with the reader's digest of the work done on the exponent, but there is also a detailed history. Temporarily getting rid of that might reduce the strain all the same?

Gordon 2014-07-16 22:24

[QUOTE=TheMawn;378262]Well unless I am missing something, our options are to either get rid of the data or to expand storage capacity. I always think there is no such thing as too much data, but I also like to live in that perfect world where storage space is not an issue. GIMPS and I clearly don't live together in that world.



[snip]

[/QUOTE]

Now let's see I can buy a [URL="http://www.aria.co.uk/SuperSpecials/Other+products/4TB+Seagate+Barracuda+ST4000DM000+3.5%22+SATA+III+Hard+Drive+-+HDD+?productId=54551"]4 TB drive[/URL] for about $150.

That's 4 MILLION megabytes of storage.

Let's pretend that each candidate has 4kb of data attached to it, yes I know ludicrously high but go with the flow

Our single hard drive can store data for about 1 thousand million exponents...seriously, data storage space is in any practical sense of the word unlimited. In anybodies world.

chalsall 2014-07-16 22:34

[QUOTE=Gordon;378279]Our single hard drive can store data for about 1 thousand million exponents...seriously, data storage space is in any practical sense of the word unlimited. In anybodies world.[/QUOTE]

Thanks for bringing this forward. 8-|

Uncwilly 2014-07-16 23:29

[QUOTE=Gordon;378279]Now let's see I can buy a [url]http://www.aria.co.uk/SuperSpecials/Other+products/4TB+Seagate+Barracuda+ST4000DM000+3.5%22+SATA+III+Hard+Drive+-+HDD+?productId=54551[/url]
4 TB drive for about $150.[/QUOTE]
The problem with that, as I understand it is: the server's RAID system uses SCSI. Changing out the whole thing to a different system might be more complex than just plugging in a single new drive.

TheMawn 2014-07-17 00:10

[QUOTE=Gordon;378279]Now let's see I can buy a [URL="http://www.aria.co.uk/SuperSpecials/Other+products/4TB+Seagate+Barracuda+ST4000DM000+3.5%22+SATA+III+Hard+Drive+-+HDD+?productId=54551"]4 TB drive[/URL] for about $150.

That's 4 MILLION megabytes of storage.

Let's pretend that each candidate has 4kb of data attached to it, yes I know ludicrously high but go with the flow

Our single hard drive can store data for about 1 thousand million exponents...seriously, data storage space is in any practical sense of the word unlimited. In anybodies world.[/QUOTE]

I have thought of this, too, believe me. So have others. I've asked in more ways than one about how much data is required per exponent, trying to hint at "do we need more?" There have been direct suggestions by other people to upgrade the storage capacity and the like, also.

Because of all the no-responses, I decided to stop. I figured the powers that be have their reasons for not stepping up the storage.

Case in point:


[QUOTE=Uncwilly;378284]The problem with that, as I understand it is: the server's RAID system uses SCSI. Changing out the whole thing to a different system might be more complex than just plugging in a single new drive.[/QUOTE]


Luckily, I believe we are getting closer to getting something done but it will be difficult.

As always, I am available to help, however limited my capacity to do so!

flagrantflowers 2014-07-17 04:49

Friend of mine used to tell a good story about how they got a commercial drive from somewhere and decided to see how long it would last in one of the server racks. I don't know what they were doing exactly, early GPS truck/traffic tracking (READ: a lot of read and writes).

Long story short the drive head melt and smoked in under an hour, I think the actual figure was 20 minutes but you get the point. The duty cycle of a server hard drive is not compatible with a commercial drive.

Madpoo 2014-07-17 05:22

[QUOTE=Uncwilly;378284]The problem with that, as I understand it is: the server's RAID system uses SCSI. Changing out the whole thing to a different system might be more complex than just plugging in a single new drive.[/QUOTE]

I've seen lots of folks offering to pay for hardware in this thread, so if it's a problem that could be solved by throwing hardware at it, you'd think George/Scott would go for it.

For my part, I've got a pile of older HP hard drives (3.5"), new, in packages, that have been sitting for years. Never did use them, and now all the new HP servers I use are all SFF (2.5" drives) SAS, so I have nowhere to use them.

The drives I have are all Ultra320 SCSI, LFF (3.5) and have SCA connectors (80-pin) for hot-plug backplanes. You can use them with a normal 68-pin SCSI using an adapter that breaks out the data and power. I even have some of those adapters I'd throw in if need... I think I have 4 or maybe more.

4 x 300GB 15K drives, and a 72GB 15K drive, a 144GB 15K drive and another 144GB 10K drive. :)

The 300GB drives are all new, still in the anti-stat bags. The other 3 smaller drives were used for maybe 10-20 hours total before they ended up getting upgraded. Those new drives were stock I kept on hand for hot-swapping failed drives, but the servers themselves got retired before I ever had to dig into the drive stockpile.

I also have a lot of older memory modules for servers... DDR2 mostly. Registered (they were from HP servers after all), and most were never used. They're the ones that came with the server before I yanked them and upgraded with larger modules. I must have something like 100 or more 1GB DDR2 registered DIMMs.

But since I don't know what hardware the Primenet stuff runs on, no idea if any of that would work.

Where I work now, we even have some retired servers... HP Proliant DL360 G5 and DL380G5. They're sitting in our branch server room powered off, taking space. We were looking at recycling them but I'm hanging onto a few for spare parts... I bet if I asked we could sell one for the price of shipping. After all, the alternative we're looking at is to give them away to a recycler who will take them off our hands, no charge to us. :)

Those servers won't take the older 300GB drives I mentioned but I could see what it'd take to fit one out with 8x72GB drives, maybe 8x144GB (I don't remember what all we have on those). 32GB RAM, and I think the max processor specs on any of them is dual X5470 processors.

The key to SQL is more drives, more drives, and more drives. I'd go with 10 x 400GB drives over 1 4TB drive any day if SQL was the application.

Other general notes for SQL would be to make sure data is indexed...check what kind of data gets called for and create indexes to make retrieval quicker. And optimize those indexes at least on some weekly basis. Do backups and make sure transaction logs aren't getting crazy. Do a check every now and then after SQL's been running a while and look at the stats...which sprocs run the most often and how much time is it taking? Optimize the slowest ones to start out, or see if some sprocs are being called excessively when they don't really need to be.

Most of that is just basic SQL optimization and would be true whether it's MSSQL, MySQL or whatever.

Gordon 2014-07-17 15:38

[QUOTE=Uncwilly;378284]The problem with that, as I understand it is: the server's RAID system uses SCSI. Changing out the whole thing to a different system might be more complex than just plugging in a single new drive.[/QUOTE]

That may well be true, but my point still stands, there is zero reason for anybody to ever say nowadays "run out of storage"

flashjh 2014-07-17 15:42

[QUOTE=Madpoo;378300]I've seen lots of folks offering to pay for hardware in this thread, so if it's a problem that could be solved by throwing hardware at it, you'd think George/Scott would go for it.[/QUOTE]

Calculate the costs to upgrade and put them here for everyone to help pay, we'll make it happen.

kracker 2014-07-17 16:03

[QUOTE=flashjh;378325]Calculate the costs to upgrade and put them here for everyone to help pay, we'll make it happen.[/QUOTE]

[url]http://mersenneforum.org/showpost.php?p=377477&postcount=253[/url]

flashjh 2014-07-17 16:14

If we keep waiting until it 'hard' breaks, then we're all up a creek... Let's put something together for the short-term fix and then get a plan for long-term.

With all the experts here, I'm sure it can happen.

What is the current configuration and what does it need to be? For the short-term, I don't even know what the current hardware is right now and what would be an improvement?

chalsall 2014-07-17 17:09

[QUOTE=flashjh;378331]If we keep waiting until it 'hard' breaks, then we're all up a creek... Let's put something together for the short-term fix and then get a plan for long-term.[/QUOTE]

Seconded!

It would be a real shame if Mersenne.org Version 6 happened the same way as Version 5....

flashjh 2014-07-17 19:10

[QUOTE=chalsall;378336]Seconded!

It would be a real shame if Mersenne.org Version 6 happened the same way as Version 5....[/QUOTE]

What do you recommend for migration?

chalsall 2014-07-17 19:23

[QUOTE=flashjh;378349]What do you recommend for migration?[/QUOTE]

I've said before; I'll say again...

1and1 have served me well for many years, for many servers. (And, for the record, I make no commission recommending them.) EC2 or RackSpace et al should also be considered.

Renting or leasing servers generally makes more sense than collocating a server of your own in an ISP's rack now-a-days.

Mark Rose 2014-07-17 20:00

[QUOTE=chalsall;378350]I've said before; I'll say again...

1and1 have served me well for many years, for many servers. (And, for the record, I make no commission recommending them.) EC2 or RackSpace et al should also be considered.

Renting or leasing servers generally makes more sense than collocating a server of your own in an ISP's rack now-a-days.[/QUOTE]

EC2 should only be considered if going with a clustered system (such as Percona XtraDB Cluster). They will occasionally retire hardware, sometimes with only days of notice. EC2 is also expensive if you don't buy a reservation.

chalsall 2014-07-17 20:14

[QUOTE=Mark Rose;378354]EC2 should only be considered if going with a clustered system (such as Percona XtraDB Cluster). They will occasionally retire hardware, sometimes with only days of notice. EC2 is also expensive if you don't buy a reservation.[/QUOTE]

I will defer to you for EC2 advice.

I will stick to my recommendation of 1and1. Reliable, and inexpensive.

And, just as an aside, I ran an experiment for a client recently: I tried to get in touch with a "human" at EC2 -- after two days (via both phone and email), no contact.

It took less than 90 seconds to reach a "human" at 1and1 via a 1-800 number which worked here in Bim.

Batalov 2014-07-18 17:58

[QUOTE="mersenne.org]Warning: odbc_pconnect() [function.odbc-pconnect]: SQL error: [Microsoft][ODBC SQL Server Driver][Shared Memory]General network error. Check your network documentation., SQL state 08S01 in SQLConnect in C:\v5\www\2013\v5server\0.96_database.inc.php on line 21
pnErrorResult=3 pnErrorDetail=Database unavailable ==END==[/QUOTE]
Check your network documentation. Hmmm...

It better go to the log, not to the viewers. Some may be compelled to spend hours checking their network documentation, and shaking random wires on their computer... :smile:


All times are UTC. The time now is 06:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.