 2016-07-15, 23:33 #1057 Madpoo Serpentine Vermin Jar     Jul 2014 2·1,637 Posts FYI, I've seen sporadic problems through the day where the monitoring service reported short duration outages. Looking into it, it appears a certain IP address (with a user agent string of "Ruby") has decided to do a massive crawl of the exponent reports (EDIT: by "massive" I mean they're trying to crawl at a rate of 30+ pages per second for extended periods of time... 603K pages hit in the past nearly 24 hours). Before I dig into it more and try to mitigate, may I suggest that if anyone is trying to gather data on exponents, please either download the daily archives of the result logs, or crawl the XML specific page which operates much more efficiently. It would be too bad to have to implement some rate controls on the server side, so if you must crawl for data, try to do an appropriate rate limit on the crawler. Last fiddled with by Madpoo on 2016-07-15 at 23:39
2016-07-16, 03:01   #1058
Madpoo

Jul 2014

CCA16 Posts

Quote:
 Originally Posted by Madpoo It would be too bad to have to implement some rate controls on the server side, so if you must crawl for data, try to do an appropriate rate limit on the crawler.
Well, like I said, I hated to do it, but after setting up a rate limiter in logging only mode for a few hours and checking to make sure no other traffic is caught in the net, I've turned it on because that user is still aggressively crawling and sometimes forcing other connections to fail.

If that user was you (I know what city and ISP but beyond that I haven't tried to narrow it down) just PM me here and we'll work out a better way to do whatever you're doing.

2016-07-16, 05:23   #1059
0PolarBearsHere

Oct 2015

2·7·19 Posts

Quote:
 Originally Posted by Madpoo and we'll work out a better way to do whatever you're doing.
Like not using RubyOnRails :P ? (wasn't me by the way)

Last fiddled with by 0PolarBearsHere on 2016-07-16 at 05:23

2016-07-16, 14:48   #1060
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

5,741 Posts

Quote:
 Originally Posted by Madpoo If that user was you (I know what city and ISP but beyond that I haven't tried to narrow it down) just PM me here and we'll work out a better way to do whatever you're doing.
Not me either, but I just want to mention the possibility of a VPN. If it was me all you'd see was whichever VPN I chose, so perhaps instead of an IP block, a content/agent block might be better?

2016-07-17, 04:11   #1061
Madpoo

Jul 2014

2×1,637 Posts

Quote:
 Originally Posted by Madpoo Well, like I said, I hated to do it, but after setting up a rate limiter in logging only mode for a few hours and checking to make sure no other traffic is caught in the net, I've turned it on because that user is still aggressively crawling and sometimes forcing other connections to fail. If that user was you (I know what city and ISP but beyond that I haven't tried to narrow it down) just PM me here and we'll work out a better way to do whatever you're doing.
I don't know who it was (I searched for past hits from that IP address to see if I could match to a user, but I couldn't). Best I could tell it was some new user since they checked out the home page, looked at the download page and a few other things and then started crawling the exponent reports for every. single. exponent. one. by. one.

I was on the road all day today and kept getting emails from the monitoring setup about downtime, so now that I'm back at my computer I finally just blocked their IP address altogether.

Not something I wanted to do, but the problem was that even blocking them when they did 50 requests in 5 seconds meant some were still being processed and the backlog it built up made each request take progressively longer, which made other requests queue up. Even though it was unintentional I'd have to classify it as a DoS so a block is appropriate, unfortunately.

So, if you're that person in a certain US state with lots of lakes, and you came here looking for help on why you can't hit mersenne.org any more and your crawler keeps getting 403 errors, PM me and let's work on a better way to do this.

Meanwhile this will urge me to get off my butt and setup better dynamic restrictions to prevent this from happening in the future. And also a good motivation to dig into that page again and optimize it...something that's been in the back of my mind for a while now.

 2016-07-17, 23:12 #1062 Mark Rose
2016-07-18, 00:10   #1063
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

5,741 Posts

Quote:
 Originally Posted by Madpoo ... I finally just blocked their IP address altogether.
Thankfully it was not my VPN you blocked. But I do hope that in general IP blocks are not going to have to be common things?

2016-07-18, 03:38   #1064
Madpoo

Jul 2014

63128 Posts

Quote:
 Originally Posted by retina Thankfully it was not my VPN you blocked. But I do hope that in general IP blocks are not going to have to be common things?
I hope not either. It's a pain. Just a result of someone crawling far too aggressively.

I'll reiterate that for anyone looking to capture data on exponents, there are options far more suited to that than crawling the html report_exponent pages. XML reports are awesome and faster, you can also specify a range of exponents instead of doing one request per exponent, or if you want a specific large batch of something or another, ask and if it's not too cumbersome I may be able to do a BCP package or something, but since that's a manual thing on my part, I won't make any promises on if/when.

2016-07-18, 04:00   #1065
Mark Rose

"/X\(‘-‘)/X\"
Jan 2013

2·1,433 Posts

Quote:
 Originally Posted by Madpoo I hope not either. It's a pain. Just a result of someone crawling far too aggressively. I'll reiterate that for anyone looking to capture data on exponents, there are options far more suited to that than crawling the html report_exponent pages. XML reports are awesome and faster, you can also specify a range of exponents instead of doing one request per exponent, or if you want a specific large batch of something or another, ask and if it's not too cumbersome I may be able to do a BCP package or something, but since that's a manual thing on my part, I won't make any promises on if/when.
Perhaps the page should be updated to tell people there are other options.

 2016-07-18, 20:41 #1066 henryzz
 2016-07-18, 20:58 #1067 James Heinrich

