mersenneforum.org Low memory line counting in huge files
 Register FAQ Search Today's Posts Mark Forums Read

 2021-09-20, 15:17 #1 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 22×1,481 Posts Low memory line counting in huge files I have been running a program that outputs a line to a file every time it encounters a composite. Many of these lines are duplicated and I am interested in targeting the composites that are encountered the most. The issue is that the files generated by my program can get extremely large(150+GB). In an idea world I would run a command such as: sort -n comps.txt | uniq -c | sort -n | tail -n 10000 > best10k.txt However, running this in WSL2 require memory equal to twice the file size which is unmanageable(I have 48GB memory). So far I have been splitting the file into pieces based on the first one or two digits using awk and then processing them separately appending results to one file: awk '{file = "comps_" substr(\$0, 1, 2) ".txt"; print > file}' < comps.txt However, sometimes the file is dominated by one composite which means one of the files gets too big. Also the awk command is quite slow. Does anyone have any suggestions about how I could make this run faster/with less memory? The output of sort -n comps.txt | uniq -c should easily fit in memory(probably twice over). If running powershell commands is a better option that is possible.
 2021-09-20, 15:36 #2 kruoli     "Oliver" Sep 2017 Porta Westfalica, DE 2×3×113 Posts Have you tried using the --buffer-size=SIZE and --temporary-directory=DIR options of sort? It should not need so much memory, neither should uniq. Or is disk space a problem? Does your sort default to in-memory sorting? Also, use sort [other options] -r in the second occurence and head instead of tail. Edit: I can think of some a more efficient implementation with Powershell 2; since you now can pass C# code there directly, I'll give it a try later, because the normal Powershell syntax is still a struggle for me. We should be able to nail the memory requirements down quite drastically this way. Last fiddled with by kruoli on 2021-09-20 at 15:46 Reason: Additions.
 2021-09-20, 15:59 #3 chris2be8     Sep 2009 2,179 Posts How many different composites are there? If there are a reasonably small number, say under 100,000, I would write a perl script to read the file and update a hash using the composite as the key and count of matches as the value. At EOF sort the hash by value and output the top n records. This assumes you can fit all the unique composites into memory. If that needs too much memory I would do it in batches, outputting "composite count" from each batch, sorted by composite. Then do a series of merges of two or more batches replacing matching records with one where the count is the sum of the input counts. When eventually down to 1 file just run through it extracting the "n" largest counts (and any other data you want, eg how many singletons there are). HTH
2021-09-20, 17:36   #4
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

592410 Posts

Quote:
 Originally Posted by kruoli Have you tried using the --buffer-size=SIZE and --temporary-directory=DIR options of sort? It should not need so much memory, neither should uniq. Or is disk space a problem? Does your sort default to in-memory sorting? Also, use sort [other options] -r in the second occurence and head instead of tail. Edit: I can think of some a more efficient implementation with Powershell 2; since you now can pass C# code there directly, I'll give it a try later, because the normal Powershell syntax is still a struggle for me. We should be able to nail the memory requirements down quite drastically this way.
I have used those options(and forgot about them when writing the post). This reduces the memory usage down to the buffer + the file size. It slows it down significantly though.

The number of unique composites is fairly limitted compared with the number of composites. They should fit in memory with little issue. Summing the counts is probably a good idea. Maybe better would be a perl or python script that counts the composites. I suppose I should just bite the bullet and write that script.

2021-09-20, 18:09   #5
nordi

Dec 2016

7·11 Posts

Quote:
 Originally Posted by henryzz The number of unique composites is fairly limitted compared with the number of composites. They should fit in memory with little issue.
In that case you can use the following Python script:
Code:
from collections import Counter
with open("YOUR_FILE_NAME") as f:
print(Counter(line.rstrip() for line in f).most_common(10000))

2021-09-21, 08:36   #6
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

592410 Posts

Quote:
 Originally Posted by nordi In that case you can use the following Python script: Code: from collections import Counter with open("YOUR_FILE_NAME") as f: print(Counter(line.rstrip() for line in f).most_common(10000))
Thank you. This works with very low memory usage(not even 1GB).

The only issue is that it outputs in a different format:
Code:
[('comp1', count1), ('comp2',count2)]
while the previous method outputs:
Code:
count1 comp1
count2 comp2

I can alter my parsing to cope with this although if there is an easy solution that would be useful.

This sort of code snippet is why I posted. If I had written this it would have been done from scratch using hash tables. While I can read and write basic python I don't know the libraries at all.

2021-09-21, 15:39   #7
nordi

Dec 2016

10011012 Posts

Quote:
 Originally Posted by henryzz the previous method outputs: Code: count1 comp1 count2 comp2

You can get that with
Code:
from collections import Counter
with open("FILE_NAME") as f:
counter = Counter(line.rstrip() for line in f)

for composite, count in counter.most_common(10000):
print(composite, count)
That's also easier to adjust if you want a different format.

2021-09-21, 18:05   #8
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

22·1,481 Posts

Quote:
 Originally Posted by nordi You can get that with Code: from collections import Counter with open("FILE_NAME") as f: counter = Counter(line.rstrip() for line in f) for composite, count in counter.most_common(10000): print(composite, count) That's also easier to adjust if you want a different format.

Thank you. That does look much easier to fiddle with. Most of my contact with python has been through editting existing scripts on this forum(sometimes heavily). This has lead to a fairly narrow skill set. This solution would have been obvious to me in C#.

 Similar Threads Thread Thread Starter Forum Replies Last Post Xyzzy Forum Feedback 3 2018-12-30 19:37 paulunderwood Miscellaneous Math 15 2016-01-21 18:56 garambois Aliquot Sequences 50 2012-01-19 18:25 Dubslow Information & Answers 9 2011-09-09 06:01 HiddenWarrior Data 18 2005-10-11 03:53

All times are UTC. The time now is 20:59.

Tue Oct 26 20:59:13 UTC 2021 up 95 days, 15:28, 0 users, load averages: 2.03, 1.45, 1.33