Cost-efficiently pair each line of a file with all others

Question

I have a very huge file containing numbers only. file -

And want to pair each line with all others. Output like below

123212,234234
123212,12324
123212,1243223
123212,5453443
234234,123212
234234,12324
234234,1243223
234234,5453443
12324,123212
12324,234234
12324,1243223
12324,5453443
1243223,123212
1243223,234234
1243223,12324
1243223,5453443
5453443,123212
5453443,234234
5453443,12324
5453443,1243223

since the input file contains more than 50L records. so doing it via a loop will be a costly operation.

GNU parallel might handle this efficiently, but I'm not sure how to test that. — Joshua Goldberg, Jan 19 '23 at 22:51

John1024 · Answer 1 · 2015-09-03T04:50:37.367

All methods for creating this output will be costly. This approach, however, will work even if the file is much larger than RAM:

$ while read n; do awk -v n="$n" '$1!=n{print n "," $1}' file; done <file
123212,234234
123212,12324
123212,1243223
123212,5453443
234234,123212
234234,12324
234234,1243223
234234,5453443
12324,123212
12324,234234
12324,1243223
12324,5453443
1243223,123212
1243223,234234
1243223,12324
1243223,5453443
5453443,123212
5453443,234234
5453443,12324
5453443,1243223

Written over multiple lines

while read n
do
    awk -v n="$n" '$1!=n{print n "," $1}' file
done <file

read n reads numbers from file one at a time. For each n, the awk script is run to create that part of the output for which n is in the first column. The option -v n="$n" creates an awk variable named n which has the same value as the shell variable n. The condition $1!=n selects those lines in file for which the number on that line of the file differs from n. For those lines, we print the number n, followed by a comma, followed by the number on the line.

iruvar · Answer 2 · 2015-09-04T04:24:53.567

I agree with John, this is going to be expensive no matter what.

join -o 1.2,1.3,2.2,2.3 -j 1 <(awk '{printf "%s %d %s\n", "x", FNR, $0}' file) \
<(awk '{printf "%s %d %s\n", "x", FNR, $0}' file) |
awk '$1 != $3{print $2, $4}'

You could fire up two process substitution instances that, each using awk, return the contents of the file with two synthetic fields inserted at the beginning of each record, the first field containing a fixed value (x in the example above) and the second field the line number. This can then be fed to join stipulating field 1 as the join field. This causes every record from the first instance of the process substitution to match every record from the second. Use an awk post-processor to discard instances of records matching themselves (using the fact that line numbers are going to be equal in these cases)

Thomas Erker · Answer 3 · 2015-09-04T17:40:17.873

Create a SQLite database to join each line with each other:

sqlite3 tmp.db
sqlite> CREATE TABLE T (x INTEGER);
sqlite> .import input_file T
sqlite> .mode csv
sqlite> .output output_file
sqlite> SELECT * FROM T JOIN T AS S WHERE T.x != S.x;

This solution does not guarantee the order of input lines, but it starts only one process, has no external loops and should work with limited RAM.

Update: Fix the select statement so it will not join a value with itself. If equal values are OK as long as they are not from the same line, use WHERE T.rowid != S.rowid.

score 0 · Answer 4 · answered Sep 03 '15 at 15:56

Would you also consider using a totally different application, like kdb+?

(its 32-bit version is free-as-in-beer with a 4 GB memory limit)

Some basics:

Loading your file as a single-column numeric list.
```
flip (enlist "I";",") 0: hsym `$"/path/to/input"
```
- 0: is a multi-purpose function to load from the input file. For the purpose of this question, treat (enlist "I";",") simply as the file format specification, and then apply a flip to turn the output to a usable list.
Applying the cross function.
```
a cross a:... <from above>
```
- q (the language of kdb+) can be quite terse, but that also means variable assignment (e.g. a:42 to set 42 to a) can be assigned and used in an orderly fashion. Here, we assign our file input to a variable a, so that we can cross itself.
Prepare the string-ed output.
```
"," 0: flip a... <from above>
```
- Once again, 0: is used to prepare the results into comma-delimited strings here.
Write to output file.
```
(hsym `$"/path/to/output") 0: ","... <from above>
```
- This time round, we need () around the left argument of 0: to make the functional usage for hsym explicit. Finally, 0: is used here for the third time to write to a file.

Putting it all together:

(hsym`$"/path/to/output")0:","0:flip a cross a:flip(enlist"I";",")0:hsym`$"/path/to/input"

And now, for the bad news...

The 4 GB RAM limitation of the 32-bit free version only handles up to about 6000 lines...

q)\ts (hsym`$"output6k.txt")0:","0:flip a cross a:flip(enlist"I";",")0:hsym`$"test6k.txt"
23428 3378126736
q)count distinct flip (enlist "I";",") 0:hsym`$"test6k.txt"
6000

\ts shows that the time taken is just under 24 seconds, using up almost 3.4 GB of memory.

(I decided to still post this as an answer, to not let my efforts go to waste...)

Cost-efficiently pair each line of a file with all others

4 Answers4

Written over multiple lines

Linked