18

Given a file L with one non-negative integer per line and text file F, what would be a fast way to keep only those lines in F, whose line number appears in file L?

Example:

$ cat L.txt
1
3

$ cat F.txt Hello World Hallo Welt Hola mundo

$ command-in-question -x L.txt F.txt Hello World Hola mundo

I'm looking for a command that can handle a file L with 500 million or more entries; file L is sorted numerically.

Note: I'm halfway through an implementation for a command-in-question but I just wondered, whether one might be able to use some Unix tools here as well.


Update: Thank for all the answers, I learned a lot today! I would like to accept more than one answer, but that's not possible.

I took the fastest solution from the current answers an put them into a standalone tool: filterline.

miku
  • 673

10 Answers10

11

grep -n | sort | sed | cut

(   export LC_ALL=C
    grep -n ''   | sort -t:  -nmk1,1 ./L - |
    sed /:/d\;n  | cut  -sd: -f2-
)   <./F

That should work pretty quickly (some timed tests are included below) with input of any size. Some notes on how:

  • export LC_ALL=C
    • Because the point of the following operation is to get the entire file of ./F stacked up inline with its ./L lineno's file, the only characters we'll really need to worry about are ASCII [0-9]digits and the :colon.
    • For that reason it is more simple to worry about finding those 11 characters in a set of 128 possibles than it is if UTF-8 is otherwise involved.
  • grep -n ''
    • This inserts the string LINENO: into the head of every line in stdin - or <./F.
  • sort -t: -nmk1,1 ./L -
    • sort neglects to sort its input files at all, and instead (correctly) presumes they are presorted and -merges them in -numerically sorted order, ignoring basically anything beyond any possible -k1,1st occurring -t:colon character anyway.
    • While this may require some temp space to do (depending on how far apart some sequences may occur), it will not require much as compared to a proper sort, and it will be very fast because it involves zero backtracking.
    • sort will output a single stream where any lineno's in ./L will immediately precede the corresponding lines in ./F. ./L's lines always come first because they are shorter.
  • sed /:/d\;n
    • If the current line matches a /:/colon delete it from output. Else, auto-print the current and next line.
    • And so sed prunes sort's output to only sequential line pairs which do not match a colon and the following line - or, to only a line from ./L and then the next.
  • cut -sd: -f2-
    • cut -suppresses from output those of its input lines which do not contain at least one of its -d:elimiter strings - and so ./L's lines are pruned completely.
    • For those lines which do, their first : colon-delimited -field is cut away - and so goes all of grep's inserted lineno's.

small input test

seq 5 | sed -ne'2,3!w /tmp/L
        s/.*/a-z &\& 0-9/p' >/tmp/F

...generates 5 lines of sample input. Then...

(   export LC_ALL=C; </tmp/F \
    grep -n ''   | sort -t:  -nmk1,1 ./L - |
    sed /:/d\;n  | cut  -sd: -f2-
)|  head - /tmp[FL]

...prints...

==> standard input <==
a-z 1& 0-9
a-z 4& 0-9
a-z 5& 0-9

==> /tmp/F <==
a-z 1& 0-9
a-z 2& 0-9
a-z 3& 0-9
a-z 4& 0-9
a-z 5& 0-9

==> /tmp/L <==
1
4
5

bigger timed tests

I created a couple of pretty large files:

seq 5000000 | tee /tmp/F |
sort -R | head -n1500000 |
sort -n >/tmp/L

...which put 5mil lines in /tmp/F and 1.5mil randomly selected lines of that into /tmp/L. I then did:

time \
(   export LC_ALL=C
    grep -n ''   | sort -t:  -nmk1,1 ./L - |
    sed /:/d\;n  | cut  -sd: -f2-
)   <./F |wc - l

It printed:

1500000
grep -n '' \
    0.82s user 0.05s system 73% cpu 1.185 total
sort -t: -nmk1,1 /tmp/L - \
    0.92s user 0.11s system 86% cpu 1.185 total
sed /:/d\;n \
    1.02s user 0.14s system 98% cpu 1.185 total
cut -sd: -f2- \
    0.79s user 0.17s system 80% cpu 1.184 total
wc -l \
    0.05s user 0.07s system 10% cpu 1.183 total

(I added the backslashes there)

Among the solutions currently offered here, this is the fastest of all of them but one when pitted against the dataset generated above on my machine. Of the others only one came close to contending for second-place, and that is meuh's perl here.

This is by no means the original solution offered - it has dropped a third of its execution time thanks to advice/inspiration offered by others. See the post history for slower solutions (but why?).

Also, it is worth noting that some other answers might very well contend better if it were not for the multi-cpu architecture of my system and the concurrent execution of each of the processes in that pipeline. They all work at the same time - each on its own processor core - passing around the data and doing their small part of the whole. It's pretty cool.

but the fastest solution is...

But it is not the fastest solution. The fastest solution offered here, hands-down, is the C program. I called it cselect. After copying it to my X clipboard, I compiled it like:

xsel -bo | cc -xc - -o cselect

I then did:

time \
    ./cselect /tmp/L /tmp/F |
wc -l

...and the results were...

1500000
./cselect /tmp/L /tmp/F  \
    0.50s user 0.05s system 99% cpu 0.551 total
wc -l \
    0.05s user 0.05s system 19% cpu 0.551 total
mikeserv
  • 58,310
  • 1
    You can make it significantly faster (almost as quick as mine on multi-core systems) with sed -ne'/:/!{n;p;}' | cut -d: -f2- instead of sed -ne'/:/!N;/\n/s/[^:]*://p' – Stéphane Chazelas Jun 14 '15 at 09:05
  • @StéphaneChazelas - you might get better results if you switch seds - the sed I'm using is the heirloom sed - you can see the alias value in the time results. My heirloom package, by the way, is statically compiled against a musl libc - the regex implementation for which is based on TRE. When I switch it to the GNU sed - and run it without cut - it adds a full second to the completion time (2.8 secs) - compounds it by more than a third. And that's only .3 secs faster than yours on my system. – mikeserv Jun 14 '15 at 09:59
  • 1
    sort -mn as opposed to sort -nmk1,1 might be better as you don't need to do the splitting here (not tested) – Stéphane Chazelas Jun 14 '15 at 16:34
  • @StéphaneChazelas - yeah, I thought the same and I tried it every which way. -n is spec'd just to do the first numeric string on a line so I figured, ok -mn or -nm and, for whatever reason the only times it ever dipped below 2sec in completion time was when I added in all of the options as is. It's weird - and it's the reason yesterday I didn't tack on the -m in the first place - I knew what I was about, but it seemed just to work out as some kind of auto-optimization thing. Interestingly, the heirloom sort has a -z string-length option which only applies to -[cm].... – mikeserv Jun 14 '15 at 16:50
  • -n is not the first numeric string on the line. It just considers the line as a number so abc 123 would be 0. So it can't be less efficient than with -t: -k1,1 – Stéphane Chazelas Jun 14 '15 at 17:08
  • @StéphaneChazelas - I meant the first in the key. But I think the reason it is faster: -n - Restrict the sort key to an initial numeric string, consisting of optional blank* characters, optional -, and *digits with an optional radix character and thousands separators (as defined in the current locale), which shall be sorted by arithmetic value. An empty digit string shall be treated as zero. Leading zeros and signs on zeros shall not affect ordering.* - is there are a lot of \(.*\)* in there. I guess just [:\n] shuts that down. I did think so too, but tests showed otherwise. – mikeserv Jun 14 '15 at 17:15
10

I'd use awk, but not store the whole content of L.txt in memory and do unnecessary hash look ups ;-).

list=L.txt file=F.txt
LIST="$list" awk '
  function nextline() {
    if ((getline n < list) <=0) exit
  }
  BEGIN{
    list = ENVIRON["LIST"]
    nextline()
  }
  NR == n {
    print
    nextline()
  }' < "$file"
  • Exactly, I tried hash-maps and they would exceed memory; bitsets will buy you more headroom; but by using the fact, that the input is sorted, you can get rid of this (space) problem entirely. – miku Jun 13 '15 at 10:55
  • Since L is sorted probably the preferable solution with such huge data sets (+1). - One question, though; is there a reason why you use a variable 'list' instead of just using the file name literally in getline? – Janis Jun 13 '15 at 10:57
  • 1
    @Janis; isn't that just a case of standard good coding practice: do not hard code literals - use variables instead ... (more flexible and less error prone, and easier to maintain) – Peter.O Jun 13 '15 at 11:42
  • 1
    @StéphaneChazelas: It needs pre-loop initialisation of n, otherwise (as-is) it misses 1 in L.txt – Peter.O Jun 13 '15 at 12:03
  • 1
    @Peter.O, oops, that's what I had tried to address with by NR>=n, but that was wrong. Should be better now. – Stéphane Chazelas Jun 13 '15 at 12:17
  • 1
    @Janis, the idea was that if that code was to be embedded in a command-in-question script, then you can't have the filename embbed in the code. -v list="$opt_x" doesn't work either because of the backslash-prcessing done by awk on it. Which is why I use ENVIRON instead here. – Stéphane Chazelas Jun 13 '15 at 12:19
  • Nice one! With a list named \List.txt and using the -v option, it fails: "awk: warning: escape sequence \L treated as plain L" -- but perhaps worse than just failing, is if a file named List.txt (no backslash) exists and the script process the wrong file... When using ENVIRON it works as expected on the oddly named \List.txt file – Peter.O Jun 13 '15 at 12:35
  • @Peter.O; It's certainly good practice in large programs, specifically where one literal would be used many times. Putting every literal in a variable is not per se good practice, often rather the opposite. It's also good practice if one wants to have several declarations collected in one place, as Stephane's the latest version shows (but not the original one). - Stephane's explanation still holds, to have the script externally parameterized. – Janis Jun 13 '15 at 14:27
  • @Janis, actually factorising code by using functions incurs a significant performance cost here. – Stéphane Chazelas Jun 14 '15 at 09:18
  • @Stephane; I just noticed with my own performance measurements. Updated my post with that information. – Janis Jun 14 '15 at 10:26
9

I'd use awk:

awk 'NR==FNR {a[$1]; next}; FNR in a' L.txt F.txt

Update: I've done performance measures; it seems this version scales even better with very large data sets (as is the case with the stated requirements), since the comparison is very fast and overcompensates the effort necessary to build up the hash table.

Janis
  • 14,222
  • 1
    @miku; Yes, it's a nice compact solution. But a caveat; not all awks might be able to handle such huge data sets. - I am using GNU awk and there's no problems; the test with 500 million lines of data required 7 minutes. – Janis Jun 13 '15 at 11:01
  • 1
    This rather slow (by comparison) real 16m3.468s - user 15m48.447s - sys 0m10.725s. It used 3.3 GB of RAM testing a 1/10'th size L with 50,000,000 lines; and F with 500,000,000 lines - vs time for Stéphane Chazelas' awk anser: real 2m11.637s - user 2m2.748s - sys 0m6.424s - I'm not using a fast box, but the comparison is interesting. – Peter.O Jun 13 '15 at 22:19
  • @Peter.O; Thanks for the data! A slower speed was to expect, given that (in my own test case) half a billion lines were stored in an associative array. (That's why I commented "(+1)" above for Stephane's proposal.) - Though I was astonished that this terse solution was still processing 1 million lines per second! I think it makes this code pattern (because of it's simplicity!) a viable option, and specifically in cases with less extreme data sizes. – Janis Jun 14 '15 at 04:38
  • It definitely is a viable solution. On the test data I used (5mil lines/1.5mil L) yours completed in a little over 4 seconds - only a second behind Stephane's answer. The code used to gen the test set is in my answer, but it's mostly just seq output and then a smaller, randomly selected subset of same in L. – mikeserv Jun 14 '15 at 05:42
  • 1
    I just made some more performance measures with a data file size of 500 million lines and a key file size of 50 million and resp. 500 million lines, with a noteworthy observation. With the smaller key file the times are 4 min (Stephane) vs. 8 min (Janis), while with the larger key file it's 19 min (Stephane) vs. 12 min (Janis). – Janis Jun 14 '15 at 10:11
  • Manufactured random data (with guaranteed properties; i.e. disjunct data and keys, and all keys appear in the data). – Janis Jun 14 '15 at 10:30
  • You manufactured 500mil lines of random data? How? – mikeserv Jun 14 '15 at 10:32
8

With C omitting meaningful error messages:

#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[]) {

    FILE *L;
    FILE *F;

    unsigned int to_print;
    unsigned int current = 0;
    char *line = NULL;
    size_t len = 0;

    if ((L = fopen(argv[1], "r")) == NULL) {
        return 1;
    } else if ((F = fopen(argv[2], "r")) == NULL) {
        fclose(L);
        return 1;
    } else {

        while (fscanf(L, "%u", &to_print) > 0) {
            while (getline(&line, &len, F) != -1 && ++current != to_print);
            if (current == to_print) {
                printf("%s", line);
            }
        }

        free(line);
        fclose(L);
        fclose(F);
        return 0;
    }
}
FloHimself
  • 11,492
  • 2
    This is the most performant answer here. At least, it is so by my tests. In case anyone is interested, I compiled it like: xsel -bo | cc -xc - -o cselect. And it just worked - it only needs the two libs. – mikeserv Jun 14 '15 at 00:41
  • 1
    Thanks, this is great! I hope you don't mind, but I wrapped your code up into a little tool. – miku Jun 15 '15 at 07:29
  • 1
    @miku Go ahead, I'm glad I could help. I noticed you've increased LINE_MAX in your version, so you probably work with very large lines in your files. I've updated the A with a version using getline() to remove the line size limit. – FloHimself Jun 15 '15 at 10:53
  • @FloHimself, well, thanks again : ) Indeed, some input lines might exceed LINE_MAX, so getline seems just right. – miku Jun 15 '15 at 10:56
3

I wrote a simple Perl script to do that:

Usage: script.pl inputfile_f inputfile_f

#!/usr/bin/env perl

$number_arguments = $#ARGV + 1;
if ($number_arguments != 2) {
    die "Usage: script.pl inputfile_f inputfile_l\n";
}

open($f, '<', $ARGV[0])
    or die "$ARGV[0]: Not found\n";
open($l, '<', $ARGV[1])
    or die "$ARGV[1]: Not found\n";

@line_numbers = <$l>;

while ($line = <$f>) {
    $count_f ++;
    if ($count_f == @line_numbers[$count_l]) {
        print $line;
        $count_l ++;
    }
}
  • Loads F.txt
  • Loads L.txt
  • Stores each line of L.txt into an array
  • Reads F.txt line by line, tracking its current line number and the current array index; increases the F.txt current line number; if the F.txt current line number matches the content of the array at the current array index, it prints the current line and increases the index

Cost and complexity considerations:

Considering the cost to make the assignments, the cost to make the comparisons and the cost to print the lines, given N1 as the number of lines in F.txt and N2 as the number of lines in L.txt, the while loop runs at most N1 times, leading to 2N1 + N2 assignments (obviously assuming N1 > N2), to 2N1 comparisons and to N2 prints; given as equal the cost of each operation, the total cost to run the while loop is 4N1+2N2, which leads to a complexity of the script of O(N).

Test on a 10-million-lines input file:

Using a 10-million-lines F.txt file containing random 50-characters-long lines and a 10-million-lines L.txt file containing numbers from 1 to 10000000 (worst case scenario):

~/tmp$ for ((i=0; i<3; i++)); do time ./script.pl F.txt L.txt > output; done

real    0m15.628s
user    0m13.396s
sys 0m2.180s

real    0m16.001s
user    0m13.376s
sys 0m2.436s

real    0m16.153s
user    0m13.564s
sys 0m2.304s
kos
  • 2,887
3

Just for completeness: we can merge the excellent awk script in the answer by Stéphane Chazelas, and the perl script in the answer by kos but without keeping the entire list in memory, in the hope that perl might be faster than awk. (I've changed the order of args to match the original question).

#!/usr/bin/env perl
use strict;

die "Usage: $0 l f\n" if $#ARGV+1 != 2;
open(L,$ARGV[0]) or die "$ARGV[0]: $!";
open(F,$ARGV[1]) or die "$ARGV[1]: $!";

while(my $number = <L>){
    #chop $number;
    while (<F>) {
        if($. == $number){
            print;
            last;
        }
    }
}
meuh
  • 51,383
  • This is way faster than the awk. It's about as fast as mine - I tested both three times just now and each time mine handled my 5mil line testset in 1.8... seconds and yours 1.9... seconds each time. The testset gen code is in my answer if you care, but the point is that it is very good. What's more, the output is correct - I still can't make the awk work... Still though, both of our answers are put to shame by FloHimself's. – mikeserv Jun 14 '15 at 06:32
  • @mikeserv, we must have different awks. On your sample, I get 1.4s with gawk (4s for Janis'), 0.9s with mawk, 1.7s with this perl solution, 2.3s with kos', 4.5s with yours (GNU sed), and 1.4s with yours (GNU sed) and my suggested improvement (and 0.5s for the C solution). – Stéphane Chazelas Jun 14 '15 at 12:55
  • @mikeserv, ah! of course with your approach, the locale makes a difference. Down from 4.5s to 2.3s here when switching from UFT-8 to C. – Stéphane Chazelas Jun 14 '15 at 12:59
2

This perl solution is faster than the other awk or perl solutions by 20% or so, but oviously not as fast as the solution in C.

perl -e '
  open L, shift or die $!;
  open F, shift or die $!;
  exit if ! ($n = <L>);
  while (1) {
    $_ = <F>;
    next if $. != $n;
    print;
    exit if ! ($n = <L>);
  }
' -- L F
jrw32982
  • 723
0
cat <<! >L.txt
1
3
!

cat <<! >F.txt
Hello World
Hallo Welt
Hola mundo
!

cmd(){
 L=$1 F=$2
 cat -n $F |
 join $L - |
 sed 's/[^ ]* //'
}

cmd L.txt F.txt
Hello World
Hola mundo

Since L.txt is sorted you can use join. Just number each line in F.txt, join the two files, then remove the line number. No large intermediate files are needed.

Actually, the above will mangle your data lines by replacing all white space by a single space. To keep the line intact you need to choose as a delimiter some character that does not appear in your data, eg "|". The cmd is then

cmd(){
 L=$1 F=$2
 cat -n $F |
 sed 's/^ *//;s/\t/|/' |
 join -t'|' $L - |
 sed 's/[^|]*|//'
}

The first sed removes leading spaces from the "cat -n" output and replaces the tab. The second sed removes the line number and "|".

meuh
  • 51,383
  • I'm afraid this won't work on larger files. It needs <10 lines. I had the same idea and tried join L.txt <(nl F.txt ) but it won't work on large files. Welcome to the site, by the way, it's not often we get such clear and well-formatted answers from new users! – terdon Jun 13 '15 at 19:24
  • @terdon, Yes, a shame that join/comm can't work with numerically sorted input. – Stéphane Chazelas Jun 13 '15 at 20:42
  • @terdon: I followed up on your lead (now deleted), and tried join -t' ' <(<L.txt awk '{printf("%010s\n",$0)}') <(<F.txt awk '{printf("%010s %s\n",NR,$0)}') | cut -d' ' -f2- -- It was slow! - and even when I fed in prepared files with suitable 0-padded keys join -t' ' L.txt F.txt | cut -d' ' -f2-, it was still slow (not including the preparation time) - slower than the awk answer by @Janis (where I've posted a comment re the actual times taken for both his and @StéphaneChazelas' answer – Peter.O Jun 13 '15 at 23:33
  • @Peter.O yeah. I tried a similar approach which avoids one of the awks but I couldn't find a way to make it both work and be worth it. – terdon Jun 13 '15 at 23:36
  • @terdon and others: The actual time for the join + awk printf process substiturion was real 20m11.663s user 19m35.093s sys 0m10.513s vs Stéphane Chazelas' real 2m11.637s user 2m2.748s sys 0m6.424s using L 50 million lines, F 500 million lines. – Peter.O Jun 13 '15 at 23:43
  • Welcome to Stack Exchange.  You should always quote shell variables (e.g., "$F" and "$L") unless you have a good reason not to, and you’re sure you know what you’re doing. – G-Man Says 'Reinstate Monica' Jun 14 '15 at 02:31
0

For completeness, another attempt at the join solution:

sed -r 's/^/00000000000000/;s/[0-9]*([0-9]{15})/\1/' /tmp/L | join <( nl -w15 -nrz /tmp/F ) - | cut -d' ' -f2-

This works by formatting the the line-number column that join works on as fixed length with leading zeros, so that the numbers are always 15 digits long. This circumvents the problem of join not liking normal numerical sort order, as the column has effectively now been forced to be dictionary sort. nl is used to add line numbers in this format to F.txt. Unfortunately sed needs to be used to reformat the numbering in L.txt.

This approach seems to work OK on the test data generated using @mikeserv's method. But it is still very slow - the c solution is 60x faster on my machine. about 2/3 of the time is spent in sed and 1/3 in join. Perhaps there is a better sed expression...

  • Ok - but why are we prepending all of the zeroes? I'm trying to get a feel for this. Also, nl's super cool, but you can't robustly use it on untested input. One the things which makes it so cool is its logical page -delimiter. By default if there is any line in input consisting of only the strings :\\`` *(but w/out the trailing grave)* 1, 2, 3 or three times in succession, your counts will go a little crazy. Experiment w/ it - it's pretty neat. Especially have a look at what happens when nl reads a line with 1 delimiter string and then later another w/ 3 or 2 – mikeserv Jun 15 '15 at 05:11
0

Since the accepted answer is in C, I figured it's OK to throw a python solution up here:

# Read mask
with open('L.txt', 'r') as f:
    mask = [int(line_num) for line_num in f.read().splitlines()]

# Filter input file
filtered_lines = []
with open('F.txt', 'r') as f:
    for i, line in enumerate(f.read().splitlines()):
        if (i+1) in mask:
            filtered_lines.append(line)

# Write newly filtered file
with open('F_filtered.txt', 'w') as f:
    for line in filtered_lines:
        f.write('%s\n' % line)

If using an external library like numpy, a solution would look even more elegent:

import numpy as np

with open('L.txt', 'r') as f:
    mask = np.array([int(line_num)-1 for line_num in f.read().splitlines()])

with open('F.txt', 'r') as f:
    lines = np.array(f.read().splitlines())
filtered_lines = lines[mask]

with open('F_filtered.txt', 'w') as f:
    for line in filtered_lines:
        f.write('%s\n' % line)
Alex
  • 101