0

I have a large file (say reads.fasta) which has around 5,000,000 lines and have another file reads_of_interest which contains a list of the line numbers of reads.fasta that I want to extract.

Is there an easy command line method to do this?

In other words, there is a file large_file.txt. There is another file line_numbers.txt which is of the form

12 
134
1456

And I want to extract lines 12, 134, 1456 from large_file.txt. The number of lines I want to extract is of the order of 500,000.

Thanks!

don_crissti
  • 82,805
Devil
  • 183
  • 1
    Could you update question to include typical line length ? Could be that the file could comfortably sit in memory in a perl/awk array. – steve Dec 05 '15 at 00:15
  • 1
    5mil lines is ridiculous. are you sure that is the best way to delimit the file? has it been blocked such that the lines are fixed width? that would be the sane way to go so you could just seek through to offsets. in any case, you'll find benchmarks for several different solutions against a 5mil lines file at this duplicate question: Filter File by Line Number – mikeserv Dec 05 '15 at 05:08
  • Thanks. Yes, that was exactly what I was looking for. The file is a very file of DNA reads. I wanted to restrict the downstream processing to a particular subset of reads. – Devil Dec 05 '15 at 05:19

2 Answers2

3

This is an easy and direct way to get what you want. The problem here is that the entire large_file.txt will be scanned. If this is too slow there are other things to try. One of which would be loading the file into a database keyed on the line numbers, which would give extremely fast retrieval compared to scanning the file.

#!/bin/sh
awk '
    NR == FNR {
        for (i=1; i<=NF; i++) {
            linenums[$i]
        }
    }
    NR != FNR {
        if (FNR in linenums) {
            print
        }
    }
' line_numbers.txt large_file.txt

NR is the current record number (Number of Records), and FNR is the current record number within the current file.

So when NR == NFR awk is processing the first file arg, when NR != NFR awk is processing the second (or later) file.

This reads all of the line numbers from line_numbers.txt and stores them as keys into an array with no data elements, only keys (the linenums array).

When the second file, large_file.txt, is being read, if the current record number has been stored as a key in the array linenums, then the line from large_file.txt will be printed.

The method of looking up the line numbers in the linenums array is relatively fast because awk uses an internal hashing algorithm to lookup the keys.

RobertL
  • 6,780
  • 2
    If this is too slow there are other things to try. One of which would be loading the file into a database keyed on the line numbers, which would give extremely fast retrieval compared to scanning the file. That would be much faster, but readers need to note that the original loading of the file into the DB would be slower than just extracting the desired lines as both require reading the entire file, but only one requires inserting the data into the DB. If you need to extract lines more than one time, that (or similar pre-splitting) is by far the best solution. – Andrew Henle Dec 05 '15 at 01:32
1

Assuming file_numbers.txt contains a single line and if that line is not too large, the following should work

sed -n "$(<file_numbers.txt sed -e "s/ /p;/g" -e "s/$/p/")" large_file.txt
iruvar
  • 16,725
  • I suggest changing that to sed -n -f <(<file_numbers.txt sed -e 's/ /p;/g' -e 's/$/p/') large_file.txt, so you don't pass the entire command list on the command line.  Actually, I would say sed -n -f <(sed -e 's/ /p;/g' -e 's/$/p/' file_numbers.txt) large_file.txt, because I don't like putting I/O redirection first.  Also, you should document your assumptions. – Scott - Слава Україні Dec 05 '15 at 04:23
  • 1
    thats not gonna work for 500k lines. – mikeserv Dec 05 '15 at 05:10
  • @mikeserv, not sure what you mean - file_numbers.txt is a single-line file. It's large_file.txt that's 500k lines long – iruvar Dec 05 '15 at 05:30
  • largefile is 5mil lines and linenumbers is 500k. – mikeserv Dec 05 '15 at 05:47