11

I went through the answers in this helpful thread, but my problem seems to be different enough that I can't think of good answer (at least with sed).

I have a large CSV file (200+ GB) with rows that look like the following:

<alphanumerical_identifier>,<number>

where <alphanumerical_identifier> is unique across the entire file. I would like to create a separate file that replaces the first column by an index, i.e.

<index>,<number>

so that we get:

1, <number>
2, <number>
3, <number>

Can awk generate an increasing index without loading the full file in memory?

Since the index increases monotonically, it may be even better to just drop the index. Would the solution for that be that different?, i.e.:

<number>
<number>
<number>
  • I am not sure about this solution's feasibility. But how about just generating as many numbers as in the CSV file in a separate file and then just appending the second column of the CSV file to that file? – Ramesh Sep 18 '14 at 18:38
  • @Ramesh That is perfectly fine as long as the output is correct. – Amelio Vazquez-Reina Sep 18 '14 at 18:44
  • try : perl -pe '$_ = "$.,$_"' bigfile.csv > newfile.csv but it would load full file in the memory, not sure – Raza Sep 18 '14 at 18:45
  • 2
    I suspect I'm misunderstanding something; otherwise, awk -F, '{print ++n, $2}' would work. Or awk -F, '{print $2}' for the second variation. – G-Man Says 'Reinstate Monica' Sep 18 '14 at 18:52
  • 2
    @G-Man, that's probably it although FNR would serve just as well as ++n – iruvar Sep 18 '14 at 18:56
  • 1
    I'd triple-check that you really can get rid of that Uniq Identifier... why not add a first (3rd) column with the index, but still keep the identifier? isn't that identifier used anywhere else? – Olivier Dulac Sep 19 '14 at 07:19

2 Answers2

13

Not near a terminal to test, but how about the oft-overlooked nl command? Something like:

cut -f 2 -d , original.csv | nl -w 1 -p -s , > numbered.csv

bishop
  • 3,209
  • 1
    P.S.: A 200GB CSV file? Wow, and I thought working with the North American Ported Number Database as a CSV (a couple of DVD) was huge! – bishop Sep 18 '14 at 19:53
  • 1
    It works, although there is a big blank after the number. I would replace it with: cut -d, -f 2- /tmp/aa | nl -w 1 -p -s , – Ángel Sep 18 '14 at 20:02
  • @Angel: Thanks, updated my answer to use the width option -w 1 instead of left-numbering. – bishop Sep 18 '14 at 20:12
  • Thanks @bishop - Where do the input and output filenames go? – Amelio Vazquez-Reina Sep 18 '14 at 20:13
  • @user815423426 Haha, yeah, whoops... the file names. Added them. – bishop Sep 18 '14 at 20:15
  • Thanks! I guess I know the answer, but if I just want to preserve the second column? (as per my OP update): First part of the command is enough?) – Amelio Vazquez-Reina Sep 18 '14 at 20:17
  • 1
    @user815423426 Yes, the cut command before the pipe symbol (|) will give you just the second column, effectively having implicit line numbers. – bishop Sep 18 '14 at 20:20
7

Here are a few approaches, but none will approach the speed of the cut and nl solution above:

  1. awk

    awk -F, '{$1=NR;print $1","$2;}' file.csv > newfile.csv
    
  2. Perl

    perl -pe 's/[^,]+/$./' file.csv > newfile.csv
    

    or

    perl -F, -ane '$F[0]=$.; print join ",", @F' file.csv
    
  3. Shell (but I don't recommend it for a 200G file, it will take ages)

    i=1; while IFS=, read foo num; do 
            printf "%d,%s\n" $((i++)) $num; 
    done < file.csv > newfile.csv
    

The above solutions are sorted in order of speed. I tested on my laptop and a 40M file and they took (average of 10 runs) 2.2282 (awk), 2.4555 (1st perl), 3.1825s (2nd perl) and a whopping 48.6035s for the shell. The very clever cut and nl solution you already have was about 4 times faster at 0.6078s.

terdon
  • 242,166
  • Nice, thanks for the stats! The shell result surprises me, somewhat. If you replace printf with echo, does the timing significantly improve? – bishop Sep 19 '14 at 01:03
  • 3
    40G file processed in 2.2282 seconds? Where can I get me that laptop? – John B Sep 19 '14 at 04:06
  • 3
    @JohnB umm, yes, sorry, that was 40M, not G :) – terdon Sep 19 '14 at 13:08
  • I like the resetting of $1 approach with awk. The cut solution is definitely much faster, but that's to be expected as it doesn't replace <alphanumerical_identifier> with anything. I think the fastest awk variant might be something like: mawk 'BEGIN{FS=OFS=","}{$1=NR}1' file.csv > newfile.csv. – John B Sep 19 '14 at 13:54
  • @JohnB ah, yes, I would guess that using OFS instead of explicitly printing , would be slightly faster and that may well add up to a significant difference in huge file. – terdon Sep 19 '14 at 14:02