11

I have around 50 very large csv files, they have thousands of lines.

And I only want to keep the first 200 lines for each of them - I'm okay if the generated files to overwrite the original ones.

What command should I use to do this?

Kusalananda
  • 333,661
AGamePlayer
  • 7,605
  • If you have perl this is a one-liner. – Thorbjørn Ravn Andersen Jan 12 '20 at 21:41
  • We now have at least two separate interpretation of what you would want to achieve. Do you want to, for each file, extract 200 lines and put that into a new file (or the original), or, do you want to extract 200 lines from each files and put them all into a single new file? – Kusalananda Jan 13 '20 at 07:56

6 Answers6

43

Assuming that the current directory contains all CSV files and that they all have a .csv filename suffix:

for file in ./*.csv; do
    head -n 200 "$file" >"$file.200"
done

This outputs the first 200 lines of each CSV file to a new file using head and a redirection. The new file's name is the same as the old but with .200 appended to the end of the name. There is no check to see if the new filename already exists or not.

If you want to replace the originals:

for file in ./*.csv; do
    head -n 200 "$file" >"$file.200" &&
    mv "$file.200" "$file"
done

The && at the end of the head command makes it so that the mv won't be run if there was some issue with running head.

If your CSV files are scattered in subdirectories under the current directory, then use shopt -s globstar and then replace the pattern ./*.csv in the loop with ./**/*.csv. This will locate any CSV file in or below the current directory and perform the operation on each. The ** globbing pattern matches "recursively" down into subdirectories, but only if the globstar shell option is set.


For CSV files containing data with embedded newlines, the above will not work properly as you may possibly truncate a record. Instead, you would have to use some CSV-aware tool to do the job for you.

The following uses CSVkit, a set of command-line tools for parsing and in general working with CSV files, together with jq, a tool for working with JSON files.

There is no tool in CSV kit that can truncate a CSV file at a particular point, but we can convert the CSV files to JSON and use jq to only output the first 200 records:

for file in ./*.csv; do
    csvjson -H "$file" | jq -r '.[:200][] | map(values) | @csv' >"$file.200" &&
    mv "$file.200" "$file"
done

Given some CSV file like the below short example,

a,b,c
1,2,3
"hello, world",2 3,4
"hello
there","my good
man",nice weather for ducks

the csvjson command would produce

[
  {
    "a": "a",
    "b": "b",
    "c": "c"
  },
  {
    "a": "1",
    "b": "2",
    "c": "3"
  },
  {
    "a": "hello, world",
    "b": "2 3",
    "c": "4"
  },
  {
    "a": "hello\nthere",
    "b": "my good\nman",
    "c": "nice weather for ducks"
  }
]

The jq tool would then take this, and for each object in the array (restricted to the first 200 objects), extract the values as an array and format it as CSV.

It's probably possible to do this transformation directly with csvpy, another tool in CSVkit, but as my Python skills are non-existent, I will not attempt to come up with a solution that does that.

Kusalananda
  • 333,661
  • 2
    CSV file can have new line in value if escaped. So you solution does not work for any CSV file. – talex Jan 13 '20 at 09:38
  • 3
    @talex This is absolutely true. I chose to take the words of the question literally and interpret "lines" as "lines" rather than as "records". I will add a more proper variation taking CSV records into account as soon as my work day allows me to do so. – Kusalananda Jan 13 '20 at 10:19
  • 1
    @talex Something that does this is now added. – Kusalananda Jan 13 '20 at 12:00
  • 1
    Awk will happily deal with CSV values containing a newline. Because of the quoting rules (quote values containing comma, newline, quotes; and double internal quotes), a newline in a value must have an odd number of quotes before it. So you just getline and append until the number of quotes is even (with error checking and runaway format issues, naturally). Performance and data volume don't matter as we are only reading 200 lines, and can then break. We can store 200 lines in an array, and write them back to FILENAME. We can skip that if there are fewer than 200 lines in any case. – Paul_Pedant Jan 13 '20 at 12:34
  • Any reason for the ./ prefix paranoia? :-) – Jens Jan 14 '20 at 19:01
  • @Jens It's either prefixing the pattern with ./ or calling the utility with -- after the options (csvjson -H -- "$file"). Either of these will avoid the issue that you run into when you have filenames that start with a dash. Try for example to create a file called -f with touch, and then to remove it with rm. – Kusalananda Jan 14 '20 at 19:42
  • If you're using GNU sed then sed -i '201,$d' "$file" should do in-place editing of the file – CSM Jan 14 '20 at 21:09
  • @CSM Yes, but there are variant on this in other answers already. – Kusalananda Jan 14 '20 at 21:12
24

Previous answers copy data and overwrite files. This technique should keep the same inodes, do no copying, and run a whole lot faster. For each file :

(a) Find the length of each file by reading the first 200 lines.

(b) Truncate the file to that length using truncate from GNU coreutils, or with the truncate found on some BSD systems:

SZ="$( head -n 200 -- "${file}" | wc -c )"
truncate -s "${SZ}" -- "${file}"
Paul_Pedant
  • 8,679
15

Using sed with shell globbing:

sed -ni '1,200p' *.csv

Using globbing/sed/parallel:

printf '%s\n' *.csv | parallel -- sed -ni '1,200p' {}

This will find all .csv files in the current directory and feed them to GNU parallel which will execute a sed command on them to keep only the first 200 lines. Note this will overwrite the files in place.

Or using head with parallel:

printf '%s\n' *.csv | parallel -- head -n 200 {} ">" {}.out

This will create new files with the .out suffix.

jesse_b
  • 37,005
  • 2
    It might be worth noting that the same glob + sed approach could be used without parallel (in case the OP doesn't have access to it) - either with xargs or, since the number of files is not huge, plain sed -ni '1,200p' *.csv – steeldriver Jan 11 '20 at 19:02
  • 1
    I tried your example printf '%s\n' *.csv | parallel sed -ni '1,200p' {}, but that did not change any of my test-files. Did something get lost in formatting? – markgraf Jan 11 '20 at 21:34
  • @markgraf: it depends on what version of sed you have, for example with bsd sed you must use -i with a suffix like: sed -ni.bak '1,200p' {} – jesse_b Jan 11 '20 at 21:35
  • That doesn't work either. I'm using parallel from moreutils on ubuntu18.04. Are you using another? – markgraf Jan 11 '20 at 21:39
  • Okay, my version of parallel wants parallel sed -ni '1,200p' -- *.csv to do what it's supposed to do. – markgraf Jan 11 '20 at 21:59
  • @markgraf: Ah that is likely because something in your filename(s) is being interpreted as a command line option. They probably contain dashes? It really should have nothing to do with the version of parallel. – jesse_b Jan 11 '20 at 22:00
  • 2
    No, just a.csv, b.csv ... But apparently parallel from the package moreutils is not the same as the package parallel which is GNU parallel. :-) – markgraf Jan 11 '20 at 22:02
  • Oh so it is, didn't even know that existed. – jesse_b Jan 11 '20 at 22:03
  • Sadly, I don't have the parallel command. How do I install this? – AGamePlayer Jan 12 '20 at 02:50
3

With ksh93 and a POSIX compliant head implementation (one that leaves the cursor within stdin just after the last line it outputs), you can do:

for file in ~(N)./*; do
  [ -f "$file" ] || continue # skip non-regular files
  head -n 200 0<>; "$file" > /dev/null
done

The <>; redirection operator is a variant on the <> standard operator that truncates the file in place after the redirected command has returned provided the command returns with a success exit status.

Here, we discard head's output, we're just interested in its ability to leave the cursor just after the 200th line.

Unfortunately, ksh93's builtin head (which is enabled if you issue builtin head or if /opt/ast/bin is ahead of any directory with a head command in it in $PATH) does not behave POSIXly in this instance. It reads the input in chunks (like most other head implementations), but does not bother seeking back to the end of the 200th line when invoked that way. To force it to do that seeking back, we need to execute an external command which defeats the purpose of having a builtin head in the first place:

builtin head # enable ksh93's head builtin
{ head -n 200 && /bin/true; } 0<>; file > /dev/null

Another working approach that doesn't involve invoking an external utility would be to do an explicit 0-offset seek after head returns:

builtin head # enable ksh93's head builtin
for file in ~(N)./*; do
  [ -f "$file" ] || continue # skip non-regular files
  { head -n 200 && exec <#((CUR)); } 0<>; "$file" > /dev/null
done

For CSV input specifically, and to retain the first 200 CSV records (as opposed to lines as a CSV record can contain more than one line (embedded in "..." quoted fields), you could use ksh93's read -S especially designed to read CSVs in a loop:

for file in ~(N)./*.csv; do
  [ -f "$file" ] || continue # skip non-regular files
  for ((i=0;i<200;i++)); do 
    IFS=, read -rSA discard
  done 0<>; "$file"
done
1

I'm relatively new so please be gentle. I would appreciate constructive feedback if the solution I am proposing isn't optimal.

I created 4 sample files numbered 1 to 4 e.g. touch {1..4} and each file contains 10 samples lines such as in the first file and lines 11 through 20 in the next file as so on.

File 1

Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10 

File 2

Line 11
Line 12
Line 13
Line 14
Line 15
Line 16
Line 17
Line 18
Line 19
Line 20

To extract the first 2 lines as an example (which can be extrapolated to 200), the command head -n 2 {1..4} returns the output;

==> 1 <==
Line 1
Line 2

==> 2 <==
Line 11
Line 12

==> 3 <==
Line 21
Line 22

==> 4 <==
Line 31
Line 32

The command can redirect the output to another file with the command head -n 2 {1..4} > ExtractedOutput

Ryan
  • 177
  • This correctly "keeps n lines for each of them", but it outputs these lines to a single new file, with new contents added (the headers from head). I believe that most people would interpret the question as wanting to extract n lines from each file inte separate files (possibly "overwriting the original files"). – Kusalananda Jan 13 '20 at 07:43
  • @Kusalananda - I interpreted And I only want to keep the first 200 lines for each of them - I'm okay if the generated files to overwrite the original ones. as a single file since overwriting is optional. I take the point on the additional headers as a means to identify the source of the data set. Maybe the question could be elaborated on as to the format in which the information should be extracted? If you believe the answer is inappropriate, I would be happy to remove it. – Ryan Jan 13 '20 at 07:53
1

Use ed to truncate each file.

for f in *.csv; do
  printf '201,$d\nwq\n' | ed "$f"
done

If you want to save a backup, it might be easier to use ex instead. (You might also consider ex simpler to use, regardless; just drop the w!%.bak| to skip making a backup first.)

for f in *.csv; do
    ex -c 'w!%.bak|201,$d|wq' "$f"
done
chepner
  • 7,501