I have around 50 very large csv files, they have thousands of lines.
And I only want to keep the first 200 lines for each of them - I'm okay if the generated files to overwrite the original ones.
What command should I use to do this?
I have around 50 very large csv files, they have thousands of lines.
And I only want to keep the first 200 lines for each of them - I'm okay if the generated files to overwrite the original ones.
What command should I use to do this?
Assuming that the current directory contains all CSV files and that they all have a .csv
filename suffix:
for file in ./*.csv; do
head -n 200 "$file" >"$file.200"
done
This outputs the first 200 lines of each CSV file to a new file using head
and a redirection. The new file's name is the same as the old but with .200
appended to the end of the name. There is no check to see if the new filename already exists or not.
If you want to replace the originals:
for file in ./*.csv; do
head -n 200 "$file" >"$file.200" &&
mv "$file.200" "$file"
done
The &&
at the end of the head
command makes it so that the mv
won't be run if there was some issue with running head
.
If your CSV files are scattered in subdirectories under the current directory, then use shopt -s globstar
and then replace the pattern ./*.csv
in the loop with ./**/*.csv
. This will locate any CSV file in or below the current directory and perform the operation on each. The **
globbing pattern matches "recursively" down into subdirectories, but only if the globstar
shell option is set.
For CSV files containing data with embedded newlines, the above will not work properly as you may possibly truncate a record. Instead, you would have to use some CSV-aware tool to do the job for you.
The following uses CSVkit, a set of command-line tools for parsing and in general working with CSV files, together with jq
, a tool for working with JSON files.
There is no tool in CSV kit that can truncate a CSV file at a particular point, but we can convert the CSV files to JSON and use jq
to only output the first 200 records:
for file in ./*.csv; do
csvjson -H "$file" | jq -r '.[:200][] | map(values) | @csv' >"$file.200" &&
mv "$file.200" "$file"
done
Given some CSV file like the below short example,
a,b,c
1,2,3
"hello, world",2 3,4
"hello
there","my good
man",nice weather for ducks
the csvjson
command would produce
[
{
"a": "a",
"b": "b",
"c": "c"
},
{
"a": "1",
"b": "2",
"c": "3"
},
{
"a": "hello, world",
"b": "2 3",
"c": "4"
},
{
"a": "hello\nthere",
"b": "my good\nman",
"c": "nice weather for ducks"
}
]
The jq
tool would then take this, and for each object in the array (restricted to the first 200 objects), extract the values as an array and format it as CSV.
It's probably possible to do this transformation directly with csvpy
, another tool in CSVkit, but as my Python skills are non-existent, I will not attempt to come up with a solution that does that.
./
or calling the utility with --
after the options (csvjson -H -- "$file"
). Either of these will avoid the issue that you run into when you have filenames that start with a dash. Try for example to create a file called -f
with touch
, and then to remove it with rm
.
– Kusalananda
Jan 14 '20 at 19:42
sed -i '201,$d' "$file"
should do in-place editing of the file
– CSM
Jan 14 '20 at 21:09
Previous answers copy data and overwrite files. This technique should keep the same inodes, do no copying, and run a whole lot faster. For each file :
(a) Find the length of each file by reading the first 200 lines.
(b) Truncate the file to that length using truncate
from GNU coreutils, or with the truncate
found on some BSD systems:
SZ="$( head -n 200 -- "${file}" | wc -c )"
truncate -s "${SZ}" -- "${file}"
truncate -s $(head -200 -- "${file}" | wc -) -- "${file}"
– jiwopene
Jan 12 '20 at 16:58
Using sed with shell globbing:
sed -ni '1,200p' *.csv
Using globbing/sed/parallel:
printf '%s\n' *.csv | parallel -- sed -ni '1,200p' {}
This will find all .csv
files in the current directory and feed them to GNU parallel which will execute a sed command on them to keep only the first 200 lines. Note this will overwrite the files in place.
Or using head with parallel:
printf '%s\n' *.csv | parallel -- head -n 200 {} ">" {}.out
This will create new files with the .out
suffix.
parallel
(in case the OP doesn't have access to it) - either with xargs
or, since the number of files is not huge, plain sed -ni '1,200p' *.csv
– steeldriver
Jan 11 '20 at 19:02
printf '%s\n' *.csv | parallel sed -ni '1,200p' {}
, but that did not change any of my test-files. Did something get lost in formatting?
– markgraf
Jan 11 '20 at 21:34
-i
with a suffix like: sed -ni.bak '1,200p' {}
– jesse_b
Jan 11 '20 at 21:35
parallel
from moreutils
on ubuntu18.04. Are you using another?
– markgraf
Jan 11 '20 at 21:39
parallel sed -ni '1,200p' -- *.csv
to do what it's supposed to do.
– markgraf
Jan 11 '20 at 21:59
parallel
from the package moreutils
is not the same as the package parallel
which is GNU parallel. :-)
– markgraf
Jan 11 '20 at 22:02
With ksh93 and a POSIX compliant head
implementation (one that leaves the cursor within stdin just after the last line it outputs), you can do:
for file in ~(N)./*; do
[ -f "$file" ] || continue # skip non-regular files
head -n 200 0<>; "$file" > /dev/null
done
The <>;
redirection operator is a variant on the <>
standard operator that truncates the file in place after the redirected command has returned provided the command returns with a success exit status.
Here, we discard head
's output, we're just interested in its ability to leave the cursor just after the 200th line.
Unfortunately, ksh93's builtin head
(which is enabled if you issue builtin head
or if /opt/ast/bin
is ahead of any directory with a head
command in it in $PATH
) does not behave POSIXly in this instance. It reads the input in chunks (like most other head
implementations), but does not bother seeking back to the end of the 200th line when invoked that way. To force it to do that seeking back, we need to execute an external command which defeats the purpose of having a builtin head
in the first place:
builtin head # enable ksh93's head builtin
{ head -n 200 && /bin/true; } 0<>; file > /dev/null
Another working approach that doesn't involve invoking an external utility would be to do an explicit 0-offset seek after head
returns:
builtin head # enable ksh93's head builtin
for file in ~(N)./*; do
[ -f "$file" ] || continue # skip non-regular files
{ head -n 200 && exec <#((CUR)); } 0<>; "$file" > /dev/null
done
For CSV input specifically, and to retain the first 200 CSV records (as opposed to lines as a CSV record can contain more than one line (embedded in "..."
quoted fields), you could use ksh93
's read -S
especially designed to read CSVs in a loop:
for file in ~(N)./*.csv; do
[ -f "$file" ] || continue # skip non-regular files
for ((i=0;i<200;i++)); do
IFS=, read -rSA discard
done 0<>; "$file"
done
I'm relatively new so please be gentle. I would appreciate constructive feedback if the solution I am proposing isn't optimal.
I created 4 sample files numbered 1 to 4 e.g. touch {1..4}
and each file contains 10 samples lines such as in the first file and lines 11 through 20 in the next file as so on.
File 1
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10
File 2
Line 11
Line 12
Line 13
Line 14
Line 15
Line 16
Line 17
Line 18
Line 19
Line 20
To extract the first 2 lines as an example (which can be extrapolated to 200), the command head -n 2 {1..4}
returns the output;
==> 1 <==
Line 1
Line 2
==> 2 <==
Line 11
Line 12
==> 3 <==
Line 21
Line 22
==> 4 <==
Line 31
Line 32
The command can redirect the output to another file with the command head -n 2 {1..4} > ExtractedOutput
n
lines for each of them", but it outputs these lines to a single new file, with new contents added (the headers from head
). I believe that most people would interpret the question as wanting to extract n
lines from each file inte separate files (possibly "overwriting the original files").
– Kusalananda
Jan 13 '20 at 07:43
And I only want to keep the first 200 lines for each of them - I'm okay if the generated files to overwrite the original ones.
as a single file since overwriting is optional. I take the point on the additional headers as a means to identify the source of the data set. Maybe the question could be elaborated on as to the format in which the information should be extracted? If you believe the answer is inappropriate, I would be happy to remove it.
– Ryan
Jan 13 '20 at 07:53
Use ed
to truncate each file.
for f in *.csv; do
printf '201,$d\nwq\n' | ed "$f"
done
If you want to save a backup, it might be easier to use ex
instead. (You might also consider ex
simpler to use, regardless; just drop the w!%.bak|
to skip making a backup first.)
for f in *.csv; do
ex -c 'w!%.bak|201,$d|wq' "$f"
done
perl
this is a one-liner. – Thorbjørn Ravn Andersen Jan 12 '20 at 21:41