3

I have a large (~300) set of .csv files, each of which are ~200k lines long, with a regular filename pattern:

outfile_n000.csv
outfile_n001.csv
outfile_n002.csv
.
.
.
outfile_nXXX.csv

I need to extract a range of lines (100013-200013) from each file, and save that extracted region to a new .csv file, appending a ptally_ prefix to differentiate it from the original file, while preserving the original file.

I know that I can use

sed -n '100013,200013p' outfile_nXXX.csv > ptally_outfile_nXXX.csv

to do this to a single file, but I need a way to automate this for large batches of files. I can get close by using the -i option in sed to do so:

sed -iptally_* -n '100013,200013p' outfile_nXXX.csv > ptally_outfile_nXXX.csv

but this writes the extracted lines to outfile_nXXX.csv, and leaves the original file renamed as ptally_outfile_nXXX.csv, as this is the purpose of -i.

Likewise, brace expansion in bash won't do the trick, as brace expansion and wildcards don't mix:

sed --n 10013,20013p *.csv > {,ptally_}*.csv

Any elegant ways to combine the extraction and renaming into a simpler process? Currently, I'm using a bash script to perform the swap between the outfile_nXXX.csv and ptally_outfile_nXXX.csv filenames, but I would prefer a more straightforward workflow. Thanks!

don_crissti
  • 82,805
avoyles
  • 43

2 Answers2

5

Use a for loop.

for f in outfile_n???.csv; do
  sed -n '100013,200013p' "$f" > ptally_"$f"
done

Alternatively, depending on your exact actual requirements, it may be more applicable to use csplit. Some of the GNU extensions extend its power considerably.

Wildcard
  • 36,499
1

Not sed, but quite elegant way:

awk 'NR >= 100013 && NR <= 200013 {print > "ptally_" FILENAME}' outfile_nXXX.csv

For bulk extracting to new, appropriate files do:

awk 'FNR >= 100013 && FNR <= 200013 {print > "ptally_" FILENAME}' outfile_n*

Also, you can store filename into the variable before passing it to the sed:

filename="outfile_nXXX.csv"

sed -n '100013,200013p' "$filename" > "ptally_$filename"
MiniMax
  • 4,123
  • 1
    XXX is a placeholder, not part of the actual name of the file. Your last example with storing it in a variable doesn't make sense at all given that there are around 300 files to be handled. – Wildcard Oct 06 '17 at 04:24
  • @Wildcard It was just example, not working solution. Idea. Sure, it should be for loop, iterating through all 300 files. And outfile_nXXX.csv is representing the general file name. Anyway, awk solution is more elegant, so I didn't developing bash way far enough :) – MiniMax Oct 06 '17 at 10:06