Leaving the original files intact and doing the processing on copies is a very good idea. You should take it further and not reuse the intermediate files either. If you reuse the intermediate files, and the process is interrupted, you'll have no way to know at which point it got interrupted.
You're applying the same transformation to two files. Don't write the code twice! Write the code once, using variables as necessary, and call that piece of code once for each file. In a shell script, the tool for this is to write a function (or, if you need that piece of code to be called from more than one script, make it a separate script).
All the text processing tools you're using can read from standard input and write to standard output. You can combine them by putting a pipe between one tool's output and the next tool's input. That way, you don't need so many intermediate files — in fact you don't need any intermediate file in this case. Pipes are a fundamental design feature of Unix.
A further shell programming tip: always put double quotes around variable expansions, i.e. $foo
.
#!/bin/bash
preprocess_csv () {
<"$1" \
tr -d '\r' |
awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' >"${1%.csv}.clean"
}
preprocess_csv "$1"
preprocess_csv "$2"
do_stuff_with_preprocessed_file "${1%.csv}.clean" "${2%.csv}.clean" >global_output
I used the parameter expansion construct ${1%.csv}
to transform e.g. foo.csv
into foo
, so that the output file from this transformation is foo.clean
.
That script is simpler than what you had, but it can still be improved. There are better tools than shell scripts to describe a chain of file processing commands: build automation tools such as the classic make. See Execute a list of commands with checkpointing? for an introduction to make with a similar use case. Here's how the transformation you have can be expressed with make. Call this file Makefile
. Note that where the lines below are indented with 8 spaces, you need to replace the 8 spaces by a tab character, it's a quirk of make.
default: global_output
%.clean: %.csv
<'$<' tr -d '\r' | awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' >'$@'
global_output: input1.clean input2.clean
do_stuff_with_preprocessed_files input1.clean input2.clean >$@
$<
in a command stands for the dependency (the file on the right of the target: dependency
above) and $@
stands for the target. With the makefile above, if you run the command make global_output
(or just make
, thanks to the default:
line at the beginning), it will run the transformations to produce the .clean
files (the .csv
files must already exist) and then it will run do_stuff_with_preprocessed_files
to produce global_output
.
This makefile is fragile because it leaves partially processed files if interrupted midway. To fix this, use temporary files in each rule, as explained in Execute a list of commands with checkpointing?.
default: global_output
%.clean: %.csv
<'$<' tr -d '\r' | awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' >'$@.tmp'
mv '$@.tmp' '$@'
global_output: input1.clean input2.clean
do_stuff_with_preprocessed_files input1.clean input2.clean >'$@.tmp'
mv '$@.tmp' '$@'