2

I have a script that takes as input 2 files.
Before the processing start some preparation is done to the files.
I had the idea to not touch the original files but do everything to copies, print what is needed as the output and delete the copies.
This approach though made the script with many variables and make it too error prone.
Example:

#!/bin/bash                                                                                                      
[[ -z $1 ]] && echo 'We need input file a' && exit 1;  
[[ -z $2 ]] && echo 'We need input file b' && exit 1;  

A_CSV=$1;  
B_CSV=$2;  

A_FILE="$A_CSV.tmp";  
B_FILE="$B_CSV.tmp";  

[ -f $A_FILE ]] && rm $A_FILE;  
[[ -f $B_FILE ]] && rm $B_FILE;  

tr -d "\r" < $A_CSV >  $A_FILE;  
tr -d "\r" < $B_CSV > $B_FILE;  

awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' $A_FILE > "$A_FILE.bck";
awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' $B_FILE > "$B_FILE.bck";

rm $A_FILE && mv "$A_FILE.bck" $A_FILE;   
rm $B_FILE && mv "$B_FILE.bck" $B_FILE;   
# extra logic following the same pattern  

You can see how a copy is created to do the update and rename over and over.

Is there a way to improve this to make the script less error prone?

Jim
  • 10,120
  • 1
    Simply related: the section "Warning regarding >" at http://unix.stackexchange.com/a/186126/117549 – Jeff Schaller Feb 03 '17 at 21:34
  • @JeffSchaller:That link is just awesome!Thanks a lot. Also I realized that the in-place editing is sub-optimal in terms of memory requirements – Jim Feb 05 '17 at 12:17

2 Answers2

4

This is achieved through a pipe (|). There are many good tutorials such as this one.

#!/bin/bash
[[ -z $1 ]] && echo 'We need input file a' && exit 1;
[[ -z $2 ]] && echo 'We need input file b' && exit 1;  

A_CSV=$1;  
B_CSV=$2;  

A_FILE="$A_CSV.tmp";  
B_FILE="$B_CSV.tmp";  

[ -f $A_FILE ]] && rm $A_FILE;
[[ -f $B_FILE ]] && rm $B_FILE;

tr -d "\r" < $A_CSV | awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' > $A_FILE
tr -d "\r" < $B_CSV | awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' > $B_FILE

I would personally create a function to deal with a single operation since you do the same thing on both files. rm -f $A_FILE $B_FILE would also look better in my opinion.

3

Leaving the original files intact and doing the processing on copies is a very good idea. You should take it further and not reuse the intermediate files either. If you reuse the intermediate files, and the process is interrupted, you'll have no way to know at which point it got interrupted.

You're applying the same transformation to two files. Don't write the code twice! Write the code once, using variables as necessary, and call that piece of code once for each file. In a shell script, the tool for this is to write a function (or, if you need that piece of code to be called from more than one script, make it a separate script).

All the text processing tools you're using can read from standard input and write to standard output. You can combine them by putting a pipe between one tool's output and the next tool's input. That way, you don't need so many intermediate files — in fact you don't need any intermediate file in this case. Pipes are a fundamental design feature of Unix.

A further shell programming tip: always put double quotes around variable expansions, i.e. $foo.

#!/bin/bash                                                                                                      

preprocess_csv () {
  <"$1" \
  tr -d '\r' |
  awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' >"${1%.csv}.clean"
}

preprocess_csv "$1"
preprocess_csv "$2"

do_stuff_with_preprocessed_file "${1%.csv}.clean" "${2%.csv}.clean" >global_output

I used the parameter expansion construct ${1%.csv} to transform e.g. foo.csv into foo, so that the output file from this transformation is foo.clean.

That script is simpler than what you had, but it can still be improved. There are better tools than shell scripts to describe a chain of file processing commands: build automation tools such as the classic make. See Execute a list of commands with checkpointing? for an introduction to make with a similar use case. Here's how the transformation you have can be expressed with make. Call this file Makefile. Note that where the lines below are indented with 8 spaces, you need to replace the 8 spaces by a tab character, it's a quirk of make.

default: global_output

%.clean: %.csv
        <'$<' tr -d '\r' | awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' >'$@'

global_output: input1.clean input2.clean
        do_stuff_with_preprocessed_files input1.clean input2.clean >$@

$< in a command stands for the dependency (the file on the right of the target: dependency above) and $@ stands for the target. With the makefile above, if you run the command make global_output (or just make, thanks to the default: line at the beginning), it will run the transformations to produce the .clean files (the .csv files must already exist) and then it will run do_stuff_with_preprocessed_files to produce global_output.

This makefile is fragile because it leaves partially processed files if interrupted midway. To fix this, use temporary files in each rule, as explained in Execute a list of commands with checkpointing?.

default: global_output

%.clean: %.csv
        <'$<' tr -d '\r' | awk '{ if(NR == 1) sub(/^\xef\xbb\xbf/,""); print }' >'$@.tmp'
        mv '$@.tmp' '$@'

global_output: input1.clean input2.clean
        do_stuff_with_preprocessed_files input1.clean input2.clean >'$@.tmp'
        mv '$@.tmp' '$@'
  • Can I pass arguments in a shell function? In your example you are using the scripts arguments within the function 2) I don't understand the syntax <"$1" tr -d '\r'. Shouldn't it be tr -d '\r' < "$1"?
  • – Jim Feb 05 '17 at 12:33
  • Is there a benefit to leave intermediate files instead of logs to figure out where the process was interrupted?
  • – Jim Feb 05 '17 at 12:38
  • @Jim 1) The function takes arguments. I'm not using the script arguments as such; in fact the function body cannot use the script argument as such. In the function, $1 is the first agument passed to the function. 2) Redirections can go at the beginning, at the end or even in the middle of the command. I prefer to put an input redirection at the beginning of a pipeline because then the order from left to right is stage 0, transformation 0→1, transformation 1→2, stage 2 rather than transformation 0→1, stage 0, transformation 1→2, stage 2. – Gilles 'SO- stop being evil' Feb 05 '17 at 18:57
  • @Jim 3) Leaving intermediate files is easier (no logs to arrange) and has the benefit that you can pick up where you left off. – Gilles 'SO- stop being evil' Feb 05 '17 at 18:57