0

I have 7 csv files which have some climatic data. The name of the files are: SMVV50065-2015-01.csv and *2015-02.csv, 2015-03.csv etc. When I open the csv files I see a syntax like this:

" SMVV, 2015-01-01 00:00,50065,780,7,1000,-2,18, , ,1000"

which reffer to temperature,pressure,humidity etc. measurements. The ", ," are missing data. I used sed command to change the missing values from gap to NA.More specifically I wrote

sed 's/ ,/NA/g' SMVV50065-2015-01.csv > newfile01.csv

I manage to change all the gaps to NA. The problem is that I want to do the same to the rest of my files using foreach command and save them,after the change, in new files with names newfile01.csv, newfile02.csv etc. Do you have any idea what is the exact syntax of the command??

steve
  • 21,892

2 Answers2

1

I'm assuming that your CSV files have strictly no quotes with commas in them, and that they contain no fields with newlines.

This will change empty fields or fields containing only spaces to NA:

awk 'BEGIN { FS=OFS="," } { for (i=1;i<=NF;++i) if ($i ~ /^ *$/) $i = "NA"; print }'

For each comma-separated field on each line of input, we test whether it matches the regular expression ^ *$. If it does, the field is replaced by the string NA. The FS and OFS variables in the BEGIN block are the input and output field separators, respectively. NF is the number of fields that awk detects in the current line of input and if i is an integer, $i will be the field corresponding to that integer, counting from one.

Your example line,

SMVV, 2015-01-01 00:00,50065,780,7,1000,-2,18, , ,1000

would be turned into

SMVV, 2015-01-01 00:00,50065,780,7,1000,-2,18,NA,NA,1000

Now, to run this on all your files I'm assuming they are all in a directory called dir and that the filenames match the pattern SMVV50065*.csv.

Looping over these files is a matter of

for name in dir/SMVV50065*.csv; do
    test -f "$name" || continue
    # construct new name and call awk here
done

We test with test -f whether the $name is actually a regular file and skip the rest of the iteration if it's not. It would not be if the pattern matches any directory names, or if the pattern doesn't match anything (in which case it will be left unexpanded).

To construct new filenames in the pattern that you suggest, we can keep a counter variable that is incremented in each iteration, from one onwards, and call printf with a formatting string that gives the output filename using this variable:

i=1
for name in dir/SMVV50065*.csv; do
    test -f "$name" || continue

    newname=$( printf 'newfile%02d.csv' "$i" )
    i=$(( i + 1 ))

    # call awk here
done

The %02d in the printf format gives us a 2-digit zero-filled integer from $i.

Now it's just a matter of calling awk on the old filename and write the result to the new. We will write the result to files in the result directory, just to keep them separated from the original files.

#!/bin/sh

mkdir -p result

i=1
for name in dir/SMVV50065*.csv; do
    test -f "$name" || continue

    newname=$( printf 'newfile%02d.csv' "$i" )
    i=$(( i + 1 ))

    awk 'BEGIN { FS=OFS="," } { for (i=1;i<=NF;++i) if ($i ~ /^ *$/) $i = "NA"; print }' "$name" >result/"$newname"
done

The only other thing I've done here is to make sure that the result directory actually exists with mkdir -p result at the start. I've also added a #!-line at the top to say that this is a sh script.

And again, with a bit of diagnostics and parametrisation added:

#!/bin/sh

indir=dir
outdir=result

mkdir -p "$outdir"

i=1
for name in "$indir"/SMVV50065*.csv; do
    if [ ! -f "$name" ]; then
        printf 'Not a regular file: "%s"\n' "$name" >&2
        continue
    fi

    newname=$( printf '%s/newfile%02d.csv' "$outdir" "$i" )
    i=$(( i + 1 ))

    printf 'Processing "%s" into "%s"...\n' "$name" "$newname" >&2

    awk 'BEGIN { FS=OFS="," } { for (i=1;i<=NF;++i) if ($i ~ /^ *$/) $i = "NA"; print }' "$name" >"$newname"
done

You could obviously put your sed command in here in place of my awk thingy, if you want.


Question in comments:

The above seems difficult, why can't we just do

foreach file (ls SMVV50065-2015-0[1-7].csv)
    sed 's/ ,/NA/g' > newfile0[1-7].csv
end 

Reply:

We first have to start with using the correct syntax. This looks somewhat like the csh shell's syntax, but since no particular shell was mentioned in the question and since sh-like shells are more commonly used, and since I personally have very little experience with csh and tcsh, I'm going to convert it into sh syntax.

The loop in sh shells is for rather than foreach and we use in and do instead of the parenthesis. You also suggest using ls for the loop, but ls is strictly an interactive command whose result are only for looking at (see "Why *not* parse `ls`?"). It's enough with a filename globbing pattern to generate a list of filenames to loop over.

So let's use your loop with correct syntax:

for file in SMVV50065-2015-0[1-7].csv; do
    sed 's/ ,/NA/g' > newfile0[1-7].csv
done

The next issue with the looping here is that we don't know if $file will be a useful value at all. If the pattern SMVV50065-2015-0[1-7].csv matches a directory name or if it doesn't match anything at all, then we should not use $file, so let's test for that:

for file in SMVV50065-2015-0[1-7].csv; do
    test -f "$file" || continue

    sed 's/ ,/NA/g' > newfile0[1-7].csv
done

Now for the sed invocation: You will need to pass the filename $file to sed so that it has something to work on:

for file in SMVV50065-2015-0[1-7].csv; do
    test -f "$file" || continue

    sed 's/ ,/NA/g' "$file" > newfile0[1-7].csv
done

The next problem is that you can't actually redirect the output from sed to a filename globbing pattern like newfile0[1-7].csv. A globbing pattern will be expanded by the shell to all the names that matches that pattern, or it will remain unexpanded if it doesn't match anything.

Assuming you have no file in the current directory that matches the newfile0[1-7].csv pattern. The loop will then create a file called newfile0[1-7].csv, and this fill will be overwritten in each iteration of the loop.

This is why I introduced my variable i, so that I could construct a new filename in each iteration:

i=1
for file in SMVV50065-2015-0[1-7].csv; do
    test -f "$file" || continue

    sed 's/ ,/NA/g' "$file" >"newfile0$i.csv"
    i=$(( i + 1 ))
done

I was assuming you may have far more than just seven files to process, and that's why I went through that extra bit of trouble to generate the output filename using printf, to ensure that we got a filename with a zero-filled number in it.

The above loop may work for you as it is, but if I just reformulate it a bit (assign the new filename to a variable and use that with sed):

i=1
for file in SMVV50065-2015-0[1-7].csv; do
    test -f "$file" || continue

    newname="newfile0$i.csv"
    i=$(( i + 1 ))

    sed 's/ ,/NA/g' "$file" >"$newfile"
done

You see? We're more or less back at my solution (without the extra bells and whistles of my last variation). The only radical difference is that you here assume that all the files will be available in the current directory, and that the output files should be created alongside the original files.

Kusalananda
  • 333,661
  • First of all I thank you very much for such a detailed analysis! Secondly due to the fact that I'm completely noob in scripting I thought that it was much more easier than that :( This seems very difficult.. I'm a little bit confused..Forgive me for trying to do it kind of easier,but I want just to use foreach and sed commands to make the change.I was thinking about this:

    foreach file (ls SMVV50065-2015-0[1-7].csv) sed 's/ ,/NA/g' > newfile0[1-7].csv end

    Is it completely wrong ?

    – H. Johnson Feb 24 '18 at 21:35
  • It depends on what the foreach program does. I believe it is unknown on Unix and Linux shells. Moreover, you'll be writing your output in one file, with a strange name, "newfile0[1-7].csv". One more thing, your input, and your output, although they have the extension "csv", will not contain valid CSV, as logn is you ignore the double quotes that start and end the line. – Gerard H. Pille Feb 24 '18 at 23:45
  • @H.Johnson I have updated my answer with a reply to your comment. – Kusalananda Feb 25 '18 at 07:10
  • foreach is the analogue in csh of for in traditional shells. – Mark Plotnick Feb 25 '18 at 08:46
  • @MarkPlotnick Ah yes, it is. Unfortunately my csh-foo is lacking. – Kusalananda Feb 25 '18 at 08:48
  • Your answer was excellent! You have helped me not only to achieve my goal, but also to fully understand the process followed. – H. Johnson Feb 28 '18 at 21:43
0

Below one is which i tried

filnames.txt==> Contains all the filenames

 for j in `cat filenames.txt`; do sed "s/ ,/NA/g" $j >newfiles_$i;i=$(($i + 1)); done