5

I have a bunch of .csv files with N columns and different number of rows (lines). I would like to add as many empty lines ;...; (N semicolons) to make them the same length. I can get the length of the longest file manually but it would also be good to get this done automatically.

For example:

I have,

file1.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
171; pep; 73; 22:26:10; 3; 72

file2.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
121; fng; 96; 09:42:10; 3; 52
141; gep; 53; 21:22:10; 3; 62
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892

file3.csv

121; fng; 96; 09:42:10; 3; 52
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892
141; gep; 53; 21:22:10; 3; 62

I need,

file1.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
171; pep; 73; 22:26:10; 3; 72
;;;;;
;;;;;
;;;;;

file2.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
121; fng; 96; 09:42:10; 3; 52
141; gep; 53; 21:22:10; 3; 62
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892

file3.csv

121; fng; 96; 09:42:10; 3; 52
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892
141; gep; 53; 21:22:10; 3; 62
;;;;;
;;;;;
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
myradio
  • 333
  • 1
    A simple (but probably not optimal) way to do it would be to use wc to count the line count of each file to find the max. You can then echo ";;;;" >> file in each file until the line count reach the max. – Bear'sBeard Dec 04 '18 at 10:45
  • 1
    Why do you want the files to have the same number of lines? Maybe there is a good method, where you can use the files as they are (with their different number of lines). – sudodus Dec 04 '18 at 11:12
  • @Bear'sBeard Yep, something like that did it, I was looking for a more compact way. – myradio Dec 04 '18 at 11:50
  • @sudodus Well, there're people before and after me in the pipeline, things must match certain formats... – myradio Dec 04 '18 at 11:51

3 Answers3

3

Thanks @Sparhawk for the suggestions in the comments, I update based on those,

#!/bin/bash

emptyLine=;;;;;;;
rr=($(wc -l files*pattern.txt |  awk '{print $1}' | sed '$ d'))
max=$(echo "${rr[*]}" | sort -nr | head -n1)
for name in files*pattern.txt;do
    lineNumber=$(wc -l < $name)
    let missing=max-lineNumber
    for((i=0;i<$missing;i++));do
        echo $emptyLine >> $name
    done
done

Well, not elegand nor efficient. Actually, it takes a couple of seconds which sounds an eternity given the small size of the data. Nevertheless it works,

#!/bin/bash

emptyLine=;;;;;;;
rr=($(wc -l files*pattern.txt |  awk '{print $1}' | sed '$ d'))
max=$(echo "${rr[*]}" | sort -nr | head -n1)
for name in $(ls files*pattern.txt);do
    lineNumber=$(cat $name | wc -l )
    let missing=max-lineNumber
    for((i=0;i<$missing;i++));do
        echo $emptyLine >> $name
    done
done

I just put this file together in the directory where I have the files provided that there is a pattern I can use to list them with files*pattern.txt

myradio
  • 333
2

An improvement of @myradio's answer.
The part inside the loop written in awk which should be much faster.

max=$(wc -l file*.csv | sed '$ d' | sort -n | tail -n1 | awk '{print $1}' )
for f in file*.csv; do
    awk -F';' -v max=$max \
      'END{
         s=sprintf("%*s",FS,"");
         gsub(/ /,"-",s);
         for(i=NR;i<max;i++)
           print s;
       }' "$f" >> "$f"
done

With -F you set the correct field separator of your files (here -F';').

The s=sprintf();gsub(); part dynamically sets the right amount of the FS (= field separator) (via).
You could simply replace that with print ";;;;;" or other static content if you like.

pLumo
  • 22,565
  • I like this solution. It's certainly harder to read but is good that had a dynamic FS. Nevertheless, 2 things: 1. About the efficiency, I don't know what your time results mean because this depends on (number and size of) the files. 2. I actually wanted to try this to compare the time results with my version, but I got a problem, awk complains about gensub being undefined. Is gensub maybe on gawk instead? – myradio Dec 05 '18 at 09:22
  • yeah that seems to be GNU Awk. I replaced it with the gsub solution from the linked answer. – pLumo Dec 05 '18 at 09:52
  • About the time results, I think I misunderstood the statement "it takes a couple of seconds" from your answer to be the time needed to process the example files from your question ... I removed that. – pLumo Dec 05 '18 at 09:54
1

In order to count the lines in each file only once:

wc -l *csv |sort -nr| sed 1d | {
    read max file
    pad=$(sed q "$file"|tr -cd ";")  # extract separators from first record
    while read lines file ; do
        while [ $((lines+=1)) -le $max ] ; do
                echo "$pad" >> "$file"
        done
    done
}

Note that any newlines in the filenames will cause problems for both sort and the while read loop, but they can handle filenames containing normal spaces.

JigglyNaga
  • 7,886