Add lines to files to make them equal length

Question

I have a bunch of .csv files with N columns and different number of rows (lines). I would like to add as many empty lines ;...; (N semicolons) to make them the same length. I can get the length of the longest file manually but it would also be good to get this done automatically.

For example:

I have,

file1.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
171; pep; 73; 22:26:10; 3; 72

file2.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
121; fng; 96; 09:42:10; 3; 52
141; gep; 53; 21:22:10; 3; 62
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892

file3.csv

121; fng; 96; 09:42:10; 3; 52
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892
141; gep; 53; 21:22:10; 3; 62

I need,

file1.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
171; pep; 73; 22:26:10; 3; 72
;;;;;
;;;;;
;;;;;

file2.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
121; fng; 96; 09:42:10; 3; 52
141; gep; 53; 21:22:10; 3; 62
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892

file3.csv

121; fng; 96; 09:42:10; 3; 52
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892
141; gep; 53; 21:22:10; 3; 62
;;;;;
;;;;;

A simple (but probably not optimal) way to do it would be to use wc to count the line count of each file to find the max. You can then echo ";;;;" >> file in each file until the line count reach the max. — Bear'sBeard, Dec 04 '18 at 10:45
Why do you want the files to have the same number of lines? Maybe there is a good method, where you can use the files as they are (with their different number of lines). — sudodus, Dec 04 '18 at 11:12
@Bear'sBeard Yep, something like that did it, I was looking for a more compact way. — myradio, Dec 04 '18 at 11:50
@sudodus Well, there're people before and after me in the pipeline, things must match certain formats... — myradio, Dec 04 '18 at 11:51

myradio · Accepted Answer · 2018-12-05T09:08:31.503

3

Thanks @Sparhawk for the suggestions in the comments, I update based on those,

#!/bin/bash

emptyLine=;;;;;;;
rr=($(wc -l files*pattern.txt |  awk '{print $1}' | sed '$ d'))
max=$(echo "${rr[*]}" | sort -nr | head -n1)
for name in files*pattern.txt;do
    lineNumber=$(wc -l < $name)
    let missing=max-lineNumber
    for((i=0;i<$missing;i++));do
        echo $emptyLine >> $name
    done
done

Well, not elegand nor efficient. Actually, it takes a couple of seconds which sounds an eternity given the small size of the data. Nevertheless it works,

#!/bin/bash

emptyLine=;;;;;;;
rr=($(wc -l files*pattern.txt |  awk '{print $1}' | sed '$ d'))
max=$(echo "${rr[*]}" | sort -nr | head -n1)
for name in $(ls files*pattern.txt);do
    lineNumber=$(cat $name | wc -l )
    let missing=max-lineNumber
    for((i=0;i<$missing;i++));do
        echo $emptyLine >> $name
    done
done

I just put this file together in the directory where I have the files provided that there is a pattern I can use to list them with files*pattern.txt

edited Dec 05 '18 at 09:08

answered Dec 04 '18 at 11:54

myradio

333

1

Nice one (+1)! Thanks for posting up the solution. One small note, don't parse ls, just use for name in files*pattern.txt; do instead. – Sparhawk Dec 04 '18 at 12:28
1

And while I'm nit-picking, there's a "useless use of cat" there too. Just do wc -l $name – Sparhawk Dec 04 '18 at 12:30
2

@Sparhawk: I think you meant wc -l < $name – Thor Dec 04 '18 at 12:37
@Thor No? wc [OPTION]... [FILE]... works too, as per the man. In fact, this script uses this construction in an earlier line. – Sparhawk Dec 04 '18 at 20:54
@Sparhawk: Sure, but if you wanted the equivalent output of cat file | wc -l redirection is the way to go. – Thor Dec 05 '18 at 08:07
@Thor Ahhhh, yes okay got it! – Sparhawk Dec 05 '18 at 08:38

pLumo · Answer 2 · 2018-12-05T09:51:02.253

2

An improvement of @myradio's answer.
The part inside the loop written in awk which should be much faster.

max=$(wc -l file*.csv | sed '$ d' | sort -n | tail -n1 | awk '{print $1}' )
for f in file*.csv; do
    awk -F';' -v max=$max \
      'END{
         s=sprintf("%*s",FS,"");
         gsub(/ /,"-",s);
         for(i=NR;i<max;i++)
           print s;
       }' "$f" >> "$f"
done

With -F you set the correct field separator of your files (here -F';').

The s=sprintf();gsub(); part dynamically sets the right amount of the FS (= field separator) (via).
You could simply replace that with print ";;;;;" or other static content if you like.

edited Dec 05 '18 at 09:51

answered Dec 04 '18 at 13:53

pLumo

22,565

I like this solution. It's certainly harder to read but is good that had a dynamic FS. Nevertheless, 2 things: 1. About the efficiency, I don't know what your time results mean because this depends on (number and size of) the files. 2. I actually wanted to try this to compare the time results with my version, but I got a problem, awk complains about gensub being undefined. Is gensub maybe on gawk instead? – myradio Dec 05 '18 at 09:22
yeah that seems to be GNU Awk. I replaced it with the gsub solution from the linked answer. – pLumo Dec 05 '18 at 09:52
About the time results, I think I misunderstood the statement "it takes a couple of seconds" from your answer to be the time needed to process the example files from your question ... I removed that. – pLumo Dec 05 '18 at 09:54

score 1 · Answer 3 · answered Dec 04 '18 at 15:42

In order to count the lines in each file only once:

wc -l *csv |sort -nr| sed 1d | {
    read max file
    pad=$(sed q "$file"|tr -cd ";")  # extract separators from first record
    while read lines file ; do
        while [ $((lines+=1)) -le $max ] ; do
                echo "$pad" >> "$file"
        done
    done
}

Note that any newlines in the filenames will cause problems for both sort and the while read loop, but they can handle filenames containing normal spaces.

Add lines to files to make them equal length

3 Answers3