5

I want to read a multi-line file in a bash script, using the file path from a variable, then merge the lines using a multi-char delimiter and save the result to another variable.

I want to skip blank lines and trailing new lines and do not want a trailing delimiter.

Additionally I want to support \r\n and - if at no further "cost" - why not also \r as line break (and of course \n).

The script should run on RHEL with GNU's bash 4.2.46, sed 4.2.2, awk 4.0.2, grep 2.20, coreutils 8.22 (tr, cat, paste, sort, cut, head, tail, tee, ...), xargs 4.5.11 and libc 2.17 and with perl 5.16.3, python 2.7.5 and openjdk 11.0.8.

It should run about twice per day on files with ca. 10 lines on a decent machine/VM. If readability, maintainability and brevity don't suffer too much I'm very open to more performant solutions though.

The files to be read from can be created and modified either on the same machine or other Win7 or Win10 systems.

My approach so far is

joined_string_var=$(sed 's/\r/\n/g' $filepathvar | grep . | sed ':a; N; $!ba; s/\n/; /g')
  • So first I replace \r with \n to cover all newline formats and make the output readable for grep.

  • Then I remove blank lines with grep .

  • And finally I use sed for the actual line merging.

I used sed instead of tr in the first step to avoid using cat, but I'm not quite sure if I prefer it like that:

joined_string_var=$(cat $filepathvar | tr '\r' '\n' | grep . | sed ':a; N; $!ba; s/\n/; /g')

UPDATE: I somehow completely missed simple redirection:

joined_string_var=$(tr '\r' '\n' <$filepathvar | grep . | sed ':a; N; $!ba; s/\n/; /g')

Any thoughts how this might be done more elegantly (less commands, better performance, not much worse brevity and readability)?

Andreas
  • 55
  • What about GNU sed: sed -zE 's/\r\n|\r|\n/; /g;s/; $//' "$filepathvar" ? (empty lines not removed (could be added), but just one command). –  Oct 14 '20 at 07:52
  • You are creating empty newlines with tr '\r' '\n'. You could simply delete \r with tr -d '\r' – Panki Oct 14 '20 at 07:53
  • @Panki then I wouldn't support \r style newlines (e.g. classic MacOS) anymore – Andreas Oct 14 '20 at 07:56
  • @Isaac then I could also use [\n\r]+, right? – Andreas Oct 14 '20 at 07:58
  • @Isaac joined_string_var=$(sed -zE 's/[\r\n]+/; /g;s/; $//' $filepathvar) just worked perfectly, fulfilling all the requirements. AFAICS. -z did the trick here, not requiring the label workaround, right? THANKS! – Andreas Oct 14 '20 at 08:05
  • 1
    @Andreas Yes, that works for your requirements, You are welcome. :-) –  Oct 14 '20 at 08:07
  • So with the delimiter from a variable it's joined_string_var=$(sed -zE 's/[\r\n]+/'"$delim"'/g;s/'"$delim"'$//' $filepathvar) – Andreas Oct 14 '20 at 08:09
  • 1
    Well, using a $delim inside a sed command is fraught with caveats. For example, it could not contain any /. –  Oct 14 '20 at 08:15
  • It fits my specific needs though :) – Andreas Oct 14 '20 at 09:12

5 Answers5

6

The elegance may come from the correct regex. Instead of changing every \r to a \n (s/\r/\n/g) you can convert every line terminator \r\n, \r, \n to the delimiter you want (in GNU sed, as few sed implementations will understand \r, and not all will understand -E):

sed -E 's/\r\n|\r|\n/; /g'

Or, if you want to remove empty lines, any run of such line terminators:

sed -E 's/[\r\n]+/; /g'

That will work if we are able to capture all line terminators in the pattern space. That means to slurp the whole file into memory to be able to edit them.

So, you can use the simpler (one command for GNU sed):

sed -zE 's/[\r\n]+/; /g; s/; $/\n/' "$filepathvar"

The -z takes null bytes as line terminators effectively getting all \r and \n in the pattern space.

The s/[\r\n]+/; /g converts all types of line delimiters to the string you want.

The s/; $/\n/ converts the (last) trailing delimiter to an actual newline.


Notes

The -z sed option means to use the zero delimiter (0x00). The use of that delimiter started as a need of find to be able to process filenames with newlines (-print0) which will match the xargs (-0) option. That meant that some tools were also modified to process zero delimited strings.

That is a non-posix option that breaks files at zeros instead of newlines.

Posix text files must have no zero (NIL) bytes, so the use of that option means, in practice, to capture the whole file in memory before processing it.

Breaking files on NILs means that newline characters end being editable on the pattern space of sed. If the file happens to have some NIL bytes, the idea still works correctly for newlines, as they still end being editable in each chunk of the file.

The -z option was added to GNU sed. The ATT sed (on which posix was based) did not have such option (and still doesn't), some BSD seds also still don't.

An alternative to the -z option is to capture the whole file in memory. That could be done Posixly in some ways:

sed 'H;1h;$!d'          # capture whole file in hold space.
sed ':a;N;$!ba'         # capture whole file in pattern space.

Having all newlines (except the last one) in the pattern space makes it possible to edit them:

sed -Ee 'H;1h;$!d;x'   -e 's/(\r\n|\r|\n)/; /g

With older sed's it is also required to use the longer and more explicit (\r\n|\r|\n)+ instead of [\r\n]+ because such sed's don't understand \r or \n inside bracket expressions [].

Line oriented

A solution that works one line at a time (a \r is also a valid line terminator in this solution), which means that there is no need to keep the whole file in memory (less memory used) is possible with GNU awk:

awk -vRS='[\r\n]+' 'NR>1{printf "; "}{printf $0}END{print ""}'  file

Must be GNU awk because of the regex record separator [\r\n]+. In other awk, the record separator must be a single byte.

5

Just use perl. Sed is more complicated to use with newlines, but perl can handle them easily:

printf 'aa\nbb\ncc\n' > file
printf 'aa2\r\nbb2\r\ncc2\r\n' > file2
printf 'aa3\rbb3\rcc3\r' > file3

So, file has \n line endings, file2 has \r\n and file3 has \r (which is obsolete these days, by the way, not much point in supporting it). Now, concatenate them into a string:

$ joined_string_var=$(perl -pe 's/(\r\n|\r|\n)/; /g' file file2 file3)
$ echo "$joined_string_var"
aa; bb; cc; aa2; bb2; cc2; aa3; bb3; cc3; 

You'll need a second pass to remove the trailing ; delimiter though:

$ joined_string_var=$(perl -pe 's/(\r\n|\r|\n)/; /g' file file2 file3 | sed 's/; $//')
$ echo "$joined_string_var" 
aa; bb; cc; aa2; bb2; cc2; aa3; bb3; cc3

Or, remove it in perl:

$ joined_string_var=$(perl -ne 's/(\r\n|\r|\n)/; /g; $k.=$_; END{$k=~s/; $//; print $k}' file file2 file3)
$ echo "$joined_string_var" 
aa; bb; cc; aa2; bb2; cc2; aa3; bb3; cc3
fra-san
  • 10,205
  • 2
  • 22
  • 43
terdon
  • 242,166
  • Thanks, but here I have a trailing delimiter, which I didn't want – Andreas Oct 14 '20 at 08:10
  • 3
    @Andreas no you don't. The variable has no trailing delimiter, it's only echo that adds it. Try joined_string_var=$(perl -pe 's/(\r\n|\r|\n)/; /g' file file2 file3) and then printf '%s' "$joined_string_var". – terdon Oct 14 '20 at 08:11
  • 1
    @terdon, Yes, indeed he does get a trailing ;. –  Oct 14 '20 at 08:24
  • @Isaac oooh! Yes indeed. I was thinking of trailing newlines. – terdon Oct 14 '20 at 08:55
  • @Andreas I'm sorry, I misunderstood your comment, I thought you were referring to the \n added by echo. You're quite right, I was adding an extra delimiter. I've added one more step to remove it now, but that does make the command more complex. – terdon Oct 14 '20 at 08:57
  • You could use: cat file1 file2 file3 | perl -pe 's/(\r\n|\r|\n)/; /g; s/; $// if eof' (read files only once (or use only one file). –  Oct 14 '20 at 09:46
  • @fra-san I don't see anything wrong with the test data formerly containing different newline styles as I explicitly did want to support them. – Andreas Oct 15 '20 at 08:54
  • @terdon I only want to join lines of single files to each variable (but supporting multiple files obviously doesn't "hurt"). Maybe that should be made clear in the question... – Andreas Oct 15 '20 at 08:55
  • @Andreas fra-san just corrected a mistake I'd made. I wanted one file with \n, one with \r and one with \r\n, but in my answer I was showing two files with \r\n instead. As for the multiple files, the same command can handle one or many files, so it doesn't really make a difference. If you give it one file, then it will only join one file. – terdon Oct 15 '20 at 09:01
3

For the record in zsh (for those coming here with a similar requirement, but not the bash limitation), you'd do:

IFS=$'\r\n'
joined=${(j[; ])$(<$filepathvar):#}
  • IFS=$'\r\n' sets the field separator for word splitting to CR or LF characters (using the ksh93-style $'...' quotes).
  • $(<file): like in ksh expands to the contents of file (without the trailing newline characters), subject to word splitting.
  • ${list:#pattern} expands to the element of the list that don't match the pattern (and extension to ksh's ${list#pattern}). Here with the empty string as the pattern to remove empty lines.
  • ${(j[; ])list} joins the elements of the list with "; ".
0
f=file
python3 -c "import re
print(re.sub(r'[\r\n]+', '; ', open('$f').read().strip('\r').strip('\n')))"
perl -nF'[\r\n]+' -0777E '$,="; ";
  say @F;
' file
0

A possibly elegant, surely non-portable, GNU awk variation that uses the join function, from the library shipped alongside gawk itself:

joined_string=$(awk -i join -v RS='[\n\r]+' -v sep='; ' '
  { a[++i] = $0 } END { print join(a, 1, i, sep) }
' "$filepathvar")

The arguments to the join function are: an array to join (a), the start element's position (1), the end element's position (i), the string to use as the separator (sep).

GNU awk's non-standard -i (or --include) option is used to extend its features by loading source libraries. The interpretation of RS as a regular expression is also an extension to the standard, supported by GNU awk and some other implementations (e.g. mawk, BusyBox awk).

Note that this approach is not suitable for large amounts of data because the whole file has to be stored in memory.

fra-san
  • 10,205
  • 2
  • 22
  • 43