21

Tools like sed, awk or perl -n process their input one record at a time, records being lines by default.

Some, like awk with RS, GNU sed with -z or perl with -0ooo can change the type of record by selecting a different record separator.

perl -n can make the whole input (each individual file when passed several files) a single record with the -0777 option (or -0 followed by any octal number greater than 0377, 777 being the canonical one). That's what they call the slurp mode.

Can something similar be done with awk's RS or any other mechanism? Where awk processes each file content as a whole in order as opposed to each line of each file?

1 Answers1

20

You can take different approaches depending on whether awk treats RS as a single character (like traditional awk implementations do) or as a regular expression (like gawk or mawk do). Empty files are also tricky to be considered as awk tends to skip them.

gawk, mawk or other awk implementations where RS can be a regexp.

In those implementations (for mawk, beware that some OSes like Debian ship a very old version instead of the modern one maintained by @ThomasDickey), if RS contains a single character, the record separator is that character, or awk enters the paragraph mode when RS is empty, or treats RS as a regular expression otherwise.

The solution there is to use a regular expression that can't possibly be matched. Some come to mind like x^ or $x (x before the start, or after the end). However some (particularly with gawk) are more expensive than others. So far, I've found that ^$ is the most efficient one. It can only match on an empty input, but then there would be nothing to match against.

So we can do:

awk -v RS='^$' '{printf "%s: <%s>\n", FILENAME, $0}' file1 file2...

One caveat though is that it skips empty files (contrary to perl -0777 -n). That can be addressed with GNU awk by putting the code in a ENDFILE statement instead. But we also need to reset $0 in a BEGINFILE statement as it would otherwise not be reset after processing an empty file:

gawk -v RS='^$' '
   BEGINFILE{$0 = ""}
   ENDFILE{printf "%s: <%s>\n", FILENAME, $0}' file1 file2...

traditional awk implementations, POSIX awk

In those, RS is just one character, they don't have BEGINFILE/ENDFILE, they don't have the RT variable, they also generally can't process the NUL character.

You would think that using RS='\0' could work then since anyway they can't process input that contains the NUL byte, but no, that RS='\0' in traditional implementations is treated as RS=, which is the paragraph mode.

One solution can be to use a character that is unlikely to be found in the input like \1. In multibyte character locales, you can even make it byte-sequences that are very unlikely to occur as they form characters that are not assigned or non-characters like $'\U10FFFE' in UTF-8 locales. Not really foolproof though and you have a problem with empty files as well.

Another solution can be to store the whole input in a variable and to process that in the END statement at the end. That means you can process only one file at a time though:

awk '{content = content $0 RS}
     END{$0 = content
       printf "%s: <%s>\n", FILENAME, $0
     }' file

That's the equivalent of sed's:

sed '
  :1
  $!{
   N;b1
  }
  ...' file1

Another issue with that approach is that if the file wasn't ending in a newline character (and wasn't empty), one is still arbitrarily added in $0 at the end (with gawk, you'd work around that by using RT instead of RS in the code above). One advantage is that you do have a record of the number of lines in the file in NR/FNR.

To work with several files at a time, one approach would be to do all the file reading by hand in a BEGIN statement (here assuming a POSIX awk, not the /bin/awk of Solaris with the API from the 70s):

awk -- '
  BEGIN {
    for (i = 1; i < ARGC; i++) {
      FILENAME = ARGV[i]
      $0 = ""
      while ((getline line < FILENAME) > 0)
        $0 = $0 line "\n"
  # actual processing here, example:
  print i&quot;. &quot;FILENAME&quot; has &quot;NF&quot; fields and &quot;length()&quot; characters.&quot;
}

}' *.txt

Same caveats about trailing newlines. That one has the advantage of being able to work with filenames that contain = characters.

  • as for the last part ("if the file wasn't ending in a newline character (and wasn't empty), one is still arbitrarily added in $0 at the end") : for text files, they are supposed to have an ending newline. vi adds one, for example, and thus modify the file when you save it. Not having a terminating newline makes some command discard the last "line" (ex: wc) but others still 'see' the last line... ymmv. Your solution is therefore valid, imo, if you are supposed to treat text files (which is probably the case, as awk is good for text processing but not so good for binaries ^^ ) – Olivier Dulac Sep 14 '16 at 16:31
  • 2
    trying to slurp all in may hit some limitations... traditionnal awk apparently had (have?) a limit of 99 fields on a line... so you may need to use a different FS as well to avoid that limit, but you may also have limits on how long the total length of a line (or the whole thing, if you manage to get it all on one line) can be? – Olivier Dulac Sep 14 '16 at 16:33
  • finally: a (silly...) hack could be to 1st parse the whole file and look for a char that isn't in there, then tr '\n' 'thatchar' the file before sending it to awk, and tr 'thatchar' \n' the output? (you may need to still append a newline to ensure, like I noted above, your input file has a terminating newline: { tr '\n' 'missingchar' < thefile ; printf "\n" ;} | awk ..... | { tr 'missingchar' '\n' } (but that add a '\n' in the end, that you may need to get rid of... maybe adding a sed before the final tr? if that tr accepts files without terminating newlines...) – Olivier Dulac Sep 14 '16 at 16:36
  • 1
    @OlivierDulac, the limit on the number of fields would only be hit if we were accessing NF or any field. awk doesn't do the splitting if we don't. Having said that, not even the /bin/awk of Solaris 9 (based on the 1970's awk) had that limitation, so I'm not sure we can find one that does (still possible as SVR4's oawk had a limit of 99 and nawk 199, so it's likely the lifting of that limit was added by Sun and may not be found in other SVR4 based awks, can you test on AIX?). – Stéphane Chazelas Sep 01 '18 at 07:21
  • FWIW if you wanted to process multiple files with a POSIX awk you could just create a function that looks a lot like the code in your END block and call that function in every FNR==1 block as well as in the END. content would be reset after use in the function or in the FNR==1 block. – Ed Morton Jan 05 '21 at 20:30
  • @Ed, that would skip empty files. Another approach could be to loop through ARGV and read the lines with getline by hand in a BEGIN statement. – Stéphane Chazelas Jan 06 '21 at 09:18
  • Yeah, I thought about mentioning that but since you had already mentioned issues with empty files I didn't bother. – Ed Morton Jan 06 '21 at 15:06