You can take different approaches depending on whether awk
treats RS
as a single character (like traditional awk
implementations do) or as a regular expression (like gawk
or mawk
do). Empty files are also tricky to be considered as awk
tends to skip them.
gawk
, mawk
or other awk
implementations where RS
can be a regexp.
In those implementations (for mawk
, beware that some OSes like Debian ship a very old version instead of the modern one maintained by @ThomasDickey), if RS
contains a single character, the record separator is that character, or awk
enters the paragraph mode when RS
is empty, or treats RS
as a regular expression otherwise.
The solution there is to use a regular expression that can't possibly be matched. Some come to mind like x^
or $x
(x
before the start, or after the end). However some (particularly with gawk
) are more expensive than others. So far, I've found that ^$
is the most efficient one. It can only match on an empty input, but then there would be nothing to match against.
So we can do:
awk -v RS='^$' '{printf "%s: <%s>\n", FILENAME, $0}' file1 file2...
One caveat though is that it skips empty files (contrary to perl -0777 -n
). That can be addressed with GNU awk
by putting the code in a ENDFILE
statement instead. But we also need to reset $0
in a BEGINFILE statement as it would otherwise not be reset after processing an empty file:
gawk -v RS='^$' '
BEGINFILE{$0 = ""}
ENDFILE{printf "%s: <%s>\n", FILENAME, $0}' file1 file2...
traditional awk
implementations, POSIX awk
In those, RS
is just one character, they don't have BEGINFILE
/ENDFILE
, they don't have the RT
variable, they also generally can't process the NUL character.
You would think that using RS='\0'
could work then since anyway they can't process input that contains the NUL byte, but no, that RS='\0'
in traditional implementations is treated as RS=
, which is the paragraph mode.
One solution can be to use a character that is unlikely to be found in the input like \1
. In multibyte character locales, you can even make it byte-sequences that are very unlikely to occur as they form characters that are not assigned or non-characters like $'\U10FFFE'
in UTF-8 locales. Not really foolproof though and you have a problem with empty files as well.
Another solution can be to store the whole input in a variable and to process that in the END statement at the end. That means you can process only one file at a time though:
awk '{content = content $0 RS}
END{$0 = content
printf "%s: <%s>\n", FILENAME, $0
}' file
That's the equivalent of sed
's:
sed '
:1
$!{
N;b1
}
...' file1
Another issue with that approach is that if the file wasn't ending in a newline character (and wasn't empty), one is still arbitrarily added in $0
at the end (with gawk
, you'd work around that by using RT
instead of RS
in the code above). One advantage is that you do have a record of the number of lines in the file in NR
/FNR
.
To work with several files at a time, one approach would be to do all the file reading by hand in a BEGIN
statement (here assuming a POSIX awk
, not the /bin/awk
of Solaris with the API from the 70s):
awk -- '
BEGIN {
for (i = 1; i < ARGC; i++) {
FILENAME = ARGV[i]
$0 = ""
while ((getline line < FILENAME) > 0)
$0 = $0 line "\n"
# actual processing here, example:
print i". "FILENAME" has "NF" fields and "length()" characters."
}
}' *.txt
Same caveats about trailing newlines. That one has the advantage of being able to work with filenames that contain =
characters.
tr '\n' 'thatchar'
the file before sending it to awk, andtr 'thatchar' \n'
the output? (you may need to still append a newline to ensure, like I noted above, your input file has a terminating newline:{ tr '\n' 'missingchar' < thefile ; printf "\n" ;} | awk ..... | { tr 'missingchar' '\n' }
(but that add a '\n' in the end, that you may need to get rid of... maybe adding a sed before the final tr? if that tr accepts files without terminating newlines...) – Olivier Dulac Sep 14 '16 at 16:36awk
doesn't do the splitting if we don't. Having said that, not even the/bin/awk
of Solaris 9 (based on the 1970's awk) had that limitation, so I'm not sure we can find one that does (still possible as SVR4's oawk had a limit of 99 and nawk 199, so it's likely the lifting of that limit was added by Sun and may not be found in other SVR4 based awks, can you test on AIX?). – Stéphane Chazelas Sep 01 '18 at 07:21END
block and call that function in everyFNR==1
block as well as in theEND
.content
would be reset after use in the function or in theFNR==1
block. – Ed Morton Jan 05 '21 at 20:30ARGV
and read the lines withgetline
by hand in aBEGIN
statement. – Stéphane Chazelas Jan 06 '21 at 09:18