2

I would like to remove all empty lines that are ONLY located at the end of the file using awk

I was able to successfully find a way to delete all of the empty lines at the top only with the following command:

awk '/^$/ && a!=1 {a=0} !/^$/ {a=1} a==1 {print}' file.txt

However, I didn't know how to reverse it so I could remove the bottom lines instead. I know I could just use the command above and pipe it with tac, but I prefer a direct approach using awk command only (if possible).

To clarify, a line is considered "empty" if it is "visually empty", i.e. contains at most spaces and/or tabs.

AdminBee
  • 22,803
  • 2
    Is a line that only contains spaces (e.g. blanks and/or tabs) considered "empty" or not? – Ed Morton Aug 27 '21 at 19:36
  • @EdMorton It doesn't matter as long as it is empty (regardless if it contains spaces and/or tabs). The goal here is just to make sure the last printed line contains info/data. My priority is that it should be done using awk – CodingNoob Aug 27 '21 at 19:59
  • 2
    For future questions - as long as it is emptyis an ambiguous statement because some people consider a line of blanks to be empty (as you apparently do) while others consider only a line that has no contents to be empty. So if/when you find yourself referring to any data as "empty" in future be sure to state what you mean by that. Most of the answer you have so far (e.g. all those that test ^$) will fail given an "empty" input line that contains blanks since we're all making different assumptions about what "empty" means. – Ed Morton Aug 27 '21 at 20:23
  • 1
    @EdMorton Oh yeah.. I get your point. You are right. I should have phrased my question better. Thank you for your clarification, brother. – CodingNoob Aug 27 '21 at 20:26

6 Answers6

6

Awk

Since Awk reads the file sequentially, from the first to the last line, without external help (e.g. Tac) it can only figure whether a block of empty lines is at the end of the file when it actually reaches the end of the file.

What you can do is keep a variable with the empty lines (i.e., only newline characters, the default record separator RS) and print those empty lines whenever you reach a non-empty line:

awk '/^$/{n=n RS}; /./{printf "%s",n; n=""; print}' file

I don't understand why there is a difference between print n and printf n.

print appends the output record separator (ORS, by default a newline) to the expression to be printed. Thus you would get an extra newline if you tried it. You could also write it with a single output statement as in

awk '/^$/{n=n RS}; /./{printf "%s%s%s",n,$0,RS; n=""}' file

Ed or Ex

To print the output (just as Awk did), choose either of

printf '%s\n' 'a' '' '.' '?.?+1,$d' ',p' 'Q'  | ed -s file
printf '%s\n' 'a' '' '.' '?.?+1,$d' '%p' 'q!' | ex -s file

To directly apply the changes to the file, choose either of

printf '%s\n' 'a' '' '.' '?.?+1,$d' 'w' 'q'   | ed -s file
printf '%s\n' 'a' '' '.' '?.?+1,$d' 'x'       | ex -s file

To understand what's going on.

Command substitution

Shells strip trailing newline characters in command substitution.

printf '%s\n' "$(cat file)"

Mind that some shells will not handle large files and error with "argument list too long".

Inspired by this answer.

Quasímodo
  • 18,865
  • 4
  • 36
  • 73
  • 2
    Note that in the general case printf var is wrong as the first argument of printf is the format. So it's better to use printf "%s", var. Using printf n here should be harmless though as the variable only contains newline characters so should be guaranteed not to contain % characters (note that for the printf utility, backslash characters are also a problem in addition to %). – Stéphane Chazelas Aug 27 '21 at 18:35
  • 1
    @StéphaneChazelas Thanks, very true. Although that's no concern here as you said, it's good to stand by the good practice so I've edited the answer. – Quasímodo Aug 27 '21 at 18:40
  • 2
    The ‘nifty trick’ is likely to be inefficient for all but very small files, and fail completely for large ones.  (The exact limit, and the behaviour for files exceeding it, will depend upon the shell you're using.) – gidds Aug 28 '21 at 15:45
  • 1
    Thanks @gidds, I have added a warning about that limit. I tried it for ¬(very small files), namely for a 10MB file with 116600 lines, and it executed in decimals of a second, while Awk was five times faster (so indeed slower, but not that bad). – Quasímodo Aug 28 '21 at 17:50
  • 1
    I think with ex you could do something like printf '%s\n' \$ ?.?+,\$d %p q\! (where I’ve eschewed the append and used the idiomatic % instead of ,) – D. Ben Knoble Aug 28 '21 at 21:50
  • 2
    @D.BenKnoble You can only skip appending the empty line if you can guarantee the file has a trailing empty line. This is similar to the approach contained in the answer I linked to. Thanks for catching the ,p, that one is only for Ed. – Quasímodo Aug 30 '21 at 11:26
6
awk 'length == 0 { ++n; next } { for (i = 1; i <= n; ++i) print ""; n = 0 }; 1' file

or shortened, as suggested in comments,

awk 'length == 0 { ++n; next } { while (n) { print ""; --n } }; 1'

This keeps track of runs of empty lines in the counter n.

Whenever an empty line is seen (length == 0), the counter is incremented but nothing is output.

When a non-empty line is seen, the appropriate number of empty lines is first output before the current line. The counter n is also reset.

This avoids outputting empty lines from the end of the file.


Using standard sed:

sed -n -e :again -e N -e '/[^\n]/!b again' -e p file

This introduces an explicit loop that adds lines to the buffer until there is something other than just newlines in it. At that point, the buffer is output. If the input file ends while reading with N, the data in the buffer (which will be only newlines) will not be output.

Annotated code (the initial #n turns off the default output, just like using -n would do):

#n

Label to branch to later.

:again

Append next line of input to buffer

with a delimiting newline.

N

Branch (jump) to :again if there's

only newlines in the buffer.

/[^\n]/!b again

Output buffer.

p

Kusalananda
  • 333,661
  • 2
    Don't forget that [^\n] is nonportable , GNU-specufic regex. – guest_7 Aug 27 '21 at 23:53
  • 1
    +1 for the awk approach. It can further be simplified via down counting while loop. – guest_7 Aug 27 '21 at 23:54
  • 1
    @guest_7 From POSIX: "The escape sequence '\n' shall match a embedded in the pattern space. A literal shall not be used in the BRE of a context address or in the substitute function.". What is GNU specific is to insert a literal newline using \n with the s/// command. I have tested my code with sed on OpenBSD and Plan 9. – Kusalananda Aug 28 '21 at 05:37
6

This 1-pass approach will work whether the input is coming from a pipe or a file but has to store each block of empty lines in memory (which probably won't really be an issue unless you have billions of contiguous empty lines in your input):

awk 'NF{print s $0; s=""; next} {s=s ORS}' file

This 2-pass approach won't work if the input is a pipe, but will if the input is a file as you said in the question and uses almost no memory:

awk 'NR==FNR{if (NF) n=NR; next} FNR>n{exit} 1' file file

The above assumes that a line that contains only spaces is considered "empty". If that's wrong then change NF to /./.

Ed Morton
  • 31,617
  • 1
    I like your approach. I understood most parts, but I still need your help for more clarification, (if you don't mind of course). I am really confused on what s=s ORS means. Can you explain what that line means? Would there be a difference if I used RS instead? Thank you very much for your time and help. Thanks to you, I learned that NF ignores everything including multiple spaces and tabs while !/^$/ doesn't. I also learned from you about next and how I can use it. :D – CodingNoob Aug 28 '21 at 00:32
  • 1
    s = s ORS is just collecting a string of spaces (empty lines). which will get printed when the next non-empty line is hit, or discarded if those empty lines were at the end of the file. ORS is the Output Record Separator, RS is the input Record Separator. We're building a string of lines to Output so, while In your case it seems like RS and ORS are probably the same thing (\n) so functionally it doesn't matter which you use, technically ORS is correct so ensure your "empty" lines have the same terminator (ORS) as your non-empty lines being output by print (i.e. also ORS). – Ed Morton Aug 28 '21 at 02:43
  • 1
    You'd see the difference between using RS and ORS when populating s if you wanted your output lines to end in |\n or something. If you're using ORS when populating s then all you need to do is set -v ORS='|\n' on the command line and everything will work, but if you're using RS to populate s then you have a problem and some coding to do. Try it to see what I mean. – Ed Morton Aug 28 '21 at 02:47
3

Using awk and its range operator ,

awk '
!NF,NF{
  t = ($0 = t $0) ORS
  if (!NF) next
  t=""
}1' file

We accumulate the empty lines + one nonempty line in t. If a nonempty is seen we flush tHe buffer t and restart the process of accumulation.


With all-POSIX constructs, we can do it using the sed utility as shown:

sed -e '
  /^\n*$/!b
  $d;N;s/./&&/;D
' file
guest_7
  • 5,728
  • 1
  • 7
  • 13
1
co=`awk '!/^$/{x=NR}END{print x}' filename`
co=$(($co+1))
j="$co,$"
sed -i "${j}d" filename

Tested and worked fine

1

Same idea as in this answer

What you can do is keep a variable with the empty lines (i.e., only newline characters, the default record separator RS) and print those empty lines whenever you reach a non-empty line

but using sed

sed -n -e '/^$/H' -e '/^$/!{x;/^$/!{s/\n//;p;z};x;p}'
CervEd
  • 174