26

Below awk command removes all duplicate lines as explained here:

awk '!seen[$0]++'

If the text contains empty lines, all but one empty line will be deleted.

How can I keep all empty lines whilst deleting all non-empty duplicate lines, using only awk? Please, also include a brief explanation.

7 Answers7

39

Another option is to check NF, eg:

awk '!NF || !seen[$0]++'

Or equivalently:

awk '!(NF && seen[$0]++)'
Thor
  • 17,182
16

Alternatively

awk '!/./ || !seen[$0]++' file

The main trick is the same, seen[$0]++ creates an entry in the seen associative array whose key is the current line ($0). Therefore, !seen[$0]++ will be false if this line has already been seen. The /./ is checking whether the line contains any non-blank characters, so !/./ matches non blank lines. Combined with || !seen[$0]++ it will ignore all duplicate lines except blank ones and print the rest.

terdon
  • 242,166
8

Here is another awk solution, similar to @Thor's answer, less concise but more efficient:

awk '!NF {print;next}; !($0 in a) {a[$0];print}' file

With this, we only check a[$0] has existed or not. If not, initializing it then print. In this case, we don't have any reference, assignment to a[$0] if it existed.

cuonglm
  • 153,898
  • I did not measure any significant time difference with my 288-line test file. However, your code certainly catches the prize for being the most readable. – Serge Stroobandt Nov 05 '15 at 20:24
  • 1
    Or, shorter '!($0 in a) {if (NF) a[$0]; print}'. - more readably: !($0 in a) {if (NF) {a[$0]}; print}'. By ensuring the $0 is only stored if the line is non-empty, ($0 in a) will always be false for empty lines. – AdminBee May 24 '22 at 11:16
  • According to "terdon" and "Ed Morton" this is not more efficient than the top voted answer. Neither explained why, however. Hopefully they will explain further. – End Anti-Semitic Hate Mar 01 '24 at 12:25
5
awk '/^[[:blank:]]*$/ { print; next; }; !seen[$0]++'

All you have to do is check for an empty (really empty or just blank) line first.

Hauke Laging
  • 90,279
1

Logical operator NOТ move outside of parentheses:

awk '!(NF && seen[$0]++)'
nezabudka
  • 2,428
  • 6
  • 15
1
awk '/^$/ || !seen[$0]++' filename
-2

This answer is similar to cuognim's answer, but is much shorter, and easier to read (at least for me). All credit goes to AdminBee:

awk '!($0 in a) {if (NF) a[$0]; print}' filename

By ensuring the $0 is only stored if the line is non-empty, ($0 in a) will always be false for empty (blank) lines.

  • 2
    Is there any benefit in this over the top answers? All of them are significantly shorter than this, so I don't quite see the benefit here. – terdon Feb 28 '24 at 12:50
  • @terdon Yes, it's much more efficient than the top 2 answers, and shorter than the third answer. This answer was provided in a helpful comment by AdminBee, and I simply converted it into a real answer so others can benefit from it as I did (comments are often deleted). – End Anti-Semitic Hate Feb 28 '24 at 22:23
  • 1
    How is it more efficient? Easier to read is subjective, I find this far more obscure since the behavior of awk to automatically increment something without explicitly being told to (with just a[$0]) is kind of esoteric. However, if you find it easier, that's absolutely fine. I just don't see how this would be more efficient than awk '!NF || !seen[$0]++'. I tested on a file with ~100k lines and couldn't see any significant difference in speed between that and yours. And both keep the same amount of data in memory, so it isn't "much more efficient" – terdon Feb 29 '24 at 11:56
  • @terdon Did you read the answer for which AdminBee's comment was posted, and the explanation provided in that answer by cuonglm? Why are you not questioning that post too? Please ask for clarification in the comment section of that answer: https://unix.stackexchange.com/a/131261/141350 – End Anti-Semitic Hate Feb 29 '24 at 12:02
  • 1
    I am questioning, as you say, because this is a new answer which, as far as I can tell, isn't providing any benefit over any of the existing answers. Cuonglm's was posted ten years ago and at the time of posting was providing a new approach. As you can see, I haven't downvoted, I was just wondering why this was posted since it didn't seem to be adding anything new, so I asked in case I was missing the benefit. You claimed better readability (fair enough, I don't see it, but this is subjective) and then that this is "much more efficient" and that doesn't seem to be true. That's all. – terdon Feb 29 '24 at 12:40
  • @EdMorton If you believe it is less efficient, please explain why here, where it was originally posted: https://unix.stackexchange.com/a/131261/141350 – End Anti-Semitic Hate Mar 01 '24 at 12:22