Remove duplicate lines with possibly different spacing, but ignore lines that start with #

Question

This is a follow-up to an earlier question. I use this syntax in order to remove duplicate lines even if the fields are spaced differently

awk '{$1=$1};!NF||!seen[$0]++' /tmp/fstab

Now I want to exclude lines that start with #. So I use this syntax

awk '/^#/ || "'!'"'{$1=$1};!NF||!seen[$0]++' /tmp/fstab
-bash: !: event not found

What is wrong with my syntax?

AdminBee · Accepted Answer · 2021-03-04T15:09:53.120

How about

awk '!NF||$1~/^#/ {print; next} {$1=$1} !seen[$0]++' /tmp/fstab

This will immediately print unchanged any line which is empty or whose first field starts with # and then skip execution so that any further code is bypassed. All other lines will be rebuilt and printed as long as they have not yet been encountered.

The reason for checking if $1~/^#/ instead of applying the match to the entire line (i.e. simply /^#/) is that this way we also catch comment lines where the # is preceded by whitespace. Although the manpage for fstab mandates that in order to qualify as comment line, the first character must be a #, as @StephenKitt noted, the Linux implementation of libmount will skip leading whitespace and accept a line as comment also if the first non-whitespace character is a #.

FelixJN · Answer 2 · 2021-03-05T12:31:41.123

1

Since bash is complaining (and not awk) and you have single quotes around your ! the problem is obvious: you are exiting the command block of awk.

awk '{$1=$1} /^#/ || !seen[$0]++' file

I.e. do the operation, then the checks. Disadvantage: Will remove/reduce spaces in comments, too, however not remove such duplicates. Avoid that by first buffering the line:

awk  '{a=$0 ; $1=$1}  /^#/ || !seen[$0]++ {print a}' file

Input:

#comment
duplicate line
#comment
duplicate    line 
not duplicate
not duplicate 2
duplicate        line
#comment2
#comment2
#comment 3
#comment      3

Output (first code)

#comment
duplicate line
#comment
not duplicate
not duplicate 2
#comment2
#comment2
#comment 3
#comment 3

Output (second code)

#comment
duplicate line
#comment
not duplicate
not duplicate 2
#comment2
#comment2
#comment 3
#comment      3

edited Mar 05 '21 at 12:31

answered Mar 04 '21 at 14:08

FelixJN

13,566

not good because the duplicate line that have space will not removed – yael Mar 04 '21 at 14:11
so need to use $1=$1}; – yael Mar 04 '21 at 14:11
see also https://unix.stackexchange.com/questions/637572/how-to-remove-duplicate-lines-ignoring-tab-or-spaces – yael Mar 04 '21 at 14:12
@FelixJN duplicate a with one space, and duplicate a with two spaces. – Stephen Kitt Mar 04 '21 at 14:13
@yael I see, see correction. – FelixJN Mar 05 '21 at 12:11
@StephenKitt TY, was unaware of what was meant in the first place. – FelixJN Mar 05 '21 at 12:14

score 1 · Answer 3 · answered Mar 06 '21 at 13:51

GNU sed with extended regex mode turned ON -E

sed -Ee '
  # print empty|blank|comment lines
  /^\s*(#|$)/b
  s/\s+/ /g;s/^ | $//g;G;        # squeeze whitespace n append previous unique lines
  /^([^\n]*)\n(.*\n)?\1(\n|$)/d; # delete if seen
  P;h;d;                         # print seen first time then update seen list
'  file

Remove duplicate lines with possibly different spacing, but ignore lines that start with #

3 Answers3