17

I encountered this use case today. It seems simple at first glance, but fiddling around with sort, uniq, sed and awk revealed that it's nontrivial.

How can I delete all pairs of duplicate lines? In other words, if there is an even number of duplicates of a given line, delete all of them; if there is an odd number of duplicate lines, delete all but one. (Sorted input can be assumed.)

A clean elegant solution is preferable.

Example input:

a
a
a
b
b
c
c
c
c
d
d
d
d
d
e

Example output:

a
d
e
Wildcard
  • 36,499

12 Answers12

6

I worked out the sed answer not long after I posted this question; no one else has used sed so far so here it is:

sed '$!N;/^\(.*\)\n\1$/d;P;D'

A little playing around with the more general problem (what about deleting lines in sets of three? Or four, or five?) provided the following extensible solution:

sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp

Extended to remove triples of lines:

sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp

Or to remove quads of lines:

sed -e ':top' -e '$!{/\n.*\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1\n\1$/d;P;D' temp

sed has an additional advantage over most other options, which is its ability to truly operate in a stream, with no more memory storage needed than the actual number of lines to be checked for duplicates.


As cuonglm pointed out in the comments, setting the locale to C is necessary to avoid failures to properly remove lines containing multi-byte characters. So the commands above become:

LC_ALL=C sed '$!N;/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n/!{N;b top' -e '};};/^\(.*\)\n\1$/d;P;D' temp
LC_ALL=C sed -e ':top' -e '$!{/\n.*\n/!{N;b top' -e '};};/^\(.*\)\n\1\n\1$/d;P;D' temp
# Etc.
Wildcard
  • 36,499
  • 2
    @Wildcard: You may want to set locale to C, otherwise in multi-byte locale, invalid character in that locale cause the command fail. – cuonglm Apr 19 '16 at 04:54
4

It's not very elegant, but it's as simple as I can come up with:

uniq -c input | awk '{if ($1 % 2 == 1) { print substr($0, 9) }}'

The substr() just trims off the uniq output. That'll work until you have more than 9,999,999 duplicates of a line (in which case uniq's output may spill over 9 characters).

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
  • I tried uniq -c input | awk '{if ($1 %2 == 1) { print $2 } }' and it seemed to work equally well. Any reason the substr version is better? – Joseph R. Apr 18 '16 at 21:50
  • 1
    @JosephR., if there is any whitespace in the lines the version in your comment will fail. – Wildcard Apr 18 '16 at 21:51
  • That is true. In that case, wouldn't a loop to print the fields $2 to $NF be more robust? – Joseph R. Apr 18 '16 at 23:08
  • @JosephR.: Why do you believe that your alternative would be more robust?  You might have difficulty getting it to work correctly when there are multiple consecutive spaces; e.g., foo   bar. – G-Man Says 'Reinstate Monica' Apr 18 '16 at 23:35
  • @JosephR., no, because it would change/eliminate the whitespace delimiting. uniq (at least in GNU coreutils) seems to reliably use exactly 9 characters before the text itself; I can't find this documented anywhere, though, and it's not in the POSIX specs. – Wildcard Apr 18 '16 at 23:36
4

Give a try to this awk script below:

#!/usr/bin/awk -f
{
  if ((NR!=1) && (previous!=$0) && (count%2==1)) {
    print previous;
    count=0;
  }
  previous=$0;
  count++;
}
END {
  if (count%2==1) {
    print previous;
  }
}

It is assumed that the lines.txt file is sorted.

The test:

$ chmod +x script.awk
$ ./script.awk lines.txt
a
d
e
Jay jargot
  • 1,175
4

With pcregrep for a given sample:

pcregrep -Mv '(.)\n\1$' file

or in a more general way:

pcregrep -Mv '(^.*)\n\1$' file
jimmij
  • 47,140
4

If input is sorted:

perl -0pe  'while(s/^(.*)\n\1\n//m){}'
Wildcard
  • 36,499
JJoao
  • 12,170
  • 1
  • 23
  • 45
  • You have an anchoring failure here. Try running it on e.g. pineapple\napple\ncoconut and the output is pinecoconut. – Wildcard May 21 '16 at 11:51
  • @Wildcard: thank you. You are right. See if my update makes sense... – JJoao May 22 '16 at 20:53
  • 1
    Yep. I was wondering why you were using \n instead of $ given the /m modifier, but then I realized that using $ would leave a blank line in place of deleted lines. Looks good now; I've removed the incorrect version since it just added noise. :) – Wildcard May 22 '16 at 22:21
  • @wildcard, thank you for the noise reduction ☺ – JJoao May 22 '16 at 22:31
3

I like python for this, for example with python 2.7+

from itertools import groupby
with open('input') as f:
    for k, g in groupby(f):
            if len(list(g)) % 2:
                    print(k),
iruvar
  • 16,725
2

As I understood the question I opted for awk, using a hash of each record, in this case I'm assuming that RS=\n, but it can be changed to consider any other sort of arrangements, it can be arranged to consider an even number of reps, instead of the odd, with a parameter or a small dialog. Every line is used as the hash and its count is increased, at the end of the file the array is scanned and prints every even count of the record. I'm including the count in order to check but, removing a[x] is enough to solve that issue.

HTH

countlines code

#!/usr/bin/nawk -f
{a[$0]++}
END{for (x in a) if (a[x]%2!=0) print x,a[x] }

Sample Data:

a
One Sunny Day
a
a
b
my best friend
my best friend
b
c
c
c
One Sunny Day
c
d
my best friend
my best friend
d
d
d
One Sunny Day
d
e
x
k
j
my best friend
my best friend

Sample Run:

countlines feed.txt
j 1
k 1
x 1
a 3
One Sunny Day 3
d 5
e 1
  • It's a nice piece of awk code, but unfortunately awk associative arrays are not ordered at all, nor are they order-preserving. – Wildcard Apr 19 '16 at 03:31
  • @Wildcard, I agree with you, if you are requiring the input order, rather than a sort order, it can be implemented through an extra hash key, the advantage of this is that you don have to sort the input, since the sort order can be made at the end with a smaller output ;) – Moises Najar Apr 19 '16 at 05:06
  • @Wildcard if you need the order to be preserved, please mention that in the question. This approach was also my first thought and you don't mention order other than to say that we can assume the file is sorted. Of course, if the file is sorted, you can always pass the output of this solution through sort. – terdon Apr 19 '16 at 16:03
  • @terdon, of course you're correct; the output can just be sorted again. Good point. It's also worth noting that the !=0 is implied by how awk converts numbers to true/false values, making this reducible to awk '{a[$0]++}END{for(x in a)if(a[x]%2)print x}' – Wildcard Apr 19 '16 at 16:44
1

Using shell constructs,

uniq -c file | while read a b; do if (( $a & 1 == 1 )); then echo $b; fi done
Guido
  • 4,114
1

If input is sorted what about this awk:

awk '{ x[$0]++; if (prev != $0 && x[prev] % 2 == 1) { print prev; } prev = $0; } END { if (x[prev] % 2 == 1) print prev; }' sorted
taliezin
  • 9,275
1

with perl:

uniq -c file | perl -lne 'if (m(^\s*(\d+) (.*)$)) {print $2 if $1 % 2 == 1}'
xx4h
  • 2,400
1

Fun puzzle!

In Perl:

#! /usr/bin/env perl

use strict;
use warnings;

my $prev;
while (<>) {
  $prev = $_, next unless defined $prev;  # prime the pump

  if ($prev ne $_) {
    print $prev;
    $prev = $_;                           # first half of a new pair
  }
  else {
    undef $prev;                          # discard and unprime the pump
  }
}

print $prev if defined $prev;             # possible trailing odd line

Verbosely in Haskell:

main :: IO ()
main = interact removePairs
  where removePairs = unlines . go . lines
        go [] = []
        go [a] = [a]
        go (a:b:rest)
          | a == b = go rest
          | otherwise = a : go (b:rest)

Tersely in Haskell:

import Data.List (group)
main = interact $ unlines . map head . filter (odd . length) . group . lines
0

a version: I use "delimiters" to simplify the inner loop (it assumes the first line is not __unlikely_beginning__ and it assumes the text is not ending with the line : __unlikely_ending__, and add that special delimiter line at the end of the inputed lines. Thus the algorithm can assume both: )

{ cat INPUTFILE_or_just_-  ; echo "__unlikely_ending__" ; } | awk '
  BEGIN {mem="__unlikely_beginning__"; occured=0; }  

    ($0 == mem)            { occured++ ; next } 

    ( occured%2 )           { print mem ;} 
                            { mem=$0; occured=1; }
'

So :

  • we remember the pattern we are currently looking at, increasing it by one everytime it reoccurs. [and if it did reoccurs, we skip the next 2 actions, which are for the case when the pattern changes]
  • When the pattern CHANGES:
    • if not a multiple of 2, we print one occurence of the memorized pattern
    • and in every case when the pattern has changed : the new memorized pattern is the current pattern, and we only saw it once.