Remove duplicate lines from file that are in a different order

Question

My file is like this:

alice, bob
bob, cat
cat, dennis
cat, bob
dennis, alice

I want to remove lines where same words have been repeated in reverse order. In this example, bob, cat and cat, bob are repeated, so cat bob should be removed and my output should be

alice, bob
bob, cat
cat, dennis
dennis, alice

How can I do this?

Any restrictions regarding the other lines? I.e. can the fields be resorted and the lines be resorted, too? — FelixJN, Aug 04 '19 at 16:18
no such restrictions. sorting can be done any number of times.. — , Aug 04 '19 at 16:28

steeldriver · Accepted Answer · 2019-08-04T15:43:32.150

3

You could use a hash that is keyed on the sorted elements:

$ perl -lne 'print unless $h{join ",", sort split /, /, $_}++' file
alice, bob
bob, cat
cat, dennis
dennis, alice

For exactly 2 fields, something like this might sufficce

$ awk -F', ' '!seen[$2 FS $1]; {seen[$0]++}' file
alice, bob
bob, cat
cat, dennis
dennis, alice

edited Aug 04 '19 at 15:43

answered Aug 04 '19 at 15:34

steeldriver

81,074

idk what the perl script does but that awk script will use a lot more memory than necessary, see https://unix.stackexchange.com/a/533876/133219 for the idiomatic awk approach. – Ed Morton Aug 04 '19 at 22:45

score 1 · Answer 2 · answered Aug 04 '19 at 22:44

1

The idiomatic awk answer:

$ awk -F', ' '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
alice, bob
bob, cat
cat, dennis
dennis, alice

The general approach for any number of fields is to sort them and use the sorted list as the index to seen[].

answered Aug 04 '19 at 22:44

Ed Morton

31,617

1

Can you please explain how logic? – Death Metal Aug 08 '19 at 20:28
1

@DeathMetal It creates a common index out of each pair of key fields by putting them in greatest-first order so A B and B A both become the index B A. Then it just tests to see if the given index has been seen before - first time either A B or B A is encountered in the input seen["B A"]++ is 0, 2nd time it's 1, and so on. The ! at the front ensures that the default action of printing the current input line only occurs when seen["B A"]++ is zero, i.e. the first time its seen in the input. – Ed Morton Aug 08 '19 at 20:52

score -1 · Answer 3 · answered Aug 04 '19 at 21:11

-1

This sorts every line by its fields, then the file and pick unique lines only

while read line
  do
    echo $line |
    tr ' ,' '\n' |
    sort |
    tr '\n' ','
done < 1 |
sed -e 's/^,//' -e 's/,$//' -e 's/,,/\n/g' |
sort -u

answered Aug 04 '19 at 21:11

FelixJN

13,566

It will also strip some white space, interpret escape sequences, do globbing, word splitting and file name generation, be extremely slow and rely on sed being able to operate on a non-POSIX-compliant input stream. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice and https://mywiki.wooledge.org/Quotes for a description of some of the issues. – Ed Morton Aug 05 '19 at 00:29

Remove duplicate lines from file that are in a different order

3 Answers3