sort letters in a single word - use it to find permutations (or anagrams)

Question

I have some dictionary to myspell in file.dic. Let's say:

abc
aword
bword
cab
worda
wordzzz

and I'm looking for different words that are permutations (or anagrams) of each other.

If there was a command "letter-sort" I'd do it more or less like that:

cat file.dic | letter-sort | paste - file.dic | sort

That gives me:

abc abc
abc cab
adorw aword
adorw worda
bdorw bword    
dorwzzz wordzzz

so now I clearly see anagrams in file. Is there such letters-sort command or how to obtain such result in maybe some other way?

Thanks @pfnuesel. Fixed now :) My fault, should be adorw - assuming order like in alphabet. — sZpak, Nov 17 '16 at 23:51

score 1 · Accepted Answer · answered Nov 17 '16 at 23:50

1

To sort letters line by line in a file, you could do something like that:

while read line; do
    grep -o . <<< "${line}" | sort | tr -d '\n'
    echo
done < file.dic

Output:

abc
adorw
bdorw
abc
adorw
dorwzzz

answered Nov 17 '16 at 23:50

pfnuesel

5,837

It works, but it's very sloooow :) For dictionary of ~300k words (lines) it took several minutes. I'll try to use pipes with python scritpt later on to make it faster. – sZpak Nov 18 '16 at 00:23
I can imagine this is slow. For a faster solution, you should try python or perl. – pfnuesel Nov 18 '16 at 00:33
Also see Why is using a shell loop to process text considered bad practice? – Wildcard Nov 18 '16 at 03:34
@pfnuesel, please add a note to your answer about the slowness. This is very bad practice; at the least you should put a warning flag at the start. – Wildcard Nov 18 '16 at 03:35

score 1 · Answer 2 · answered Nov 17 '16 at 23:51

1

You can use the fold command to break a string into an array of individual characters, like the script below

#!/bin/bash

CHARS=`echo $1 | fold -w1`
# $CHARS now contain an array of single character in the string $1

for i in "${CHARS[@]}"
do
    # do something with each character
    echo $i;
done

Assuming that you have saved the script above as test.sh you can run it as follows:

$./test.sh abcde

and it will break the string "abcde" into a characters array, which then you can use to find its anagrams.

answered Nov 17 '16 at 23:51

jsalatas

389

I've tested it and fold is much faster than grep, unfortunately, it doesn't work in non-latin locale. – sZpak Nov 18 '16 at 13:45
Hmmmm..... it works for me (just tested with greek input) $ echo δοκιμή | fold -w1 δ ο κ ι μ ή – jsalatas Nov 18 '16 at 18:54
Doesnt work with Polish though :/ echo "zażółć gęślą jaźń" | fold -w1 at least on my system. – sZpak Nov 21 '16 at 10:12
Indeed! And I just figured out why: It seems that if all characters in the string are unicode non-ascii (like the greek) it correctly uses two bytes for each character (unicode representation). In the case of polish it fails to do so, as it interpret correctly each character's length (either one or two) and interprets everything as one byte character. Same happens if I mix greek with english like this echo δοκιμή test | fold -w1 :\ – jsalatas Nov 21 '16 at 10:33

iruvar · Answer 3 · 2016-11-18T19:54:31.483

You mentioned python, just stick with python. Two words are anagrams of each other if 1. they contain the same letters and 2. letter frequencies match. The built-in Counter class can be used to do one-pass letter frequencies without the need for sorting

from __future__ import print_function
from collections import Counter, defaultdict
from itertools import combinations_with_replacement
with open('file') as f:
    data = (l.rstrip('\n') for l in f)
    data = ((l, Counter(l)) for l in data)
    perms = defaultdict(list)
    for l, c in data:
        perms[frozenset(c.iteritems())].append(l)   
    for anagrams in perms.itervalues():
        print(*anagrams)

bword
aword worda
abc cab
wordzzz

score 0 · Answer 4 · answered Aug 29 '17 at 12:53

Perl with it's command line flags can be very good at being succinct:

The following command sorts letters in a word

perl -CS -ne 'chomp; print(join("", sort(split("", $_ . "\n"))))'

In practice, if you are working with anagrams you might prefer to use the an utility. This can take a dictionary as an argument:

an -d /usr/share/dict/ngerman Anagramword

sort letters in a single word - use it to find permutations (or anagrams)

4 Answers4