Why is uniq ignoring Unicode and lines with a single letter?

Question

I'm trying to combine both the American and British dictionaries into one large dictionaries, and I'm trying to remove all the duplicates from the superset, but it seems that uniq is not outputting words like, "épée" or single letters.

This is what I've tried using:

LC_COLLATE=en_US.UTF-8 cat american-english british-english |sort|uniq -u > unique_sorted_combined_dict

If I just do this:

LC_COLLATE=en_US.UTF-8 cat american-english british-english |sort > sorted_combined_dict

"épée" and other such words do show up, along with single letters.

Is there something I'm missing here with uniq?

I should note that I'm using uniq from the GNU coreutils on Ubuntu 12.10, if that makes any difference.

What if you do 'LC_COLLATE=en_US.UTF-8 cat american-english british-english |sort > temp && cat temp | uniq > sorted_combined_dict' ? — schaiba, Feb 04 '13 at 20:57
I decided, per your comment, to just take out the -u flag, and now it's working. I don't know why -u is not giving me the expected results though. — supercheetah, Feb 04 '13 at 21:06
@supercheetah I think I made the same mistake as you when it comes to uniq. https://unix.stackexchange.com/a/394205/45386 answered my concerns. — Ken Sharp, Dec 05 '20 at 23:34

score 3 · Accepted Answer · answered Feb 04 '13 at 21:11

You're setting LC_COLLATE for the cat command only (which doesn't make use of it), while you need to set it for sort and uniq.

Also, you may need to set LC_CTYPE to something utf-8, otherwise it will cause confusion. I would set LC_ALL to en_US.UTF-8

uniq -u only reports unique lines. So, if those single letter words all appear several times, it's normal that they don't show up.

On my system, épée does appear twice:

$ cat american-english british-english | sort | grep -x 'épée'
épée
épée

Maybe you meant sort | uniq or sort -u.

Yeah, I obviously misunderstood what the -u flag does, and I think that was the cause of my problems. Thanks! — supercheetah, Feb 05 '13 at 23:11

Why is uniq ignoring Unicode and lines with a single letter?

1 Answers1

Linked