Frequency of words in non-English language text: how can I merge singular and plural forms etc.?

Question

I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l', d') in the context of shaping word tokens for sorting.

The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:

compt1 () {
for i in *.txt; do
    echo "File: $i"
    sed -e 's/ /\
/g' <"$i" | sed -e 's/^[[:alpha:]][[:punct:]]\(.*\)/\1/' | sed -e 's/\(.*\)/\L\1/' | grep -hEo "[[:alnum:]_'-]+" | grep -Fvwf /path_to_stop_words_file | sort | uniq -c | sort -rn 
done
}

...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!

Now when I compare the frequency of a significant word with the output of grep -c directly on the files, I think it's close enough within some margin of error.

Questions:

How could I modify this to merge the frequency of plurals with their singular forms i.e. words sharing a common prefix with a varying 1 character suffix?
I'm trying to assess whether the grep part in particular would work with what's on OSX?

^{1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.}

I've just been told by an expert on this kind of thing that you'll never be able to do this properly with sed & co. You should try a stemmer instead. It was also suggested that you might get better answers on [so] or [linguistics.se]. — terdon, Jul 19 '14 at 14:32
OK, I just checked with one of the [linguistics.se] mods and they said it's on topic there. Since they deal with this kind of thing professionally, they might be able to help out more. The question is perfectly on topic here though so it's completely up to you. If you would like it migrated, just flag for moderator attention and let us know. — terdon, Jul 19 '14 at 16:00
@terdon Thank you very much for looking into it and enabling the answer! As for a migration path I just don't know. As I find the topic interesting I should maybe expand my account to Linguistics and write a better focused question which links to this here... — , Jul 19 '14 at 22:18
As you wish. The linguists are willing to take this so it's really up to you, whatever you prefer. — terdon, Jul 20 '14 at 12:38

score 10 · Accepted Answer · answered Jul 19 '14 at 14:53

You really are not going to be able to do this with a simplistic sed script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.

That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.

That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.

I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:

       # 1st strip leading articles
          s/^L'//;    # Catalan
          s{ ^
            (?:
        # Castilian
                El
              | Los
              | La
              | Las

        # Catalan 
              | Els
              | Les         
              | Sa
              | Es

        # Gallego
              | O       
              | Os
              | A
              | As      
            ) 
            \s+ 
          }{}x;
        # 2nd strip interior particles
          s/\b[dl]'//g;   # Catalan
          s{ 
            \b
            (?:
                el  | los | la | las | de  | del | y          # ES
              | els | les | i  | sa | es | dels               # CA 
              | o   | os  | a  | as  | do  | da | dos | das   # GAL
            )
            \b
        }{}gx;

That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g first.

Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.

Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.

But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.

This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.

Thank you for taking the time to expose the underlying considerations. I had never considered that language-speficic morphology expertise would be required - but I did get that right away, as I read Mr Porter's take on French. In particular the region with the vowel after the first non vowel; and the second region with the same construct. I thought if I had a stem file for all the French stems, then I could make a comparison which folds the matches to stems. I will take more time to analyze what you wrote and what it entails. Ty! — , Jul 19 '14 at 22:31

score 3 · Answer 2 · edited May 23 '17 at 12:40

Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.

The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.

You are using several non-POSIX options to grep, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:

grep -h to suppress the file name: call grep on one file at a time, or pass the files to cat first.
grep -o to output only the matched part: use sed or awk instead.
grep -w to match only whole words: search for a pattern like (^|[^[:alnum:]])needle($|[^[:alnum:]]).

You are using one GNU-only construct in sed: the \L directive to lowercase a replacement in the s command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'.

Thank you! I had forgotten about that sed bit. I used it because non-heirloom tr can't do capitalized accented chars but I'll use awk for that like you suggested. Plus I'll make a habit of validating target versions of the utilities so as to save time during design instead of adapting afterwards! — , Jul 20 '14 at 05:01

score 1 · Answer 3 · edited Apr 13 '17 at 12:54

The selected answer really provides a great introduction to the challenges in the field of Natural Language Processing and Computational Linguistics and there is surely further information on dedicated SE assets. I wanted to provide a complement which underscores these challenges and provides me with a temporary "fix".

I think I can in some cases¹ trim the last s with sed to achieve a pretty safe yet interesting result:

s/\(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö]\)s$/\1/

This compacts some 50 lines in the provided sample when used with the original function.

So I tried sed with the following, which is both incomplete and not working as intended - but showcases difficulties and is helpful in my opinion in understanding what the answer explained:

sed '

h;
s/^\(par\|col\|tap.*\)/\1/
t RVv

h;
s/^\(par\|col\|tap.*\)/\1/
t RVc

h;
s/^\([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù]..*\)$/\1/
t RVnotpctv_v

h;
s/^\(.*.[aeiouyâàëéêèïîôûù]....*\)/\1/
t RVnotpctother
b

:RVv
s/^\(par\|col\|tap[bcdfghjklmnpqrstvwxz][aeiouyâàëéêèïîôûù].*\)/\1/
t R1

:RVc
s/^\(par\|col\|tap[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*\)/\1/
t R1

:RVnotpctv_v
s/^\([aeiouyâàëéêèïîôûù][aeiouyâàëéêèïîôûù].[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*\)$/\1/
t R1

:RVnotpctother
s/^\(.*[aeiouyâàëéêèïîôûù][bcdfghjklmnpqrstvwxz].*\)/\1/
t R1

:R1        
s/ement$\|ements$\|ité$\|ités$\|if$\|ive$\|ifs$\|ives$\|euse$\|euses$//
s/é$\|ée$\|ées$\|és$\|èrent$\|er$\|era$\|erai$\|eraIent$\|erais$\|erait$\|eras$\|erez$\|eriez$\|erions$\|erons$\|eront$\|ez$\|iez$\|ions$\|eons$//
s/eâmes$\|eât$\|eâtes$\|ea$\|eai$\|eaIent$\|eais$\|eait$\|eant$\|eante$\|eantes$\|eants$\|eas$\|easse$\|eassent$\|easses$\|eassiez$\|eassions$//
s/âmes$\|ât$\|âtes$\|a$\|ai$\|aIent$\|ais$\|ait$\|ant$\|ante$\|antes$\|ants$\|as$\|asse$\|assent$\|asses$\|assiez$\|assions$//
s/[bcdfghjklmnpqrstvwxz]îmes$\|ît$\|îtes$\|i$\|ie$\|ies$\|ir$\|ira$\|irai$\|iraIent$\|irais$\|irait$\|iras$\|irent$\|irez$\|iriez$\|irions$\|irons$\|iront$\|is$\|issaIent$\|issais$\|issait$\|issant$\|issante$\|issantes$\|issants$\|isse$\|issent$\|isses$\|issez$\|issiez$\|issions$\|issons$\|it$//
s/Y/i/
s/ç/c/
t R2

:R2
s/ance$\|iqUe$\|isme$\|able$\|iste$\|eux$\|ances$\|iqUes$\|ismes$\|ables$\|istes$//
s/atrice$\|ateur$\|ation$\|atrices$\|ateurs$\|ations$//
s/logie$\|logies$/log/
s/usion$\|ution$\|usions$\|utions$/u/
t Res

:Res
##Residual
s/ier$\|ière$\|Ier$\|Ière$/i/
s/\(.*[bcdefghjklmnpqrtvwxyzéëêàâûùôö]\)s$/\1/
##Undouble
s/\(en\)n$/\1/
s/\(on\)n$/\1/
s/\(et\)t$/\1/
s/\(el\)l$/\1/
s/\(eil\)l$/\1/
##Unaccent
s/\(.*\)\(é\)\([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*\)$/\1e\3/
s/\(.*\)\(è\)\([bcdefghjklmnpqrtvwxyzéëêàâûùôö]*\)$/\1e\3/
s/\(.*\)e$/\1/
t
'

In some instances it succeeds at stripping the word to some stem but there is a very conscious choice to avoid dealing with words containing only a few characters because it only implements some little features(and not R2 for instance), and badly at that. But it compacts another 50-60 lines in the sample, as it includes the prior sed expression.² For further insight I'll look into Linguistics!

^{1. This is all based on my "understanding" of the pseudo-code/description of the snowball french algorithm.}

^{2. It is wrong in many instances but running it interactively on the line provided me with the insight I was looking for when looking at words like parlons et bonbons. I realized there is nothing intrinsic in these two words which dictactes why the first one(verb) has to be stipped of its ons while the other(a noun) only of its s. It's about parsing the parts of speech as was explained...}

Frequency of words in non-English language text: how can I merge singular and plural forms etc.?

3 Answers3

Linked