I'm sorting French language words in some text files according to frequency with a focus on insight rather than statistical significance. The challenge is about preserving accented characters and dealing with the article forms in front of vowels(l'
, d'
) in the context of shaping word tokens for sorting.
The topic of the most frequent words in a file takes many shapes( 1 | 2 | 3 | 4). So I put together this function using GNU utilities:
compt1 () {
for i in *.txt; do
echo "File: $i"
sed -e 's/ /\
/g' <"$i" | sed -e 's/^[[:alpha:]][[:punct:]]\(.*\)/\1/' | sed -e 's/\(.*\)/\L\1/' | grep -hEo "[[:alnum:]_'-]+" | grep -Fvwf /path_to_stop_words_file | sort | uniq -c | sort -rn
done
}
...which trades spaces for newlines; trims a character followed by punctuation that's at the beginning of the line; then converts everything to lowercase; uses this compact grep
construct which matches word constituent characters to create tokens; then removes the stop words, and finally there is the usual sorting. The stop file contains a segment with individual characters so you have to be careful with how it's used, but the analysis provided on how to create stems for words in different languages is really interesting!
Now when I compare the frequency of a significant word with the output of grep -c
directly on the files, I think it's close enough within some margin of error.
Questions:
- How could I modify this to merge the frequency of plurals with their singular forms i.e. words sharing a common prefix with a varying 1 character suffix?
- I'm trying to assess whether the
grep
part in particular would work with what's on OSX?
1. I cannot provide source data but I can provide this file as an example. The words heure and enfant in the text provide an example. The former appears twice in the text including once as "l'heure", and helps validating if the command works or not. The latter appears in both singular and plural forms(enfant/enfants) and would benefit from being merged here.
sed
& co. You should try a stemmer instead. It was also suggested that you might get better answers on [so] or [linguistics.se]. – terdon Jul 19 '14 at 14:32