replacement for tr with utf-8 capabilities

Question

in order to isolate the last word in any line of a poem (to have a list of all rhymes), I put together several snippets of code obtaining this

awk '{print $NF}' input.txt | tr 'A-Z' 'a-z'  | tr -sc 'a-z' '\n' | rev |  sort | uniq | sort -d | rev

applying it to poems like this:

Se a ciascun l'interno affanno
Si leggesse in fronte scritto
Quanti mai, che invidia fanno
Ci farebbero pietà!

I get

fanno
affanno
scritto
piet

As you can see, the word "pietà" lacks of accented character. I guess this depends from having not tr UTF-8 capabilities. Is there any replacement for tr able to perfprm the same task in this one liner, but retaining UTF-8 accented characters?

Stéphane Chazelas · Accepted Answer · 2022-05-14T18:32:44.087

The limitation of the GNU implementation of tr with regards to multibyte characters and some of its alternatives are covered in tr analog for unicode characters?.

Here, you can do everything in awk (the GNU implementation at least supports multibyte characters and localisation):

< yourfile awk '{
  last = tolower($NF)
  gsub(/[^[:alpha:]]+/, "\n", last)
  print last}' |
  rev | sort -u | rev

Which gives:


pietà
fanno
affanno
scritto

Or if the intention is to get the last sequence of letters from each line, with perl (with which you can also do all the decoding as per locale, conversion to lowercase, reverse, locale collation sorting):

<your-file perl -Mopen=locale -MPOSIX -lne '
  $word{lc $1}++ if /(\p{Letter}+)\P{Letter}*$/;
  END {
    print $_->[0] for
      sort {strcoll($a->[1], $b->[1])}
      map {[$_, scalar reverse $_]} keys %word
  }'

Or with GNU tools:

<yourfile grep -Po '\pL+(?=\PL*$)' | sed 's/.*/\L&/' | rev | sort -u | rev

Or doing the last sequence of letter extraction with sed:

<yourfile sed -E '/([[:alpha:]]+)[^[:alpha:]]*$/!d;s//\n\L\1/;s/.*\n//' |
  rev | sort -u | rev

Which can be easier if done after the first rev:

<yourfile rev |
  sed -nE 's/^[^[:alpha:]]*([[:alpha:]]+).*$/\L\1/p' |
  sort -u | rev

@Quasímodo, my understanding is that the OP's tr -cs is to split things like l'interno into l and interno. — Stéphane Chazelas, May 14 '22 at 18:02

jubilatious1 · Answer 2 · 2022-05-15T19:50:23.993

Using Raku (formerly known as Perl_6)

raku -e 'my @a.=push: .comb(/<alpha>+/)[*-1] if .chars for lines;  \
        .put for @a.unique>>.flip.sort( *.fc.trans: "àèéìòù" => "aeeiou" )>>.flip;'

OR

raku -e 'my @a.=push: .comb(/<alpha>+/)[*-1] if .chars for lines; \
        .put for @a.unique.map(*.flip).sort( *.fc.trans: "àèéìòù" => "aeeiou" ).map(*.flip);'

Above are answers coded in Raku, which is built to handle Unicode by default. Briefly, if lines contain chars characters (i.e. are not blank), push the [*-1] last-element of a line that has been combed through via regex looking for <alpha>+ one-or-more alphabetic characters (positive-selection for desired elements, as opposed to a destructive split). [Note, here words would be a close approximation, except words only splits on whitespace and thus would still leave punctuation characters in the resultant elements].

Once @a array is filled ( here my @a.=push( … ) de-sugars to my @a = @a.push( … ) ), elements of @a are unique-ified, flipped, sorted, and flipped back.

Sorting is accomplished with the routine/parameters .sort( *.fc.trans: "àèéìòù" => "aeeiou" ), meaning that * elements are sorted on characters that have been fc case-folded as well as the six accented characters translated: "àèéìòù" => "aeeiou". Without the trans routine, words ending in those six accented characters sort to the end of the list.

Sample Input:

Se a ciascun l'interno affanno
Si leggesse in fronte scritto
Quanti mai, che invidia fanno
Ci farebbero pietà!

Sample Output:

pietà
fanno
affanno
scritto

I took the liberty of testing another poem by Pietro Metastasio entitled La libertà. Sample Output is shown below, however I added .join(", ") at the end of the code to return comma-separated output (instead of one-word-per-line). First answer below with the call to trans: "àèéìòù" => "aeeiou", second answer below without the call to trans: "àèéìòù" => "aeeiou":

~$ perl6 -e 'my @a.=push: .comb(/<alpha>+/)[*-1] if .chars for lines; .put for @a.unique.map(*.flip).sort( *.fc.trans: "àèéìòù" => "aeeiou" ).map(*.flip).join(", "); put("");' file.txt
fa, ha, bella, quella, pena, catena, ragiona, sprona, ancora, talora, pietà, beltà, sciolta, volta, libertà, rinnova, prova, è, piace, spiace, infelice, Nice, ingannatrice, fé, me, penne, avvenne, core, ardore, colore, rossore, te, amante, incostante, fai, mai, guai, spezzai, dì, miei, sei, sdegni, segni, suoi, tuoi, così, dico, antico, parlando, dimando, umano, vano, sdegno, segno, hanno, sanno, dono, ragiono, sono, sincero, primiero, impero, altero, vero, aggiro, miro, curo, procuro, so, passò, appresso, oppresso, stesso, istesso, ingrato, prato, ascolto, volto, cimento, rammento, sento, contento, estinto, istinto, difetto, aspetto, consolar, parlar, sdegnar, trovar, piacer, pensier, soffrir, morir, cor, amor, favor
~$ perl6 -e 'my @a.=push: .comb(/<alpha>+/)[*-1] if .chars for lines; .put for @a.unique.map(.flip).sort( .fc ).map(*.flip).join(", "); put("");' file.txt
fa, ha, bella, quella, pena, catena, ragiona, sprona, ancora, talora, sciolta, volta, rinnova, prova, piace, spiace, infelice, Nice, ingannatrice, me, penne, avvenne, core, ardore, colore, rossore, te, amante, incostante, fai, mai, guai, spezzai, miei, sei, sdegni, segni, suoi, tuoi, dico, antico, parlando, dimando, umano, vano, sdegno, segno, hanno, sanno, dono, ragiono, sono, sincero, primiero, impero, altero, vero, aggiro, miro, curo, procuro, so, appresso, oppresso, stesso, istesso, ingrato, prato, ascolto, volto, cimento, rammento, sento, contento, estinto, istinto, difetto, aspetto, consolar, parlar, sdegnar, trovar, piacer, pensier, soffrir, morir, cor, amor, favor, pietà, beltà, libertà, è, fé, dì, così, passò

Caveat: because all punctuation gets removed, hyphenated words (if any) are split into component words during analysis.

replacement for tr with utf-8 capabilities

2 Answers2