6

I want to replace names that I received in uppercase letters only. The names are mixed with other information which should stay as it is. AUTH: at the beginning of the line identifies it as a line where the names ought to be replaced.

TITLE: Average title
AUTH: SUPERMAN
AFF: Something
AUTH: THE NEW ONE
AFF: Berlin
AUTH: MARS-MENSCH
AFF: Planet Mars

It should then be

TITLE: Average title
AUTH: Superman
AFF: Something
AUTH: The New One
AFF: Berlin
AUTH: Mars-Mensch
AFF: Planet Mars

My question is related to but different from the Unix question "Lowercase all but the first (Uppercase) letter from UPPERCASE in cyrillic" because I use roman letters and fail to adopt the proposed solutions.

To get the respective line, I use egrep -rl ^"AUTH:" and then I follow with | xargs sed -r -i '/(AUTH:)/ ??? /\1/g' where I don't know what to replace ??? with.

MERose
  • 527
  • 1
  • 10
  • 24

3 Answers3

4
> sed '/^AUTH/{s/^AUTH: //;s/\b\([[:alpha:]]\)\([[:alpha:]]*\)\b/\u\1\L\2/g;s/^/AUTH: /;}' file
TITLE: Average title
AUTH: Superman
AFF: Something
AUTH: The New One
AFF: Berlin
AUTH: Mars-Mensch
AFF: Planet Mars
Hauke Laging
  • 90,279
3
sed -e '/^AUTH:\([^[:alpha:]]*\)/!b' -e 'h;s//\1/;x;s///                                                                  
    s/\([[:alpha:]]\)\([[:alpha:]]*[^[:alpha:]]*\)/\1/g;x;s//\
\2/g
    y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/;G;:l
    s/\n\(.*\n\)\(.\)/\2\1/;tl
    s/\(.*\)\n/AUTH:\1/
'<<\IN
TITLE: Average title
AUTH: SUPERMAN
AFF: Something
AUTH: THE NEW ONE
AFF: Berlin
AUTH: MARS-MENSCH
AFF: Planet Mars
AUTH: CONTRARY tO pOPULAR bELIEF, Lorem Ipsu'M is not simply random text. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance.                        
IN

Portably to translate characters with sed you need to use the y/// translate command and to explicitly set out each character to be translated. This actually does come in handy - there's very little confusion and/or guessing in that way.

This script splits each AUTH: line into two bits - One gets the first alphabetic character in any series of one-or-more alphabetic characters replaced with a \newline. The other bit consists of nothing but those first alphabetics and is held while the line is translated. After translation sed Gets the held bit, and replaces each \newline character with the first character following the last \newline in a loop until none remain.

Here's a look at what the Lorem Ipsum line above looks like just before sed begins the replace loop:

 \nontrary \no \nopular \nelief, \norem \npsum \ns \not \nimply \nand\
om \next. \norem \npsum \nomes \nrom \nections 1.10.32 \nnd 1.10.33 \
\nf "\ne \ninibus \nonorum \nt \nalorum" (\nhe \nxtremes \nf \nood \n\
nd \nvil) \ny \nicero, \nritten \nn 45 \nc. \nhis \nook \ns \n \nreat\
ise \nn \nhe \nheory \nf \nthics, \nery \nopular \nuring \nhe \nenais\
sance.\nCtpbLIinsrtLIcfsaodFBeMTEoGaEbCwiBTbiatottoevpdtR$

The AUTH: bit is stripped when the line is confirmed - it is reinserted last - and so it is not there, but the space that follows it - and whatever other chars that may come between - is there. You can see that all of the words begin with \newlines - sed trades those in order for each char in the string following the last one. sed doesn't distinguish between upper and lower case when saving the first letter of each word - they are all put aside and replaced later.

OUTPUT:

TITLE: Average title
AUTH: Superman
AFF: Something
AUTH: The New One
AFF: Berlin
AUTH: Mars-Mensch
AFF: Planet Mars
AUTH: Contrary to popular belief, Lorem Ipsu'M is not simply random text. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 Bc. This book is a treatise on the theory of ethics, very popular during the Renaissance.
mikeserv
  • 58,310
  • 2
    The problems that all (possible) answers will have to deal with (an impossible quest IMO ) are mainly apostrophes/proper nouns. DON'T KNOW WHAT'S UP is replaced with Don'T Know What'S Up although it should be Don't Know What's Up. No big deal to fix, I know, you can lowercase any letter following an apostrophe but then O'MALLEY becomes O'malley. Another thing: MCCABE & MRS. MILLER should be replaced with McCabe & ... not Mccabe & ..., PAIGE VANZANT should be replaced with Paige VanZant etc. Depending on the input file, this type of substitution can turn into a real nightmare. – don_crissti Dec 18 '14 at 00:54
  • @don_crissti - Oh - that's strange. I just changed it around. I probably won't handle VanZant... but, Don'T shouldn't be a problem, I think. Well, the apostrophe thing works now. But... VanZant is right out. – mikeserv Dec 18 '14 at 01:23
  • Yes, I think it's better now. As I said, it's almost impossible to deal with names like D'Artagnan or MacBride. – don_crissti Dec 18 '14 at 01:41
  • @don_crissti - I probably could handle that stuff - but not without an explicit definition list. Not in a simple script, anyway. I'm halfway imagining something like the global s///ubstitution I used here like sed ... "s/\($(printf "\(...%s...*\)*" "$@")\).*/\1/" but I'd need at least two passes at input and a whitelist array. I really appreciate your comments on this stuff, by the way - always spot-on, too. Weird that other people think about it too, actually. – mikeserv Dec 18 '14 at 02:07
  • In fact, it's good when 'T stays 'T because, as I wrote, it's names that I am going to fix and most words should start with capital letters. Look for example D'ARTAGNAN, D'HONDT or DELL'ALBA. – MERose Dec 18 '14 at 09:26
  • @MERose - wait what? You want letters following ' apostrophes not to be converted? That's simply done - but the whole D'ARTAGNAN thing is another story altogether... Ok, it won't convert the letters following a ' now. – mikeserv Dec 18 '14 at 10:15
  • Why not? d'Artagnan is the correct version, so the letter following ' (also those following a -) should be capitals. We don't care it the letter in front of that sign is upper- or lowercase. – MERose Dec 18 '14 at 12:29
  • @MERose - letters following a - are not converted - remember this only does upper to lower case conversions - it doesn't go the other way. So if a letter following an apostrophe is already uppercase it will remain that way - as is also true for dashes. Both of these are represented in the example output. Otherwise though - the lowercase d in d'Artagnan - it gets much more difficult very quickly. I can do it... but what of O'Malley for example? It's very arbitrary. – mikeserv Dec 18 '14 at 12:36
  • Yes, that's what I meant in the above code. We need D'Artagnan and don't care if d'Artagnan is the only correct version. We just think that D'Artagnan is still better than D'artagnan or D'ARTAGNAN. – MERose Dec 18 '14 at 13:35
  • @MERose - well, that's good. maybe you can find a use for it... – mikeserv Dec 18 '14 at 13:38
  • @MERose - either way, you will never get it right via a script like this (unless you manually fix the wrong stuff afterwards). You'd need a dict to handle all particular cases. And even then... Here's another one: KING RICHARD III - it's really ugly when converted to King Richard Iii. – don_crissti Dec 18 '14 at 14:09
  • True. But fortunately, we don't have a King Richard Iii in our dataset. However, I really appreciate the work of you all! – MERose Dec 18 '14 at 14:38
1

With recent versions of GNU sed:

sed -E '/^AUTH:/!b;s/([^\w:])(\w+)/\1\L\u\2/g'

With older versions:

sed -r '/^AUTH:/!b;s/([^[:alnum:]:])([[:alnum:]]+)/\1\L\u\2/g'