Bash: title-case a csv field

Question

I have this input file on a CentOS system:

1,,,,ivan petrov,,67,
2,2,,,Vasia pupkin,director,8,
3,,,,john Lenon,,,

The task is to change it to:

1,,,,Ivan Petrov,,67,
2,2,,,Vasia Pupkin,director,8,
3,,,,John Lenon,,,

Name and Surname should start with upper letter

#!/bin/bash
while IFS="," read line
do
    ns=$(echo $line | awk -F, '{print $5}')
    name=$(echo $ns | awk '{print $1}')
    surname=$(echo $ns | awk '{print $2}')
    ns=$(echo ${name^} ${surname^})
    awk -v nm="$ns" 'BEGIN{FS=OFS=","}{$5=nm}1' accnew.csv
done < <(tail -n +2 accnew.csv) > 1new.csv

That's my script, but it doesn't work correctly.

"Doesn't work correctly" isn't very helpful. How does it fail? Are there any errors? Why are you doing this in bash, that's a very bad tool for the job. Are you open to other tools? — terdon, Jun 22 '21 at 09:25
See Why is using a shell loop to process text considered bad practice? — Stéphane Chazelas, Jun 22 '21 at 09:26
I can use only bash, about error: it takes $ns vale and put it to every line, after it takes next $ns and also put it to all lines — QwertyBot, Jun 22 '21 at 09:37
You are already using awk, so you can use other tools. Why do you say you can only use bash? — terdon, Jun 22 '21 at 09:39
Neither tail nor awk are builtin in bash. Why would you want to only use builtin tools, especially in bash which is among the least efficient of all shells. — Stéphane Chazelas, Jun 22 '21 at 09:54
OK, nothing is built in, but awk will be installed by default. So you don't need to use bash here (and you really shouldn't, it's a bad tool for the job) and you can use Stéphane's solution instead. — terdon, Jun 22 '21 at 10:20
You seem to be skipping the first two lines of your input data with tail. — Kusalananda, Jun 22 '21 at 10:58
Can you ever have middle names like Ann Sue Smith? Can you ever have single-word names like Cher? Can you have names like john mcloud that should become John McLoud or sue jones-smith that should become Sue Jones-Smith? Can you ever have unusual names like Elon Musks kid X Æ A-12? If your input can contain anything other than just the most basic names as shown right now then please [edit] your question to include them in your example. — Ed Morton, Jun 22 '21 at 18:46

Stéphane Chazelas · Answer 1 · 2021-06-23T10:50:07.407

Don't use a shell loop to process text. Use a text processing utility.

Here, to capitalise names in the 5^th field, if the Lingua::EN::NameCase perl module is available:

perl -Mopen=locale -MLingua::EN::NameCase -F, -ae '
  $F[4] = nc $F[4] unless @F < 5;
  print join ",", @F' < your-file

If not, as an approximation, you could convert to uppercase the first character of every sequence of one or more alphanumeric ones:

perl -Mopen=locale -F, -ae '
  $F[4] =~ s/\w+/\u$&/g unless @F < 5;
  print join ",", @F' < your-file

That would however not handle properly names such as McGregor, van Dike... or those with combining characters.

(perl also has proper CSV parsing modules in case your input is not only the simple csv without quoting in your sample).

The same can be done with standard awk syntax, but it's a lot more cumbersome:

awk -F, -v OFS=, '
  NF >= 5 {
    r = $5; $5 = ""
    while (match(r, "[[:alnum:]]+")) {
      $5 = $5 substr(r, 1, RSTART - 1) \
           toupper(substr(r, RSTART, 1)) \
           substr(r, RSTART + 1, RLENGTH - 1)
      r = substr(r, RSTART + RLENGTH)
    }
    $5 = $5 r
  }
  {print}' < your-file

Slightly easier with GNU awk and its patsplit() function:

gawk -F, -v OFS=, '
  NF >= 5 {
    n = patsplit($5, f, /[[:alnum:]]+/, s)
    $5 = s[0]
    for (i = 1; i <= n; i++)
      $5 = $5 toupper(substr(f[i], 1, 1)) \
              substr(f[i], 2) s[i]
  }
  {print}' < your-file

If you have to use a shell loop, at least use a shell with a capitalisation operator:

#! /bin/zsh -
while IFS=, read -ru3 -A fields; do
  (( $#fields < 5 )) || fields[5]=${(C)fields[5]}
  print -r -- ${(j[,])fields} || exit
done 3< your-file

Note that that one (and the Lingua::EN::NameCase based one) differs from the other ones in that it turns éric serRA into Éric Serra instead of Éric SerRA for instance. You can achieve the same result in perl by changing \u to \u\L and in awk by applying tolower() to the second part of each word.

If you had to only use bash and its builtin commands as you indicate in comments, that would be a lot more cumbersome (in addition to being inefficient) as bash has very limited operators compared to those of zsh or ksh93 for instance and its read -a can't read separated values.

That would have to be something like (here assuming bash 4.0+ for the ${var^} operator):

#! /bin/bash -
set -o noglob -o nounset
IFS=,
re='^([^[:alnum:]]*)([[:alnum:]]+)(.*)$'
while IFS= read -ru3 line; do
  fields=( $line'' )
  if (( ${#fields[@]} >= 5 )); then
    rest="${fields[4]}" fields[4]=
    while [[ "$rest" =~ $re ]]; do
      fields[4]="${fields[4]}${BASH_REMATCH[1]}${BASH_REMATCH[2]^}"
      rest="${BASH_REMATCH[3]}"
    done
  fi
  printf '%s\n' "${fields[*]}" || exit
done 3< your-file

Those assume that the input is valid text encoded in the user's locale charset (for instance, that in a UTF-8 locale, that é above is encoded in UTF-8 (0xc3 0xa9 bytes), not iso8859-1 or other charset). The bash (and possibly awk) ones will choke on NUL bytes.

As perl's \w is alnums + underscore, you'll also find a difference for strings like jean_pierre which perl would capitalise as Jean_pierre while the other ones would capitalise it as Jean_Pierre. You may need to adapt to your specific input (also consider combining characters which would also put a spanner in the works here). See also the Lingua::EN::NameCase perl module to handle even more special cases.

As far as what commands are installed by default on what systems. Most systems will have perl (possibly the Text::CSV module, but likely not the Lingua::EN::NameCase one) and a POSIX compliant awk and sh implementations, many (even some non-GNU systems) have bash (the GNU shell), several have GNU awk (though not some GNU-based systems such as Ubuntu which at least in some versions prefer mawk). Few currently have zsh installed by default.

CentOS being a GNU system should have bash and gawk installed by default in addition to perl. bash and gawk even provide sh and awk there.

Ed Morton · Answer 2 · 2021-06-22T18:52:50.607

If all of your input is simple 2-word names of all English letters with no mid-word capitals like in your posted example, then using any awk in any shell on every Unix box:

$ awk '
    BEGIN { FS=OFS="," }
    { split($5,ns," "); $5 = uc(ns[1]) " " uc(ns[2]) }
    { print }
    function uc(str) { return toupper(substr(str,1,1)) substr(str,2) }
' file
1,,,,Ivan Petrov,,67,
2,2,,,Vasia Pupkin,director,8,
3,,,,John Lenon,,,

glenn jackman · Answer 3 · 2021-06-22T13:38:58.267

1

An alternative bash take:

while IFS=, read -ra fields; do
  read -ra name <<<"${fields[4]}"
  fields[4]=${name[*]^}
  (IFS=,; echo "${fields[*]}")
done < file

1,,,,Ivan Petrov,,67
2,2,,,Vasia Pupkin,director,8
3,,,,John Lenon,,

and perl

perl -F, -lane '
    $F[4] = join " ", map {ucfirst} split " ", $F[4];
    print join ",", @F;
' file

edited Jun 22 '21 at 13:38

answered Jun 22 '21 at 13:30

glenn jackman

85,964

1

The bash one removes the last field if it's empty (bash's read -a can't be used for reading separated values) while the perl one removes all empty trailing fields. – Stéphane Chazelas Jun 23 '21 at 08:03
The bash one assumes $IFS contains its default value, the xpg_echo option is not enabled and/or input lines are not things like -neen..., both assume that words in the name field are whitespace separated (only space and tab in the bash one) and will mangle spacing there. (not worse than the OP's attempt though in those regards) – Stéphane Chazelas Jun 23 '21 at 08:08
Note that the perl one assumes ASCII-only input. – Stéphane Chazelas Jun 23 '21 at 08:09

cas · Answer 4 · 2021-06-23T00:52:44.003

Here's a perl version using the Text::CSV module mentioned in Stéphane's answer:

#!/usr/bin/perl
use strict;
use Text::CSV;
my $csv = Text::CSV->new();
while (my $row = $csv->getline(ARGV)) {
  $row->[4] =~ s/\w+/\u$&/g;
  $csv->say(STDOUT, $row);
};

This is a minimalist script (using just the default settings) as the Text::CSV modules has lots of options for how input is processed (e.g. can use other characters like : or | as field separator instead of comma, header lines can be defined or even auto-parsed from the 1st line of the input) and how output is generated (e.g. text fields are quoted by default, but that can be changed so that only text fields containing the field separator, e.g. commas, are quoted). See man CSV::Text for details.

Because it uses a real CSV parser (rather than just split the input by commas), it can handle any valid CSV input you give it, and gracefully deals with most forms of invalid, not-quite-CSV files too.

Using the *ARGV file handle allows it to process data from STDIN or from one or more filenames specified on the command-line. Output is printed to STDOUT. Alternatively, the script could use the open() function to open file-handle(s) for input and/or output.

Example output:

1,,,,"Ivan Petrov",,67,
2,2,,,"Vasia Pupkin",director,8,
3,,,,"John Lenon",,,

If you want it in "one-liner" form to embed in a bash script or whatever:

$ perl -MText::CSV -e '$csv = Text::CSV->new();
                       while ($row = $csv->getline(*ARGV)) {
                         $row->[4] =~ s/\w+/\u$&/g;
                         $csv->say(*STDOUT, $row)
                       }' input.csv

BTW, if either version were modified to have $csv = Text::CSV->new({quote_space => 0});, the output would be:

1,,,,Ivan Petrov,,67,
2,2,,,Vasia Pupkin,director,8,
3,,,,John Lenon,,,

Technically, that's invalid CSV because fields with spaces in them are supposed to be double-quoted. Most programs will handle such not-quite-CSV files without a problem.

"invalid CSV because fields with spaces in them are supposed to be double-quoted" [citation needed] — , Jun 23 '21 at 09:56
Per the RFC "Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes." Fields containing other spaces such as blanks and tabs don't need to be quoted. — Ed Morton, Jun 24 '21 at 19:34

score 0 · Answer 5 · answered Jun 22 '21 at 10:33

Using csvjson from csvkit to turn your CSV file into JSON, and then modifying it with jq before outputting the modified data as CSV:

csvjson -H file |
jq -r '
    .[].e |= gsub(
        "(?<a>[[:alnum:]]+)"; 
        .a | sub("(?<b>.)"; .b | ascii_upcase)) |
    .[] | map(.) | @csv'

The csvjson command converts your CSV file into a JSON document with alphabetical keys for each column in an array with one object per original CSV line. The jq expression picks out the 5th (e) column from each object and extracts each word therein. Each word has its first character converted into upper-case using the ascii_upcase function of jq, and the result is then outputted as properly quoted CSV data.

Given the data in the question, this would result in

1,,,,"Ivan Petrov",,67,
2,2,,,"Vasia Pupkin","director",8,
3,,,,"John Lenon",,,

This would also cope with CSV fields containing embedded commas and newlines.

Bash: title-case a csv field

5 Answers5