5

I have a file with the following format, each column separated by tabs:

C1  C2  C3
a   b,c d
e   f,g,h   i
j   k   l
...

Now I need to have the number of lines according to the number of values separated by commas (if that's the case) in the 2nd column. The lines must have one of the values and not the others. The result would be this:

C1  C2  C3
a   b   d
a   c   d
e   f   i
e   g   i
e   h   i
j   k   l
...
...

As this is due to work asap, I've just made a don't do this at home script, reading line by line with a while, due to my lack of skills in awk, or not exploring other possible solutions with other tools. The script is as follows:

I'm revising the script in the meantime

# DON'T DO THIS AT HOME SCRIPT
> duplicados.txt
while IFS= read -r line; do
  # get the value of the column of interest
  cues="$(echo "$line" | awk -F'\t' '{ print $18 }')"
  # if the column has commas then it has multiple values
  if [[ "$cues" =~ , ]]; then
    # count the commas
    c=$(printf "%s" "$cues" | sed 's/[^,]*//g' | wc -c)
    # loop according to the number of commas
    for i in $(seq $(($c + 1))); do
      # get each value of the column of interest according to the position
      cue="$(echo "$cues" | awk -F',' -v c=$i '{ print $c; ++c }')"
      # save the line to a file substituting the whole column for the value
      echo "$line" | sed "s;$cues;$cue;" >> duplicados.txt
    done
    continue
  fi
  # save the single value lines
  echo "$line" >> duplicados.txt
done < inmuebles.txt

With this I get the desired result (as far as I can tell). As you can imagine the script is slow and very ineficient. How could I do this with awk or other tools?

A sample of the real data is like this, being the column of interest the number 18:

1409233 UNION   VIAMONTE    Estatal Provincial  DGEP    3321    VIAMONTE                            -33.7447365;-63.0997115 Rural Aglomerado    140273900   140273900-ESCUELA NICOLAS AVELLANEDA
1402961 UNION   SAN MARCOS SUD  Estatal Provincial  DGEA, DGEI, DGEP    3029, 3311, Z11 SAN MARCOS SUD                          -32.629557;-62.483976 / -32.6302699949582;-62.4824499999125 / -32.632417;-62.484932 Urbano  140049404, 140164000, 140170100, 140173100  140049404-C.E.N.M.A. N° 201 ANEXO SEDE SAN MARCOS SUD, 140164000-C.E.N.P.A. N° 13 CASA DE LA CULTURA(DOC:BERSANO), 140170100-ESCUELA HIPOLITO BUCHARDO, 140173100-J.DE INF. HIPOLITO BUCHARDO
1402960 UNION   SAN ANTONIO DE LITIN    Estatal Provincial  DGEA, DGEI, DGETyFP 3029, TZONAXI, Z11  SAN ANTONIO DE LITIN    3601300101020009    360102097366    0250347         SI / SI -32.212126;-62.635999 / -32.2122558;-62.6360432 / -32.2131931096409;-62.6291815804363   Rural Aglomerado    140049401, 140313000, 140313300, 140483400, 140499800   140049401-C.E.N.M.A. N° 201 ANEXO SAN ANTONIO DE LITIN, 140313000-I.P.E.A. Nº 214. MANUEL BELGRANO, 140313300-J.DE INF. PABLO A. PIZZURNO, 140483400-C.E.N.P.A. DE SAN ANTONIO DE LITIN, 140499800-C.E.N.P.A. B DE SAN ANTONIO DE LITIN
  • In your real data, it is not possible to tell a tab from a literal space. Can you change all tabs to \t so that we can test it in real data. – guest_7 May 01 '22 at 02:34

5 Answers5

10

You could do it in awk by splitting the compound column on , and looping over the result:

awk -F'\t' 'BEGIN{OFS=FS} {n=split($2,a,/,/); for(i=1;i<=n;i++){$2 = a[i]; print}}' file

Perhaps more cleanly, you could do it with Miller - in particular, using the nest verb:

$ cat file
C1      C2      C3
a       b,c     d
e       f,g,h   i
j       k       l

$ mlr --tsv nest --explode --values --across-records --nested-fs ',' -f C2 file C1 C2 C3 a b d a c d e f i e g i e h i j k l

More compactly --explode --values --across-records --nested-fs ',' may be replaced by --evar ','

steeldriver
  • 81,074
4

As you also tagged the question with sed, I feel urged to add an sed solution:

sed -e '/,/{s//\n/;h;s/[^\t]*\n//;x;s/\n[^\t]*//p;G;D;}'

(Note: For readabiliy I used \n for newline and \t for tab as you can do with GNU sed. For a portable solution, use a backslash with an actual newline instead of \n and an actual tab for \t, entered as ctrlV followed by tab)

Lines with a comma are copied to the hold space, one copy gets printed with what is before the comma, the other copy goes to the next cycle with the part after the comma. In detail:

  • To avoid getting mixed up with multiple comma, we replace one with a newline s//\n/
  • save one copy to the hold space before we mess up the line
  • s/[^\t]*\n// removes the part upto the first comma
  • then we xchange buffers
  • s/\n[^\t]*//p removes the part starting from the comma and prints it
  • G appends the hold space to the pattern space. This can contain addition commas, so
  • D removes the first (already printed) line and starts over with the rest of the line
Philippos
  • 13,453
2

awk (or perl in awk mode) is probably the best standard solution, but you can do this reasonably efficiently in most shells, especially those with arrays (ksh, bash, zsh):

set -f # split but don't glob unquoted substitutions
#bash
while IFS=$'\t' read -ra ary; do 
#ksh
while read -r line; do IFS=$'\t'; ary=($line)
#zsh I haven't worked out

IFS=,; for v in ${ary[17]}; do ary[17]=$v; IFS=$'\t'; printf '%s\n' "${ary[*]}" done

bash,ksh arrays are 0-origin versus 1-origin fields in awk

we don't need to special-case no-comma, it splits to a single value

done <input >output

For old/limited shells without arrays, instead use the positional parameters like (may vary):

set -f
while read -r line; do IFS=$'\t'; set -- $line
  IFS=,; for v in ${18}; do
    # can't alter $num so yucky
    for i in $(seq $#); do
      case $i in (1);; (*) printf '\t';; esac
      case $i in (18) printf %s "$v";; (*) eval printf %s \"\${$i}\";; esac
    done
    # or maybe i=1; while [ $i -le $# ]; do ... i=$((i+1)); done
    # where [/test is likely shell builtin and seq is unlikely 
  done
done <input >output
1

Using perl

## Column of Interest
CoI=2 
perl -sF'\t' -aple '$"="\t";
  $_ = join $\, map { $F[$I]=$_;"@F" } split /,/, $F[$I]
' -- -I="$((CoI-1))" file
  • perl options used:-

    • -p this causes Perl to read a file line by line and output the line before next cycle.
    • -a turns ON the auto split mode, whereby the input record ($_) is split and components put in the array (@F)
    • -l makes the input record separator and output record separator to be newlines.
    • -s turns ON rudimentary switch processing. With its help we set a global variable $I on the command line.
    • -F field separator specified here
    • -e Perl code specified here.
  • perl builtin variables used:-

    • $_ current record being processed.
    • $" array elements joiner
    • @F current record split fields stored in this array. It is zero indexed
    • $\ output record separator
  • Perl code:-

    • split the column of interest $F[$I] around comma and progressively assign the split elements to the column if interest and join the array @F with the $" variable and assign to input record($_)
    • default action is to autoprint the input record.

CoI=2
awk -F '\t' -v coi="$CoI" '
BEGIN { OFS=FS;s[1]=ORS }
NF >= coi {
  split($(coi),a,",")
  for (i=t=""; ++i in a;) {
    $(coi) = a[i]
    t = t s[i>1] $0
  }
  $0=t
}1
' file

Using GNU sed in extended regex mode (-E):

CoI=2
sed -E '
  s/[^\t]+/\n&\n/'"$CoI"'
  s/(\n.*)(\n.*)/\2\1,/
  :loop
    s/\n(.*\n)([^,]+),/\2\1/
    P;/\n$/d
    s/[^\t]+/\n/'"$CoI"'
  tloop
' file

This is operation of progress on a record of what sed is doing:

   pat_spc       output
 a b,c,d e         -
 a \nb,c,d\n e     -
 a \n e\nb,c,d,    -
 a b e\nc,d,      a b e
 a \n e\c,d,        -
 a c e\nd,        a c e
 a \n e\nd,         -
 a d e\n          a d e
/\n$/ stop, fetch next line

CoI=2
python3 -c 'import sys

ifile,coi = sys.argv[1:] coi = int(coi)-1 fs,rs,ofs,ors = ("\t","\n") * 2

with open(ifile) as f: for l in f: F = l.rstrip(rs).split(fs) for e in F[coi].split(","): F[coi] = e print(*F,sep=ofs) ' file "$CoI"


with bourne shell builtins using positional parameter array

cleanup() {
  echo cleaning up temp files... >&2
  /bin/rm -f -- "$temp"
}
trap cleanup EXIT
set -u

#-------------------+

user input section

#-------------------+ CoI=2 inp='inmuebles.txt' #-------------------+

: <<_README_

  1. CoI standing for column of interest.
  2. CoT must be a positive integer.
  3. CoI must not be more than the number of fields in the unput.
  4. inp stores the input file name,

possibly with full or relative path, to make it accessible to the script. 5. All lines must have same number of fields. 6. No field to have TAB and/or newline. 7. Field separator is TAB. 8. File should be readable by the user and be a regular ascii text file. README

IFS=$(printf '\t') temp=$(mktemp) del=

while IFS= read -r line <&3 do set -f;set -- $line;set +f

: ${del:=$(dc <<eof 1 $# $CoI-+n eof )}

while case $# in "$del") break;; esac do printf '%s\t' "$1" shift done > "$temp"

x1=$1;shift

for csv in $(set -f;IFS=',';set -- $x1;printf '%s\t' "$@") do printf '%s' "$(cat < "$temp")" "$csv" case $# in 0) echo; break;; esac printf '\t%s\n' "$*" done done 3< "$inp"


guest_7
  • 5,728
  • 1
  • 7
  • 13
-2
 while read line
 do
 fic=$(echo $line | awk '{print $1}')
 laco=$(echo $line | awk '{print $NF}')
 secon_colu=$(echo $line| awk '$2 ~ /,/{print $2}')
 if [[ "$secon_colu" =~ "," ]]
 then
 for ko in $(echo $line | awk '$2 ~ /,/{print $2}'| sed 's/,/ /g')
 do
 echo "$fic $ko  $laco"
 done
 else
 echo $line
 fi
 done<file.txt

output

C1 C2 C3
a b  d
a c  d
e f  i
e g  i
e h  i
j k l