6

File.tsv is a tab delimited file with 7 columns:

cat File.tsv
1   A   J               1
2   B   K   N           1
3   C   L   O   P   Q   1

The following reads File.tsv which is tab delimited file with 7 columns, and stores the entries in an Array A.

while IFS=$'\t' read -r -a D; do
    A=("${A[@]}" "${D[i]}" "${D[$((i + 1))]}" "${D[$((i + 2))]}" "${D[$((i + 3))]}" "${D[$((i + 4))]}" "${D[$((i + 5))]}" "${D[$((i + 6))]}")
done < File.tsv
nA=${#A[@]}
for ((i = 0; i < nA; i = i + 7)); do
    SlNo="${A[i]}"
    Artist="${A[$((i + 1))]}"
    VideoTitle="${A[$((i + 2))]}"
    VideoId="${A[$((i + 3))]}"
    TimeStart="${A[$((i + 4))]}"
    TimeEnd="${A[$((i + 5))]}"
    VideoSpeed="${A[$((i + 6))]}"
done

Issue

Certain entries are empty in tsv files, but the empty values are skipped while reading the file.

Note

Is the tsv file, empty values are preceded and succeeded by a tab character.

Desired Solution

Empty values be read and stored in the array.

Porcupine
  • 1,892

5 Answers5

10

As I said in my comments, this is not a job for a shell script. bash (and similar shells) are for co-ordinating the execution of other programs, not for processing data.

Use any other language instead - awk, perl, and python are good choices. It will be easier to write, easier to read and maintain, and much faster.

Here's an example of how to read your text file into an Array of Hashes (AoH) in perl, and then use the data in various print statements.

An AoH is a data structure that is exactly what its name says it is - an array where each element is an associative array (aka hash).

BTW, this could also be done with an Array of Arrays (AoA) data structure (also known as a List of Lists or LoL), but it's convenient to be able to access fields by their field name instead of having to remember their field number.

You can read more about perl data structures in the Perl Data Structures Cookbook which is included with perl. Run man perldsc or perldoc perldsc. You probably also want to read perllol and perlreftut too. and perldata if you're not familiar with perl variables ("Perl has three built-in data types: scalars, arrays of scalars, and associative arrays of scalars, known as hashes". A "scalar" is any single value, like a number or a string or a reference to another variable)

Perl comes with a lot of documentation and tutorials - run man perl for an overview. The included perl docs come to about 14MB, so it's often in a separate package in case you don't want to install it. On debian: apt install perl-doc. Also, each library module has its own documentation.

#!/usr/bin/perl -l

use strict;

Array to hold the hashes for each record

my @data;

Array of field header names. This is used to insert the

data into the %record hash with the right key AND to

ensure that we can access/print each record in the right

order (perl hashes are inherently unordered so it's useful

and convenient to use an indexed array to order it)

my @headers=qw(SlNo Artist VideoTitle VideoId TimeStart TimeEnd VideoSpeed);

main loop, read in each line, split it by single tabs, build into

a hash, and then push the hash onto the @data array.

while (<>) { chomp; my %record = ();

my @line = split /\t/;

iterate over the indices of the @line array so we can use

the same index number to look up the field header name

foreach my $i (0..$#line) { # insert each field into the hash with the header as key. # if a field contains only whitespace, then make it empty ($record{$headers[$i]} = $line[$i]) =~ s/^\s+$//; }

push @data, %record ; }

show how to access the AoH elements in a loop:

print "\nprint @data in a loop:"; foreach my $i (0 .. $#data) { foreach my $h (@headers) { printf "$data[%i]->{%s} = %s\n", $i, $h, $data[$i]->{$h}; } print; }

show how to access individual elements

print "\nprint some individual elements:"; print $data[0]->{'SlNo'}; print $data[0]->{'Artist'};

show how the data is structured (requires Data::Dump

module, comment out if not installed)

print "\nDump the data:"; use Data::Dump qw(dd); dd @data;

FYI, as @Sobrique points out in a comment, the my @line =... and the entire foreach loop inside the main while (<>) loop can be replaced with just a single line of code (perl has some very nice syntactic sugar):

  @record{@headers} = map { s/^\s+$//, $_ } split /\t/;

Note: Data::Dump is a perl module for pretty-printing entire data-structures. Useful for debugging, and making sure that the data structure actually is what you think it is. And, not at all co-incidentally, the output is in a form that can be copy-pasted into a perl script and assigned directly to a variable.

It's available for debian and related distros in the libdata-dump-perl package. Other distros probably have it packaged too. Otherwise get it from CPAN. Or just comment out or delete the last three lines of the script - it's not necessary to use it here, it's just another way of printing the data that's already printed in the output loop.

Save it as, say, read-tsv.pl, make it executable with chmod +x read-tsv.pl and run it:

$ ./read-tsv.pl file.tsv                                    
print @data in a loop:
$data[0]->{SlNo} = 1
$data[0]->{Artist} = A
$data[0]->{VideoTitle} = J                        
$data[0]->{VideoId} = 
$data[0]->{TimeStart} = 
$data[0]->{TimeEnd} = 
$data[0]->{VideoSpeed} = 1

$data[1]->{SlNo} = 2 $data[1]->{Artist} = B $data[1]->{VideoTitle} = K $data[1]->{VideoId} = N $data[1]->{TimeStart} = $data[1]->{TimeEnd} = $data[1]->{VideoSpeed} = 1

$data[2]->{SlNo} = 3 $data[2]->{Artist} = C $data[2]->{VideoTitle} = L $data[2]->{VideoId} = O $data[2]->{TimeStart} = P $data[2]->{TimeEnd} = Q $data[2]->{VideoSpeed} = 1

print some individual elements: 1 A

Dump the data: [ { Artist => "A", SlNo => 1, TimeEnd => "", TimeStart => "", VideoId => "", VideoSpeed => 1, VideoTitle => "J", }, { Artist => "B", SlNo => 2, TimeEnd => "", TimeStart => "", VideoId => "N", VideoSpeed => 1, VideoTitle => "K", }, { Artist => "C", SlNo => 3, TimeEnd => "Q", TimeStart => "P", VideoId => "O", VideoSpeed => 1, VideoTitle => "L", },
]

Notice how the nested for loops print the data structure in the exact order we want (because we iterated over the @headers array), while just dumping it with the dd function from Data::Dump outputs the records sorted by key name (which is how Data::Dump deals with the fact that hashes in perl aren't ordered).


Other comments

Once you have your data in a data structure like this, it's easy to insert it into an SQL database like mysql/mariadb or postgresql or sqlite3. Perl has database modules (see DBI) for all of those and more.

(In debian, etc, these are packaged as libdbd-mysql-perl, libdbd-mariadb-perl, libdbd-pg-perl, libdbd-sqlite3-perl, and libdbi-perl. Other distros will have different package names)

BTW, the main parsing loop could also be implemented using another perl module called Text::CSV, which can parse CSV and similar file formats like Tab separated. Or with DBD::CSV which builds on Text::CSV to allow you to open a CSV or TSV file and run SQL queries against it as if it were an SQL database.

In fact, it's a fairly trivial 10-15 line script to use these modules to import a CSV or TSV file into an SQL database, and most of that is boilerplate setup stuff...the actual algorithm is a simple while loop to run a SELECT query on the source data and an INSERT statement into the destination.

Both of these modules are packaged for debian, etc, as libtext-csv-perl and libdbd-csv-perl. Probably packaged for other distros too. and, as always, available on CPAN.

cas
  • 78,579
  • 2
    BTW, I forgot to mention, you can read more about perl data structures in the Perl Data Structures Cookbook. Run man perldsc or perldoc perldsc. You probably also want to read perllol and perreftut too. perl comes with a lot of documentation and tutorials - run man perl for an overview. – cas Jun 14 '21 at 02:18
  • 2
    You can simplify your foreach loop considerably with my @record{@header} = split /\t/; using hash slice mechanics. https://perldoc.perl.org/perldata#Slices – Sobrique Jun 14 '21 at 08:54
  • @Sobrique Yeah, but I wrote this to be as simple and straight-forward and easy to understand as possible - loops are common to all languages while perl syntatic sugar isn't. I even hesitated before using ($record{$headers[$i]} = $line[$i]) =~ s/^\s+$//; because modifying a variable with a regexp as it's being assigned is potentially confusing to people unfamiliar with perl. – cas Jun 14 '21 at 09:03
  • 1
    @cas please add Raku to your list of acceptable scripting languages ("awk, perl, and python"). Example code below. Thank you. – jubilatious1 Jun 14 '21 at 09:54
  • 4
    @jubilatious1 It's not a list of acceptable languages. It's only a list of example languages, a tiny subset of "Use any other language instead" that I said. I've got nothing against raku (I even upvoted your answer because it's nice to see an answer using something other than awk/perl/sed here and raku is cool), but don't feel it's necessary to edit my answer just to provide another example - the potential additions are endless. – cas Jun 14 '21 at 10:01
  • @cas Thank you for clarifying! Speaking personally, I'm always on the lookout for new tidbits of advice here on SO, so your Perl code was a nice exposition ;-). As for languages, I'm just afraid i have to disagree, however. I don't think i can name half-a-dozen high-level, open-source, scripting languages that I would spend time mastering for text processing. Just my opinion, YMMV. Finally, someone here on U&L DID mention a new language other than the 5 we've mentioned, but I can't find the link!! I'll add it in a new comment if I run across it. Best. – jubilatious1 Jun 14 '21 at 16:42
  • @cas The language/program I couldn't recall a day ago is named Miller: https://miller.readthedocs.io/en/latest/ – jubilatious1 Jun 15 '21 at 21:40
4

As \t is a whitespace character (a character for which isspace() returns true or grep '[[:space:]]' matches on), that's the standard word splitting behaviour as specified by POSIX: sequences of IFS-whitespace characters (the whitespace characters that are present in $IFS) are treated as one delimiter and leading, trailing ones or the ones on either side of a non-whitespace IFS character are ignored.

Initially, in ksh where that "feature" comes from¹, that special treatment was limited to the TAB, NL and space characters, but POSIX changed it to any character that is considered whitespace in the locale. In practice though, most shells still only do it for TAB/NL/SPC (which also happen to be in the default value of $IFS). The only exception (the only shell that is POSIX-compliant in that regard) that I know is yash.

In bash, the behaviour changed in 5.0, and now bash behaves like ksh93 in that the special treatment is applied to whitespace characters, but only those that are encoded on one byte.

Compare the treatment for the en space character and the carriage return ones:

$ s=$'\u2000' a="a${s}${s}b" bash -c 'IFS=$s; [[ $s = [[:space:]] ]] && echo yes;  printf "<%s>\n" $a'
yes
<a>
<>
<b>
$ s=$'\r' a="a${s}${s}b" bash -c 'IFS=$s; [[ $s = [[:space:]] ]] && echo yes;  printf "<%s>\n" $a'
yes
<a>
<b>

So in bash 5.0+, for bash to no longer give that special treatment to TAB, you'd need to construct a locale where TAB is not considered whitespace. Even then, that would not be enough as even for non-whitespace characters, like , instead of TAB, a,b,c, is split into a, b, and c instead of a, b, c and the empty string.

Better here would be to use a shell where the special treatment of whitespace characters with regards to IFS splitting can be disabled, or have a proper splitting operator, or better still as @cas says, use a proper programming language such as perl instead of a shell which is not designed for that.

In ksh93 or zsh, the special treatment can be removed by doubling the corresponding character in $IFS. zsh's word splitting also doesn't discard a trailing empty element. So in zsh,

IFS=$'\t\t' read -rA array

would actually work. There, you could also do:

IFS= read -r line && array=( "${(@ps[\t])line}" )

To read a line and then split it with the s parameter expansion flag.

In ksh93, you could do the splitting with:

set -o noglob
IFS=$'\t\t'
IFS= read -r line && array=( $line'' )

(there's still a difference with the zsh approaches above in that an empty line results in an array with one empty element instead of an array with no element).

ksh93 does support multidimensional arrays which look like could be of help to you:

i=0
IFS=$'\t\t'; set -o noglob
typeset -a array=()
while IFS= read -r line; do
  array[i++]=( $line'' )
done < file.tsv

And then ${array[2][5]} would give you the 6th field of the 3rd line for instance.


¹ IFS-splitting initially comes from the Bourne shell, but in the Bourne shell, all characters received that treatment, not just TAB/NL/SPC

3

Using Raku (formerly known as Perl_6)

All code below performed at the bash command line:

raku -e 'my @a = lines>>.split("\t"); .raku.put for @a;'  test_tsv.tsv

RETURNS:

$(("1", "A", "J", "", "", "", "1").Seq)
$(("2", "B", "K", "N", "", "", "1").Seq)
$(("3", "C", "L", "O", "P", "Q", "1").Seq)

Above, lines are each split on \t tabs. The .raku method is used to visualize empty strings. The >> hyper operator is short for a mapping, as in .map(*.split("\t")). The code above returns an object with 3 .Seq elements, however this is easily coerced to a list:

raku -e 'my @a = lines>>.split("\t"); .list.raku.put for @a;'  test_tsv.tsv
("1", "A", "J", "", "", "", "1")
("2", "B", "K", "N", "", "", "1")
("3", "C", "L", "O", "P", "Q", "1")

The elems method is used in Raku to get a count of elements, either elements of the entire array object, or elements of each 'line':

raku -e 'my @a = lines>>.split("\t"); @a.elems.put;'  test_tsv.tsv
3
raku -e 'my @a = lines>>.split("\t"); @a>>.elems.put;'  test_tsv.tsv
7 7 7

Once you're satisfied that you've obtained the structure you desire, it's easy simplify the code, or add/remove 'Perlish/Rakuish' idioms such as for (to get desired output for visualization, or further manipulation):

raku -e 'my @a = lines>>.split("\t"); .put for @a;'  test_tsv.tsv
1 A J    1
2 B K N   1
3 C L O P Q 1

raku -e 'my @a = lines>>.split("\t"); @a>>.list.raku.put;' test_tsv.tsv (("1", "A", "J", "", "", "", "1"), ("2", "B", "K", "N", "", "", "1"), ("3", "C", "L", "O", "P", "Q", "1"))

https://raku.org/
https://docs.raku.org/

jubilatious1
  • 3,195
  • 8
  • 17
1

Bash seems to like grouping whitespace characters together, so if you're intent on using bash, then you'll need to replace each \t with something else (I personally like the \034 Field Separator character -- seems fitting).

line=$'col1\tcol2\t\tcol4'
line=$(echo "${line}" | tr '\t' '\034')
IFS=$'\034' line_array=( ${line}'' )
echo "${line_array[@]@Q}"
# 'col1'  'col2'  ''  'col4'

note the '' in ( ${line}'' ) is necessary to handle cases where a line ends with an empty field (see Stéphane's comment for a caveat)

In your case you could use:

set -o noglob
while IFS= read -r line; do
    IFS=$'\034' A=( ${line}'' )
# Do something with &quot;${A[...]}&quot;
echo &quot;Processing row: ${A[@]@Q}&quot;
for value in &quot;${A[@]}&quot;; do
    echo &quot;got value: '${value}'&quot;
done

echo

done < <(tr '\t' '\034' < File.tsv)

Example Output

For the following File.tsv...

row1col1    *   row1col3
    *   row2col3
row3col1        row3col3
row4col1    *   

...the while loop would output:

Processing row: 'row1col1' '*' 'row1col3'
got value: 'row1col1'
got value: '*'
got value: 'row1col3'

Processing row: '' '' 'row2col3' got value: '' got value: '' got value: 'row2col3'

Processing row: 'row3col1' '' 'row3col3' got value: 'row3col1' got value: '' got value: 'row3col3'

Processing row: 'row4col1' '' '' got value: 'row4col1' got value: '' got value: ''

Edit

Implemented the following suggestions from Stéphane:

  • instead of using parameter expansion to replace \t with \034 on each line in the loop, just run tr once on the entire file
  • replace ( ${line} ) with ( ${line}'' ) to handle cases where a line ends with an empty field
  • use set -o noglob to handle fields that equal *
  • 1
    In A=( ${line//$'\t'/$'\034'} ), the ${line...} expansion is unquoted so in bash is still subject to split+glob, so you'd also need set -o noglob (see with a line containing /*<TAB>* for instance). – Stéphane Chazelas Sep 15 '22 at 18:35
  • Note that it splits foo<TAB> into "foo" only instead of "foo" and "". A=( ${line...}'' ) would preserve that empty element (but also means that an empty line is split into one empty element). – Stéphane Chazelas Sep 15 '22 at 18:37
  • echo can't be used for arbitrary data anyway. If you apply tr once to the whole input (<file tr '\t' '\34' | while...) as opposed to one on every line, that will likely be faster than using bash parameter expansion operators that are notoriously inefficient (though making read's input a pipe instead of a regular files means it will need to read it one byte at a time, degrading performance). – Stéphane Chazelas Sep 15 '22 at 18:41
  • @StéphaneChazelas Thank you so much! I can't believe I didn't test having an empty last column. And running tr on the whole file did in fact speed things up just a pinch :) For my own use, I'm not reading a file but rather looping over results from a SQL query where I've put each row into an array, so the byte-by-byte pipe thing is a non-issue. But I've added all of this to my response! – Shaun Mitchell Sep 15 '22 at 19:27
  • Instead of <file tr '\t' '\34' | while..., I used while ... done < <(tr '\t' '\034' < file). I believe that should still read line by line instead of byte-by-byte, no? Is there some reason that would be an inoptimal solution (other than that it's still using bash lol) – Shaun Mitchell Sep 15 '22 at 19:29
  • That makes no difference, <(...) is also a pipe (named here as /dev/fd/n or a real named pipe on systems that don't have /dev/fd). As the input is no seekable, read needs to read lines one byte at a time to make sure it doesn't read past the line delimiter. With seekable files, it can optimise by reading by blocks and seek back to after the line delimiter – Stéphane Chazelas Sep 16 '22 at 06:39
0

If you do want to do it in bash (and I agree that other languages are better), try something like:

sed -e 's/\t/,\t/g' File.tsv | \
while IFS=$'\t' read -r -a D; do
    A=("${A[@]}" "${D[i]%,}" "${D[$((i + 1))]%,}" "${D[$((i + 2))]%,}" "${D[$((i + 3))]%,}" "${D[$((i + 4))]%,}" "${D[$((i + 5))]%,}" "${D[$((i + 6))]%,}")
done 
nA=${#A[@]}

The trick here is to add something to each field (in this case, a comma at the end) so that no field is blank. Then strip it later. If you need to distinguish between an empty and a missing field, add the extra character at the end as well.

It is also possible to fill in a dummy value for an empty field, though this requires you be able to assign a dummy value. Note that you need to do the replacement twice:

sed -e 's/\t\t/\tdummy\t/g;s/\t\t/\tdummy\t/g' File.tsv | \

Also, you should simplify by removing i, and using +=. Thus:

    A+=("${D[0]%,}" "${D[1]%,}" "${D[2]%,}" "${D[3]%,}" "${D[4]%,}" "${D[5]%,}" "${D[6]%,}")

If you aren't going to vary the number of columns, I would also recommend reading into seven distinct variables.

David G.
  • 1,369