As I said in my comments, this is not a job for a shell script. bash (and similar shells) are for co-ordinating the execution of other programs, not for processing data.
Use any other language instead - awk, perl, and python are good choices. It will be easier to write, easier to read and maintain, and much faster.
Here's an example of how to read your text file into an Array of Hashes (AoH) in perl
, and then use the data in various print statements.
An AoH is a data structure that is exactly what its name says it is - an array where each element is an associative array (aka hash).
BTW, this could also be done with an Array of Arrays (AoA) data structure (also known as a List of Lists or LoL), but it's convenient to be able to access fields by their field name instead of having to remember their field number.
You can read more about perl data structures in the Perl Data Structures Cookbook which is included with perl. Run man perldsc
or perldoc perldsc
. You probably also want to read perllol
and perlreftut
too. and perldata
if you're not familiar with perl variables ("Perl has three built-in data types: scalars, arrays of scalars, and associative arrays of scalars, known as hashes". A "scalar" is any single value, like a number or a string or a reference to another variable)
Perl comes with a lot of documentation and tutorials - run man perl
for an overview. The included perl docs come to about 14MB, so it's often in a separate package in case you don't want to install it. On debian: apt install perl-doc
. Also, each library module has its own documentation.
#!/usr/bin/perl -l
use strict;
Array to hold the hashes for each record
my @data;
Array of field header names. This is used to insert the
data into the %record hash with the right key AND to
ensure that we can access/print each record in the right
order (perl hashes are inherently unordered so it's useful
and convenient to use an indexed array to order it)
my @headers=qw(SlNo Artist VideoTitle VideoId TimeStart TimeEnd VideoSpeed);
main loop, read in each line, split it by single tabs, build into
a hash, and then push the hash onto the @data array.
while (<>) {
chomp;
my %record = ();
my @line = split /\t/;
iterate over the indices of the @line array so we can use
the same index number to look up the field header name
foreach my $i (0..$#line) {
# insert each field into the hash with the header as key.
# if a field contains only whitespace, then make it empty
($record{$headers[$i]} = $line[$i]) =~ s/^\s+$//;
}
push @data, %record ;
}
show how to access the AoH elements in a loop:
print "\nprint @data in a loop:";
foreach my $i (0 .. $#data) {
foreach my $h (@headers) {
printf "$data[%i]->{%s} = %s\n", $i, $h, $data[$i]->{$h};
}
print;
}
show how to access individual elements
print "\nprint some individual elements:";
print $data[0]->{'SlNo'};
print $data[0]->{'Artist'};
show how the data is structured (requires Data::Dump
module, comment out if not installed)
print "\nDump the data:";
use Data::Dump qw(dd);
dd @data;
FYI, as @Sobrique points out in a comment, the my @line =...
and the entire foreach
loop inside the main while (<>)
loop can be replaced with just a single line of code (perl has some very nice syntactic sugar):
@record{@headers} = map { s/^\s+$//, $_ } split /\t/;
Note: Data::Dump is a perl module for pretty-printing entire data-structures. Useful for debugging, and making sure that the data structure actually is what you think it is. And, not at all co-incidentally, the output is in a form that can be copy-pasted into a perl script and assigned directly to a variable.
It's available for debian and related distros in the libdata-dump-perl
package. Other distros probably have it packaged too. Otherwise get it from CPAN. Or just comment out or delete the last three lines of the script - it's not necessary to use it here, it's just another way of printing the data that's already printed in the output loop.
Save it as, say, read-tsv.pl
, make it executable with chmod +x read-tsv.pl
and run it:
$ ./read-tsv.pl file.tsv
print @data in a loop:
$data[0]->{SlNo} = 1
$data[0]->{Artist} = A
$data[0]->{VideoTitle} = J
$data[0]->{VideoId} =
$data[0]->{TimeStart} =
$data[0]->{TimeEnd} =
$data[0]->{VideoSpeed} = 1
$data[1]->{SlNo} = 2
$data[1]->{Artist} = B
$data[1]->{VideoTitle} = K
$data[1]->{VideoId} = N
$data[1]->{TimeStart} =
$data[1]->{TimeEnd} =
$data[1]->{VideoSpeed} = 1
$data[2]->{SlNo} = 3
$data[2]->{Artist} = C
$data[2]->{VideoTitle} = L
$data[2]->{VideoId} = O
$data[2]->{TimeStart} = P
$data[2]->{TimeEnd} = Q
$data[2]->{VideoSpeed} = 1
print some individual elements:
1
A
Dump the data:
[
{
Artist => "A",
SlNo => 1,
TimeEnd => "",
TimeStart => "",
VideoId => "",
VideoSpeed => 1,
VideoTitle => "J",
},
{
Artist => "B",
SlNo => 2,
TimeEnd => "",
TimeStart => "",
VideoId => "N",
VideoSpeed => 1,
VideoTitle => "K",
},
{
Artist => "C",
SlNo => 3,
TimeEnd => "Q",
TimeStart => "P",
VideoId => "O",
VideoSpeed => 1,
VideoTitle => "L",
},
]
Notice how the nested for loops print the data structure in the exact order we want (because we iterated over the @headers
array), while just dumping it with the dd
function from Data::Dump
outputs the records sorted by key name (which is how Data::Dump deals with the fact that hashes in perl aren't ordered).
Other comments
Once you have your data in a data structure like this, it's easy to insert it into an SQL database like mysql/mariadb or postgresql or sqlite3. Perl has database modules (see DBI) for all of those and more.
(In debian, etc, these are packaged as libdbd-mysql-perl
, libdbd-mariadb-perl
, libdbd-pg-perl
, libdbd-sqlite3-perl
, and libdbi-perl
. Other distros will have different package names)
BTW, the main parsing loop could also be implemented using another perl module called Text::CSV, which can parse CSV and similar file formats like Tab separated. Or with DBD::CSV which builds on Text::CSV
to allow you to open a CSV or TSV file and run SQL queries against it as if it were an SQL database.
In fact, it's a fairly trivial 10-15 line script to use these modules to import a CSV or TSV file into an SQL database, and most of that is boilerplate setup stuff...the actual algorithm is a simple while loop to run a SELECT query on the source data and an INSERT statement into the destination.
Both of these modules are packaged for debian, etc, as libtext-csv-perl
and libdbd-csv-perl
. Probably packaged for other distros too. and, as always, available on CPAN.
read
not capable of doing this? It would be more useful if you could explain what you said. – Porcupine Jun 13 '21 at 14:49bash
is known to have difficulty handling whitespace delimited strings, and TAB is whitespace. You couldtr "\t" ","
at the beginning of your pipeline, ifgrep "," File.tsv
is empty. I agree with the Perl answers – waltinator Jun 13 '21 at 16:20