How can I print two fields from a csv file to an output using bash-scripting, instead of one?

Question

I have not used bash often in the past for bash scripting & I am currently reading using bash scripting. The file contains a number of fields stored in csv format. The first scipt below will gather all the ips in a file; however, I am also trying to gather the IP and another field called Network.. Does anyone know I can achieve this?

files=`ls | grep data_batch_`
for file in ${files[@]}
do
  cat ${file} | cut -d , -f2 | grep -v "IP" > data_${file}
done

I have tried do add a boolean operator but I have no success. Also tried more pipes. I have not used bash often so I may be missing something with the syntax or not understanding why this is not a allowed?

    files=`ls | grep data_batch`
for file in ${files[@]}
do
  cat ${file} | cut -d , -f2 | cut -d, -f3 | grep -v "IP" && "Network" > data_${file}
done

For some reason when I do this, it appears to overwrite the IP value with the NETWORK value, instead of storing them both. Essentially, all I am trying to do is print two fields to a file instead of one, but I am not sure how to achieve his solution. Any tips will also help.

My desired output is the ip address value and the network value stored in the file. At the moment, all I get is the IP. The desired output below.

1.1.1.1
Network5

Please add sample input and your desired output for that sample input to your question. — Cyrus, Jan 22 '22 at 23:21

cas · Answer 1 · 2022-01-24T01:43:21.817

There are numerous problems with your script:

files=`ls | grep data_batch_`
for file in ${files[@]}
do
  cat ${file} | cut -d , -f2 | grep -v "IP" > data_${file}
done

Don't Parse ls
Don't use backticks. Use $() instead. It does the same thing but doesn't break quoting and can be nested.
You're using files in the for loop as if it's an array, but it's not an array. You're defining it as a scalar string (the output of ls | grep ...). If you want to define an array, you need to use parenthesis, e.g.

This defines files as a string:
```
$ files=$(echo 1 2 3)
$ declare -p files
declare -- files="1 2 3"
```
While this defines it as an array:
```
$ files=( $(echo 1 2 3) )
$ declare -p files
declare -a files=([0]="1" [1]="2" [2]="3")
```
Alternatively, you could use mapfile (AKA readarray):
```
 $ mapfile -t files < <(printf "%s\n" 1 2 3)
 $ declare -p files
 declare -a files=([0]="1" [1]="2" [2]="3")
```
Double-quote your variable expansions. Using curly-braces is NOT a substitute for quoting. See Why does my shell script choke on whitespace or other special characters? and $VAR vs ${VAR} and to quote or not to quote for reasons why.
In your second script, you are piping the output of cut -d, -f2 into cut -d, -f3. That's not going to work.

The first cut only outputs one field (field 2). The second cut will output exactly the same because there's only one field (or no fields, since there's no comma) in its input and you're telling it to output non-existent field 3. Try running echo 1,2,3 | cut -d, -f2 and then run echo 1,2,3 | cut -d, -f2 | cut -d, -f3 and you'll see that the output for both commands is identical: 2.

To output two fields with cut -f, list them separated by commas. For example:
```
cut -d, -f2,3
```
BTW, you can also specify a range of fields with -, e.g. if you wanted to output fields 2 to 5, you'd use: cut -d, -f2-5. See man cut.
I don't know if this is a problem or not, but it's something to be aware of. Your scripts are redirecting stdout to output files with the same name as the input file, but prefixed with data_. So if your input file is data_batch_1.csv then your output file is going to be data_data_batch_1.csv.

This may be exactly what you want, in which case it's not a problem - but it will mean that if you run the script again, the file glob will match both your original input files and the output files generated by the first run....resulting in filenames like data_data_data_batch_1.csv. You may want to consider using a different naming convention for the output files, or write them to a different directory.

Anyway, those are the problems. Here's a few solutions....try something more like this:

for file in *data_batch_*; do
  cut -d, -f2,3 "$file" | grep -v IP > "data_$file"
done

If you really wanted to use an array of filenames, you could use mapfile and find with -print0. e.g.

mapfile -t -d '' files < <(find . -maxdepth 1 -type f -name '*data_batch_*' -print0)
for file in "${files[@]}"; do
   cut -d, -f2,3 "$file" | grep -v IP > "data_$file"
done

Alternatively, you could use awk instead of cut:

awk -F, -v OFS=, '$2$3 !~ /IP/ { print $2, $3 > "data_" FILENAME }' *data_batch_*

If neither $2 nor $3 contain "IP", then print them with stdout redirected to a file with the same name as the current filename (awk's FILENAME variable), prefixed with the string "data_".

This will be significantly faster because it doesn't have to fork cut and grep multiple times - once for every file it processes.

Finally, CSV files can (and often do) contain double-quoted string fields - and those quoted fields can contain commas. Simple comma-delimited files without quotes and without commas embedded within fields can be processed reliably with cut. Actual CSV with all its optional extras requires a CSV parser. Your best option for this is to use either:

a language which already has a full-featured CSV parser - e.g. perl has the Text::CSV module and python includes a csv library.
a tool like Miller or csvkit

score 1 · Answer 2 · answered Jan 23 '22 at 11:04

1

If you have awk available:

$ cat /tmp/abc
name1,0.0.0.0,NetworkName1
name2,0.4.2.3,NetworkName2
name3,0.1.43.5,NetworkName3
$ awk 'BEGIN { FS = "," } ;{printf $2","$3"\n"}' /tmp/abc
0.0.0.0,NetworkName1
0.4.2.3,NetworkName2
0.1.43.5,NetworkName3

so in this case,

for i in $(ls | grep -E ^test.*[.]csv$)
do
    cat $i | cut -d , -f2,3 >> testing.txt
done

can become

$ awk 'BEGIN { FS = "," } ;{printf $2","$3"\n"}' test*.csv > testing.txt

If you do frequent structured text processing, investing some time in learning awk is going to be beneficial.

answered Jan 23 '22 at 11:04

i_am_on_my_way_to_happiness

21

Wow, this is very useful command. Do you know if it's a way to add an conditional to this? For example, if I want to only output $3 if it is equal to a certain value? – Lam Jan 27 '22 at 02:09
yes,
$ awk 'BEGIN { FS = "," } ;{ if ($3=="something") {print $3;} else {printf $2","$3"\n";}}' test*.csv > testing.txt

you may want to have a look at https://books.google.se/books/about/sed_awk.html?id=Xu0G31e-4gIC&redir_esc=y
– i_am_on_my_way_to_happiness Jan 28 '22 at 12:47

score 0 · Answer 3 · answered Jan 23 '22 at 02:08

I had some luck with the following:

Contents of directory:

$ ls
test.csv  test1.csv  test3csv test5.txt

where each of the files contains some lines like the following:

name1,0.0.0.0,NetworkName1
name2,0.4.2.3,NetworkName2
name3,0.1.43.5,NetworkName3

The script:


for i in $(ls | grep -E ^test.*[.]csv$)
do
    cat $i | cut -d , -f2,3 >> testing.txt
done

This takes all files that start with test, and end with .csv, cuts out fields two and three, and appends them to the file testing.txt.

The output file looks like this afterwards

0.0.0.0,NetworkName1
0.4.2.3,NetworkName2
0.1.43.5,NetworkName3

listing each IP address and each Network name on a separate line.

In your script, the reason that you see things in the output file being overwritten is because you are currently using the > operator, which overwrites everything in the file, while what you probably want is the >> operator, which appends the text to the end of the file.

How can I print two fields from a csv file to an output using bash-scripting, instead of one?

3 Answers3

Linked