Attempting to use a variable set from a read of a .csv file as the pattern to grep a second .csv file

Question

I have two .csv files. The first file has a column of words. The second has two columns with the first column containing values that match one of the entries in the first file. I want to read the first file line by line and use each line to make a grep query of the second file. Current code shown

while read line
do
grep $line ./filetwo.csv
done < fileone.csv

This code produces nothing. If I replace $line with a variable that is not assigned by the read of a file it works perfectly. I have been looking at this issue for years and have never found an answer to what looks like a simple problem. I do not understand why a variable assigned by a read of a .csv file does not provide the same results as a variable that is directly assigned. I am using a zsh shell.

A not infrequent cause of issues like this is that one or both of the files has DOS-style (CRLF) line endings — steeldriver, Aug 05 '23 at 17:05

Stéphane Chazelas · Answer 1 · 2023-08-06T17:26:08.513

CSV files are more common in the Microsoft world so you may find that:

they are encoded in UTF-16 rather than the locale's charset so would need to be converted.
or they are encoded in UTF-8 but with a byte-order-mark.
they have CRLF line delimiter.
their last line is not delimited (so read would return false on them).

You can check if that's the case with file yourfile.csv.

Then you could do:

dos2unix < fileone.csv |
  while IFS=, read -r first rest_if_any_ignored; do
    dos2unix < filetwo.csv | grep -Fe "$first"
  done

(note the -F for fixed string search instead of the default which would do regex matching (the re in grep)), but that would be rather inefficient as that runs three commands for each line of fileone.csv and each grep processes the contents filetwo.csv from the start each time.

That also looks for the $first string anywhere in filetwo.csv, not just the first column and doesn't do exact matches. For instance, if $first is foo, that will report both foobar,other and other,foobar lines. That also doesn't handle CSV quoting. So you may be better off use a language with proper CSV parsing.

If those files are simple CSVs, that is without quoting nor headers, that would be a job for join here:

preprocess() {
  dos2unix -O -- "$@" | sort -t, -k1b,1
}
join -t, <(preprocess < fileone.csv) <(preprocess < filetwo.csv)

For real CSVs, with headers and possibly quoting (including of data containing newline characters), you could use a CSV parser such as mlr and its join verb.

For instance, if the first column is called foo in fileone.csv and bar in filetwo.csv:

mlr --csv join -j foo -r bar -f fileone.csv filetwo.csv

That handles CRLF, undelimited lines and UTF-8 with BOM, but not UTF-16 which you'd need to convert to UTF-8 first either with dos2unix or iconv.

mlr can also do simple CSVs and several other tabulated formats. Check its manual for details.

Attempting to use a variable set from a read of a .csv file as the pattern to grep a second .csv file

1 Answers1