CSV files are more common in the Microsoft world so you may find that:
- they are encoded in UTF-16 rather than the locale's charset so would need to be converted.
- or they are encoded in UTF-8 but with a byte-order-mark.
- they have CRLF line delimiter.
- their last line is not delimited (so
read
would return false on them).
You can check if that's the case with file yourfile.csv
.
Then you could do:
dos2unix < fileone.csv |
while IFS=, read -r first rest_if_any_ignored; do
dos2unix < filetwo.csv | grep -Fe "$first"
done
(note the -F
for fixed string search instead of the default which would do regex matching (the re
in grep
)), but that would be rather inefficient as that runs three commands for each line of fileone.csv
and each grep
processes the contents filetwo.csv
from the start each time.
That also looks for the $first
string anywhere in filetwo.csv
, not just the first column and doesn't do exact matches. For instance, if $first
is foo
, that will report both foobar,other
and other,foobar
lines. That also doesn't handle CSV quoting. So you may be better off use a language with proper CSV parsing.
If those files are simple CSVs, that is without quoting nor headers, that would be a job for join
here:
preprocess() {
dos2unix -O -- "$@" | sort -t, -k1b,1
}
join -t, <(preprocess < fileone.csv) <(preprocess < filetwo.csv)
For real CSVs, with headers and possibly quoting (including of data containing newline characters), you could use a CSV parser such as mlr
and its join
verb.
For instance, if the first column is called foo
in fileone.csv
and bar
in filetwo.csv
:
mlr --csv join -j foo -r bar -f fileone.csv filetwo.csv
That handles CRLF, undelimited lines and UTF-8 with BOM, but not UTF-16 which you'd need to convert to UTF-8 first either with dos2unix
or iconv
.
mlr
can also do simple CSVs and several other tabulated formats. Check its manual for details.