Fill empty cell in pipe-separated-value file with default text

Question

I have a PSV file with empty cells. I want to fill the empty cells with the valule 00000000000

Example input

0000000001|00346743139|201901|07
0000000002||201901|00

Desired output

0000000001|003467431|201901|07
0000000002|00000000000|201901|00

I have tried by using

sed -i "s/||/|00000000000|/g" filename

What happened when you tried your sed command? Error messages? Unexpected output? We weren't with you so you need to describe what happened — Chris Davies, Jan 12 '21 at 07:13
Can you check the first line ? Why 00346743139 should became 003467431 ? Your sed command is good, the modification is done in place check your file — Ôrel, Jan 12 '21 at 07:49

AdminBee · Answer 1 · 2021-01-12T09:05:28.373

Your sed command looks ok, and it works if I try it with your input example.

As a general alternative for parsing pipe-separated files, or in general table-formatted values, I would recommend awk, however. It seems you want to replace "empty" columns with 00000000000. To do that, you can use

awk 'BEGIN{FS=OFS="|"} {for (i=1;i<=NF;i++) {if ($i=="") $i="00000000000"}} 1' filename

This will first instruct awk to consider | the field separator for input and output (FS is the input field separator, OFS is the output field separator.
It will then iterate over all fields (=columns) of a line, and if it finds an empty field, set its value to 00000000000 instead ($i referring to "field nr. i of the current line", and NF being an auto-generated variable containing the number of fields).
In the end, it will print all lines including the possible modifications (the seemingly stray 1 at the end)

You can also restrict this easily to only apply to field nr. 2, in case you want to handle other empty columns differently:

awk 'BEGIN{FS=OFS="|"} {if ($2=="") $2="00000000000"} 1' filename

This seems somewhat overkill when the OP hasn't explained why their (apparently valid) sed command isn't sufficient — Chris Davies, Jan 12 '21 at 08:34
@roaima You could say that and I wouldn't really object. However, the OP obviously deals with formatted (though not structured) data, and in these cases using awk is almost always better than using sed, which is what I wanted to point out (while providing a different approach at solving the OPs task). — AdminBee, Jan 12 '21 at 08:38

Kusalananda · Accepted Answer · 2021-01-12T15:14:01.780

Your sed command appears to be replacing all non-overlapping occurrences of the string || with |00000000000|. It is unclear how this does not solve your issue, at least for the data supplied:

% sed 's/||/|00000000000|/g' file.csv
0000000001|00346743139|201901|07
0000000002|00000000000|201901|00

Note that your expected output seems to truncate 00346743139 into 003467431 (the last two integers are removed). It is unclear whether this is intentional.

If there is an issue in using the non-standard -i option with the sed implementation on your Unix, then please see the post How can I achieve portability with sed -i (in-place editing)?

For example, on macOS, your command would give you

$ sed -i "s/||/|00000000000|/g" filename
sed: 1: "filename": invalid command code f

due to the way the -i option is used differently on that system. Use -i '' instead on macOS (and read man sed on your system about this option).

For more robust CSV parsing, you may want to use an actual CSV parser.

Using the csvkit tools to do the CSV parsing, and jq to do the actual processing:

% csvjson -I -H file.csv | jq -r '.[] | map(select(. == null) |= "00000000000") | @csv' | csvformat -D '|'
0000000001|00346743139|201901|07
0000000002|00000000000|201901|00

This first converts your CSV file to JSON using csvjson. The options we use here turns off type inference (so that the numbers are interpreted as strings), and tells the utility that there is no header line in the CSV data. The csvjson tool will automatically detect that | is the correct delimiter used in the data, but you may also explicitly tell it to use | as the delimiter using -d '|'.

The jq code then replaces all empty values with the string 00000000000 and formats the processed data into CSV again.

Since you want to have pipe-delimited output, csvformat is used to change the delimiters outputted by @csv in jq into |-characters.

Redirect the output to a new filename, and then optionally replace the original data with that new file.

Both csvkit and jq are available for macOS via the Homebrew package manager.

non-overlapping -> I think you have likely nailed it, since sed 's/||/|0|/g' <<< '|||' would give |0||, not |0|0|. Of course, we can only guess until the asker edits. — Quasímodo, Jan 12 '21 at 11:28
@Quasímodo I think it's more likely that they are on macOS and are having issues with sed: 1: "filename": invalid command code f since -i works differently from with GNU sed. — Kusalananda, Jan 12 '21 at 11:31

Fill empty cell in pipe-separated-value file with default text

2 Answers2