How do I backslash-ignore a delimiter passed into cut?

Question

I have the following use case:

echo "some comment char '\;' embedded in strings   ; along with inline comments" \
| cut -d';' -f 1

I want:

some comment char ';' embedded in strings

I get:

some comment char '

How do I hide the delimiter configured for cut from cut, as in this use case? Ideally, cut would read and respect the backslash, but if not that way, is there another?

note that I'd like to use cut as opposed to other more versatile tools like sed, awk, perl, python, etc. Tools like tr or at most grep are fine. — Chris, Jul 21 '23 at 20:30
cut simply isn't that fancy. If you tell it that ; is your delimiter, then every ; counts; there is no escaping. — larsks, Jul 21 '23 at 20:36
Well awk would've been more readable but you can use grep's PCRE as follows: .... |grep -oP '.*(?=(;.*?))' and get the result you want — Valentin Bajrami, Jul 21 '23 at 21:13
... or maybe use both positive lookahead and negative lookbehind grep -Po '.*?(?=(?<!\\);)' although I think perhaps plain perl perl -F'(?<!\\);' -lne 'print $F[0]' is clearer — steeldriver, Jul 21 '23 at 21:16
What about \\;? Can you be sure there will be no quoted backslash? If not then you have to actually parse the string which would be ugly. — Hauke Laging, Jul 21 '23 at 21:32
Change -f 1 to -f 2. Let cut count all the semicolons, adjust your expectations. — waltinator, Jul 21 '23 at 23:11
@waltinator that won't give the desired result, either (try it, and compare the result to what the poster is asking for). Maybe you were thinking cut -f1-2, which will fetch the first two fields, and results in some comment char '\;' embedded in strings. — larsks, Jul 21 '23 at 23:41

Stéphane Chazelas · Answer 1 · 2023-11-07T13:47:23.683

With GNU grep or compatible (for the non-standard though nowadays fairly common -o option):

grep -Eo '^(\\.|[^\\;])*'

That matches and outputs the sequence of 0 or more (*)¹ of either \ followed by any single character (.) which covers escaped ; but also escaped \ or any character other than \ and ;, at the start of the line (^).

Example:

$ cat file
foo\;bar;baz
foo\\;bar;baz
$ grep -Eo '^(\\.|[^\\;])*' file
foo\;bar
foo\\

To remove that escaping, pipe to sed 's/\\$.$/\1/g', or do the whole thing in sed if your sed supports the -E option as well:

$ sed -E 's/^((\\.|[^\\;])*).*/\1/; s/\\(.)/\1/g' file
foo;bar
foo\

Or with perl:

$ perl -lpe 's/^(\\.|[^;])*+\K.*//; s/\\(.)/$1/g' file
foo;bar
foo\

^{¹ Though note that grep -o won't output empty matches.}

Ed Morton · Answer 2 · 2023-11-07T13:27:13.000

Using any awk in any shell on every Unix box:

$ echo "some comment char '\;' embedded in strings   ; along with inline comments" |
awk -F';' '{gsub(/\\\\/,RS); gsub(/\\;/,"\\\\"); gsub(/\\\\/,";",$1); gsub(RS,"\\",$1); print $1}'
some comment char ';' embedded in strings

and borrowing @Stéphane's sample input file:

$ cat file
foo\;bar;baz
foo\\;bar;baz

$ awk -F';' '{gsub(/\\\\/,RS); gsub(/\\;/,"\\\\"); gsub(/\\\\/,";",$1); gsub(RS,"\\",$1); print $1}' file
foo;bar
foo\

and extending that to include a line with more fields:

$ cat file
foo\;bar;baz
foo\\;bar;baz
foo\\;bar\;this\;that\\;baz;here\;and\;there

we can print any or all of the fields as we like (here also outputting the original line first and the field number at the start of each output line that contains a single field):

$ awk -F';' '{print; gsub(/\\\\/,RS) gsub(/\\;/,"\\\\"); for (i=1; i<=NF; i++) { gsub(/\\\\/,";",$i); gsub(RS,"\\",$i); print "   " i, $i }; print "---" }' file
foo\;bar;baz
   1 foo;bar
   2 baz
---
foo\\;bar;baz
   1 foo\
   2 bar
   3 baz
---
foo\\;bar\;this\;that\\;baz;here\;and\;there
   1 foo\
   2 bar;this;that\
   3 baz
   4 here;and;there

The above:

converts every \\ in the current input line ($0) into a newline (the default value of RS), which is a string that cannot exist within a newline-separated records, so we can handle \\; in the input as an escaped backslash rather than an escaped semi-colon, then
converts every \; in $0 into \\, which is also now a string that cannot exist in $0 since we just converted them all to RSs, to get rid of the troublesome ; in it, then
the act of modifying $0 causes awk to resplit $0 into fields at every remaining ; which puts our desired target string in $1, then
we convert every \\ (created at step 2 above) in $1 to ;, then
convert every RS (created at step 1 above) in $1 back to \\, then
we print that field, $1

That approach will work for every RS that is a literal string as defined by POSIX, if your RS is a regexp as supported by some awks, e.g. GNU awk, then come up with a string without regexp metachars that matches that regexp to use as the replacement instead of RS

How do I backslash-ignore a delimiter passed into cut?

2 Answers2