I once came up with this which we can refine to:
perl -0777 -pe '
s{
# /* ... */ C comments
/ (?<lc> # line continuation
(?<bs> # backslash in its regular or trigraph form
\\ | \?\?/
)
(?: \n | \r\n?) # handling LF, CR and CRLF line delimiters
)* \* .*? \* (?&lc)* /
| / (?&lc)* / (?:(?&lc) | [^\r\n])* # // C++/C99 comments
| (?<code> # tokenising anything else
"(?:(?&bs)(?&lc)*.|.)*?" # "strings" literals
| '\''(?&lc)*(?:(?&bs)(?&lc)*(?:\?\?.|.))?(?:\?\?.|.)*?'\'' # (w)char literals
| \?\?'\'' # trigraph form of ^
| .[^'\''"/?]* # anything else
)
}{$+{code} eq "" ? " " : $+{code}}exsg'
to handle a few more corner cases.
Note that if you remove a comment, you could change the meaning of the code (1-/* comment */-1
is parsed like 1 - -1
while 1--1
(which you'd obtain if you removed the comment) would give you an error). It's better to replace the comment with a space character (as we do here) instead of completely removing it.
The above should work properly on this valid ANSI C code for instance that tries to include a few corner cases:
#include <stdio.h>
int main()
{
printf("%d %s %c%c%c%c%c %s %s %d\n",
1-/* comment */-1,
/\
* comment */
"/* not a comment */",
/* multiline
comment */
'"' /* comment */ , '"',
'\'','"'/* comment */,
'\
\
"', /* comment */
"\\
" /* not a comment */ ",
"??/" /* not a comment */ ",
'??''+'"' /* "comment" */);
return 0;
}
Which gives this output:
#include <stdio.h>
int main()
{
printf("%d %s %c%c%c%c%c %s %s %d\n",
1- -1,
"/* not a comment */",
'"' , '"',
''','"' ,
'
"',
"\
" /* not a comment / ",
"??/" / not a comment */ ",
'??''+'"' );
return 0;
}
Both printing the same output when compiled and run.
You can compare with the output of gcc -ansi -E
to see what the pre-processor would do on it. That code is also valid C99 or C11 code, however gcc
disables trigraphs support by default so it won't work with gcc
unless you specify the standard like gcc -std=c99
or gcc -std=c11
or add the -trigraphs
option).
It also works on this C99/C11 (non-ANSI/C90) code:
// comment
/\
/ comment
// multiline\
comment
"// not a comment"
(compare with gcc -E
/gcc -std=c99 -E
/gcc -std=c11 -E
)
ANSI C didn't support the // form
of comment. //
is not otherwise valid in ANSI C so wouldn't appear there. One contrived case where //
may genuinely appear in ANSI C (as noted there, and you may find the rest of the discussion interesting) is when the stringify operator is in use.
This is a valid ANSI C code:
#define s(x) #x
s(//not a comment)
And at the time of the discussion in 2004, gcc -ansi -E
did indeed expand it to "//not a comment"
. However today, gcc-5.4
returns an error on it, so I'd doubt we'll find a lot of C code using this kind of construct.
The GNU sed
equivalent could be something like:
lc='([\\%]\n|[\\%]\r\n?)'
sed -zE "
s/_/_u/g;s/!/_b/g;s/</_l/g;s/>/_r/g;s/:/_c/g;s/;/_s/g;s/@/_a/g;s/%/_p/g;
s@\?\?/@%@g;s@/$lc*\*@:&@g;s@\*$lc*/@;&@g
s:/$lc*/:@&:g;s/\?\?'/!/g
s#:/$lc*\*[^;]*;\*$lc*/|@/$lc*/([\\\\%].|[^\\\\%\n\r])*|(\"($lc|[\\\\%]$lc*[^\r\n]|[^\\\\%\"])*\"|'$lc*([\\\\%]$lc*[^\r\n])?([^\\\\%']|$lc)*'|$lc|[^'\"@;:]+)#<\5>#g
s/<>/ /g;s/!/??'/g;s@%@??/@g;s/[<>@:;]//g
s/_p/%/g;s/_a/@/g;s/_s/;/g;s/_c/:/g;s/_r/>/g;s/_l/</g;s/_b/!/g;s/_u/_/g"
If your GNU sed
is too old to support -E
or -z
, you can replace the first line with:
sed -r ":1;\$!{N;b1}
data abc; set xyz; run;
– Sharique Alam Jul 21 '16 at 11:18INSERT INTO string_table VALUES('/*'), ('*/'), ('/**/');
) – zwol Jul 21 '16 at 17:20