I would like to remove comments starting with #
from a file. I have tried the simpler approaches described in How can I remove all comments from a file? but I have a few additional rules:
- A
#
does not start a comment if it occurs as part of a quoted string. - Strings can be quoted by single quotes
'
or double quotes"
. - Double-quoted strings can contain quotes if preceded by a backslash
\"
, backslashes are quoted as\\
. - All quotes in the input are matched. However, this is not required for quotes that are part of a string's content, in other words
"'"
,"\""
and'"'
are valid strings). - Quoted strings can't contain newline characters.
- Comments can contain any characters including any number of
#
,'
,"
and\
. - Any
#
outside of quotes starts a comment (as Stéphane Chazelas pointed out code code for most shells follows more complex rules - think about Bash's$#
which does not start a comment).
For example the following input
# comment only
# comments are allowed to contain quotes "' and # number signs
# comments are allowed to contain pairs 'of' "quotes"
some text # with an explanation
some "quoted text # not a comment" # comment
'# not a comment' and '# not a comment either' # comment
"# not a comment containing 'quotes\"" # another comment
shall be converted into the following output
some text
some "quoted text # not a comment"
'# not a comment' and '# not a comment either'
"# not a comment containing 'quotes""
I would like to accomplish this with popular Unix command line tools like awk
, grep
and sed
on modern Debian/Ubuntu systems. I'm not strictly limited to features described by POSIX although a POSIX-compliant solution would be preferred.