Remove comments from file while ignoring quoted comment signs

Question

I would like to remove comments starting with # from a file. I have tried the simpler approaches described in How can I remove all comments from a file? but I have a few additional rules:

A # does not start a comment if it occurs as part of a quoted string.
Strings can be quoted by single quotes ' or double quotes ".
Double-quoted strings can contain quotes if preceded by a backslash \", backslashes are quoted as \\.
All quotes in the input are matched. However, this is not required for quotes that are part of a string's content, in other words "'", "\"" and '"' are valid strings).
Quoted strings can't contain newline characters.
Comments can contain any characters including any number of #, ', " and \.
Any # outside of quotes starts a comment (as Stéphane Chazelas pointed out code code for most shells follows more complex rules - think about Bash's $# which does not start a comment).

For example the following input

# comment only
# comments are allowed to contain quotes "' and # number signs
# comments are allowed to contain pairs 'of' "quotes"
some text # with an explanation
some "quoted text # not a comment" # comment
'# not a comment' and '# not a comment either' # comment
"# not a comment containing 'quotes\"" # another comment

shall be converted into the following output


some text
some "quoted text # not a comment"
'# not a comment' and '# not a comment either'
"# not a comment containing 'quotes&quot;"

I would like to accomplish this with popular Unix command line tools like awk, grep and sed on modern Debian/Ubuntu systems. I'm not strictly limited to features described by POSIX although a POSIX-compliant solution would be preferred.

You may be asking for the impossible. Non-Perl regular expressions (which is what standard POSIX tools provide) can only handle certain types of grammar, and depending from the exact grammar of the language in that file (whatever it is, as you do not say) it may be simply impossible to do this correctly with regular expressions. See https://stackoverflow.com/a/590789/340790 and the infamous https://stackoverflow.com/q/1732348/340790 . You need to state what language the file contents are. — JdeBP, Sep 06 '20 at 04:34
I see you've updated your question add a descriptions of where comments occur and what they contain. You should update your sample input/output to show all those cases so we have something to test against for a simple pass/fail on a potential solution. — Ed Morton, Sep 06 '20 at 21:54

Stéphane Chazelas · Accepted Answer · 2020-09-07T20:19:17.903

If the point is to remove comments from POSIX sh scripts, note that only the ones marked as YES in the code below are comments:

echo 1 # YES
echo 2 $# NO foo# NO
echo 3;#YES
# YES
cat << E
# NO
E
echo 4 " # NO \" # NO" \" # YES
echo "5
# NO
$(echo 6 # YES
)
`echo 7 \" # NO \"`
"
eval 'echo 8 # NO, then YES'

(and you can see the stackexchange syntax highlighter gets it wrong in most of the cases).

Covering those would take hundreds of lines of awk or sed code.

And rules for csh, fish, perl, python, ruby which are other languages that have "..." and '...' quotes and # as comment leader would be radically different.

If

it's not about shell syntax,
you can assume that there is no escaping of quotes,
that the quoted strings don't contain newline characters,
that all quotes are matched,
that any # outside of quotes starts a comment and not only those following a blank or other delimiter,
that the input is valid text in the current locale

And if by standard you mean POSIX 2018 or earlier, you could do it with sed with:

sed "s/^\(\(\([^\"'#]\)*\(\"[^\"]*\"\)\{0,1\}\('[^']*'\)\{0,1\}\)*\)#.*/\1/"

POSIX 2018 sed doesn't support -E for EREs which would be needed for an alternation operator, but here we do something approaching with BREs by doing $a\{0,1\}b\{0,1\}$* ((a?b?)* in ERE) as an equivalent of (a|b)*. Using (a*b*)* as in Rakesh's answer would also work.

grep would not be an option as standard grep only prints the full matching lines. awk uses EREs though. Standard awk doesn't have capture groups, but you should be able to do things like:

awk "match(\$0, /^([^'\"#]|\"[^\"]*\"|'[^']*')*#/) {
       \$0 = substr(\$0, 1, RLENGTH-1)
     }
     {print}"

With your edited requirements, you can handle the escaped quotes by using "(\\.|[^\\"])*" or its BRE equivalent:

sed 's/^\(\(\([^"\\'\''#]\)*\(\\.\)\{0,1\}\("\([^"\\]*\(\\.\)\{0,1\}\)*"\)\{0,1\}\('"'[^']*'\)\{0,1\}\)*\)#.*/\1/"

or:

awk 'match($0, /^([^'\''"\\#]|\\.|"(\\.|[^\\"])*"|'\''(\\.|[^\\'\''])*'\'')*#/) {
       $0 = substr($0, 1, RLENGTH-1)
     }
     {print}'

both of which also handle escaped quotes outside of quotes (as in foo\"bar # comment).

I've switched to using single quotes here to reduce the number of backslashes that need to be inserted to get a literal \\ , but that literal single quotes in the data have to be inserted as 'before'\''after', that is '\'', the first ' to close the 'before' quoted strings, \' using backslash to quote/escape the literal ' (as you can't insert a single quote inside a single-quoted string) and the 'after' quoted string follows.

Thumbs up - both the sed and awk solutions generate the correct output :-) — Martin Konrad, Sep 08 '20 at 20:58

Rakesh Sharma · Answer 2 · 2020-09-07T06:21:10.717

Based on the rules specified we distinguish 5 kinds of words:

double quoted words (they can include escaped double quotes as well) "... \"... "
single quoted words '...' they will not have an included single quote.
backslash quoted word \.basically any escaped char.
non comment starting char [^'#"]
what remains is a comment.

#! /bin/bash
# whitespace and horizontal whitespace
_ws_=$(printf '\t \nx') 
ws="[${_ws_%?}]" hws="[${_ws_%??}]"
nac="[^\"'#]" nac="($nac)" #not a comment char
bqw='[].'    bqw="($bqw)" # backslashed word
sqw="'[^']*'" sqw="($sqw)" # single quoted word
#double quoted word 
dqw='
  "
    (
      [^\"]* ([][])* []"
    )*
    [^"]*
  "
'
dqw="(${dqw//$ws/})"
sed 

  -e '/#/!b' 

  -e "s/^(($sqw$dqw$bqw$nac))./\1/" 

  -e "s/$hws*$//" 

< file

Note this is fully POS IX

This solution generates the correct output :-) Beware that it strips away whitespaces before the comment sign, though. — Martin Konrad, Sep 08 '20 at 20:54

Martin Konrad · Answer 3 · 2020-09-08T19:10:05.920

Solution

The following solution works with popular sed implementations like GNU sed which support extended regular expressions (ERE):

sed -E "s/^(([^#\"'\\]|'[^']*'|\"([^\"\\\\]|\\\\.)*\")*)#.*/\1/" input.txt

The main advantage of this solution is better readability than many other solutions.

Note: The -E switch is not part of POSIX 2018, yet, but it is on its way to become part of become POSIX 2020. If you need a POSIX-2018-compatible solution see Stéphane Chazelas' answer.

How it works

The following longer version breaks the above regex into pieces that are easier to digest:

NON_QUOTED_TEXT="[^#\"'\\]"
SINGLE_QUOTED_STRING="'[^']*'"
DOUBLE_QUOTED_STRING='"([^"\\]|\\.)*"'
REMOVE_COMMENTS="^((${NON_QUOTED_TEXT}|${SINGLE_QUOTED_STRING}|${DOUBLE_QUOTED_STRING})*)#.*"
sed -E "s/${REMOVE_COMMENTS}/\1/" input.txt

We are using sed to search for text matching the regular expression contained in ${REMOVE_COMMENTS} and replace each match with the content of the first capture group \1. This capture group contains the match of the regular expression between the first opening parenthesis ( and the last closing parenthesis ). This part of the regex matches any text before the first comment sign (#) which doesn't occur as part of a quoted string. Looking at it in detail we are matching a sequence of 0 to N (*) of the following options (a|b|c):

Non-quoted text: characters other than #, ", ' and \.
Single-quoted text: any number (*) of characters other (^) than ' enclosed by a pair of single quotes.
Double-quoted text: A string enclosed by a pair of double quotes. The string is allowed to contain any number of characters other than " and \ or ((a|b)) an arbitrary character preceded by a backslash (\\.).

When you're combining the parts to the complete solution above we have to keep in mind that Bash rules require slightly different quoting when using single vs. double quotes. See Differences between single and double quotes in Bash for a details.

score -1 · Answer 4 · answered Sep 06 '20 at 09:23

-1

command

 sed -e '/^#/d' filename| sed "s/# comment$//g"

Python

#!/usr/bin/python
import re
d=re.compile(r'^#')
r=re.compile(r'#\scomment$')
l=open('p','r')
for  i in l:
    if not re.search(d,i):
        e=re.sub(r,"",i)
        print e.strip()

output

some text # with a comment
some "quoted text # not a comment"
'# not a comment' "# it's not a comment" '#still not a comment

'

answered Sep 06 '20 at 09:23

Praveen Kumar BS

5,211

Seems like in the first version of my question I didn't get across that comments can contain any text. Your solution assumes all comments contain the word comment in their text (which was the case in my example). I have updated my question to clarify. – Martin Konrad Sep 06 '20 at 17:34

Remove comments from file while ignoring quoted comment signs

4 Answers4

Solution

How it works

Linked