13

I did tried sed and awk, but its not working as the character involves / which is already there in command as delimiter.

Please let me know how can I achieve this.

Below is a sample Example. We want to remove the commented sections, i.e /*.....*/

/*This is to print the output
data*/
proc print data=sashelp.cars;
run;
/*Creating dataset*/
data abc;
set xyz;
run;
AdminBee
  • 22,803
  • -bash-4.1$ sed 's,/*.**/,,g' test.sas Below is the ouput i get , the first comment is still there. /This is to print the output data/ proc print data=sashelp.cars; run;

    data abc; set xyz; run;

    – Sharique Alam Jul 21 '16 at 11:18
  • 1
    Thanks for the edit. It would be even better if you included your desired output as well. Also include what you tried and how it failed in the question not in the comments. – terdon Jul 21 '16 at 11:33
  • 2
    What should happen to string literals containing comments or comment delimiters? (e.g. INSERT INTO string_table VALUES('/*'), ('*/'), ('/**/'); ) – zwol Jul 21 '16 at 17:20
  • 1
    Related (sorry I can't resist!): http://codegolf.stackexchange.com/questions/48326/remove-single-line-and-multiline-comments-from-string – ilkkachu Jul 21 '16 at 21:27
  • I updated my post with another solutions, please recheck if now it is good for you. – Luciano Andress Martini Jun 06 '18 at 14:50
  • Hi are you still having troubles with this I write a very small C program to help me with that, can I post a new answer? – Luciano Andress Martini Jun 17 '19 at 14:57

9 Answers9

23

I think i found a easy solution!

cpp -P yourcommentedfile.txt 

SOME UPDATES:

Quote from the user ilkachu (original text from the user comments):

I played a bit with the options for gcc: -fpreprocessed will disable most directives and macro expansions (except #define and #undef apparently). Adding -dD will leave defines in too; and std=c89 can be used to ignore new style // comments. Even with them, cpp replaces comments with spaces (instead of removing them), and collapses spaces and empty lines.

But I think it is still reasonable and a easy solution for the most of the cases, if you disable the macro expansion and other things I think you will get good results... - and yes you can combine that with shell script for getting better... and much more...

  • 1
    Using the C preprocessor is likely the most robust solution. Since the preprocessor is likely the most robust parser of C comments. Clever. – grochmal Jul 21 '16 at 13:33
  • 1
    It also means if your users want to use include or ifdef they're free to. There are some X11 configuration files that use cpp for this. What's the reason for the tail command? I don't think there's anything in OP's question that calls for it. – Random832 Jul 21 '16 at 13:51
  • 14
    But cpp will do a lot more than removing comments (process #include, expand macros, including builtin ones...) – Stéphane Chazelas Jul 21 '16 at 14:17
  • The tail is for remove the things that Stéphane Chazelas said. – Luciano Andress Martini Jul 21 '16 at 14:55
  • 3
    @LucianoAndressMartini, no, tail -n +7 will just remove the first 7 lines, it will not prevent the #include processing or macro expansions. Try echo __LINE__ | cpp for instance. Or echo '#include /dev/zero' | cpp – Stéphane Chazelas Jul 21 '16 at 15:02
  • You're perfect right! But i think it will work for the user. – Luciano Andress Martini Jul 21 '16 at 16:06
  • 2
    You probably want to use -P mode if you do this. (This may eliminate the need to use tail.) – zwol Jul 21 '16 at 17:14
  • 3
    I played a bit with the options for gcc: -fpreprocessed will disable most directives and macro expansions (except #define and #undef apparently). Adding -dD will leave defines in too; and std=c89 can be used to ignore new style // comments. Even with them, cpp replaces comments with spaces (instead of removing them), and collapses spaces and empty lines. – ilkkachu Jul 21 '16 at 21:51
  • If you dont use the -P option, the white spaces not collapses. – Luciano Andress Martini Jul 22 '16 at 00:57
  • @LucianoAndressMartini but then you get the line number directives... – ilkkachu Jul 22 '16 at 10:30
  • Yes. And you can avoid then with | tail -n+7 – Luciano Andress Martini Jul 22 '16 at 10:59
  • @ilkkachu The comments actually need to be replaced by spaces. Consider 1+/*comment*/++i. If completely is completely collapsed, it would become 1+++i or ((1)++) + (i), which is an error (1 is an rvalue, suffix ++ requires an lvalue.) Instead, with spaces, it would be 1+ ++i, which evaluates to (1) + (++(i)), which, in turn, evaluates to (1) + ((i) += 1) or (1) + ((i) = (i) + 1). – EKons Jul 22 '16 at 13:27
  • @ΈρικΚωνσταντόπουλος, If it were, C, then yes. But the question at hand does not specify this exactly, it only says the comments need to be removed. – ilkkachu Jul 22 '16 at 13:36
  • @ilkkachu My statement remains true: it could have been C, it might be C. For safety, use spaces. – EKons Jul 22 '16 at 13:38
  • @ikkachu -fpreprocessed -E stripping comments is likely to be unintentional; don't rely on it to stay that way. – zwol Jul 25 '16 at 01:18
  • Note that that stripcmt fails on a few of the corner cases in the sample C file in my answer. It also turns 1 -/*comment*/-1 into 1--1 which then becomes invalid C code. – Stéphane Chazelas Jun 06 '18 at 15:00
  • And that is what you expect when you want to remove characters falling under /.../... but if you are truly writing C, and removing /.../ is not exactly what you expect but a more intelligent behavior and this tools (and note that is a specialized tool) or C preprocessor, or sed and/or perl solutions are not enough for you maybe you should need to write your own removing comments program! I think you will agree with me, that can be so very painful that is more easy to correct the isolated generated errors than writing something like that in most of cases.. – Luciano Andress Martini Jun 06 '18 at 15:28
  • -fpreprocessed doesn't seem to be recognized for me on mac (Apple clang version 12.0.5) – Caleb Stanford Apr 19 '22 at 03:00
11

I once came up with this which we can refine to:

perl -0777 -pe '
  s{
    # /* ... */ C comments
    / (?<lc> # line continuation
        (?<bs> # backslash in its regular or trigraph form
          \\ | \?\?/
        )
        (?: \n | \r\n?) # handling LF, CR and CRLF line delimiters
      )* \* .*? \* (?&lc)* /
    | / (?&lc)* / (?:(?&lc) | [^\r\n])* # // C++/C99 comments
    | (?<code> # tokenising anything else
         "(?:(?&bs)(?&lc)*.|.)*?" # "strings" literals
       | '\''(?&lc)*(?:(?&bs)(?&lc)*(?:\?\?.|.))?(?:\?\?.|.)*?'\'' # (w)char literals
       | \?\?'\'' # trigraph form of ^
       | .[^'\''"/?]* # anything else
      )
  }{$+{code} eq "" ? " " : $+{code}}exsg'

to handle a few more corner cases.

Note that if you remove a comment, you could change the meaning of the code (1-/* comment */-1 is parsed like 1 - -1 while 1--1 (which you'd obtain if you removed the comment) would give you an error). It's better to replace the comment with a space character (as we do here) instead of completely removing it.

The above should work properly on this valid ANSI C code for instance that tries to include a few corner cases:

#include <stdio.h>
int main()
{
  printf("%d %s %c%c%c%c%c %s %s %d\n",
  1-/* comment */-1,
  /\
* comment */
  "/* not a comment */",
  /* multiline
  comment */
  '"' /* comment */ , '"',
  '\'','"'/* comment */,
  '\
\
"', /* comment */
  "\\
" /* not a comment */ ",
  "??/" /* not a comment */ ",
  '??''+'"' /* "comment" */);
  return 0;
}

Which gives this output:

#include <stdio.h>
int main()
{
  printf("%d %s %c%c%c%c%c %s %s %d\n",
  1- -1,

"/* not a comment */",

'"' , '"', ''','"' , '

"',
"\ " /* not a comment / ", "??/" / not a comment */ ", '??''+'"' ); return 0; }

Both printing the same output when compiled and run.

You can compare with the output of gcc -ansi -E to see what the pre-processor would do on it. That code is also valid C99 or C11 code, however gcc disables trigraphs support by default so it won't work with gcc unless you specify the standard like gcc -std=c99 or gcc -std=c11 or add the -trigraphs option).

It also works on this C99/C11 (non-ANSI/C90) code:

// comment
/\
/ comment
// multiline\
comment
"// not a comment"

(compare with gcc -E/gcc -std=c99 -E/gcc -std=c11 -E)

ANSI C didn't support the // form of comment. // is not otherwise valid in ANSI C so wouldn't appear there. One contrived case where // may genuinely appear in ANSI C (as noted there, and you may find the rest of the discussion interesting) is when the stringify operator is in use.

This is a valid ANSI C code:

#define s(x) #x
s(//not a comment)

And at the time of the discussion in 2004, gcc -ansi -E did indeed expand it to "//not a comment". However today, gcc-5.4 returns an error on it, so I'd doubt we'll find a lot of C code using this kind of construct.

The GNU sed equivalent could be something like:

lc='([\\%]\n|[\\%]\r\n?)'
sed -zE "
  s/_/_u/g;s/!/_b/g;s/</_l/g;s/>/_r/g;s/:/_c/g;s/;/_s/g;s/@/_a/g;s/%/_p/g;
  s@\?\?/@%@g;s@/$lc*\*@:&@g;s@\*$lc*/@;&@g
  s:/$lc*/:@&:g;s/\?\?'/!/g
  s#:/$lc*\*[^;]*;\*$lc*/|@/$lc*/([\\\\%].|[^\\\\%\n\r])*|(\"($lc|[\\\\%]$lc*[^\r\n]|[^\\\\%\"])*\"|'$lc*([\\\\%]$lc*[^\r\n])?([^\\\\%']|$lc)*'|$lc|[^'\"@;:]+)#<\5>#g
  s/<>/ /g;s/!/??'/g;s@%@??/@g;s/[<>@:;]//g
  s/_p/%/g;s/_a/@/g;s/_s/;/g;s/_c/:/g;s/_r/>/g;s/_l/</g;s/_b/!/g;s/_u/_/g"

If your GNU sed is too old to support -E or -z, you can replace the first line with:

sed -r ":1;\$!{N;b1}
  • perl solution have problem with multi line: test it with this output => echo -e "BEGIN/comment/ COMMAND /com\nment/END" – Baba Jul 21 '16 at 14:18
  • @Babby, works for me. I've added a multi-line comment and the resulting output in my test case. – Stéphane Chazelas Jul 21 '16 at 14:28
  • The best thing to compare to nowadays would be gcc -std=c11 -E -P (-ansi is just another name for -std=c90). – zwol Jul 21 '16 at 17:16
  • @zwol, the idea is to be able to handle code written for any C/C++ standard (c90, c11 or other). Strictly speaking, it's not possible (see my 2nd contrived example). The code still tries to handle C90 constructs (like ??'), hence we compare with cpp -ansi for those and C99/C11... one (like // xxx), hence we compare with cpp (or cpp -std=c11...) – Stéphane Chazelas Jul 21 '16 at 17:29
  • @zwol, I've split the test case in an attempt to clarify a bit. It looks like trigraphs are still in C11, so my second test case is not standard C anyway. – Stéphane Chazelas Jul 21 '16 at 17:47
  • Right, as far as I know the only official change since C90 in this area is the introduction of // comments in C99. GCC's default non-conversion of trigraphs is an independent thing. – zwol Jul 21 '16 at 18:00
  • @zwol, I've removed that case as it was hardly even an academic one. I've replaced with the one that was referred to in the 2004 discussion (academic as well) – Stéphane Chazelas Jul 21 '16 at 20:50
  • That sed expression near the end scares me. I wonder if, like monsters under the bed, it goes away if you look at it and then scroll away? – user Jul 22 '16 at 12:39
  • @MichaelKjörling, you'll find that it's a lot more straightforward (though less portable) than the (clever) approach at http://sed.sourceforge.net/grabbag/scripts/remccoms3.sed – Stéphane Chazelas Jul 23 '16 at 07:17
6

with sed:

UPDATE

/\/\*/ {
    /\*\// {
        s/\/\*.*\*\///g;
        b next
    };

    :loop;
    /\*\//! {
        N;
        b loop
    };
    /\*\// {
        s/\/\*.*\*\//\n/g
    }
    :next
}

support all possible (multi line comment, data after [or and] befor, );

 e1/*comment*/
-------------------
e1/*comment*/e2
-------------------
/*comment*/e2
-------------------
e1/*com
ment*/
-------------------
e1/*com
ment*/e2
-------------------
/*com
ment*/e2
-------------------
e1/*com
1
2
ment*/
-------------------
e1/*com
1
2
ment*/e2
-------------------
/*com
1
2
ment*/e2
-------------------

run:

$ sed -f command.sed FILENAME

e1
-------------------
e1e2
-------------------
e2
-------------------
e1

-------------------
e1
e2
-------------------

e2
-------------------
e1

-------------------
e1
e2
-------------------

e2
-------------------
Baba
  • 3,279
4
 $ cat file | perl -pe 'BEGIN{$/=undef}s!/\*.+?\*/!!sg'

 proc print data=sashelp.cars;
 run;

 data abc;
 set xyz;
 run;

Remove blank lines if any:

 $ cat file | perl -pe 'BEGIN{$/=undef}s!/\*.+?\*/\n?!!sg'

Edit - the shorter version by Stephane:

 $ cat file | perl -0777 -pe 's!/\*.*?\*/!!sg'
Hans Schou
  • 78
  • 5
2

Solution by Using SED command and no Script

Here you are:

sed 's/\*\//\n&/g' test | sed '/\/\*/,/\*\//d'

N.B. This doesn't work on OS X, unless you install gnu-sed. But it works on Linux Distros.

  • 1
    you can use -i option to edit file in-place instead of redirecting output to new file. or much safer -i.bak to backup file – Rahul Jul 21 '16 at 12:18
  • 1
    It is not working for all the cases too, try to put a comment in the same line and watch what happens... Example set xy; /test/ I think we will need perl too solve this in a easy way. – Luciano Andress Martini Jul 21 '16 at 12:19
  • @Rahul exactly, thanks for mentioning. I just wanted to keep it more simple. –  Jul 21 '16 at 12:21
  • Im very sorry to say that it is not working for comments in the same line. – Luciano Andress Martini Jul 21 '16 at 12:38
  • @LucianoAndressMartini Now it does! –  Jul 21 '16 at 18:28
  • @Scott I've tested on RHEL 7 and Kali 2, and it worked, but it won't work if you're trying to use sed -e 's/\*\//\n&/g' -e '/\/\*/,/\*\//d' test, it needs a pipe. because first of all, we need to add a newline when there are comments starting and ending in the same line, so it can get it as a range with a starting/ending point since it checks newlines. I'm not sure why it doesn't work for you, but it does for me. I just tested now again to make sure, and it did work. Anyway, Luciano is right, perl can be better and easier, but I prefer sed itself. –  Jul 22 '16 at 16:00
  • I suspect we’re chasing different test cases.  My issue is that a line abc /* def */ ghi gets completely deleted, while (IMO) it should leave abc ghi.  Admittedly, it is an unfortunate nuisance that the OP hasn’t posted expected output, hasn’t addressed edge cases, and seems to have abandoned this question. – Scott - Слава Україні Jul 22 '16 at 19:10
  • @Scott Indeed, in that case you're utterly right mate. –  Jul 22 '16 at 19:30
  • Try using \r instead of \n in the replacement pattern. – Wildcard Nov 15 '16 at 06:22
1

sed operates on one line at a time, but some of the comments in the input span multiple lines. As per https://unix.stackexchange.com/a/152389/90751 , you can first use tr to turn the line-breaks into some other character. Then sed can process the input as a single line, and you use tr again to restore the line-breaks.

tr '\n' '\0' | sed ... | tr '\0' \n'

I've used null bytes, but you can pick any character that doesn't appear in your input file.

* has a special meaning in regular expressions, so it will need escaping as \* to match a literal *.

.* is greedy -- it will match the longest possible text, including more */ and /*. That means the first comment, the last comment, and everything in between. To restrict this, replace .* with a stricter pattern: comments can contain anything that's not a "*", and also "*" followed by anything that's not a "/". Runs of multiple *s also have to be accounted for:

tr '\n' '\0' | sed -e 's,/\*\([^*]\|\*\+[^*/]\)*\*\+/,,g' | tr '\0' '\n'

This will remove any linebreaks in the multiline comments, ie.

data1 /* multiline
comment */ data2

will become

data1  data2

If this isn't what was wanted, sed can be told to keep one of the linebreaks. This means picking a linebreak replacement character that can be matched.

tr '\n' '\f' | sed -e 's,/\*\(\(\f\)\|[^*]\|\*\+[^*/]\)*\*\+/,\2,g' | tr '\f' '\n'

The special character \f, and the use of a back-reference that may not have matched anything, aren't guaranteed to work as intended in all sed implementations. (I confirmed it works on GNU sed 4.07 and 4.2.2.)

JigglyNaga
  • 7,886
  • Could you please let mne know how it will work .I tried as below. tr '\n' '\0' | sed -e 's,/*([^*]|*+[^*/])**+/,,g' test.sas | tr '\0' '\n' and i got as below: /This is to print the output data/data abcdf; set cfgtr; run; proc print data=sashelp.cars; run;

    data abc; set xyz; run;

    – Sharique Alam Aug 05 '16 at 13:25
  • @ShariqueAlam You've put test.sas in the middle of the pipeline there, so sed reads from it directly, and the first tr has no effect. You need to use cat test.sas | tr ... – JigglyNaga Aug 06 '16 at 14:49
0

using one line sed to remove comments:

sed '/\/\*/d;/\*\//d' file

proc print data=sashelp.cars;
run;
data abc;
set xyz;
run;
0

GNU awk manual provides an example with getline that does just that, which I copy here verbatim.

# Remove text between /* and */, inclusive
{
    while ((start = index($0, "/*")) != 0) {
        out = substr($0, 1, start - 1)  # leading part of the string
        rest = substr($0, start + 2)    # ... */ ...    
        while ((end = index(rest, "*/")) == 0) {  # is */ in trailing part?
            # get more text
            if (getline <= 0) {
                print("unexpected EOF or error:", ERRNO) > "/dev/stderr"
                exit
            }
            # build up the line using string concatenation
            rest = rest $0
        }
        rest = substr(rest, end + 2)  # remove comment
        # build up the output line using string concatenation
        $0 = out rest
    }
    print $0
}

Bear in mind that it joins mon/*comment*/key into monkey. As Stéphane Chazelas mentions in this answer, this may lead to an effectively different code, so consider changing $0 = out rest to $0 = out " " rest.

Save that in a file, say commentRemove.awk, and execute it on a inputfile:

awk -f commentRemove.awk inputfile
Quasímodo
  • 18,865
  • 4
  • 36
  • 73
0

Why not using cpp (preprocessor) ?

( sed "s,^\(\s*\)#,\1_:#:_," \
| sed "s,\\\\$,_:%:_," \
| cpp \
| sed "/^\s*#/ d" \
| sed "s,^\(\s*\)_:#:_,\1#," \
| sed "s,_:%:_$,\\\\," \
) < in.c > out.c