Trying to use a Regex \/\(.|\n)?\*\/ to select every C multiline comment but its not working in SED

Question

I need to match the C styled multiline comment for eg.

#include <stdio.h>
int main()
{
    // this is a dummy function
    float sum = 0;
    // testing the sed commands
int x = 6; // single-line comment
x = x + 5;

char y = 'n'; /* end of c  *
file */

}

I need to delete all multi-line comments.

So I used sed s/\/\*$.\|\n$*\?\*\///, but it doesn't work. I tried replacing / with %: s%\/\*$.\|\n$*\?\*\/%% but it still isn't working.

Can anyone please help to put this regex \/\*(.|\n)*?\*\/ operation in sed command?

Already answered. See https://stackoverflow.com/questions/13061785/remove-multi-line-comments — fpmurphy, Apr 15 '21 at 08:48
What operating system are you using? We need to know this to know what sed you have. — terdon, Apr 15 '21 at 09:11
well, usually when you end up at "but it doesn't work", it's time to look at the error message and figure out what it might mean. And to include it in the question when you ask for help, so others don't need to guess out everything from zero. Here, you need quotes to protect the sed command from the shell. Without them, sed sees s//*(.|n)*?*///. See https://mywiki.wooledge.org/Quotes, https://unix.stackexchange.com/q/68694/170373, https://unix.stackexchange.com/q/400447/170373, https://unix.stackexchange.com/q/503013/170373 — ilkkachu, Apr 15 '21 at 09:33
In any case, removing comments from C code according to all the actual syntax rules, is hideously hard to get right, see e.g. Deleting (some) comments from a C program, Deleting all C comments with sed, and esp. Stéphane's answer in How can I delete all characters falling under /* … / including / & */? — ilkkachu, Apr 15 '21 at 09:44
I am using ubuntu 18.04. That Regex is working fine. But inside SED its unable to detect the match. It doesn't throws any error. I am allowed to use SED only and no any other software. — Tushar Amdoskar, Apr 15 '21 at 09:57
without quotes, GNU sed says sed: -e expression #1, char 14: unknown option to 's' because there's s//...///, i.e. extra slashes after the s///. Unless your shell is something funky, that is, but the other common not-so-POSIX shells like fish, Zsh and tcsh would complain about that glob not matching anything. — ilkkachu, Apr 15 '21 at 10:06
@TusharAmdoskar, you're "allowed to"? So, what are the exact requirements and restrictions then? Standard POSIX sed? GNU sed (the one you have on Ubuntu)? Multi-line comments only, or single-line and multi-line comments both? That last \/\*(.|\n)*?\*\/ looks a lot like a Perl-style regex that would match single and multi-line comments. Please [edit] your answer to include the constraints, comments are mostly just good for stuffing information out of sight. — ilkkachu, Apr 15 '21 at 10:11
@TusharAmdoskar, and, I'm sorry if that came out too harshly. It's just that the problem with unix-likes is that they're different, not all systems tools that support all the same features. Standard POSIX features vs. GNU extensions being a big one. Also, when you say you're allowed to only use X, it makes it sound like a course assignment. And the problem with those, is that they're often disconnected from real-world problems. In the real-world, it's often better to find a tool that works best for the job, while assignments can have arbitrary limitations like that. — ilkkachu, Apr 15 '21 at 10:56
Let alone the actually hairy stuff you get in the real world, like the fact that (in C++ or C99) // hi /* there is not the start of a /* -style comment, and printf("/* hello */"); also contains comments. — ilkkachu, Apr 15 '21 at 10:58

terdon · Answer 1 · 2021-04-15T11:02:30.200

Sed works on "records" (lines) which are defined by the presence of a trailing newline (\n) character. This means you cannot match past a \n because as far as sed is concerned, the \n is the end of the record. You can get around this, in GNU sed, by using -z to slurp the file and treat the entire thing as a single record (unless your file has NULLs (\0) in it, in which case each \0 will define a record):

$ sed -zE 's|/\*.*\n.*\*/||' file.c 
#include <stdio.h>
int main()
{
    // this is a dummy function
    float sum = 0;
    // testing the sed commands
int x = 6; // single-line comment
x = x + 5;

char y = 'n'; 

}

However, this will fail if you have multiple multi-line comments in the same file because sed cannot do non-greedy matching, so it will always try and find the longest possible match which means it would match from the first /* to the last */. So use a tool that can do non-greedy matching, like perl:

$ perl -0777 -pe 's|/\*.*?\n.*?\*/||gs' file.c 
#include <stdio.h>
int main()
{
    // this is a dummy function
    float sum = 0;
    // testing the sed commands
int x = 6; // single-line comment
x = x + 5;

char y = 'n'; 

}

This, however, will fail if you have a single line /* */ comment. The safest way I can think of is to forget about trying to do this with regular expressions and instead write a little script that keeps count of opening and closing comment tags and deletes accordingly.

Another problem is that a string with /* or */ will also break it. For example, what if you have something like:

char foo [ ] = "A comment starts with /*";

At the end of the day, the only safe way of doing this will be something like this SO answer by Ed Morton which uses a C preprocessor:

If this is in a C file then you MUST use a C preprocessor for this in combination with other tools to temporarily disable specific preprocessor functionality like expanding #defines or #includes, all other approaches will fail in edge cases. This will work for all cases:
[ $# -eq 2 ] && arg="$1" || arg=""
eval file="\$$#"
sed 's/a/aA/g; s/__/aB/g; s/#/aC/g' "$file" |
          gcc -P -E $arg - |
          sed 's/aC/#/g; s/aB/__/g; s/aA/a/g'
Put it in a shell script and call it with the name of the file you want parsed, optionally prefixed by a flag like "-ansi" to specify the C standard to apply.

See https://stackoverflow.com/a/35708616/1745001 for details.

maybe just perl -0777 -pe 's|/\*.*?\*/||gs', since they already had (.|\n)*? in the question (and . matches \n when /s is in effect IIRC). s|/\*.*?\n.*?\*/||gs has the problem that it eats lines uncommented lines between single-line comments, since it demands to find a newline. Change the // comments to /* .. */ to see. — ilkkachu, Apr 15 '21 at 09:41
Crap. You're right about the single line comments, @ilkkachu, but the question states "multiline" comments (see question title and "I need to delete all multi-line comments"), so I think the \n is needed. I have to find another workaround so it doesn't eat the uncommented lines between single-line comments. — terdon, Apr 15 '21 at 09:51
something like s,/\* (.(?!\*/))* \n ((.|\n)*?) \*/,,gmx came to mind. Or maybe it's simpler to do s,/\* (.*?) \*/, $1 =~ /\n/ ? "" : $& ,gsxe. — ilkkachu, Apr 15 '21 at 10:02
@ilkkachu neither of those seems to work. Time to post your answer! >:) — terdon, Apr 15 '21 at 10:07
Also might mention strings with /* in them might confuse things. Things like these are best done with a C preprocessor. — Kusalananda, Apr 15 '21 at 10:22
@Kusalananda. ANTLR is an excellent alternative to the C preprocessor for this purpose. — fpmurphy, Apr 15 '21 at 11:14

Trying to use a Regex \/\*(.|\n)*?\*\/ to select every C multiline comment but its not working in SED

1 Answers1

Trying to use a Regex \/\(.|\n)?\*\/ to select every C multiline comment but its not working in SED