9

I would like to print all my C comments to a separate text file.

  • Using awk, sed, grep, or bash
  • output all multi-line C comments between /* ... */ (inclusive)
  • output lines with // comments
  • Optional: print line numbers

I tried these solutions but it did not work on Ubuntu

The purpose of my request is to be able to quickly use the source code comments as a starting point for good documentation. I do not like the extra clutter and proprietary embedded commands (i.e. Doxygen) of dedicated documentation programs. For example, properly commenting each source code function and removing superficial one line comments will be a great time saver, and provide a nearly complete reference. This will also encourage better source code comments.

jwzumwalt
  • 259
  • 1
    How robust does it need to be? In particular if you have /* inside a string does it need to work correctly? – icarus Jan 05 '24 at 05:30
  • No special cases. My source files are plain vanilla so a simple straight forward solution is fine. I just want to improve documentation and have a starting point. – jwzumwalt Jan 05 '24 at 05:34
  • 2
    Plain vanilla source files can still have printf("/* Automatically generated file, do not edit!\n") in them! To solve this problem you pretty much need to go character by character through the source file, making choices on each one. This tends to rule of grep. Let me see what I can come up with. – icarus Jan 05 '24 at 05:42
  • Do you allow nested comments? – nobody Jan 05 '24 at 12:25
  • 1
    Programs, such as Doxygen have all the edge cases that makes this kind of thing very difficult, and a simple grep won't work. Also, it works on undocumented code, so you don't need all the 'extra clutter'. – Neil Jan 05 '24 at 15:54
  • This is not going to improve documentation. This will make the documentation harder to read, maintain, and access. That being said, the actual task itself is a perfectly legitimate exercise, and I'm interested in seeing the answer. – Mad Physicist Jan 05 '24 at 19:22
  • Start from here? https://perldoc.perl.org/perlfaq6#How-do-I-use-a-regular-expression-to-strip-C-style-comments-from-a-file? – jubilatious1 Feb 11 '24 at 07:24

5 Answers5

31

There's been quite a few answers using shell-magic already, but I think it can be done a lot easier by using the tools you probably already have. Namely, gcc.

diff -u <(gcc -fpreprocessed -dD -E main.c) main.c | grep '^+' | cut -c 2-

How it works?

  1. gcc -fpreprocessed -dD -E main.c Removes all comments from a file and puts it on stdout

  2. diff -u <(...) main.c Takes the input from stdout and compares it with the original

  3. grep '^+' Filters on all lines starting with a +. In other words: filter on what was previously a determined a comment

  4. cut -c 2- Remove the + symbols from the output

No need for super complex regex, perl or awk stuff while also covering all edge cases that the other answers might have missed.

terdon
  • 242,166
Opifex
  • 420
  • 7
    +1 for parsing C rather than using Regexp (even though I love using regexp) – RedGrittyBrick Jan 05 '24 at 15:29
  • 2
    That doesn't really work in practice on real life C code, partly because gcc -E affects the spacing (that part can be addressed with the -w option of diff), partly because it assumes comments consist of whole lines, and also because diff can include the same (uncommented) line both as -line and +line in some hunks, when the preprocessor adds some #... in addition to removing some comments. You can compare with the output of my answer's solution to see where it fails. – Stéphane Chazelas Jan 06 '24 at 13:01
  • 1
    @RedGrittyBrick, you don't need to do a full C language parsing, only tokenising like the C pre-processor does, which is not hard to do with perl regexps, certainly not as hard as parsing XML with regexps like in the Q&A you link to. – Stéphane Chazelas Jan 06 '24 at 14:34
  • @StéphaneChazelas When writing this answer, I used it on real life C code. Indeed, it doesn't remove the non-comment parts of lines containing inline comments, but... that's not what OP asked. So, yes. Our answers provide slightly different output, but for all practical purposes both are functional and do what OP asked. It's just that, in my humble opinion, using gcc is a lot simpler and readable than the big block of perl code. – Opifex Jan 07 '24 at 00:00
  • As I said, the problems are not limited to comments not taking up the full lines. Try it for instance in the (very short) Src/main.c or Src/builtinc.c in the source code of zsh which are the two files I tried it on. – Stéphane Chazelas Jan 07 '24 at 10:00
  • Adding the -w option to diff (as already noted) and -P option to gcc helps reduce the number of false positives. – Stéphane Chazelas Jan 07 '24 at 10:08
  • Also note that gcc -fpreprocessed assumes line continuations have already been preprocessed, so that means it won't work properly on files that do contain line continuations. – Stéphane Chazelas Jan 07 '24 at 10:18
  • Adding --horizon-lines=0 -d to diff also helps. Combined with -w and -P as already mentioned, that removes all false positives in the builtin.c mentioned above. – Stéphane Chazelas Jan 07 '24 at 10:43
13

It's not as trivial as it may seem if you take into account things like: puts("string with /*") bearing in mind that "s can occur in ch = '"'.

Or line continuations:

printf("...");    /\
* yes, this is a comment */
/\
/ and this as well

Or trigraphs.

To cover those, we can adapt this answer to the opposite question to make it print rather than remove the comments:

perl -0777 -pe '
  s{
    (?<comment>
      # /* ... */ C comments
      / (?<lc> # line continuation
          (?<bs> # backslash in its regular or trigraph form
            \\ | \?\?/
          )
          (?: \n | \r\n?) # handling LF, CR and CRLF line delimiters
        )* \* .*? \* (?&lc)* /
      | / (?&lc)* / (?:(?&lc) | [^\r\n])* # // C++/C99 comments
    ) |
       "(?:(?&bs)(?&lc)*.|.)*?" # "strings" literals
       | '\''(?&lc)*(?:(?&bs)(?&lc)*(?:\?\?.|.))?(?:\?\?.|.)*?'\'' # (w)char literals
       | \?\?'\'' # trigraph form of ^
       | .[^'\''"/?]* # anything else
  }{$+{comment} eq "" ? "" : "$+{comment}\n"}exsg'

Which on the contrived examples from the other question which cover most of the corner cases:

#include <stdio.h>
int main()
{
  printf("%d %s %s %c%c%c%c%c %s %s %d\n",
  1-/* comment */-1,
  /\
* comment */
  "/* not a comment */",
  /* multiline
  comment */
  // comment
  /\
/ comment
  // multiline\
comment
  "// not a comment",
  '"' /* comment */ , '"',
  '\'','"'/* comment */,
  '\
\
"', /* comment */
  "\\
" /* not a comment */ ",
  "??/" /* not a comment */ ",
  '??''+'"' /* "comment" */);
  return 0;
}

Gives:

/* comment */
/\
* comment */
/* multiline
  comment */
// comment
/\
/ comment
// multiline\
comment
/* comment */
/* comment */
/* comment */
/* "comment" */

To get the line numbers, as we're running in slurp mode where the subject is the whole input as opposed to processing the input one line at a time, it's a bit more tricky. We could do it though using the (?{code}) regexp operator to increment a counter each time a line delimiter (CR, LF or CRLF in C) is found:

perl -0777 -pe '
  s{
    (?<comment>(?{$l=$n+1})
      /
      (?<lc>  # line continuation
        (?<bs> # backslash in its regular or trigraph form
          \\ | \?\?/
        ) (?<nl>(?:\n|\r\n?) (?{$n++})) # handling LF, CR and CRLF line delimiters
      )*
      (?:
        \* (?: (?&nl) | .)*? \* (?&lc)* / # /* ... */ C comments
        | / (?:(?&lc) | [^\r\n])*         # // C++/C99 comments
      )
    ) |
       "(?:(?&bs)(?&lc)*.|.)*?" # "strings" literals
       | '\''(?&lc)*(?:(?&bs)(?&lc)*(?:\?\?.|.))?(?:\?\?.|.)*?'\'' # (w)char literals
       | \?\?'\'' # trigraph form of ^
       | (?&nl)
       | .[^'\''"/?\r\n]* # anything else
  }{$+{comment} eq "" ? "" : sprintf("%5d %s\n", $l, $+{comment})}exsg'

Which on that same sample gives:

    5 /* comment */
    6 /\
* comment */
    9 /* multiline
  comment */
   11 // comment
   12 /\
/ comment
   14 // multiline\
comment
   17 /* comment */
   18 /* comment */
   21 /* comment */
   26 /* "comment" */
  • Thank you for the education. I had no idea there were so many edge cases. I originally started to write a C program for this and got about two hours in to writing the program and realized it was going to be much more difficult than anticipated. :-) – jwzumwalt Jan 05 '24 at 11:09
  • 1
    @jwzumwalt writing a C program is a probably a better approach than the answer you have currently accepted (You can change the accepted answer if you think a better one is available). However use a program to write a C program is an even better approach. This question is almost perfect for a solution written in lex or flex, or re2c ot similar tools. It should be about 50 lines. – icarus Jan 06 '24 at 18:34
2

It can be done in awk as follows:

#!/bin/awk

Handles case where both /* and */ are on the same line

{ line_printed = 0; }

Find the beginning of a multiline comment

/^[[:space:]]*/*/ { multiline = 1;

# Remove leading spaces
sub(/^[[:space:]]+/,&quot;&quot;);
printf &quot;[%d] %s\n&quot;, NR, $0;
line_printed = 1;

}

Find the end of a multiline comment

/*/[[:space:]]*$/ { multiline = 0; if (line_printed == 0) printf "%s", $0;

print &quot;\n&quot;
next;

}

The content between /* and */

{ if ( multiline == 1 && line_printed == 0 ) { print $0; next } }

A single line comment

/^[[:space:]]*/// { # Remove leading spaces sub(/^[[:space:]]+/,""); printf "[%d] %s\n\n", NR, $0; }

Save this script as foo.awk (or any other name; the extension is optional) and then run with awk -f foo.awk input.c. The script will print all comments (separated by an extra newline) and will add the line number before every comment.

terdon
  • 242,166
td211
  • 374
  • 3
    There are several corner cases that this doesn't handle, for example int i; // loop counter, two comments on the same line, strings with /* in them. It may be good enough. – icarus Jan 05 '24 at 06:32
  • How do you use it? I tried "doc.sh < main.c", "cat main.c | doc.sh", etc. – jwzumwalt Jan 05 '24 at 06:46
  • Use it as follows: awk -f script_file source_code.c > comments.txt @jwzumwalt – td211 Jan 05 '24 at 10:42
  • strings with /* in them arent an issue, since the regex matches the first /* that follows any number of spaces. – td211 Jan 05 '24 at 10:44
  • Works GREAT! Thx. The only special case I have (and it still works) is a line with double //, when a commented line is commented out. - :-) – jwzumwalt Jan 05 '24 at 10:55
  • You're welcome :) If it works accept it as solution for future reference. – td211 Jan 05 '24 at 10:56
  • I have already created a script in my "template" dir so all future projects include the script. I will also do a search and update to include it in current projects. Out of curiosity (yes I am getting greedy) , is it a simple task to have it include src line numbers? Something like [67] /* .... */ – jwzumwalt Jan 05 '24 at 11:00
  • It should be fairly simple, an extra counter and print statement. Do you want the leading spaces or not? – td211 Jan 05 '24 at 11:15
  • Please, one space.... [###]. FYI, 30 years ago (before I retired) I wanted to do this and never got around to it. My projects will be so much better. Where are you? I am in Idaho, USA and it is 4am. Don't you sleep??? – jwzumwalt Jan 05 '24 at 11:21
  • I'm on the other end of the world for you, it is noon here. I highly recommend that you learn awk it's syntax is close to C and is really versatile. @jwzumwalt – td211 Jan 05 '24 at 11:33
  • Changed it for you. @jwzumwalt – td211 Jan 05 '24 at 11:50
  • I have a 4 inch thick book on awk and sed but never got around to diving into it. In my 40 year career, I probably used awk less than 10 times and sed less than 50. I have over 700 bash scripts and regex usually got things done. – jwzumwalt Jan 05 '24 at 11:50
  • I just learned the basics from a 30 ish pages book. I'm by no means an expert. – td211 Jan 05 '24 at 11:52
  • I was hoping for a source code line number, but if that is difficult than lets skip it. – jwzumwalt Jan 05 '24 at 11:53
  • Fixed again. It is easier indeed. That's what record number NR is for. @jwzumwalt – td211 Jan 05 '24 at 11:57
2

Well definitely not the fanciest one, nor the most recommended, because it has some flaws. But I think it looks really cool:

awk '/\/\//; /\/\*.*\*\//; /\*\//; /^\/\*/ { a=1 } /\*\// { a=0 } a && $0 != "/*" { print }'
Bog
  • 989
  • Thank you for the effort. This works except it cuts off (does not print) the last line containing .... */ – jwzumwalt Jan 05 '24 at 11:05
  • @jwzumwalt changed it. It does now :) – Bog Jan 05 '24 at 11:37
  • 2
    It is cool, but scary. Why don't you relax it a bit and comment what you're doing? – td211 Jan 05 '24 at 11:49
  • I didn't know about here docs before. Looks interesting to know, thanks! (Writing short but unreadable one liners is cool, but it probably will bite you later, so be careful) @Bog – td211 Jan 05 '24 at 12:07
  • @td211 oh yeah, here docs are pretty useful^^ And of course it is gonna bite me later, it bit me already many many times before. But that's the fun :) – Bog Jan 05 '24 at 12:40
  • Write docs for it, then process the docs with it :) – Mad Physicist Jan 05 '24 at 19:25
  • 3
    I get it, I also enjoy using weird and unreadable code, but I don't post it here without an explanation. The point of these sites isn't to just give a solution, it's to share knowledge, so a complex and opaque solution with no explanation ins't very helpful. – terdon Jan 06 '24 at 12:01
0

Update 2/15/24 - While learning to use Raylib, I came across a C "parser" included in their software suite that appears to do exactly what I needed, see: https://github.com/raysan5/raylib It has the added benefit of finding and nicely formatting all strucs, defines, functions, call backs, etc.

typical example

Function 232: DrawRectangleRounded() (4 input parameters)
  Name: DrawRectangleRounded
  Return type: void
  Description: Draw rectangle with rounded edges
  Param[1]: rec (type: Rectangle)
  Param[2]: roundness (type: float)
  Param[3]: segments (type: int)
  Param[4]: color (type: Color)
jwzumwalt
  • 259