3

I have a TOML file in the following format (categories may have any name, the sequential numbering is just an example and not guaranteed):

[CATEGORY_1]
A=1
B=2

[CATEGORY_2] C=3 D=4

E=5

...

[CATEGORY_N] Z=26

What I want to achieve is to retrieve the text inside a given category.

So, if I specify, let's say, [CATEGORY_1] I want it to give me the output:

A=1
B=2

I tried using grep to achieve this task, with the z flag, so it could interpret newlines as null-byte characters and using this regular expression:

(^\[.*])             # Match the category 
  ((.*\n*)+?         # Match the category content in a non-greedy way
    (?=\[|$))        # Lookahead to the start of other category or end of line

It wasn't working unless I removed the ^ at beginning of the expression. However, if I do this, it will misinterpret loose pairs of brackets as a category.

Is there a way to do it correctly? If not with grep, with other tool, such as sed or awk.

Kusalananda
  • 333,661
Educorreia
  • 175
  • 9

3 Answers3

10

You might consider using tomlq, a TOML wrapper for jq from the yq project, allowing you to retrieve the contents of category name simply using jq syntax .name

Ex. given:

$ cat file.toml 
[CATEGORY_1]
A=1
B=2

[CATEGORY_2] C=3 D=4

E=5

[CATEGORY_N] Z=26

then

$ tomlq -t '.CATEGORY_1' file.toml
A = 1
B = 2

... and with the section name given on the command line:

$ tomlq -t --arg section 'CATEGORY_1' '.[$section]' file.toml
A = 1
B = 2

The output is in TOML format. Would you want tab-delimited output:

$ tomlq -r --arg section 'CATEGORY_1' '.[$section] | to_entries[] | [ .key, .value ] | @tsv' file.toml
A       1
B       2

Use @csv in place of @tsv to get CSV output.


Since you originally asked about a grep solution, with pcregrep:

$ pcregrep -Mo '(?s)\[CATEGORY_1\]\n\K.*?(?=\n+\[)' file.toml
A=1
B=2

where (?s) makes . match \n so that .*? matches across multiple lines. You can fake it with GNU grep in PCRE mode using the -z flag:

$ grep -Pzo '(?s)\[CATEGORY_1\]\n\K.*?\n(?=\n+\[)' file.toml
A=1                                                                                                                                                                                          
B=2

Since it has a fixed length, you could replace \[CATEGORY_1\]\n\K with a lookbehind (?<=\[CATEGORY_1\]\n) to match the lookahead (?=\n+\[) if you prefer the symmetry.

Kusalananda
  • 333,661
steeldriver
  • 81,074
  • Despite being a fit solution, this is going into an environment in which I can't add any external dependencies on tools that aren't already available – Educorreia Jul 29 '21 at 12:20
  • 3
    @Educorreia no problem - this site aims to provide answers that others may find useful as well – steeldriver Jul 29 '21 at 12:21
7

Slightly more complex than pure sed, but with the possibility of more fine-tuning:

$ awk -v catname="[CATEGORY_1]" '/^\[.*\]$/{p=($0==catname)} p' input.toml
[CATEGORY_1]
A=1
B=2

  • You can specify the desired category name on the command line as awk variable catname.
  • Inside the program, it will print the current line if a flag p is set to 1 (see here on how that works).
  • If we encounter a "category start pattern" (line begins with [ and ends with ]), we set the flag to 0, but if the line exactly matches the category name, we set the flag to 1 (in the sense of: we set p to the result of the check whether $0, the current line, is equal to the string stored in catname).

This way, everything starting from the category header up to the next category header will be printed.

Stretch goals

If you want to omit the category header, you can change

{p=($0==catname)}

to

{p=($0==catname); next}

This will then skip processing to the next line immediately after setting the flag, thereby bypassing the conditional print instruction.

If in addition you want to exclude empty lines, change the "seemingly stray" p at the end of the program to p&&NF, which will only be true if the flag p is non-zero and there is at least one "field" (i.e. non-whitespace text) on the current line.

AdminBee
  • 22,803
5

If I understand you correctly, you can use this sed command:

# Choose the category until the next [ character
# and then delete any line starting with the [ character
$ sed -n '/^\[CATEGORY_2\]/,/^\[/p' file | sed '/^\[/d'
C=3
D=4

E=5

Educorreia
  • 175
  • 9