Extract text starting at specific category header to next category header from a text file

Question

I have a TOML file in the following format (categories may have any name, the sequential numbering is just an example and not guaranteed):

[CATEGORY_1]
A=1
B=2
[CATEGORY_2]
C=3
D=4
E=5
...
[CATEGORY_N]
Z=26

What I want to achieve is to retrieve the text inside a given category.

So, if I specify, let's say, [CATEGORY_1] I want it to give me the output:

A=1
B=2

I tried using grep to achieve this task, with the z flag, so it could interpret newlines as null-byte characters and using this regular expression:

(^\[.*])             # Match the category 
  ((.*\n*)+?         # Match the category content in a non-greedy way
    (?=\[|$))        # Lookahead to the start of other category or end of line

It wasn't working unless I removed the ^ at beginning of the expression. However, if I do this, it will misinterpret loose pairs of brackets as a category.

Is there a way to do it correctly? If not with grep, with other tool, such as sed or awk.

score 10 · Answer 1 · edited Aug 02 '21 at 10:40

You might consider using tomlq, a TOML wrapper for jq from the yq project, allowing you to retrieve the contents of category name simply using jq syntax .name

Ex. given:

$ cat file.toml 
[CATEGORY_1]
A=1
B=2
[CATEGORY_2]
C=3
D=4
E=5
[CATEGORY_N]
Z=26

then

$ tomlq -t '.CATEGORY_1' file.toml
A = 1
B = 2

... and with the section name given on the command line:

$ tomlq -t --arg section 'CATEGORY_1' '.[$section]' file.toml
A = 1
B = 2

The output is in TOML format. Would you want tab-delimited output:

$ tomlq -r --arg section 'CATEGORY_1' '.[$section] | to_entries[] | [ .key, .value ] | @tsv' file.toml
A       1
B       2

Use @csv in place of @tsv to get CSV output.

Since you originally asked about a grep solution, with pcregrep:

$ pcregrep -Mo '(?s)\[CATEGORY_1\]\n\K.*?(?=\n+\[)' file.toml
A=1
B=2

where (?s) makes . match \n so that .*? matches across multiple lines. You can fake it with GNU grep in PCRE mode using the -z flag:

$ grep -Pzo '(?s)\[CATEGORY_1\]\n\K.*?\n(?=\n+\[)' file.toml
A=1                                                                                                                                                                                          
B=2

Since it has a fixed length, you could replace \[CATEGORY_1\]\n\K with a lookbehind (?<=\[CATEGORY_1\]\n) to match the lookahead (?=\n+\[) if you prefer the symmetry.

Despite being a fit solution, this is going into an environment in which I can't add any external dependencies on tools that aren't already available — Educorreia, Jul 29 '21 at 12:20
@Educorreia no problem - this site aims to provide answers that others may find useful as well — steeldriver, Jul 29 '21 at 12:21

AdminBee · Answer 2 · 2021-07-29T11:47:35.010

Slightly more complex than pure sed, but with the possibility of more fine-tuning:

$ awk -v catname="[CATEGORY_1]" '/^\[.*\]$/{p=($0==catname)} p' input.toml
[CATEGORY_1]
A=1
B=2

You can specify the desired category name on the command line as awk variable catname.
Inside the program, it will print the current line if a flag p is set to 1 (see here on how that works).
If we encounter a "category start pattern" (line begins with [ and ends with ]), we set the flag to 0, but if the line exactly matches the category name, we set the flag to 1 (in the sense of: we set p to the result of the check whether $0, the current line, is equal to the string stored in catname).

This way, everything starting from the category header up to the next category header will be printed.

Stretch goals

If you want to omit the category header, you can change

{p=($0==catname)}

to

{p=($0==catname); next}

This will then skip processing to the next line immediately after setting the flag, thereby bypassing the conditional print instruction.

If in addition you want to exclude empty lines, change the "seemingly stray" p at the end of the program to p&&NF, which will only be true if the flag p is non-zero and there is at least one "field" (i.e. non-whitespace text) on the current line.

score 5 · Accepted Answer · edited Jul 29 '21 at 14:07

5

If I understand you correctly, you can use this sed command:

# Choose the category until the next [ character
# and then delete any line starting with the [ character
$ sed -n '/^\[CATEGORY_2\]/,/^\[/p' file | sed '/^\[/d'
C=3
D=4
E=5

edited Jul 29 '21 at 14:07

Educorreia

175
9

answered Jul 29 '21 at 11:16

schrodingerscatcuriosity

12,396

If you add ^ before the \[ in both expressions, I don't think you have to assume that the content doesn't have any [. Besides that, I think it should work, thank you – Educorreia Jul 29 '21 at 11:19
@Educorreia could the content have [, leading or not? – schrodingerscatcuriosity Jul 29 '21 at 11:22
Yes, it should only interpret [ as a category delimiter if it's right in the beginning of the line – Educorreia Jul 29 '21 at 11:24
Simpler (should work): sed -n '/^\[CATEGORY_2\]+,/^\[/-p'. – D. Ben Knoble Jul 30 '21 at 14:18

Extract text starting at specific category header to next category header from a text file

3 Answers3

Stretch goals