0

Can awk † find the nth iteration of a "{" and return everything up to the next "}" character?

[EDIT: yes... solution from Ed Morton at bottom]

† I've been assuming awk is the correct tool for this job. Other ideas are welcome.

I need to isolate blocks of text in hundreds of files. Some files have only one block, but others contain dozens.

sample:

$ cat samp2.txt
//////////////////////////////////
// North Carolina office
// satellite branch
//////////////////////////////////
   {
   first   "John"
   last    "Doe"
   address "163 Main Street"
   age     "25"
   gender  "male"
   }

It may be best to > the current block into a temp file so the script can operate on it before addressing the next. They'll end up in separate files anyway.

I suspect awk can be given an index to find the nth match. The bash script can manage the loop and iteration.

I've gotten close

$ awk '/\{/{flag=1;next}/\}/{flag=0}flag' samp2.txt 
   first   "John"
   last    "Doe"
   address "163 Main Street"
   age     "25"
   gender  "male"

However, since the above operates on the entire file it doesn't work for files containing more than one block (e.g. below). Irrespective of how many blocks in any file, I need every block separated to be processed individually.

Some files contain comments, but many do not--with no standard. I discard them, but the inconsistency means comments can't be relied upon for tracking where we are. The only given is the curly braces (and the line separation).

The text is always newline-separated, but not always a blank line between blocks. The data pairs vary, so this can't be a simple grep 5 lines and proceed solution.

$ cat samp3.txt 
//GROUP1
{
first       "John"
address     "124 Main Street"
last    "Jones"
special     "supervisor"
age "35"
gender      "male"
}

//The fourth group { first "John" address "125 Main Street" last "Jacob" age "30" gender "male" } { first "John" address "523 Main Street" last "Jingle" age "40" gender "male" }


My above awk statement runs through all groups, mashing them all into one large paragraph.

$ awk '/\{/{flag=1;next}/\}/{flag=0}flag' samp3.txt
first       "John"
address     "124 Main Street"
last    "Jones"
special     "supervisor"
age "35"
gender      "male"
first       "John"
address     "125 Main Street"
last    "Jacob"
age "30"
gender      "male"
first       "John"
address     "523 Main Street"
last    "Jingle"
age "40"
gender      "male"

I need to tell awk to look for the nth "{" and then dump to the nth "}" separately, like this instead:

first       "John"
address     "124 Main Street"
last    "Jones"
special     "supervisor"
age "35"
gender      "male"
 (awk exits, bash script does its thing)

first "John" address "125 Main Street" last "Jacob" age "30" gender "male" (awk exits, bash script does its thing)

first "John" address "523 Main Street" last "Jingle" age "40" gender "male" (awk exits, bash script does its thing)

[etc]

The intent is similar to a non-greedy regex match of the nth "{ .+ }" .
With that, there may be a perl solution that's smarter?

TIA.

This code got me what I need. Adapted from Ed Morton's answer.

awk -v n=$LoopVariable -v RS='}' 'NR==n{gsub(/.*\{\r?\n|\n$/,""); print}' $SourceFile

EDITS: Input really helped me isolate my question to what I need. Thank you for that.


I've found a few SE questions that seem similar, but if these contain my solution I'm not well-versed enough in awk to see the connection.

Glorfindel
  • 815
  • 2
  • 10
  • 19
zedmelon
  • 153
  • 1
    You are being needlessly verbose. After talking so much when the time came to mention the most important thing "(do some stuff)" , you kept quiet on that. Now could you add the missing piece of info in the form of an expected output. – guest_7 Aug 09 '21 at 00:14
  • @guest_7 You're right... sorry. Brevity's never been my strong suit. My first sentence was "isolate the groups." However it got lost in all that garbage. I've reduced a fair amount, and hopefully it's clearer now. I need to either loop an awk statement with an incremental index (probably more likely), or have awk loop and spit out some (temp files?) that can be processed externally to the awk statement. – zedmelon Aug 09 '21 at 01:13
  • @ilkkachu I've trimmed my question and hopefully made it clearer. I need to operate on each group before moving on to the next group. Either by an awk loop, or by looping an awk statement that can use an index to keep advancing. – zedmelon Aug 09 '21 at 01:15
  • 1
    What starts a group - the "{" or the "//GROUP"? What ends the group, the "}" or the blank line? My thought is that you can probably process this in awk in paragraph mode by setting the RS to either the empty string or }. – icarus Aug 09 '21 at 02:24
  • 1
    (1) I see “blocks of code” and “group(s) of data”, and I wonder whether you’re using these terms interchangeably or you have two different concepts.  (2) As guest_7 says, you’re leaving out too much detail.  Does the input contain only blocks, or is there stuff (that should be ignored) between blocks?  (I guess the //GROUP lines should be ignored, right?)  Can there be { and/or } anywhere other than at the beginning or end of a block?  And, yeah, it would help if you gave a *hint* at what ‘processing’ you want to do, and/or show us the code you have written.  … (Cont’d) – Scott - Слава Україні Aug 09 '21 at 04:53
  • 1
    (Cont’d) … (3) Do you need the awk script to count the blocks? … … … … … … … … … … … … … … … … … … … … … … … … … … … Please do not respond in comments; [edit] your question to make it clearer and more complete. – Scott - Слава Україні Aug 09 '21 at 04:53
  • 1
    "me being open to someone saying "hey awk isn' the right tool for this; use X"" -- well, Bash is definitely not the right tool for text processing: https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice – ilkkachu Aug 09 '21 at 08:43
  • 1
    @zedmelon, so, you want to have the awk script print just one of the blocks on each run? Probably with an argument to tell which one...? Which means that you don't want to do anything with the data blocks in the awk script, you don't want to print each block empty lines in between them, and you don't want the awk script to pause between each block. Instead you want to have a shell loop run that awk script repeatedly, to get one block at a time? Yes, you're right, you are moving goalposts... – ilkkachu Aug 09 '21 at 17:00
  • @ilkkachu apologize--it wasn't my intent. I didn't fully understand what I needed in order to articulate it properly. Once I did, Ed Morton gave a solution that no doubt you could've provided last night. Wish I could buy you a beer for your trouble. :,( – zedmelon Aug 09 '21 at 17:10
  • @ilkkachu, icarus, and scott: I'm catching up on comments. The text files reminded me of code, but I'm really processing as text, so I removed "code" from the question. Thank all of you for helping me consolidate my thoughts and get the question presentable. I think it now describes what I meant to say the first time. I really do appreciate your time...thanks again. – zedmelon Aug 09 '21 at 17:22
  • @ilkkachu Thank you for the link--a lot of info there. I've got it in a new tab for reading tonight. – zedmelon Aug 09 '21 at 17:39

3 Answers3

4

I don't see the expected output in your question so I'm not sure but you did say Can awk † find the nth iteration of a "{" and return everything up to the next "}" character? so is this what you're trying to do (using any awk and assuming } and { can't appear anywhere else in your input):

$ awk -v n=2 -v RS='}' 'NR==n{gsub(/.*\{\n|\n$/,""); print}' samp3.txt
first       "John"
address     "125 Main Street"
last    "Jacob"
age "30"
gender      "male"

If you want to call that in a shell loop:

$ for i in {1..3}; do
    awk -v n="$i" -v RS='}' 'NR==n{gsub(/.*\{\n|\n$/,""); print}' samp3.txt
    echo "-----"
done
first       "John"
address     "124 Main Street"
last    "Jones"
special     "supervisor"
age "35"
gender      "male"
-----
first       "John"
address     "125 Main Street"
last    "Jacob"
age "30"
gender      "male"
-----
first       "John"
address     "523 Main Street"
last    "Jingle"
age "40"
gender      "male"
-----

but there's almost certainly a better way to do whatever it is you want to do than calling awk multiple times in a loop, for example call awk once to print each block with a terminating } and then read that into a shell array for further processing:

$ readarray -d '}' -t arr < <(awk 'BEGIN{RS=ORS="}"} {gsub(/.*\{\n|\n$/,"")} $0~/[^[:space:]]/' samp3.txt)
$ for i in "${arr[@]}"; do printf '%s\n' "$i"; echo "-----"; done
first       "John"
address     "124 Main Street"
last    "Jones"
special     "supervisor"
age "35"
gender      "male"
-----
first       "John"
address     "125 Main Street"
last    "Jacob"
age "30"
gender      "male"
-----
first       "John"
address     "523 Main Street"
last    "Jingle"
age "40"
gender      "male"
-----

In reality, though, whatever it is you're doing in the shell loop should probably also be done inside the one call to awk.

zedmelon
  • 153
Ed Morton
  • 31,617
  • yes, thank you! Your first answer is very close to the testing I did while you made your first edit. Still wrapping my head around the second. Thank you again! – zedmelon Aug 09 '21 at 17:17
  • FYI, the latest edit of your first awk statement doesn't work anymore. awk -v n=$Loop -v RS='{[^}]+}' 'NR==n{print RT}' left the curly braces, but the current version leaves extra lines. – zedmelon Aug 09 '21 at 17:30
  • I'm not sure which versions you're referring to but I just finished tweaking it now I know it is what you wanted and have no plans to make any other changes so try it now and let me know if it doesn't do what you want. Obviously make sure to refresh your browser tab before copy/pasting the script to make sure you really do have the latest version. – Ed Morton Aug 09 '21 at 17:39
  • Here's an image showing two awk commands. The 1st command was your answer when I first saw it (and it's very close to what I need--I can loop that and use tr to strip the braces). The 2nd command is currently in your answer, and it leaves extra lines. https://ibb.co/82Y1VFQ – zedmelon Aug 09 '21 at 18:42
  • 1
    You don't need tr or anything else if you're using awk. Check if your input file has DOS line endings (see https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it?noredirect=1) and if so get rid of them or change \{\n to \{\r?\n in the gsub() to accommodate them. If you can also have other white space after the { then use \{[[:blank:]]*\r?\n. – Ed Morton Aug 09 '21 at 18:46
  • 1
    The reason I removed the first script I posted is that it would only work in GNU awk and that was completely unnecessary for your problem so I posted one that'll work in any awk instead. The problem you're hitting now is there are characters in your input file that weren't shown/stated in your question so we just need to handle them and that's trivial now we have an idea that they exist. – Ed Morton Aug 09 '21 at 19:03
  • 1
    Thank you Ed... that's it! I understand now (both your line of reasoning and the operation of awk). – zedmelon Aug 09 '21 at 19:10
1

My code makes assumptions that may not be true which will mean it may fail in many circumstances. There may also be more efficient solutions that could be used.

Assumption 1 Every GROUP block is seperated by a newline

Assumption 2 You want an action taken at each block

Assumtion 3 Every GROUP block increments (If not, you may end up with a lot of empty files.)

for i in {1..5}; do 
  awk -F"\n" -v RS="" -v inc="GROUP$i" '$0~inc{printf( "%s\n", $0); next}' $inputfile | sed  '/\/\|{\|}/d' > output$i.txt ; 
done

Your example has GROUP1&4, I added a GROUP5 and scripted a for loop to increment from a range of 1-5. This range will be used as a key when going through the GROUP blocks. If you have more groups, you can increase the range accordingly.

awk will be used within the loop to extract the blocks. sed will clean up (awk can do this all in one but I am still learning) after which each block is written to its own output file matching the number of the GROUP block.

Input File

//GROUP1
{
first       "John"
address     "124 Main Street"
last    "Jones"
special     "supervisor"
age "35"
gender      "male"
}

//GROUP4 { first "John" address "125 Main Street" last "Jacob" age "30" gender "male" } { first "John" address "523 Main Street" last "Jingle" age "40" gender "male" }

//GROUP5 { first "Maria" address "188 John Street" last "Phones" special "Supervisors supervisor" age "35" gender "Female" }

Output

cat output1.txt
first       "John"
address     "124 Main Street"
last    "Jones"
special     "supervisor"
age "35"
gender      "male"

cat output4.txt first "John" address "125 Main Street" last "Jacob" age "30" gender "male" first "John" address "523 Main Street" last "Jingle" age "40" gender "male"

cat output5.txt first "Maria" address "188 John Street" last "Phones" special "Supervisors supervisor" age "35" gender "Female"

sseLtaH
  • 2,786
  • 1
    Wow, this is really cool--definitely bookmarking it for future reference. I didn't clearly define my request and have edited my question to better describe what I need. – zedmelon Aug 09 '21 at 16:55
  • Assumption 1 is spot on. Assumption 2 is correct, but I only care about isolating blocks between { and }--groups are loosely outlined by comments, which are inconsistent if they even exist. Unfortunately I set up assumption 3 with my sample text (fixed), and it's incorrect. – zedmelon Aug 09 '21 at 16:59
1

You were almost there....tweaking your code a bit will get you the individual blocks

awk -v n="$loopVar" '/\{/{f=1;++i;next} /\}/{f=0} i==n&&f' file

Caveats:-

  • /\{/ will match an opening brace anywhere.
  • Somewhat better is: NF==1&&$1=="{"
  • Same for the closing brace as well.
  • Before awk, run your input file through dos2unix utility to clear off the carriage returns \r
guest_7
  • 5,728
  • 1
  • 7
  • 13
  • Thanks! I remember waaaaaaay back when I was just learning to script (ksh), the only thing I knew about awk was the '{print $2}' function (plenty of UUoE in those days too). Someone told me awk is a powerful scripting language, which was so far above my head I couldn't even comprehend it. These days I can at least pick apart stuff like your answer and kinda figure out how you arrived at it. Awk is pretty badass--I think I need to look for an O'Reilly book. – zedmelon Aug 10 '21 at 13:50