Re-formatting a text file's columns with awk

Question

Ok, since this is a complex question, I will explain it clearly. I got a file content shown as below:

$ Cat File1 
ABC Cool Lol POP {MNB}
ABC Cool Lol POP {MNB}
ABC Cool Lol POP {MNB}
ABC Cool Lol POP {TBMKF}
ABC Cool Lol POP {YUKER}
ABC Cool Lol POP {EFEFVD}

The output that I want

-Cool MNB +  POP ;
-Cool MNB  + POP ;
-Cool MNB  + POP ;
-Cool TBMKF + POP ;
-Cool YUKER + POP ;
-Cool EFEFVD +POP ;

Firstly I try to take out the last column from the File1 and print it out by sed 's/[{}//g' File1 > File3

After that I copy the whole content of File1 to a new File4

cp File1 File4

After that I replace the data inside the File4 with the File3 data (means the data without bracket one "File1 last column that one")

awk 'FNR==NR{a[NR]=$1;next}{$5=a[FNR]}1' File3 File4 >>File5

Output should be like this

ABC Cool Lol POP MNB
ABC Cool Lol POP MNB
ABC Cool Lol POP MNB
ABC Cool Lol POP TBMKF
ABC Cool Lol POP YUKER
ABC Cool Lol POP EFEFVD

Finally, I try

awk -F“ " '{print - $2,$5 +,$4 ";"}‘ File5

But the outcome did not come out as shown as I want, only the similar data MNB is all listed down, others did not shown up (File one last column data),

i not sure what u mean .But I just a new begineer for touching awk .This is the task that I need to be done ,I try my best to slowly one step to one step to do that based on my understanding of awk. — heng960407, Sep 15 '16 at 13:31
Please change your title to something more specific to your problem. This will make it easier for others who have similar questions in future to find it. At the moment "A question about awk" is very general. — Tom Fenech, Sep 16 '16 at 10:38

score 16 · Accepted Answer · answered Sep 15 '16 at 13:43

16

I don't know why you are copying things left and right. The simple thing is

awk '{print "-" $2, substr($5,2,length($5)-2), "+", $4, ";"}' File1

I put the - in the beginning and the ; at then end.

In between we print

$2 because we want it as it is.
a substring of $5, which is the string without the first and the last character. We skip the first character by starting at position 2 (awk has always been strange about that) and leave out the last character by only selecting a substring which is two characters shorter, than the original $5
the + because we want it
and then $4

However, I'm not sure if all these string functions are specific to GNU awk.

answered Sep 15 '16 at 13:43

Bananguin

7,984

substr(string, 2) returns the substring starting from the second character, like cut -c2-, tail -n +2, sed '2,$'... What's so strange about that? – Stéphane Chazelas Sep 15 '16 at 14:30
3

That command is standard and would even work with the original awk from the 70s. – Stéphane Chazelas Sep 15 '16 at 14:54
@StéphaneChazelas: Ah, I've been waiting for you :-) Usually we start counting at 0 which means index 2 is the third position, but here the second position is at index 2. Thanks for clarifying the remaining GNU question. – Bananguin Sep 15 '16 at 14:58
@Bananguin, in Unix shell and utilities as shown in the few examples above, we start at 1, not 0. Most notable exceptions are ksh's arrays and ${var:offset} (both copied by bash). All other shell arrays start at 1. See also Is there a reason why the first element of a Zsh array is indexed by 1 instead of 0? – Stéphane Chazelas Sep 15 '16 at 15:21

Costas · Answer 2 · 2016-09-16T05:51:01.420

7

With sed

sed '
    s/\S\+\s/-/
    s/\(\S\+\s\)\{2\}{\(\S\+\)}/\2 + \1;/
    ' File1

And awk variation

awk -F"[[:blank:]{}]+" '{print "-" $2, $5, "+", $4}' ORS=" ;\n" File1

edited Sep 16 '16 at 05:51

answered Sep 15 '16 at 13:53

Costas

14,916

Kaz · Answer 3 · 2016-09-15T17:35:34.803

6

Easy TXR job:

$ txr -c '@(repeat)
@a @b @c @d {@e}
@(do (put-line `-@b @e + @d ;`))
@(end)' -
ABC Cool Lol POP {MNB}
ABC Cool Lol POP {MNB}
ABC Cool Lol POP {MNB}
ABC Cool Lol POP {TBMKF}
ABC Cool Lol POP {YUKER}
ABC Cool Lol POP {EFEFVD}
[Ctrl-D][Enter]
-Cool MNB + POP ;
-Cool MNB + POP ;
-Cool MNB + POP ;
-Cool TBMKF + POP ;
-Cool YUKER + POP ;
-Cool EFEFVD + POP ;

Using TXR Lisp awk macro to transliterate Awk solution:

 txr -e '(awk (t (prn `-@[f 1] @{[f 4] [1..-1]} + @[f 3] ;`)))'

Fields are in the f list, and indexing is zero based.

edited Sep 15 '16 at 17:35

answered Sep 15 '16 at 17:27

Kaz

8,273

1

+1 for the lisp and crytiest look ! That language MUST compete in pcg ( programming code golf) – Archemar Sep 15 '16 at 18:21
@Archemar TXR doesn't compete in golfing very well because there are specialized languages designed for that which do things like assign functions to individual characters, which can then be strung together to achieve composition. – Kaz Sep 15 '16 at 18:30
@Archemar Put an entry in: http://codegolf.stackexchange.com/questions/68712/output-the-next-kana – Kaz Sep 15 '16 at 20:53
1

@Kaz Is there a TXR tutorial somewhere ? The man page seems rather huge. How does it perform compared to awk ? – bli Sep 21 '16 at 08:21
1

@bli GNU Awk is something like at least 30 times faster at basic field splitting through a large file than the TXR awk macro, which is some 220+ lines of interpreted code, including the overall loop for processing input sources into records and fields. – Kaz Sep 21 '16 at 16:11

Ray · Answer 4 · 2016-09-15T23:43:59.747

Using awk is easiest when the $1,$2,... fields already contain the exact strings you want to work with. The field separator, if it contains more than one character, is interpreted as a regular expression. We don't need to do any search and replace or substring operations to get rid of the {curly braces}. We just count them as part of the delimiter.

awk -F'[ {}]+' '{printf("-%s %s + %s ;\n", $2, $5, $4)}'

Using printf instead of print also makes it a bit easier to see how the string will be formatted, but if you want to have print "-"$2,$5" + "$4";" instead of printf("-%s %s + %s ;\n", $2, $5, $4), that's an option.

Re-formatting a text file's columns with awk

4 Answers4