6

I'm trying to parse a HTML file using shell scripting.

There are 4 different regular expressions that i need to catch: name= , age= , class= , marks=.

Using

grep "name=\|age=\|class=\|marks=" student.txt

I'm able to get the required lines, but along with these matching lines I also need to print the second line from each match which contains the score.

Referring to the question: Print Matching line and nth line from the matched line.

I modified the code to:

awk '/name=\|age=\|class=\|marks=/{nr[NR]; nr[NR+2]}; NR in nr' student.txt

But this does not seem to work. How do I search for multiple regular expressions in the same awk command?

debal
  • 3,704
  • You can't realistically parse tag-based markup languages like HTML and XML using bash or utilities such as grep, sed or cut. If you just want to dump/render HTML, see (lynx|elinks|links|links2|w3m) -dump, html2text, or vilistextum. For parsing out pieces of data, see tidy + (xmlstarlet|xmlgawk|xpath|xml2), or learn xslt. Ask #xml and #html for more help. See http://xrl.us/bkrxog and http://xrl.us/p0ny

    This is a factoid returned from greybot on irc.freenode.org

    – Valentin Bajrami Aug 29 '13 at 15:34

4 Answers4

9

Try with:

awk '/foo/||/bar/' Input.txt
Rahul Patil
  • 24,711
4

awk regexps are extended regexps while grep's without -E are basic regexp. With extended regexp:

awk '/name=|age=|class=|marks=/{nr[NR]; nr[NR+2]}; NR in nr'

Note that standard basic regexp do not have an alternation operator, so

grep 'a\|b'

Will typically not work in every grep (while a few like GNU grep support it as an extension).

grep -E 'a|b'
grep -e a -e b
grep 'a
b'

Will work in every grep though.

2

Using grep

What if you used the after context switch to grep (-A) and specified a 1 to get the first line after a match?

$ grep -E -A 1 "name=|age=|class=|marks=" student.txt

Example

Sample file.

$ cat student.txt 
name=
1st line after name
2nd line after name
age=
1st line after age
2nd line after age
class=
1st line after class
2nd line after class
marks=
1st line after marks
2nd line after marks

Then when you execute the above command:

$ grep -E -A 1 "name=|age=|class=|marks=" student.txt
name=
1st line after name
--
age=
1st line after age
--
class=
1st line after class
--
marks=
1st line after marks

Using awk

As @RahulPatil suggested using the construct to awk:

'/string1/||/string2/||...'

Something like this would do what you're looking for.

$ awk '
  /name=/||/age=/||/class=/||/marks=/{nr[NR]; nr[NR+1]}; NR in nr
' student.txt 

Example

$ awk '
  /name=/||/age=/||/class=/||/marks=/{nr[NR]; nr[NR+1]}; NR in nr
' student.txt
name=
1st line after name
age=
1st line after age
class=
1st line after class
marks=
1st line after marks
slm
  • 369,824
1

Have you tried using the "-A" flag with grep? It will print lines of trailing context after the match. For example: grep -A1 foo file.txt will match and print lines with the word foo and also print the line immediately following.