Awk command to get the next line from the matched pattern and crop the string in between two patterns

Question

I need to get the next line after matched pattern and need cut or crop the value in between two patterns from that line.

Sample Source file

<h2>Commodity Information</h2>

<dl>
        <dt>Commodity Orgin</dt>
        <dd>uerb45e001.material.com</dd>

        <dt>Commodity Code & Dimension</dt>
        <dd>151151.15 Dim 90 </dd>

        <dt>Commodity Serial #</dt>
        <dd>2009081020</dd>

        <dt>Client Name</dt>
        <dd>Jack</dd>

</dl>

Desired Output:

Commodity Orgin : uerb45e001.material.com
Commodity Code & Dimension : 151151.15 Dim 90
Commodity Serial # : 2009081020
Client Name : Jack

Stock advice: use an XML-aware tool, such as xmlstarlet or xsltproc. — Michael Vehrs, Jun 09 '16 at 10:00
@sp asic curl --proxy <proxy_server_details:port> --url http://commodity.co/ | grep -e "Commodity Orgin" -e "Commodity Code & Dimension" -e "Commodity Serial #" -e "Client Name" -A1 > commodity.txt
awk -F '>' '/Proxy Hostname/{getline; print $2}' commodity.txt — ramp, Jun 09 '16 at 10:05
@MichaelVehrs : Both xmlstarlet or xsltproc not available in our env. — ramp, Jun 09 '16 at 10:07
@ramp add the command you have tried in question details... I tried a solution with paste, grep with Perl regex and sed.. would that work for you? — Sundeep, Jun 09 '16 at 10:30

score 2 · Answer 1 · edited May 23 '17 at 12:40

lynx -dump to convert the HTML to plain text, and then awk to reformat the output, setting the field-separator to a newline (\n) and the record separator to two-or-more newlines (\n\n+).

The sub() function calls in the awk script remove excess spaces before printing the required output.

$ lynx -dump ramp.html | 
    awk -v RS='\n\n' -F'\n' '/^[[:space:]]+/ {
        sub(/^ +/,"",$1);
        sub(/ +/," ",$2);
        print $1":"$2
    }'
Commodity Orgin: uerb45e001.material.com
Commodity Code & Dimension: 151151.15 Dim 90
Commodity Serial #: 2009081020
Client Name: Jack

I really don't like to do this because it's never a good idea to parse XML or HTML with regular expressions. It doesn't work. Even if you can hack it up so that it seems like it works, it is extremely fragile and WILL break as soon as the HTML or XML changes enough from what your regexps are looking for. A real XML or HTML parser is the only thing that can do the job properly.

But, with that said, here's something that uses only sed and fmt, tools which should be available on any unix-like system:

$ sed -e '/<d[td]\|^[[:blank:]]*$/!d
          s/<[^>]*>//g;
          s/^ *//;
          /^\(Commodity\|Client\)/ s/$/:/' ramp.html | 
      fmt |
      sed -e '/^[[:blank:]]*$/d'
Commodity Orgin: uerb45e001.material.com
Commodity Code & Dimension: 151151.15 Dim 90
Commodity Serial #: 2009081020
Client Name: Jack

The first sed script deletes all lines except blank lines and lines containing either a <DT> or <DD> tag, then it strips all HTML tags from the input, deletes leading spaces and adds a : to the end of the field name lines. The output from sed is then piped into fmt to reformat the lines, then into sed again to delete blank lines.

This is a hack, and is only guaranteed to work on exactly the sample input you provided. Anything substantially different is likely to break the script. That's what happens when you use regular expressions to parse any but the most trivial HTML or XML.

In my environment Lynx tool is not installed. And it is little tedious to make this installed in our environment. When I tried using only the awk command I am getting the below output
Commodity Code & Dimension
:
151151.15 Dim 90

Commodity Serial #
:
2009081020

Client Name
:
Jack — ramp, Jun 09 '16 at 09:44
@ramp, parsing HTML is generally dangerous. Unless you are confident that the HTML tags will always be line-based, I encourage you to consider alternatives. — Jeff Schaller, Jun 09 '16 at 10:59
@ramp You should try to get lynx installed if you can. or links which is very similar (and can be used interchangeably for lynx in this particular case, but not in most other cases). There is no way that awk script will work without lynx or lynx - it was written to work with the output from either of them, not with raw HTML. In case it's not obvious, I was using lynx as a HTML parser to convert the input from HTML to plain text, so it could be processed with awk. — cas, Jun 09 '16 at 13:13
lynx is a very common text mode web browser, and particularly useful for this kind of task because it is very forgiving of broken HTML - as a web browser, it has to be in order to cope with all the truly awful HTML code that millions of sites on the web emit. Most XML & HTML parsers are very strict in what they accept, they expect properly formatted valid XML/HTML and often just won't work with sufficiently broken input. — cas, Jun 09 '16 at 13:16

score 0 · Answer 2 · answered Jun 09 '16 at 12:35

0

If you had xmlstarlet, and the input were (massaged into) valid XML, you could do something like this:

xmlstarlet sel --text -t -m //dt -v 'concat(., " : ", following::dd)' -nl input.html

answered Jun 09 '16 at 12:35

Michael Vehrs

2,208

score 0 · Answer 3 · edited Apr 13 '17 at 12:37

paste -d: <(grep -oP '<dt>\K.*(?=<)' file.html) <(grep -oP '<dd>\K.*(?=<)' file.html) | sed 's/:/ : /'

Commodity Orgin : uerb45e001.material.com
Commodity Code & Dimension : 151151.15 Dim 90 
Commodity Serial # : 2009081020
Client Name : Jack

two grep commands to extract text between <dt> and <dd> tags (assuming they are in same line as given in OP's sample file
paste combines the two files line by line with : as separator
sed command replaces that ':' separator with ' : ' as per OP's expected output (this hack won't work if text between the tags also contain the : character)
See this answer for explanation on use of \K and (?=)

Awk command to get the next line from the matched pattern and crop the string in between two patterns

3 Answers3