lynx -dump
to convert the HTML to plain text, and then awk
to reformat the output, setting the field-separator to a newline (\n
) and the record separator to two-or-more newlines (\n\n+
).
The sub()
function calls in the awk
script remove excess spaces before printing the required output.
$ lynx -dump ramp.html |
awk -v RS='\n\n' -F'\n' '/^[[:space:]]+/ {
sub(/^ +/,"",$1);
sub(/ +/," ",$2);
print $1":"$2
}'
Commodity Orgin: uerb45e001.material.com
Commodity Code & Dimension: 151151.15 Dim 90
Commodity Serial #: 2009081020
Client Name: Jack
I really don't like to do this because it's never a good idea to parse XML or HTML with regular expressions. It doesn't work. Even if you can hack it up so that it seems like it works, it is extremely fragile and WILL break as soon as the HTML or XML changes enough from what your regexps are looking for. A real XML or HTML parser is the only thing that can do the job properly.
But, with that said, here's something that uses only sed
and fmt
, tools which should be available on any unix-like system:
$ sed -e '/<d[td]\|^[[:blank:]]*$/!d
s/<[^>]*>//g;
s/^ *//;
/^\(Commodity\|Client\)/ s/$/:/' ramp.html |
fmt |
sed -e '/^[[:blank:]]*$/d'
Commodity Orgin: uerb45e001.material.com
Commodity Code & Dimension: 151151.15 Dim 90
Commodity Serial #: 2009081020
Client Name: Jack
The first sed
script deletes all lines except blank lines and lines containing either a <DT>
or <DD>
tag, then it strips all HTML tags from the input, deletes leading spaces and adds a :
to the end of the field name lines. The output from sed
is then piped into fmt
to reformat the lines, then into sed
again to delete blank lines.
This is a hack, and is only guaranteed to work on exactly the sample input you provided. Anything substantially different is likely to break the script. That's what happens when you use regular expressions to parse any but the most trivial HTML or XML.
xmlstarlet
orxsltproc
. – Michael Vehrs Jun 09 '16 at 10:00awk -F '>' '/Proxy Hostname/{getline; print $2}' commodity.txt
– ramp Jun 09 '16 at 10:05