awk to match and cut out fields with alternating delimiter

Question

I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.

I have attached a sample picture. I still haven't figured out how to attach as a file.

I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.

the following two commands do the job for the respective matches:

awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html

My goal is to combine the two statements above so the output becomes:

UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7

I have tried awk's if, then, else but wasn't conclusive.

How do I achieve this as a one-liner? are there better candidates than awk? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.

text is very long and ugly formatting. i tcan't be added as it contians URLs and as a beginner i am not allowed > 1 URL in my post — HenrikJson, Feb 19 '17 at 21:18
the {} produced something that looks like only a partial code extract -> no good — HenrikJson, Feb 19 '17 at 21:49
You don't need a one-liner to make it easy to script, but here is one anyway: awk -F'[<>]' '/<H3/{print $5} /HREF="/{sub(/[^"]*"/,"");sub(/".*/,"");print}' bookmark.html — dave_thompson_085, Feb 21 '17 at 02:53
dave_thompson_085, if you add that comment as answer i will mark it as correct. please if you have time also add annotations how to read the awk command. parts of are clear to me but not all — HenrikJson, Feb 21 '17 at 08:11

Costas · Accepted Answer · 2017-02-21T10:03:13.717

2

This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution

sed '
    /\n/{P;d;}
    /<H3/s/[><]/\n/4g
    /HREF/s/"/\n/g
    D
    ' bookmarks.htm

For non-GNU versions of sed:

sed '
    /\n/{P;d;}     #if there is more then 1 line «P»rint 1st line then «d»elete all
    /<\/H3/s//\n/  #replace «</H3» by «\n»ewline
    /\n/s/">/\n/   #replace «">» by «\n»ewline if previous command is executed
    /HREF/s/"/\n/g #put «\n»ewline» around url if «HREF» in line
    D              #«D»elete 1 first line, go to start
    ' bookmarks.htm

edited Feb 21 '17 at 10:03

answered Feb 19 '17 at 21:53

Costas

14,916

Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/\n/4g – HenrikJson Feb 20 '17 at 21:33
1

@HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /<\/H3/s//\n/;/\n/s/">/\n/ – Costas Feb 21 '17 at 06:17
Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all – HenrikJson Feb 21 '17 at 08:11
@HenrikJson see updated – Costas Feb 21 '17 at 10:06

JJoao · Answer 2 · 2017-02-20T15:33:54.623

1

Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.

xml + xmlstarlet + xpath

If the input is well formed xml we can use xmlstarlet + xpath expression:

xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html

html + xmllint : xml

If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:

xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -

xmllint + xpath

We can use xmllint + xpath expression, directly

xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html

... but the output format is not the same...

edited Feb 20 '17 at 15:33

answered Feb 20 '17 at 00:17

JJoao

12,170
1
23
45

Could you explain what this is ? – Feb 20 '17 at 00:53
@DarkHeart, I added some more information. – JJoao Feb 20 '17 at 12:58
on the cygwin I am running, neither xmllint nor xpath available – HenrikJson Feb 20 '17 at 21:29
1

@HenrikJson, you can install both xmllint (setup-x86_64 -qP libxml2) and xmlstarlet in cygwin. – JJoao Feb 20 '17 at 23:41

score 1 · Answer 3 · answered Feb 21 '17 at 08:27

One last answer: this time a one-ligner perl

perl -nE 'say $1 if (/<h3.*?>(.*?)<\/h3>/i or /href="(.*?)"/i)' ex.html

(I believe that xml parser based solutions are better, but since you have a tool-generated file, the amount of surprises should not be very high)

score 0 · Answer 4 · answered Feb 20 '17 at 21:59

For now I discarded demand for one-liner and did it as a script instead.

I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.

This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?

#!/bin/sh
file=$1
while IFS= read -r line
do
hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
url=$(echo $line | awk -F'"' '/HREF/{print $2}')
if [ ${url} ]; then
    echo $url
elif [ ${hdr} ]; then
    echo $hdr
fi
done <"$file"

Here the file: (finally got it)

<html xmlns="http://www.w3.org/1999/xhtml">
<body>
  <h1>Bookmarks</h1>
  <dl>
    <dd>
        <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
      <dl>
        <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
        <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces"        add_date="1484738308">url-2</a></dt>
      </dl>
    </dd>
    <dd>
        <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
      <dl>
        <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061"    add_date="1486550866">url-3</a></dt>
        <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
        <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
      </dl>
    </dd>
    <dd>
        <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
      <dl>
        <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
        <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
      </dl>
    </dd>
  </dl>
</body>
</html>

awk to match and cut out fields with alternating delimiter

4 Answers4

xml + xmlstarlet + xpath

html + xmllint : xml

xmllint + xpath