3

example bookmarks file

I would like to use awk or similar to match patterns of a chrome bookmarks file and depending on match, cut out a specific field based on different field delimiters.

I have attached a sample picture. I still haven't figured out how to attach as a file.

I want the folder names in case the string H3 is matched and the URL in case the string HREF is encountered.

the following two commands do the job for the respective matches:

awk -F'[<>]' '/H3/{print $5}' bookmarks.htm
awk -F'"' '/HREF/{print $2}' bookmarks.html

My goal is to combine the two statements above so the output becomes:

UNIX
url-1
url-2
OCE
url-3
url-4
url-5
ANDROID
url-6
url-7

I have tried awk's if, then, else but wasn't conclusive.

How do I achieve this as a one-liner? are there better candidates than awk? python, perl would both be great, however, one-liner is an absolute as it would be an easy task writing a shell script that does the job.

Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232

4 Answers4

2

This is wrong way to process html-files with sed/awk/… There are few special parsers but as temporary substitution

sed '
    /\n/{P;d;}
    /<H3/s/[><]/\n/4g
    /HREF/s/"/\n/g
    D
    ' bookmarks.htm

For non-GNU versions of sed:

sed '
    /\n/{P;d;}     #if there is more then 1 line «P»rint 1st line then «d»elete all
    /<\/H3/s//\n/  #replace «</H3» by «\n»ewline
    /\n/s/">/\n/   #replace «">» by «\n»ewline if previous command is executed
    /HREF/s/"/\n/g #put «\n»ewline» around url if «HREF» in line
    D              #«D»elete 1 first line, go to start
    ' bookmarks.htm
Costas
  • 14,916
  • Thanks, that gives the urls but not the headers, trying to adapt the part: /<H3/s/[><]/\n/4g – HenrikJson Feb 20 '17 at 21:33
  • 1
    @HenrikJson It possible if you use non-GNU sed: 4g construction is not recognized. In the case you have to substitute it by /<\/H3/s//\n/;/\n/s/">/\n/ – Costas Feb 21 '17 at 06:17
  • Costas, if you add that comment + your original command as answer i will mark it as correct. please if you have time also add annotations how to read the sed command. parts of are clear to me but not all – HenrikJson Feb 21 '17 at 08:11
  • @HenrikJson see updated – Costas Feb 21 '17 at 10:06
1

Using a xml / html parser / processor has some advantages. Xpath expressions are the standard way to select specific parts.

xml + xmlstarlet + xpath

If the input is well formed xml we can use xmlstarlet + xpath expression:

xmlstarlet sel -t -v '//h3|//a/@href' -nl bookmarks.html

html + xmllint : xml

If the input is just valid html, we can convert it to xml (using xmllint) and use the previous:

xmllint -html -xmlout ex.html | xmlstarlet sel -t -v '//h3|//a/@href' -nl -

xmllint + xpath

We can use xmllint + xpath expression, directly

xmllint -html -xpath '//h3/text()|//a/@href' bookmarks.html

... but the output format is not the same...

JJoao
  • 12,170
  • 1
  • 23
  • 45
1

One last answer: this time a one-ligner perl

perl -nE 'say $1 if (/<h3.*?>(.*?)<\/h3>/i or /href="(.*?)"/i)' ex.html

(I believe that xml parser based solutions are better, but since you have a tool-generated file, the amount of surprises should not be very high)

JJoao
  • 12,170
  • 1
  • 23
  • 45
0

For now I discarded demand for one-liner and did it as a script instead.

I had to post this as a response as it would have been too long for a comment. Still, feel free to respond.

This script does the job but is too sluggish, can anyone speed it up or alternatively suggest a one-liner?

#!/bin/sh
file=$1
while IFS= read -r line
do
hdr=$(echo $line | awk -F'[<>]' '/H3/{print $5}')
url=$(echo $line | awk -F'"' '/HREF/{print $2}')
if [ ${url} ]; then
    echo $url
elif [ ${hdr} ]; then
    echo $hdr
fi
done <"$file"

Here the file: (finally got it)

<html xmlns="http://www.w3.org/1999/xhtml">
<body>
  <h1>Bookmarks</h1>
  <dl>
    <dd>
        <DT><H3 ADD_DATE="1484311924" LAST_MODIFIED="1485532328">UNIX</H3>
      <dl>
        <dt><a HREF="http://unix.stackexchange.com/questions/223182/how-to-replace-spaces-in-all-file-names-with-underscore-in-linux-using-shell-scr" add_date="1484311897">url-1</a></dt>
        <dt><a HREF="http://unix.stackexchange.com/questions/81349/how-do-i-use-find-when-the-filename-contains-spaces"        add_date="1484738308">url-2</a></dt>
      </dl>
    </dd>
    <dd>
        <DT><H3 ADD_DATE="1486550854" LAST_MODIFIED="1487228526">OCE</H3>
      <dl>
        <dt><a HREF="http://www.oraclecertificationprep.com/apex/f?p=OCPSG%3AEXAM_DETAILS%3A%3A%3ANO%3A%3AP2_EXAM%3A1Z0-061"    add_date="1486550866">url-3</a></dt>
        <dt><a HREF="http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=303&amp;p_certName=SQ1Z0_047" add_date="1486550898">url-4</a></dt>
        <dt><a HREF="https://www.quora.com/How-do-you-prepare-for-an-Oracle-Database-SQL-exam" add_date="1486550950">url-5</a></dt>
      </dl>
    </dd>
    <dd>
        <DT><H3 ADD_DATE="1487084050" LAST_MODIFIED="1487228595">ANDROID</H3>
      <dl>
        <dt><a HREF="https://material.io/guidelines/style/color.html#" add_date="1487228526">url-6</a></dt>
        <dt><a HREF="https://developer.android.com/index.html" add_date="1487228539">url-7</a></dt>
      </dl>
    </dd>
  </dl>
</body>
</html>