1

I need extract text with a bash script from a website in HTML, I used this solution but doesn't work fine for me, because I must edit and format the output text. I need the text between the tag:

<p><p tabindex="0">

Example I am surf in https://apps.ubuntu.com/cat/applications/clementine/ In Firefox displayed the next tag in your source code:

<p><p tabindex="0">Clementine is a multiplatform music player focusing on a fast and easy-to-use interface for searching and playing your music.</p><p tabindex="0">Summary of included features :</p><ul><li>Search and play your local music library.</li><li>Listen to internet radio from Last.fm, SomaFM and Magnatune.</li><li>Tabbed playlists, import and export M3U, XSPF, PLS and ASX.</li><li>Visualisations from projectM.</li><li>Transcode music into MP3, Ogg Vorbis, Ogg Speex, FLAC or AAC</li><li>Edit tags on MP3 and OGG files, organise your music.</li><li>Download missing album cover art from Last.fm.</li><li>Native desktop notifications using libnotify.</li><li>Supports MPRIS, or remote control using the command-line.</li><li>Remote control using a Wii Remote, MPRIS or the command-line.</li><li>Copy music to your iPod, iPhone, MTP or mass-storage USB player.</li><li>Queue manager. It is largely a port of Amarok 1.4, with some features rewritten to take advantage of Qt4.</li></ul></p>

How to extract the above text, after <p><p tabindex="0"> ? remember is only a example, the website and tag could change.

Output='Clementine is a multiplatform music player focusing on a fast and easy-to-use interface for searching and playing your music.'


thanks to $John1024 your solution work fine with ubuntu site, but with opensuse doesn't display correctly. this is the variable with OpenSuse site:

output="$(wget -q http://software.opensuse.org/package/clementine -O - | sed -n 's/.*<p id="pkg-desc">\([^<]*\).*/\1/p')".

echo $output

Clementine is a modern music player and library organiser. Clementine is a (doesn't display completaly the text)

I tried also:

output="$(wget -q http://software.opensuse.org/package/clementine/ -O - | sed -ne '/<p id="pkg-desc">/, /[.</p>]$/p')"

echo $output

<p id="pkg-desc">Clementine is a modern music player and library organiser. Clementine is a port of Amarok 1.4, with some features rewritten to take advantage of Qt4. (it displays with '<p id="pkg-desc">')

davidva
  • 160

1 Answers1

2

In general, one should use a tool that understands html. For limited purposes, though, a simple command may suffice. In this case, sed is sufficient to do what you ask and works well in bash scripts. If you have captured the source html into index.html, then:

$ sed -n 's/.*<p><p tabindex="0">\([^<]*\).*/\1/p' index.html 
Clementine is a multiplatform music player focusing on a fast and easy-to-use interface for searching and playing your music.

Or, to capture the html and process it all in one step:

$ wget -q https://apps.ubuntu.com/cat/applications/clementine/ -O - | sed -n 's/.*<p><p tabindex="0">\([^<]*\).*/\1/p' 
Clementine is a multiplatform music player focusing on a fast and easy-to-use interface for searching and playing your music.

To capture that output to a bash variable:

output="$(wget -q https://apps.ubuntu.com/cat/applications/clementine/ -O - | sed -n 's/.*<p><p tabindex="0">\([^<]*\).*/\1/p')"

The -n option is used on sed. This tells it not to print output unless we explicitly ask. sed goes through the input line by line looking for a line which matches .*<p><p tabindex="0">\([^<]*\).*. All the text that follows the <p tabindex="0"> and the next tag is captured in variable 1. Everything on that line is then replaced with just that captured text which is then printed.

John1024
  • 74,655
  • Hi, work fine with ubuntu site, but with opensuse doesn't display correctly. this is the variable with OpenSuse site: output="$(wget -q http://software.opensuse.org/package/clementine -O - | sed -n 's/.

    ([^<]

    ).*/\1/p')". I tried also: output="$(wget -q http://software.opensuse.org/package/clementine/ -O - | sed -ne '/

    /, /[.

    ]$/p')"
    – davidva Mar 16 '14 at 21:22