I need extract text with a bash script from a website in HTML, I used this solution but doesn't work fine for me, because I must edit and format the output text. I need the text between the tag:
<p><p tabindex="0">
Example I am surf in https://apps.ubuntu.com/cat/applications/clementine/
In Firefox displayed the next tag in your source code:
<p><p tabindex="0">Clementine is a multiplatform music player focusing on a fast and easy-to-use interface for searching and playing your music.</p><p tabindex="0">Summary of included features :</p><ul><li>Search and play your local music library.</li><li>Listen to internet radio from Last.fm, SomaFM and Magnatune.</li><li>Tabbed playlists, import and export M3U, XSPF, PLS and ASX.</li><li>Visualisations from projectM.</li><li>Transcode music into MP3, Ogg Vorbis, Ogg Speex, FLAC or AAC</li><li>Edit tags on MP3 and OGG files, organise your music.</li><li>Download missing album cover art from Last.fm.</li><li>Native desktop notifications using libnotify.</li><li>Supports MPRIS, or remote control using the command-line.</li><li>Remote control using a Wii Remote, MPRIS or the command-line.</li><li>Copy music to your iPod, iPhone, MTP or mass-storage USB player.</li><li>Queue manager. It is largely a port of Amarok 1.4, with some features rewritten to take advantage of Qt4.</li></ul></p>
How to extract the above text, after <p><p tabindex="0">
? remember is only a example, the website and tag could change.
Output='Clementine is a multiplatform music player focusing on a fast and easy-to-use interface for searching and playing your music.'
thanks to $John1024 your solution work fine with ubuntu site, but with opensuse doesn't display correctly. this is the variable with OpenSuse site:
output="$(wget -q http://software.opensuse.org/package/clementine -O - | sed -n 's/.*<p id="pkg-desc">\([^<]*\).*/\1/p')"
.
echo $output
Clementine is a modern music player and library organiser. Clementine is a
(doesn't display completaly the text)
I tried also:
output="$(wget -q http://software.opensuse.org/package/clementine/ -O - | sed -ne '/<p id="pkg-desc">/, /[.</p>]$/p')"
echo $output
<p id="pkg-desc">Clementine is a modern music player and library organiser. Clementine is a port of Amarok 1.4, with some features rewritten to take advantage of Qt4.
(it displays with '<p id="pkg-desc">')
to
<p – davidva Mar 16 '14 at 07:00