How do I extract specific URLs from an HTML file

Question

I have an HTML file, without any formatting. I want to extract URLs of the form https://sitename.com/*/ending and only those URLs.

What is the best way to go about doing this?

This question is NOT a duplicate. The other question is asking about pulling the contents of a specific named DIV. This is asking how to pull a list of URLs, fitting a specific format.

No. That's not really enough to answer my question. How would I use a selector to select only URLs of a given format? — Daniel Goldman, Mar 04 '16 at 22:09
The only difference is you want the URL: the approach is exactly the same... — jasonwryan, Mar 04 '16 at 22:10
Okay. So once again, how do I select URLs of a given format?! — Daniel Goldman, Mar 04 '16 at 22:12
You follow the directions in the answer to the duplicate... Of course, some independent thought is required. — jasonwryan, Mar 04 '16 at 22:17
Sorry that I lack the knowledge of the tool in order to perform the task. Your comments are not constructive. #divname is easy enough. How do I say "just URLs" and then filter out URLs that are not of the proper form? — Daniel Goldman, Mar 04 '16 at 22:22
I find the easiest way to extract URLs from HTML files is to use a tool like lynx that already knows how to identify and extract URLs. e.g. lynx -dump -listonly file.html | awk '/https:\/\/sitename\.com\// {print $2}' — cas, Mar 04 '16 at 23:26

Hobadee · Accepted Answer · 2016-03-04T23:09:03.260

1

A simple grep should do this for you:

grep -o "https://sitename.com/.+/ending" somefile.html

(Note: I don't have a *nix machine in front of me right now to test this on.)

Edit: Fired up my linux box and found this to work:

grep -wEo "https://sitename\.com/[^/]+/ending" somefile.html

A .+ will be greedy and capture way too much. Using a negative assertion will properly find the end of a sub-directory. Note this will NOT find nested sub-directories such as https://sitename.com/sub/directory/ending.

edited Mar 04 '16 at 23:09

answered Mar 04 '16 at 22:22

Hobadee

326

grep -o "https://sitename.com/" somefile.html I get a bunch of lines of that string, but if I add in the rest, I get nothing. – Daniel Goldman Mar 04 '16 at 22:32
1

Try egrep (or grep -e) instead. – Hobadee Mar 04 '16 at 22:36
No go. Good try though. I think I'm getting somewhere. The issue, I think, is that these URLs are not necessarily on separate lines. – Daniel Goldman Mar 04 '16 at 22:37
1

Whoops - forgot you will also have to escape some stuff... grep -eo "https://sitename.com/.+/ending" somefile.html – Hobadee Mar 04 '16 at 22:39
grep: https://sitename.com/.+/ending: No such file or directory test.html:https://sitename.com/test-1/ending https://sitename.com/test-2/ending test.html:https://notsite.com/ending – Daniel Goldman Mar 04 '16 at 22:47
It's a good effort. Maybe wait until you're in front of a terminal? – Daniel Goldman Mar 04 '16 at 22:48
Damn my curiosity! Fired up my Linux boxxen and came up with this one which appears to work: grep -wEo "https://sitename\.com/[^/]+/ending" somefile.html I had also forgotten about regex being greedy, so we use a negated character class to match the middle part of the URL. Note that this will NOT match a URL such as "https://sitename.com/sub/directory/ending" – Hobadee Mar 04 '16 at 22:59
Let us continue this discussion in chat. – Hobadee Mar 04 '16 at 23:06

How do I extract specific URLs from an HTML file

1 Answers1