0

I have an HTML file, without any formatting. I want to extract URLs of the form https://sitename.com/*/ending and only those URLs.

What is the best way to go about doing this?

This question is NOT a duplicate. The other question is asking about pulling the contents of a specific named DIV. This is asking how to pull a list of URLs, fitting a specific format.

  • No. That's not really enough to answer my question. How would I use a selector to select only URLs of a given format? – Daniel Goldman Mar 04 '16 at 22:09
  • The only difference is you want the URL: the approach is exactly the same... – jasonwryan Mar 04 '16 at 22:10
  • Okay. So once again, how do I select URLs of a given format?! – Daniel Goldman Mar 04 '16 at 22:12
  • You follow the directions in the answer to the duplicate... Of course, some independent thought is required. – jasonwryan Mar 04 '16 at 22:17
  • Sorry that I lack the knowledge of the tool in order to perform the task. Your comments are not constructive. #divname is easy enough. How do I say "just URLs" and then filter out URLs that are not of the proper form? – Daniel Goldman Mar 04 '16 at 22:22
  • 2
    I find the easiest way to extract URLs from HTML files is to use a tool like lynx that already knows how to identify and extract URLs. e.g. lynx -dump -listonly file.html | awk '/https:\/\/sitename\.com\// {print $2}' – cas Mar 04 '16 at 23:26

1 Answers1

1

A simple grep should do this for you:

grep -o "https://sitename.com/.+/ending" somefile.html

(Note: I don't have a *nix machine in front of me right now to test this on.)

Edit: Fired up my linux box and found this to work:

grep -wEo "https://sitename\.com/[^/]+/ending" somefile.html

A .+ will be greedy and capture way too much. Using a negative assertion will properly find the end of a sub-directory. Note this will NOT find nested sub-directories such as https://sitename.com/sub/directory/ending.

Hobadee
  • 326
  • grep -o "https://sitename.com/" somefile.html I get a bunch of lines of that string, but if I add in the rest, I get nothing. – Daniel Goldman Mar 04 '16 at 22:32
  • 1
    Try egrep (or grep -e) instead. – Hobadee Mar 04 '16 at 22:36
  • No go. Good try though. I think I'm getting somewhere. The issue, I think, is that these URLs are not necessarily on separate lines. – Daniel Goldman Mar 04 '16 at 22:37
  • 1
    Whoops - forgot you will also have to escape some stuff... grep -eo "https://sitename.com/.+/ending" somefile.html – Hobadee Mar 04 '16 at 22:39
  • grep: https://sitename.com/.+/ending: No such file or directory test.html:https://sitename.com/test-1/ending https://sitename.com/test-2/ending test.html:https://notsite.com/ending – Daniel Goldman Mar 04 '16 at 22:47
  • It's a good effort. Maybe wait until you're in front of a terminal? – Daniel Goldman Mar 04 '16 at 22:48
  • Damn my curiosity! Fired up my Linux boxxen and came up with this one which appears to work: grep -wEo "https://sitename\.com/[^/]+/ending" somefile.html I had also forgotten about regex being greedy, so we use a negated character class to match the middle part of the URL. Note that this will NOT match a URL such as "https://sitename.com/sub/directory/ending" – Hobadee Mar 04 '16 at 22:59