I have an HTML file, without any formatting. I want to extract URLs of the form https://sitename.com/*/ending and only those URLs.
What is the best way to go about doing this?
This question is NOT a duplicate. The other question is asking about pulling the contents of a specific named DIV. This is asking how to pull a list of URLs, fitting a specific format.
lynx
that already knows how to identify and extract URLs. e.g.lynx -dump -listonly file.html | awk '/https:\/\/sitename\.com\// {print $2}'
– cas Mar 04 '16 at 23:26