While I have seen several questions similar to the one I am going to ask, for e.g. How can I convert all the html files I get into text files after a wget command?
I also saw a blog post which describes and have seen it works. I tried it even locally and found even that works but in local files i.e. files which are residing say in some /usr/share/doc/$PACKAGENAME/index.html and number of pages linked therein, there should be an easier way to get at least the top page.
I tried doing something like -
html2text file:///usr/share/doc/$PACKAGENAME/html/index.html > packagename-doc.txt
but that didn't work.
I get the output -
Cannot open input file "file:///usr/share/doc/$PACKAGENAME/html/index.html".
I am not giving any package names as it doesn't really matter and there are so many packages which nowadays give documentation in html pages rather than man or info but that's outside the topic altogether.
Can somebody either tell why or give an alternative way of doing it either via html2text or some other tool which does it in a simple way .
if file_.startswith('http://') or file_.startswith('https://'):
, you can see here that the filename is considered an URL if starts withhttp
otherwise is considered as regular file, check the source code in GitHub https://github.com/aaronsw/html2text/blob/8ddc844b03042faf4875d465580ad122f9292978/html2text.py#L870 so you have to removefile://
at the beginig of you filename. – Kerkouch Apr 26 '18 at 00:47