0

While I have seen several questions similar to the one I am going to ask, for e.g. How can I convert all the html files I get into text files after a wget command?

I also saw a blog post which describes and have seen it works. I tried it even locally and found even that works but in local files i.e. files which are residing say in some /usr/share/doc/$PACKAGENAME/index.html and number of pages linked therein, there should be an easier way to get at least the top page.

I tried doing something like -

html2text file:///usr/share/doc/$PACKAGENAME/html/index.html > packagename-doc.txt

but that didn't work.

I get the output -

Cannot open input file "file:///usr/share/doc/$PACKAGENAME/html/index.html".

I am not giving any package names as it doesn't really matter and there are so many packages which nowadays give documentation in html pages rather than man or info but that's outside the topic altogether.

Can somebody either tell why or give an alternative way of doing it either via html2text or some other tool which does it in a simple way .

shirish
  • 12,356
  • Did you try html2text < /file.html. ? – user1133275 Apr 26 '18 at 00:15
  • 2
    Because it is hard encoded in it if file_.startswith('http://') or file_.startswith('https://'):, you can see here that the filename is considered an URL if starts with http otherwise is considered as regular file, check the source code in GitHub https://github.com/aaronsw/html2text/blob/8ddc844b03042faf4875d465580ad122f9292978/html2text.py#L870 so you have to remove file:// at the beginig of you filename. – Kerkouch Apr 26 '18 at 00:47

1 Answers1

2

@Karkouch has the right idea - you need to remove the file:// part. Shell tools generally do not understand or expect URLs as parameters.

In fact, file:///[…]/html/index.html is a valid path, but points to a file within a directory named html, etc., finally inside a directory within the PWD called literally file:. Multiple slashes are simply treated as a single slash, and every single visible character (and most invisible ones) are valid in a *nix path. The only character not valid in a path is NUL.

l0b0
  • 51,350
  • I dunno why somebody downvoted the question, but yes, removing file and the extra slashes exactly does as advertised. – shirish Apr 26 '18 at 05:14