1

How can I convert all the html files I get into plain text files after a wget command?

I'm thinking of using lynx to convert html files into ".txt" files, getting rid of tags.

I have this snippet to save an entire website, but how can I change it so that it leaves only the text files that are converted from html files on the website "foobar", in my local folder "test"?

wget -P /test/ --recursive http://foobar.html

I'm not sure how to pipe this to lynx and how to specify applying commands for all the files under some specific directory.

stacko
  • 791

2 Answers2

1

wget may not be the proper tool. Lynx can download the files and convert them to plain text at the same time, but would do that by redirecting its output to a file. Because it has no -output option, that makes it a little awkward to use in a script, since you would have to assign output names.

But supposing that you had a directory full of .html files, then you could use find to traverse that directory and convert the files, e.g.,

#!/bin/sh
find . -type f -name '*.htm*' | while IFS= read path
do
    lynx -dump "$path" >"${path%%.htm*}.txt"
done

to put the ".txt" files in the same tree, or

#!/bin/sh
find . -type f -name '*.htm*' | while IFS= read path
do
    target=${path/foobar/test}
    lynx -dump "$path" >"${target%%.htm*}.txt"
done

in the folder "test" (mapping "foobar" to "test"). The "/" substitution is bash-specific, not in POSIX (but if you chose to use POSIX, sed works well enough).

Further reading:

Thomas Dickey
  • 76,765
  • How can I specify the folder that has html files I want to convert? By making the part ' .htm' into '/test/.htm'? – stacko Apr 09 '16 at 18:52
0

You could probably just download them as HTML files as planned, then use the command-line utility html2text.

https://stackoverflow.com/questions/30015809/html2text-convert-special-characters

tgharold
  • 283