How can I convert all the html files I get into text files after a wget command?

Question

How can I convert all the html files I get into plain text files after a wget command?

I'm thinking of using lynx to convert html files into ".txt" files, getting rid of tags.

I have this snippet to save an entire website, but how can I change it so that it leaves only the text files that are converted from html files on the website "foobar", in my local folder "test"?

wget -P /test/ --recursive http://foobar.html

I'm not sure how to pipe this to lynx and how to specify applying commands for all the files under some specific directory.

What do you mean by text file? you mean to save the file with an extention .txt? — Vikyboss, Apr 09 '16 at 17:59
if I am understanding correctly what you want, go check html2text. By the way, with today's web 2.0 websites, your success will be very spotty if any. — MelBurslan, Apr 09 '16 at 18:12
@Vikyboss I mean text files that got rid of the tags from the original html files. — stacko, Apr 09 '16 at 18:19

Thomas Dickey · Answer 1 · 2016-04-09T19:16:21.097

wget may not be the proper tool. Lynx can download the files and convert them to plain text at the same time, but would do that by redirecting its output to a file. Because it has no -output option, that makes it a little awkward to use in a script, since you would have to assign output names.

But supposing that you had a directory full of .html files, then you could use find to traverse that directory and convert the files, e.g.,

#!/bin/sh
find . -type f -name '*.htm*' | while IFS= read path
do
    lynx -dump "$path" >"${path%%.htm*}.txt"
done

to put the ".txt" files in the same tree, or

#!/bin/sh
find . -type f -name '*.htm*' | while IFS= read path
do
    target=${path/foobar/test}
    lynx -dump "$path" >"${target%%.htm*}.txt"
done

in the folder "test" (mapping "foobar" to "test"). The "/" substitution is bash-specific, not in POSIX (but if you chose to use POSIX, sed works well enough).

How can I convert all the html files I get into text files after a wget command?

2 Answers2

Linked