Force lynx or elinks to interpret spaces and line breaks

Question

Consider the following commands, and their results :

$ echo "<br/> <br/>a<br/>b<br/>c<br/><br/> <br/>"|lynx -dump -stdin

   a
   b
   c
$ echo "<br/> <br/>a<br/>b<br/>c<br/><br/> <br/>"|elinks -dump
   a
   b
   c

Neither prints the correct number of lines : elinks skips the first white space, and both skip blank lines and trailing lines with white space only.

Is there a way to force lynx or elinks to interpret all spaces and line breaks ? I didn't see anything obvious in their manpage.

(I mean, beside using a temporary character to be suppressed by sed or tr or whatever after the display.)

Neither of these work for me either. What are you ultimately trying to accomplish with these? Just curious, since they all seem to do it, and I could find no direct reason as to why they work like this. But rather than fight uphill against these tools, perhaps there is some more deeper reason/problem that you're trying to solve that could be solved in a different way. — slm, Nov 30 '14 at 14:22
Well I solved my problem using the workaround mentioned in parenthesis at the end of the question (I choose a character which I was confident was not in my data, in my case ~). — Skippy le Grand Gourou, Nov 30 '14 at 14:26
Yeah I read that, but was curious if there was some down stream use you needed to have the new lines in, or were you just looking for a way to convert html to a text output, or something else entirely. — slm, Nov 30 '14 at 14:28
As for what I'm ultimately trying to accomplish, I'm filling a DB with data from HTML tables. The easiest and most robust way I found is to dump each column into a separate file, then I merge files using paste, and finally I use psql \copy to fill the DB. Usually I use hxselect "td:nth-child()" to get columns, but for one particular column it gives me HTML code which needs to be interpreted, which is why I use lynx — then the issue arises when rows are empty. — Skippy le Grand Gourou, Nov 30 '14 at 14:32
I would use a proper progrmaming language such as Ruby/Perl/Python to do this. You can connect to a URL using one of the many libraries that they offer and then navigate the HTML object that's returned within them to pull out the tables you want. The approach you're taking here is not going to be very robust or tolerant to changes that may occur. — slm, Nov 30 '14 at 15:21
Nokogiri in Ruby is one such method: http://ruby.bastardsbook.com/chapters/html-parsing/. Another example: http://stackoverflow.com/questions/8749101/parse-html-table-using-nokogiri-and-mechanize — slm, Nov 30 '14 at 15:22
Thanks for suggestions. However it's kind of a one-shot so I prefer using tools I'm confident with rather than programming languages I don't know. In addition input is not always well-formed so I often need e.g. to sed through it. I find bash tools quite adapted to text processing so for now I don't find the need for anything else (though I needed perl's HTML::TableExtract once). — Skippy le Grand Gourou, Nov 30 '14 at 15:39

score 4 · Answer 1 · edited Jun 11 '20 at 12:04

4

Lynx can be configured to modify this behavior using COLLAPSE_BR_TAGS in the configuration file, e.g., lynx.cfg:

If COLLAPSE_BR_TAGS is set FALSE, Lynx will not collapse serial BR tags. If set TRUE, two or more concurrent BRs will be collapsed into a single line break. Note that the valid way to insert extra blank lines in HTML is via a PRE block with only newlines in the block.

Default value for COLLAPSE_BR_TAGS is TRUE

edited Jun 11 '20 at 12:04

Community

1

answered Oct 14 '15 at 01:14

Thomas Dickey

76,765

Great suggestion. Unfortunately this parameter doesn't seem to have the expected effect, if any. – Skippy le Grand Gourou Oct 14 '15 at 13:35
@SkippyleGrandGourou, AFAICT, lynx -cfg <(echo COLLAPSE_BR_TAGS:FALSE) -dump -stdin has some effect, but still seems to do some br collapsing (seems to introduce only one blank line) – Stéphane Chazelas Oct 18 '15 at 21:23
This solution works perfectly for using lynx to view html in mutt. – Jonas Dahlbæk Sep 18 '17 at 21:02

score 0 · Answer 2 · answered Nov 28 '21 at 03:55

Lynx v2.8.9 (released July 8, 2018) added the trim_blank_lines option for controlling whether blank lines are trimmed.

Setting both the collapse_br_tags and trim_blank_lines options to false¹ will preserve blank lines.

¹ Lynx recognizes “1”, “+”, “on” and “true” for true values, and “0”, “-”, “off” and “false” for false values.
https://www.mankier.com/1/lynx

Example:

echo "<br/> <br/>a<br/>b<br/>c<br/><br/> <br/>d" \
| lynx -stdin -collapse_br_tags=0 -trim_blank_lines=0 -nomargins=1 -dump

Result:


a
b
c
d

Unfortunately, you'll then get spurious trailing newlines, as is evident in the above output.

Fortunately, that's easy to fix with an additional sed pipeline stage to remove multiple newlines at EOF:

echo "<br/> <br/>a<br/>b<br/>c<br/><br/> <br/>d" \
| lynx -stdin -collapse_br_tags=0 -trim_blank_lines=0 -nomargins=1 -dump \
| sed ':loop; /^\n*$/{$d;N;}; /\n$/b loop'

Force lynx or elinks to interpret spaces and line breaks

2 Answers2