4

Consider the following commands, and their results :

$ echo "<br/> <br/>a<br/>b<br/>c<br/><br/> <br/>"|lynx -dump -stdin

   a
   b
   c
$ echo "<br/> <br/>a<br/>b<br/>c<br/><br/> <br/>"|elinks -dump
   a
   b
   c

Neither prints the correct number of lines : elinks skips the first white space, and both skip blank lines and trailing lines with white space only.

Is there a way to force lynx or elinks to interpret all spaces and line breaks ? I didn't see anything obvious in their manpage.

(I mean, beside using a temporary character to be suppressed by sed or tr or whatever after the display.)

  • Neither of these work for me either. What are you ultimately trying to accomplish with these? Just curious, since they all seem to do it, and I could find no direct reason as to why they work like this. But rather than fight uphill against these tools, perhaps there is some more deeper reason/problem that you're trying to solve that could be solved in a different way. – slm Nov 30 '14 at 14:22
  • Well I solved my problem using the workaround mentioned in parenthesis at the end of the question (I choose a character which I was confident was not in my data, in my case ~). – Skippy le Grand Gourou Nov 30 '14 at 14:26
  • Yeah I read that, but was curious if there was some down stream use you needed to have the new lines in, or were you just looking for a way to convert html to a text output, or something else entirely. – slm Nov 30 '14 at 14:28
  • As for what I'm ultimately trying to accomplish, I'm filling a DB with data from HTML tables. The easiest and most robust way I found is to dump each column into a separate file, then I merge files using paste, and finally I use psql \copy to fill the DB. Usually I use hxselect "td:nth-child()" to get columns, but for one particular column it gives me HTML code which needs to be interpreted, which is why I use lynx — then the issue arises when rows are empty. – Skippy le Grand Gourou Nov 30 '14 at 14:32
  • I would use a proper progrmaming language such as Ruby/Perl/Python to do this. You can connect to a URL using one of the many libraries that they offer and then navigate the HTML object that's returned within them to pull out the tables you want. The approach you're taking here is not going to be very robust or tolerant to changes that may occur. – slm Nov 30 '14 at 15:21
  • 1
    Nokogiri in Ruby is one such method: http://ruby.bastardsbook.com/chapters/html-parsing/. Another example: http://stackoverflow.com/questions/8749101/parse-html-table-using-nokogiri-and-mechanize – slm Nov 30 '14 at 15:22
  • Thanks for suggestions. However it's kind of a one-shot so I prefer using tools I'm confident with rather than programming languages I don't know. In addition input is not always well-formed so I often need e.g. to sed through it. I find bash tools quite adapted to text processing so for now I don't find the need for anything else (though I needed perl's HTML::TableExtract once). – Skippy le Grand Gourou Nov 30 '14 at 15:39

2 Answers2

4

Lynx can be configured to modify this behavior using COLLAPSE_BR_TAGS in the configuration file, e.g., lynx.cfg:

If COLLAPSE_BR_TAGS is set FALSE, Lynx will not collapse serial BR tags. If set TRUE, two or more concurrent BRs will be collapsed into a single line break. Note that the valid way to insert extra blank lines in HTML is via a PRE block with only newlines in the block.

Default value for COLLAPSE_BR_TAGS is TRUE

Thomas Dickey
  • 76,765
0

Lynx v2.8.9 (released July 8, 2018) added the trim_blank_lines option for controlling whether blank lines are trimmed.

Setting both the collapse_br_tags and trim_blank_lines options to false1 will preserve blank lines.

1 Lynx recognizes “1”, “+”, “on” and “true” for true values, and “0”, “-”, “off” and “false” for false values.
https://www.mankier.com/1/lynx

Example:

echo "<br/> <br/>a<br/>b<br/>c<br/><br/> <br/>d" \
| lynx -stdin -collapse_br_tags=0 -trim_blank_lines=0 -nomargins=1 -dump

Result:


a b c

d

Unfortunately, you'll then get spurious trailing newlines, as is evident in the above output.

Fortunately, that's easy to fix with an additional sed pipeline stage to remove multiple newlines at EOF:

echo "<br/> <br/>a<br/>b<br/>c<br/><br/> <br/>d" \
| lynx -stdin -collapse_br_tags=0 -trim_blank_lines=0 -nomargins=1 -dump \
| sed ':loop; /^\n*$/{$d;N;}; /\n$/b loop'