0

i have a small amounts of text in html that i want to process.

currently they are rendered with shr-render-region so that all html is out of the way, then copied and processed.

this works fine except for the fact that the rendering inserts newlines according to the value of shr-width, and these newlines can't be removed with replace-regexp-in-string or any other function that i have tried. (C-u C-x = reports that they are Line Feed (C-j) newlines, but matching with \n fails.)

is it possible to avoid inserting these when rendering with shr? or is there are way to strip them that i'm missing? perhaps i can cleanly extract the text some other way?

ideally paragraph breaks in the text (single blank lines) would be preserved, but no other newlines would interfere.

the text is currently variable pitch, i.e. shr-use-fonts is non-nil. but i have also tried setting it to nil and the newlines are still inserted.

EDIT:

an example of what i'm working with (it's posts from mastodon, i'm processing them in https://codeberg.org/martianh/mastodon.el):

<p>Thrilled to have coauthored the 1st version of the guidelines for conducting research on the <a href=\"https://mastodon.xyz/tags/Linux\" class=\"mention hashtag\" rel=\"nofollow noopener noreferrer\" target=\"_blank\">#<span>Linux</span></a> kernel: <a href=\"https://github.com/torvalds/linux/commit/f09f6f9b69821c9efcf16e6b5b466ce9e263ca51\" rel=\"nofollow noopener noreferrer\" target=\"_blank\"><span class=\"invisible\">https://</span><span class=\"ellipsis\">github.com/torvalds/linux/comm</span><span class=\"invisible\">it/f09f6f9b69821c9efcf16e6b5b466ce9e263ca51</span></a> This is in the wake of the UMN incident and will hopefully help fellow sw.eng. scholars to enforce <a href=\"https://mastodon.xyz/tags/ethics\" class=\"mention hashtag\" rel=\"nofollow noopener noreferrer\" target=\"_blank\">#<span>ethics</span></a> when studying the <a href=\"https://mastodon.xyz/tags/kernel\" class=\"mention hashtag\" rel=\"nofollow noopener noreferrer\" target=\"_blank\">#<span>kernel</span></a> community.</p>
user27075
  • 488
  • 3
  • 11
  • Please give an example of those texts for reproduction. – Tobias Mar 14 '22 at 14:02
  • I think you may have gone down the wrong path somewhere. `replace-regexp-in-string` can match and replace both "\n" (NB: single backslash only) and the literal newline character. (You can generate the latter with `C-q C-j`.) I just checked in ielm. Both work in either the regex or the string being searched. – Phil Hudson Mar 24 '22 at 19:50
  • @PhilHudson i'm aware that it should match, that's why im concerned. btw, i have seen `shr` do this in two unrelated packages, hence my confusion and asking here. – user27075 Mar 24 '22 at 21:07
  • Please add a trivial example of an unexpectedly failing regexp match. – Phil Hudson Mar 25 '22 at 05:44
  • I have posted an answer, but you should double-check if you understand Phil's comments. I could not reproduce the behavior that you are describing here (i.e. matching on "\n" works fine here, like it should) – dalanicolai Mar 25 '22 at 09:15
  • @PhilHudson thanks. its helpful to know that what i'm seeing is unexpected. – user27075 Mar 25 '22 at 10:21
  • We could still benefit from seeing an example of a regexp search that fails. – Phil Hudson Mar 25 '22 at 11:29
  • @PhilHudson i'm still working out that the issue is, i think, different to what i have said: the regexp search/replace fails because the text is filled -- it doesn't catch the soft newlines introduced by `shr-fill-lines`. i'm trying to unfill it instead of regexp replace the newlines – user27075 Mar 25 '22 at 12:38

1 Answers1

1

You can prevent shr-fill-lines from having any effect by overwriting it and make it return nil as follows:

(defun shr-fill-lines (_ _)
  nil)

(Then you could also just use shr-render-buffer)

dalanicolai
  • 6,108
  • 7
  • 23