3

I've been slowly converting a blog of mine to markdown. The final thing to do is replacing all the html anchors with markdown.

I've come up this sed regex, which for all intents and purposes should do what I want, but it doesn't.

Source data:

$ cat /tmp/test
on <a href="https://www.reddit.com/" target="_blank" rel="noopener">reddit</a> or <a href="https://lifehacker.com/" target="_blank" rel="noopener">Lifehacker</a>

Sed command:

$ sed -r 's/<a.*?href="(.*?)".*?>(.*?)<\/a>/[\2](\1)/g' /tmp/test
on [Lifehacker](https://lifehacker.com/" target="_blank" rel="noopener)

What I want it to return:

on [Reddit](https://reddit.com/) or [Lifehacker](https://lifehacker.com/")
  • Is there a line break in /tmp/test or is that wrapping? – Quasímodo Apr 25 '20 at 11:01
  • Sorry, that pasted bad. It's on one line. Fixed it – devilkin Apr 25 '20 at 11:12
  • I dont think gnu sed supports non-greedy patterns. You can replace the 2nd .*? with [^"]*. – meuh Apr 25 '20 at 11:33
  • 1
    Using regular expressions to parse html is generally not a great idea (see the often quoted Stack Overflow answer on the topic). You may save yourself headaches by using a specialized html parser. Here, even pandoc may be a convenient solution - e.g. pandoc -f html -t markdown /tmp/test. – fra-san Apr 25 '20 at 11:45
  • @meuh while that fixes part of the issue, it still doesn't return both - only the last one is matched. – devilkin Apr 25 '20 at 12:04
  • @fra-san while true, most of the syntax has been converted to markdown when I exported it from the old platform (WP), and using pandoc screws up the parts that are already converted. – devilkin Apr 25 '20 at 12:06
  • @devilkin Based on my experience, I wouldn't recommend pandoc in general. But it happens to conveniently work in simple cases. You may just pipe single lines to it. Using regular expression to parse whole html documents is likely to screw things up too. Finally, you can have non-greedy matching in sed, but it requires some work. – fra-san Apr 25 '20 at 12:13
  • @meuh thx for that comment on the fact that sed doesn't do non-greedy. Interesting tidbit I didn't know. – devilkin Apr 25 '20 at 12:14

1 Answers1

3

sed uses basic and extended regular expressions (BRE/ERE). .*? is part of a Perl Compatible Regular Expression (PCRE).

To use PCRE, use perl:

$ perl -pe 's/<a.*?href="(.*?)".*?>(.*?)<\/a>/[\2](\1)/g' test
on [reddit](https://www.reddit.com/) or [Lifehacker](https://lifehacker.com/)
  • This is exactly the same expression as the original, but used with perl -p which reads & prints file line by line – like sed does

Here is a similar regex using ERE with sed:

$ sed -E 's/<a[^>]*href="([^"]*)[^>]*>([^<]*)[^>]*>/[\2](\1)/g' test
on [reddit](https://www.reddit.com/) or [Lifehacker](https://lifehacker.com/)
  • PCRE uses a ? following a quantifier to match the shortest repetition, standard regular expressions do not
  • Negated character classes are used to work around this
guest
  • 2,134
  • Thanks :) I was thinking of having a look at perl, but I was breaking my head why sed didn't work. Now I know... TIL :) – devilkin Apr 25 '20 at 12:17