Converting html anchors to markdown using sed regex

Question

I've been slowly converting a blog of mine to markdown. The final thing to do is replacing all the html anchors with markdown.

I've come up this sed regex, which for all intents and purposes should do what I want, but it doesn't.

Source data:

$ cat /tmp/test
on <a href="https://www.reddit.com/" target="_blank" rel="noopener">reddit</a> or <a href="https://lifehacker.com/" target="_blank" rel="noopener">Lifehacker</a>

Sed command:

$ sed -r 's/<a.*?href="(.*?)".*?>(.*?)<\/a>/[\2](\1)/g' /tmp/test
on [Lifehacker](https://lifehacker.com/" target="_blank" rel="noopener)

What I want it to return:

on [Reddit](https://reddit.com/) or [Lifehacker](https://lifehacker.com/")

I dont think gnu sed supports non-greedy patterns. You can replace the 2nd .*? with [^"]*. — meuh, Apr 25 '20 at 11:33
Using regular expressions to parse html is generally not a great idea (see the often quoted Stack Overflow answer on the topic). You may save yourself headaches by using a specialized html parser. Here, even pandoc may be a convenient solution - e.g. pandoc -f html -t markdown /tmp/test. — fra-san, Apr 25 '20 at 11:45
@meuh while that fixes part of the issue, it still doesn't return both - only the last one is matched. — devilkin, Apr 25 '20 at 12:04
@fra-san while true, most of the syntax has been converted to markdown when I exported it from the old platform (WP), and using pandoc screws up the parts that are already converted. — devilkin, Apr 25 '20 at 12:06
@devilkin Based on my experience, I wouldn't recommend pandoc in general. But it happens to conveniently work in simple cases. You may just pipe single lines to it. Using regular expression to parse whole html documents is likely to screw things up too. Finally, you can have non-greedy matching in sed, but it requires some work. — fra-san, Apr 25 '20 at 12:13
@meuh thx for that comment on the fact that sed doesn't do non-greedy. Interesting tidbit I didn't know. — devilkin, Apr 25 '20 at 12:14

guest · Accepted Answer · 2020-04-25T22:42:52.613

sed uses basic and extended regular expressions (BRE/ERE). .*? is part of a Perl Compatible Regular Expression (PCRE).

To use PCRE, use perl:

$ perl -pe 's/<a.*?href="(.*?)".*?>(.*?)<\/a>/[\2](\1)/g' test
on [reddit](https://www.reddit.com/) or [Lifehacker](https://lifehacker.com/)

This is exactly the same expression as the original, but used with perl -p which reads & prints file line by line – like sed does

Here is a similar regex using ERE with sed:

$ sed -E 's/<a[^>]*href="([^"]*)[^>]*>([^<]*)[^>]*>/[\2](\1)/g' test
on [reddit](https://www.reddit.com/) or [Lifehacker](https://lifehacker.com/)

PCRE uses a ? following a quantifier to match the shortest repetition, standard regular expressions do not
Negated character classes are used to work around this

Thanks :) I was thinking of having a look at perl, but I was breaking my head why sed didn't work. Now I know... TIL :) — devilkin, Apr 25 '20 at 12:17

Converting html anchors to markdown using sed regex

1 Answers1