2

I'm trying to extract data between

<td></td>

tags, but if i use something like

awk -F"<td>" {' $1 ":" $2 '}

it will output the remaining html data after column 1 and column 2, how can I extract inbetween both and only extract the data/string its self?

slm
  • 369,824
guest
  • 21

1 Answers1

3

This does what you want:

$ awk -F'</*td>' '$2{print $2}' someFile

This works by defining a split argument that matches the beginning <td> as well as the ending </td>. This isolates the string in the middle as field $2. The remainder will print $2 if it's defined.

Example

$ cat someFile
!DOCTYPE html PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
   <HEAD>
      <TITLE>
         A Small Hello
      </TITLE>
   </HEAD>
<BODY>
   <table><td>hello world</td></table>
   <table><td>hello world</td></table>
   <table><td>hello world</td></table>
   <table>
   <td>hello world</td>
   </table>
   <H1>Hi</H1>
   <P>This is very minimal "hello world" HTML document.</P>
</BODY>
</HTML>

Output:

$ awk -F'</*td>' '$2{print $2}' someFile
hello world
hello world
hello world
hello world

References

slm
  • 369,824