2

I am extracting a page from website using cURL command.

curl "www.google.com" -s |  w3m -dump -T text/html > foo.txt

The w3m command outputs the HTML page in a much more simpler format so that the string manipulation in the foo.txt is simpler now.

Now, I have some values in my foo.txt which get extracted as below.

Assistant director at Hollywood studios
Student at University of Texas at Arlington

Now, I need to extract only the values after at to store in my database. How can I do it? For example, for the above input, I need the values as,

Hollywood Studios
University of Texas at Arlington
Ramesh
  • 39,297

3 Answers3

3

Another way would be to replace the first occurrence of at with a tab, so now you have a tab delimited file and you can use awk properly:

$ sed 's/ at /\t/' foo.txt | awk -F'\t' '{print $1" :: "$2}'
Assistant director :: Hollywood studios
Student :: University of Texas at Arlington

Or, the same thing in Perl:

$ perl -ne '/(.+?) at (.+)/; print "$1 :: $2\n"' foo.txt

or even

$ perl -F'\sat\s' -lane 'print "$F[0] :: @F[1..$#F]"' foo.txt
terdon
  • 242,166
2

You can use at as column separator in awk. The following should work:

awk -F'at' '{print $2}' foo.txt
Ketan
  • 9,226
2

Another option is to pipe your text into grep and cut:

grep -o ' at .*$' foo.txt | cut -c5-

This will extract the longest string for each line which starts with ' at '. The cut will then trim the leading ' at '.