Extracting characters after a particular text

Question

I am extracting a page from website using cURL command.

curl "www.google.com" -s |  w3m -dump -T text/html > foo.txt

The w3m command outputs the HTML page in a much more simpler format so that the string manipulation in the foo.txt is simpler now.

Now, I have some values in my foo.txt which get extracted as below.

Assistant director at Hollywood studios
Student at University of Texas at Arlington

Now, I need to extract only the values after at to store in my database. How can I do it? For example, for the above input, I need the values as,

Hollywood Studios
University of Texas at Arlington

score 3 · Answer 1 · answered Feb 10 '14 at 23:51

Another way would be to replace the first occurrence of at with a tab, so now you have a tab delimited file and you can use awk properly:

$ sed 's/ at /\t/' foo.txt | awk -F'\t' '{print $1" :: "$2}'
Assistant director :: Hollywood studios
Student :: University of Texas at Arlington

Or, the same thing in Perl:

$ perl -ne '/(.+?) at (.+)/; print "$1 :: $2\n"' foo.txt

or even

$ perl -F'\sat\s' -lane 'print "$F[0] :: @F[1..$#F]"' foo.txt

score 2 · Answer 2 · answered Feb 10 '14 at 23:01

2

You can use at as column separator in awk. The following should work:

awk -F'at' '{print $2}' foo.txt

answered Feb 10 '14 at 23:01

Ketan

Some words will include those letters together... Also, you'd need $3 in the case of Texas at Arlington. – jasonwryan Feb 10 '14 at 23:04
aha gotcha! let me think. – Ketan Feb 10 '14 at 23:05

score 2 · Accepted Answer · answered Feb 10 '14 at 23:07

2

Another option is to pipe your text into grep and cut:

grep -o ' at .*$' foo.txt | cut -c5-

This will extract the longest string for each line which starts with ' at '. The cut will then trim the leading ' at '.

answered Feb 10 '14 at 23:07

Werner Lehmann

3 Answers3