3

A little extended problem from "cat line x to line y on a huge file":

I have a huge file (2-3 GB). I'd like to cat/print only from the line having "foo:" to the line having "goo:". Assume that "foo:" and "goo:" only appear once in a file; "foo:" proceeds "goo:".

So far this is my approach:

  • First, find the line with "foo:" and "goo:": grep -nr "foo:" bigfile
  • Returns 123456: foo: hello world! and 654321: goo: good bye!
  • Once I know these starting and ending line numbers, and the difference (654321-123456=530865) I can do selective cat:
  • tail -n+123456 bigfile | head -n 530865

My question is that how can I effectively substitute the line number constants with expressions (e.g., grep ...)?

I can write a simple Python script but want to achieve it using only combining commands.

Nullptr
  • 139

4 Answers4

9
sed -n '/foo/,/goo/p;/goo/q' <bigfile

That would print only those lines. If you wanted the line numbers you'd add an =.

sed -n '/foo/=;/goo/=;//q' <bigfile

The q is important because it quits the input when it is called - else sed will continue to read the infile through to the end.

If you don't want to print foo/goo lines you can do instead:

With GNU sed:

sed -n '/foo/,/goo/!d;//!p;/goo/q
' <<\DATA
line1
foo 
line3
line4
line5
goo 
line7
DATA

OUTPUT

line3
line4
line5

And with any other:

sed -n '/foo/G;/\n/,/goo/!d;//q;/\n/!p 
' <<\DATA
line1
foo 
line3
line4
line5
goo 
line7
DATA    

OUTPUT

line3
line4
line5

Either way, though, this also quits its input as soon as it encounters the last line in your search.

mikeserv
  • 58,310
  • what's the advantage of that longer sed command over sed -n '/foo:/!{ /goo:/,/foo:/!p; }' bigFile? (Not being argumentative; generally curious.) – HalosGhost Sep 02 '14 at 03:04
  • @HalosGhost - I edited it. Thanks for the inspiration - your own answer gets my vote. – mikeserv Sep 02 '14 at 03:26
  • Thank you! I removed the sed solution in my post considering that what I posted was not actually equivalent (in execution flow) to my awk solution. I'll leave my post to awk :) – HalosGhost Sep 02 '14 at 03:30
  • 1
    sed -n '/foo/,/goo/!d;//!p;/goo/q' won't work with all sed implementations (//!p will only match on goo with some). – Stéphane Chazelas Sep 02 '14 at 06:24
  • @StéphaneChazelas - which are those if you don't mind? That confuses me - specifically the note here about c[2addr]c\ text Delete the pattern space. With a 0 or 1 address or at the end of a 2-address range, place text on the output and start the next cycle. Maybe I should make that a question, huh? It's just that - GNU sed at, least - seems to do all of them. Or else, when used like I do above, only the specifically addressed lines. Is it a bug, do you think? – mikeserv Sep 02 '14 at 06:57
  • In the original one (tested on Unix V7) and probably all the derived commercial Unices (tested Solaris as well) as well as FreeBSD (and probably all the BSDs). ls / | sed -n '/dev/,/lib/!d;//!p;/lib/q' shows dev. Only exception I've found so far is GNU sed. – Stéphane Chazelas Sep 02 '14 at 08:39
  • @StéphaneChazelas - I think it's a GNU bug, maybe. I've been up and down their sed info pages and I've never noticed any reference to that behavior. Not doing so does seem a little out-of-character considering the explicit coverage of N and its last line behavior in the BUGS section. I suppose I should report it. Anyway, I updated the answer to reflect it. – mikeserv Sep 02 '14 at 09:19
  • Both make sense. // matches on the last pattern. For GNU sed, it'll be the last pattern match run, while on other seds, that'll be the last one lexically on the sed command script (even though it was not run. /goo/ is not run in /foo/,/goo/ if /foo/ matched). I would even say that the GNU sed behaviour is more inline with the POSIX spec. – Stéphane Chazelas Sep 02 '14 at 09:23
  • @StéphaneChazelas - so your point is that it is a bug in the spec for it not being specific enough? But what of the /ran/,/ge/c\\ behavior? They seem related... And besides, in my example - if it really were doing the range again - then shouldn't it !p the whole range? I didn't even see your answer till just now... – mikeserv Sep 02 '14 at 09:26
  • I'd say yes, the spec is not specific enough. I don't see what's the problem with c though. That seems unambiguous to me and I don't see how that's related to //. The ambiguity is about what // matches on when the previous match is part of a /x/,/y/ address range. There's also ambiguity in ls / | sed '/s/s/i/u/;s//<&>/g' for instance. – Stéphane Chazelas Sep 02 '14 at 10:23
5

If you are okay with abandoning your current approach of using something in subshells to get the line numbers and allowing another utility to print the file, this can be accomplished in pure awk with little difficulty:

If you wish to print the lines between foo: and goo: and not the lines themselves, then you can use the following (picked up from here originally):

awk '/goo:/ { exit }; flag; /foo:/ { flag = 1 }' bigFile

The above exits when it sees the end token (goo:), prints if flag is true, and sets flag to true (1, actually) when it reaches the opening token (foo:).

If, however, you would prefer to include the token lines in the output, the command is actually even simpler, as @jasonwryan mentioned:

awk '/foo:/,/goo:/' bigFile

If you are hell-bent on only getting the line numbers and not actually printing the file with the same utility, then you can get the line numbers of the start and end tokens like so:

awk '/foo:|goo:/ { print NR }' bigFile
HalosGhost
  • 4,790
4

Alternative sed one:

sed '/foo/,$!d;/goo/q'
1

To substitute the constants with expressions, you can use command substitution.

To substitute the output of a command into an expression, use $(command)

In this case, the appropriate command line is:

tail -n+$(grep -nr "foo:" bigfile | cut -d':' -f1) bigfile | \
head -n$(($(grep -nr "goo:" bigfile | cut -d':' -f1)-$(grep -nr "foo:" bigfile | cut -d':' -f1)+1))

This will print all lines from the line containing foo: to the line containing goo:, inclusive.

HalosGhost
  • 4,790
  • 2
    This reads the input file at least 4 times. You would do better to grep both in one $(( arithmetic )) evaluation statement and save the results into variables. – mikeserv Sep 02 '14 at 03:37