7

I need to determine if a file contains a certain regex at a certain line and to return true (exit 0) if found, and otherwise false. Maybe I'm overthinking this, but my attempts proved a tad unwieldy. I have a solution, but I'm looking for maybe others that I hadn't thought of. I could use perl, but I'm hoping to keep this "lightweight" as possible as it runs during a puppet execution cycle.

The problem is common enough: in RHEL6, screen was packaged in a way that limited the terminal width to 80 characters, unless you un-comment the line at 132. This command checks to see if that line has already been fixed:

 awk 'NR==132 && /^#termcapinfo[[:space:]]*xterm Z0=/ {x=1;nextfile} END {exit 1-x}' /etc/screenrc

Note: if the file has fewer that 132 lines, it must exit with false.

I thought sed would be of help here, but apparently then you have to do weird tricks like null-substitutions and branches. Still, I'd like to see a sed solution just to learn. And maybe there is something else I overlooked.

EDIT 1: Added nextfile to my awk solution

EDIT 2: Benchmarks EDIT 3: Different host (idle). EDIT 4: mistakenly used Gile's awk time for optimized-per's run. EDIT 5: new bench

Benchmarks

First, note: wc -l /etc/screenrc is 216. 50k iterations when line not present, measured in wall-time:

  • Null-op: 0.545s
  • My original awk solution: 58.417
  • My edited awk solution (with nextfile): 58.364s
  • Giles' awk solution: 57.578s
  • Optimized perl solution 90.352s Doh!
  • Sed 132{p;q}|grep -q ... solution: 61.259s
  • Cuonglm's tail | head | grep -q : 70.418s Ouch!
  • Don_chrissti's head -nX |head -n1|grep -q: 116.9s Brrrrp!
  • Terdon's double-grep solution: 65.127s
  • John1024's sed solution: 45.764s

Thank you John and thank you sed! I am honestly surprised perl was on-par here. Perl loads in a bunch of shared libraries on startup, but as long as the OS is caching them all, it comes down to the parser and byte-coder. In the distant past (perl 5.2?) I found it was slower by 20%. Perl was slower as I originally expected but appeared to be better due to a copy/paste error on my part.

Benchmarks Part 2

The biggest configuration file which has practical value is /etc/services. So I've re-run these benches for this file and where the line to be changed is 2/3rds in the file. Total lines is 1100, so I picked 7220 and modified the regex accordingly (so that in one case it fails, in another it succeeds; for the bench it always fails).

  • John's sed solution: 121.4s
  • Chrissti's {head;head}|grep solution: 138.341s
  • Counglm's tail|head|grep solution: 77.948s
  • My awk solution: 175.5s
Otheus
  • 6,138

8 Answers8

14

With GNU sed:

sed -n '132 {/^#termcapinfo[[:space:]]*xterm Z0=/q}; $q1'

How it works

  • 132 {/^#termcapinfo[[:space:]]*xterm Z0=/q}

    On line 132, check for the regex ^#termcapinfo[[:space:]]*xterm Z0=. If found quit, q, with the default exit code of 0. The rest of the file is skipped.

  • $q1

    If we reach the last line, $, then quit with exit code 1: q1.

Efficiency

Since it is not necessary to read past the 132nd line of the file, this version quits as soon as we reach the 132nd line or the end of the file, whichever occurs first:

sed -n '132 {/^#termcapinfo[[:space:]]*xterm Z0=/q; q1}; $q1'

Handling empty files

The version above will return true for empty files. This is because, if the file empty, no commands are executed and the sed exits with the default exit code of 0. To avoid this:

! sed -n '132 {/^#termcapinfo[[:space:]]*xterm Z0=/q1; q}'

Here, the sed command exits with code 0 unless the the desired string is found in which case it exits with code 1 The preceding ! tells the shell to invert this code to get back to the code we want. The ! modifier is supported by all POSIX shells. This version will work even for empty files. (Hat tip: G-Man)

John1024
  • 74,655
  • 1
    This will exit with a status of 0 if the file has fewer than 132 lines.  I can’t figure out how to make sed exit with a non-zero status if there’s no input (i.e., immediate EOF), so I suggest reversing the sense of the exit status — exit 1 if the pattern is found and 0 if not.  … … … … … … … … … … … … … … … … … … … … … … … …  This reads to the end of the file if the pattern isn’t present on line 132.  I suggest changing it to sed -n '132 {/^#termcapinfo[[:space:]]*xterm Z0=/q1; q}' so it exits after it reads line 132, whether it matches or not. – G-Man Says 'Reinstate Monica' Sep 04 '15 at 01:19
  • John, @G-Man is right. Maybe you can edit your answer to suit. Thanks :) – Otheus Sep 04 '15 at 11:39
  • 1
    To invert the sed's output, just prefix with !. Not "universally portable" but in most cases it works. – Otheus Sep 04 '15 at 11:45
  • @G-Man The original answer does correctly produce a status of 1 for files with 1 to 131 lines. But, you are absolutely right about empty files and your efficiency suggestion is a good one. I updated the answer with your improvements. – John1024 Sep 09 '15 at 20:45
  • D’oh!  I see what I did — I changed the command to sed -n '132 {/pattern/q}; 132q1', and then I tested that on short files. – G-Man Says 'Reinstate Monica' Sep 09 '15 at 21:09
5

With POSIX toolchest:

tail -n +132 </etc/screenrc | head -n 1 | grep -q pattern
cuonglm
  • 153,898
  • Does not fail if the file has too few lines. – FelixJN Sep 03 '15 at 13:32
  • 1
    With more than one fork, I had this solution: sed -n '132{p;q}'|grep -q pattern – Otheus Sep 03 '15 at 13:59
  • @Otheus: If your wanted line is big, tail will be much faster. – cuonglm Sep 03 '15 at 15:47
  • What do you mean "if your wanted line is big"? As in line 132? Or if the line number (not 132) is very big (50000 in a huge file)? – Otheus Sep 04 '15 at 11:42
  • @Otheus: Yes, if the line number is very big. Anyway, tail + head + grep seem to be always fastest, since when tail seek to the line instead of processing each line at the beginning like others. – cuonglm Sep 04 '15 at 11:54
  • cuonglm, you misunderstand how tail works. It cannot seek to a particular line because unix files aren't aware of lines. Tail must count the lines like all the other programs do -- it must read in byte-by-byte. The reason tail is more efficient in some cases is because it starts scanning for newlines from the end of the file. So the only advantage is when the line you want is greater than mid-way through the file (in terms of bytes) -- and the file must be large enough to compensate for the triple pipeline – Otheus Sep 04 '15 at 12:31
  • No, it doesn't read byte-by-byte, at least with GNU tail http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/tail.c#n834. I invite you to read http://unix.stackexchange.com/questions/47407/cat-line-x-to-line-y-on-a-huge-file and http://unix.stackexchange.com/questions/102905/does-tail-read-the-whole-file – cuonglm Sep 04 '15 at 13:57
3

You can do it in more efficiently in awk: exit as soon as you've hit the relevant line.

awk 'NR==132 {if (/^#termcapinfo[[:space:]]*xterm Z0=/) found=1; exit}
     END {exit !found}' /etc/screenrc

Alternatively, you can use GNU sed (but portable sed doesn't let you specify the exit code).

Alternatively, you can use the Unix philosophy of combining tools together: extract the line you want with head and tail, and pass it to grep.

</etc/screenrc tail -n +132 | head -n 1 |
grep -q '^#termcapinfo[[:space:]]*xterm Z0='

Or you can use sed to extract the desired line:

</etc/screenrc sed -n '32 {p; q;}' |
grep -q '^#termcapinfo[[:space:]]*xterm Z0='

(Both of these rely on the fact that you want the same outcome for an empty line and for a file that's too short.)

For such a small file, the fastest approach is likely to be one that uses a single tool, as the overhead of launching multiple programs will be larger than the performance gain from using special-purpose tools such as head, tail and sed. If you wanted line 132000000, starting off with tail -n +132000000 would likely be faster than anything else.

cuonglm
  • 153,898
2

Some alternatives with ed:

ed -s infile <<\IN
132s/^#termcapinfo[[:space:]]*xterm Z0=/&/
q
IN

or sed+grep:

sed '132!d;q' infile | grep -q '^#termcapinfo[[:space:]]*xterm Z0='

In both cases, if infile has less than 132 lines or if line 132 doesn't match the pattern, the exit code is 1. Both should be quite portable, ed will read the whole file in memory though...

If you're working with huge files then head may be faster then sed e.g.:

{ head -n 131 >/dev/null; head -n 1; } <infile | grep -q '^#termcapinfo[[:space:]]*xterm Z0='
don_crissti
  • 82,805
1

I know you said you didn't want to use perl. I think you're operating under a misconception about how 'lightweight' it is.

You could do this:

#!/usr/bin/env perl

use strict;
use warnings;

open ( my $input_fh, '<', "/etc/screenrc" ) or die $!; 
while ( <$input_fh> ) {
   if ( $. == 132 
   and m/^#termcapinfo[[:space:]]*xterm Z0=/ ) {
       exit 0; 
   }
}

exit 1;

Which you can condense to a one liner:

perl -ne 'exit 0 if $. == 132 and  m/^#termcapinfo[[:space:]]*xterm Z0=/ END { exit 1 }' 
Sobrique
  • 4,424
  • Yep, I did imagine your 2nd perl script, but figured awk was faster. They look just about the same, but yours would continue reading to the end of the file. A slightly different version would be as efficient as awk, not counting perl's small performance overhead. – Otheus Sep 03 '15 at 14:06
  • Turns out perl's performance overhead is quite significant. – Otheus Sep 03 '15 at 15:07
  • Both of these perl scripts continue reading after line 132, which could be avoided quite easily. – Toby Speight Sep 03 '15 at 18:11
  • That isn't a significant overheard. Every one of your solutions take less time than a disk seek, so relative speed is irrelevant. – Sobrique Sep 04 '15 at 03:11
1

You could always use a couple of greps:

grep -nm 1 "^#termcapinfo[[:space:]]*xterm Z0=" /etc/screenrc | grep -q '^132:'

The -n adds the line number to each matched line in grep's output. For example:

$ seq 11 15 | grep -n 5
5:15

The -m 1 (which, unlike the other two, is not defined by POSIX and might not be available in your grep implementation) makes grep exit after the first match.

So, the first grep looks for lines matching the regex and prints them along with the line number. The second grep will silently (-q) return true if an input line starts with 132:, so it will only be true if the regex matched line 132.


Here's another simple Perl approach:

perl -ne '$.==132 && !/^#termcapinfo\s*xterm Z0=/ && exit(1);'

The idea is to exit with a status of 1 only if line 132 doesn't match the regex. It will, therefore, exit with 0 otherwise. You could make it a bit more efficient (but more complex) by only checking the relevant line:

perl -ne '$.==132 && !/^#termcapinfo\s*xterm Z0=/ && exit(1); exit(1) if $.>132'

You could also simplify your original awk a little:

awk 'NR==132 && /^#termcapinfo[[:space:]]*xterm Z0=/{exit 0} NR>132{exit 1}'
terdon
  • 242,166
  • assuming the file is very large the grep would take too long - especially if no match is given (if so, the -m option would reduce time) – FelixJN Sep 03 '15 at 13:15
  • @Fiximan yeah, I didn't want to add it since it isn't POSIX but OK, edited. Thanks. – terdon Sep 03 '15 at 13:17
  • The awk wrongly exits with 0 if the file is < 132 lines. Same problem with both perl scripts. – Otheus Sep 03 '15 at 14:01
  • @Otheus why is that wrong? I thought the point here is to check whether line 132 matches the regex. Why should it exit with an error if there is no line 132? EDIT: Ah, just noticed that requirement in the OP now. OK, I'll try and fix that. – terdon Sep 03 '15 at 16:13
  • The command is used to determine if another command should run. The subsequent command is the sed -i form that removes the #. In theory I could just do the sed command anyway; but this is running under puppet which means it might run many many times, and I want to avoid unnecessary disk writes. If the file isn't what is expected, the sed command should never run because it will always fail. – Otheus Sep 03 '15 at 16:21
0

A workaround with head, tail, wc, and grep. (if loop in bash syntax)

if [[ $(head -n 132 file | wc -l) -eq 123 &&\
    $( head -n 132 file | tail -n 1 |\
    grep '^#termcapinfo[[:space:]]*xterm Z0=') ]] ; then
  echo success
else
  echo fail
fi
FelixJN
  • 13,566
  • 1
    As long as you're using bash's [[ ... ]] syntax, why not use it's =~ regexp feature as well? :) – Otheus Sep 03 '15 at 14:50
  • @Otheus I'm not really too familiar with that so I cannot safely update the answer - but it makes sense to use it here. – FelixJN Sep 03 '15 at 15:53
0

For completeness' sake, the ruby solution:

 ruby -e 'while gets do;  if $.==132 ; exit(/^#termcapinfo[[:space:]]*xterm Z0=/?0:1); end; end; exit(1)' /etc/screenrc 

I don't see a way of using ruby -n here without calling at_exit() in every iteration of the file-line read.

The MRI (Matz Ruby Interpreter, 1.8.7) takes insanely long clocking it at 139s.

Otheus
  • 6,138