13

I have a file that looks like this:

asd 123 aaa wrqiqirw 123
123 itiewth 123 asno 123
123 132 123 123 123
boagii 123 asdnojaneoienton 123

Expected output is:

123
123
123
123

I will need to search for patterns via regex. Is there any way to implement such a thing?

terdon
  • 242,166
Andrew
  • 151
  • 2
    Most programs that deal with regular expressions print the first match by default. What regex will you be using? The example of 123 is too trivial to be useful. – terdon Mar 15 '17 at 10:01
  • 3
    @terdon strictly speaking, it's not printing the first match, but do something with the first match. – cuonglm Mar 15 '17 at 10:18
  • 1
    What do you expect for lines without a matching pattern? – Philippos Mar 15 '17 at 12:21
  • @terdon From man grep about -o: "Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line" – SuibianP Mar 25 '22 at 10:13
  • 2
    @SuibianP sorry, what? I know what GNU grep's -o does, but why is it relevant here? As you say, that would print all matches on separate lines, which is not what the OP wants (note that they show a file with 12 occurrences of 123 and a desired output with only 4). Also look at the answers already provided. – terdon Mar 25 '22 at 10:19
  • @terdon I am trying to point out that grep does not print the first match by default, but all matches on the input line. Sorry if I misunderstood your comment. – SuibianP Mar 25 '22 at 10:33
  • @SuibianP I... don't know what comment you mean. You're the first to mention grep -o in this comment thread. All I did here was i) edit the question 5 years ago to fix some formatting issues and ii) leave a comment, also 5 years ago, asking the OP for more details since different patterns could affect the answer. I never mentioned -o or grep. If you were to do this with grep, it wouldn't be with a simple grep -o for the reason you mentioned. – terdon Mar 25 '22 at 10:45

7 Answers7

11

With pcregrep, with a pattern like 12*3:

pcregrep -o1 '(12*3).*'

With pcregrep or GNU grep -P:

grep -Po '^.*?\K12*3'

(pcregrep works with bytes more than characters, while GNU grep will work on characters as defined in the current locale (and you'd have to make sure the input contains valid text in the current locale)).

Note that GNU grep won't print anything if the pattern matches the empty string.

  • 1
    \K Keep the stuff left of \K. Not in []. Keep as in... the wife can keep it, grep doesn't want it. Also relevant in the context of negative look behind: \k{}, \k<>, \k'' Named backreference. Not in []. – Ray Foss Sep 11 '20 at 03:03
5

In Perl, simply

perl -lne 'print $& if /\d+/' inputfile

or from stdin:

echo foo 123 bar 456 doo 789 | perl -lne 'print $& if /\d+/'
123

The regex \d+ will match any string of consecutive numbers, and $& refers the matching string.

ilkkachu
  • 138,973
4

Just a grep should be enough to bring matches of 123 in every line.
It does not makes sense if the match is first ,middle or in the end.
You ask for 123 you get 123 if it is in the line (unless your question is not expressed correctly and you require something different)

$ grep -wo '123' file # -w: word match  -o : return only matched string instead of the whole line (default grep operation)

In case you need to catch with regex the first number of each row (any number - any length) then this will do the job:

cat <<EOF >file1
asd 111 777 aaa wrqiqirw 123
333 123 itiewth 123 asno 123
4444 111 123 123 567
boagii what 666 asdnojaneoienton 123
EOF
grep -Po '^[0-9]+|^.*?\K[0-9]+' file1
#output
111
333
4444
666
3

POSIXLY:

LC_ALL=C sed -e 's/.*\(123\).*/\1/' <file

LC_ALL-C is needed here to prevent sed from crashing or producing unexpected result if the file contained invalid characters in your current locale.

It also produces one entry at a line, but matched the last, not the first.

For matching the first, with ast-open's sed whose EREs support the Perl-style *? non-greedy repetition operator:

LC_ALL=C sed -E 's/.*?(123).*/\1/'

(-E for extended RE will be in next version of POSIX)

cuonglm
  • 153,898
  • Note that this will catch the last 123 on the line and not the first. –  Mar 15 '17 at 10:52
  • @RakeshSharma Yes, of course, not sure why my editing was not updated, Fix it now – cuonglm Mar 15 '17 at 10:57
  • LC_ALL=C changes the meaning of characters though. Even for a pattern like 123, that could have unexpected results, as that 1 for instance could match something that is otherwise part of a character in the user's locale (like à that is encoded as 81 30 87 31 in a zh_CN.gb18030 locale, so echo 'Ã23' | LC_ALL=C grep 123 would match there) – Stéphane Chazelas Mar 15 '17 at 11:34
  • 3
    EREs don't have the *? operator. You need perl or compatible regexps for that. – Stéphane Chazelas Mar 15 '17 at 11:42
  • @StéphaneChazelas Ah yes, fixed it. – cuonglm Mar 15 '17 at 13:34
  • 1
    I don't get your last edit... There is no sed that supports PCRE and -E stands for extended regex aka ERE which, as noted above, doesn't support the *? operator. – don_crissti Mar 15 '17 at 13:36
  • 1
    As such, the second solution will still print the last match... – don_crissti Mar 15 '17 at 13:42
  • @don_crissti, ssed (on which GNU sed is based IIRC) supports PCRE with the -R flag. – Stéphane Chazelas Mar 15 '17 at 14:07
  • @StéphaneChazelas - all right (I knew something but wasn't sure so better err on the negative side than on the positive side) but the rest of my comments still stands – don_crissti Mar 15 '17 at 14:09
  • @don_crissti .*? with -E (and .*\? without) appears to be supported by ast-open's sed – Stéphane Chazelas Mar 15 '17 at 14:13
3

With grep in every line:

while IFS= read -r line; do printf '%s\n' "$line" | grep -o 123 | head -1; done < filename

That is:

  • While loop in order to check each line separately.
  • grep -o to get only the match instead of the whole line with matches.
  • head -1 to take only the first match and not the following ones.
cuonglm
  • 153,898
  • 3
    You should avoid using while loop to process text in shell script, see http://unix.stackexchange.com/q/169716/38906. Also, you have to double quote your variable, and using printf instead of echo. – cuonglm Mar 15 '17 at 10:39
3
sed -e '
   /\n/{P;d;}
   s/12*3/\n&\n/;D
' < inoutfile
2

with awk

re='12*3' awk '{match($0, ENVIRON["re"])}; RSTART{print(substr($0, RSTART, RLENGTH))}' file
iruvar
  • 16,725