Finding string multiple positions in a large text file

Question

I need to find the exact positions within the text of a certain string. i.e. File:

to be or not to be, that's the question

String "to". Output needed: 0,14 (characters from the beginning until I find my sting). I tried:

$ grep -o 'to' myfile.txt | wc -l

this gives me "8597", ...I assume that this is the total number but I need the positions within the text, in characters.

... and if the latter, do you want the positions relative to the start of each line, or the start of the file? and is it a string match or a word match (i.e. should to match in potato for example)? — steeldriver, Aug 01 '18 at 09:49

Kusalananda · Answer 1 · 2018-08-01T07:51:00.130

$ awk -v str='to' '{ off=0; while (pos=index(substr($0,off+1),str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos } }' file
1: 1
1: 14

Or, more nicely formatted:

awk -v str='to' '
    {
        off = 0  # current offset in the line from whence we are searching
        while (pos = index(substr($0, off + 1), str)) {
            # pos is the position within the substring where the string was found
            printf("%d: %d\n", NR, pos + off)
            off += length(str) + pos
        }
    }' file

The awk program outputs the line number followed by the position of the string on that line. If the string occurs multiple times on the line, multiple lines of output will be produced.

The program uses the index() function to find the string on the line, and if it's found, prints the position on the line where it is found. It then repeats the process for the rest of the line (using the substr() function) until no more instances of the string can be found.

In the code, the off variable keeps track of the offset from the start of the line from which we need to do the next search. The pso variable contains the position within the substring at offset off where the string was found.

The string is passed on the command line using -v str='to'.

Example:

$ cat file
To be, or not to be: that is the question:
Whether ‘tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, ‘tis a consummation
Devoutly to be wish’d. To die, to sleep;

$ awk -v str='the' '{ off=0; while (pos=index(substr($0,off+1), str)) { printf("%d: %d\n", NR, pos+off); off+=length(str)+pos} }' file
1: 30
2: 4
2: 26
5: 21
7: 20

score 1 · Answer 2 · answered Aug 01 '18 at 08:04

1

Try

grep -b 'to' file

for offset from the beginning of the file; or

grep -nb 'to' file

for line number and offset.

answered Aug 01 '18 at 08:04

JJoao

12,170
1
23
45

score 1 · Answer 3 · answered Aug 01 '18 at 09:26

1

if your file has multiple lines, to find the first occurrence of your string, you could use:

sed -zE 's/^(\w[^to]+)(to)(.*)/\1\2/' YourFile | wc -c

answered Aug 01 '18 at 09:26

Hossein Vatani

746

slm · Answer 4 · 2018-08-01T08:52:11.477

You can use grep to do this:

$ grep -aob 'to' file | grep -oE '[0-9]+'
0
13

Incidentally, your math appears off when you state you're looking for 0,14, the 2nd to starts at position 13 if you're counting 0 as the first, which you appear to be given your coordinates start with 0.

If you want the above output to be a comma separated list of coordinates:

$ grep -aob 'to' file | grep -oE '[0-9]+' | paste -s -d ','
0,13

How does it work?

This method exploits GNU grep's ability to print the byte offset of matches (-b), and we're force it to only print these via the -o switch.

   -b, --byte-offset
          Print the 0-based byte offset within the input file before each
          line of output.  If -o (--only-matching) is specified, print the 
          offset of the matching part itself.

More advanced examples

If your example were to include words such as toto or were multi-lined this improved version of the above approach can handle these as well.

Sample data

$ cat file
to be or not to be, that's the question
that is the to to question
toto is a dog

Example

$ grep -aob '\bto\b' file | grep -oE '[0-9]+' | paste -s -d ','
0,13,52,55

Here we're using word boundaries \b on either side of the word we're counting to only count explicit occurrences of the string to and not words such as toto.

References

$ grep -aob 'to' file | grep -oE '[0-9]+' worked perfectly!!! Thanks a lot Slm — Jose, Aug 01 '18 at 09:27

Finding string multiple positions in a large text file

4 Answers4

How does it work?

More advanced examples

References