10

I am removing stop words from a text, roughly using this code

I have the following

$ cat file
file
types
extensions

$ cat stopwords
i
file
types

grep -vwFf stopwords file

I am expecting the result: extensions

but I get the ( I think incorrect)

file
extensions

It is as if the word file has been skipped in the stopwords file. Now here's the cool bit: if I modify the stopwords file, by changing the single word/letter i on the first line, to any other ascii letter apart from f, i, l, e, then the same grep command gives me a different and correct result of extensions.

What is going on here and how do I fix it?

I'm using grep (BSD grep) 2.5.1-FreeBSD on a Mac OSX GNU bash, version 4.4.12(1)

Tim
  • 237
  • You may want to use the -x switch for line regex instead of -w for word? However I think the -F switch will cancel either of them out, or vice-versa. – jesse_b Oct 15 '17 at 11:39
  • grep (GNU grep) 3.1 works as you expect. – Hauke Laging Oct 15 '17 at 12:31
  • I have replicated this. Another datum: Making the i pattern the second rather than the first pattern in the stopwords file also alters the behaviour. – JdeBP Oct 15 '17 at 15:06
  • I can't reproduce the behaviour on OpenBSD 6.2 with native grep nor with GNU grep 3.1. – Kusalananda Oct 15 '17 at 15:42

2 Answers2

13

This was a bug in bsdgrep, relating to a variable that tracks the part of the current line still to scan that is overwritten with successive calls to the regular expression matching engine when multiple patterns are involved.

local fix

You can work around this to an extent by not using the -w option, which relies upon this variable for correct operation and thus is failing, but instead using the regular expression extensions that match the beginning and endings of words, making your stopwords file look like:

\<i\>
\<file\>
\<types\>

This workaround will also require that you do not use the -F option.

Note that the documented regular expression components [[:<:]] and [[:>:]] that the re_format manual tells you about will not work here. This is because the regular expression library that is compiled into bsdgrep has GNU regular expression compatibility support turned on. This is another bug, which is reportedly fixed.

service fix

This bug was fixed earlier this year. The fix has not yet made it into the STABLE or RELEASE flavours of FreeBSD, but is reportedly in CURRENT.

For getting this into the MacOS version of grep, that is derived from FreeBSD's bsdgrep, please consult Apple. ☺

Further reading

JdeBP
  • 68,745
  • Nice, and thanks for reporting this upstream. I would find this answer even more fascinating if it quoted the buggy code. – dhag Oct 15 '17 at 17:57
1

This code:

pl " Input data file data1 and stopwords file data2:"
head data1 data2

pl " Expected output:"
cat $E

pl " Results, grep:"
# grep -vwFf stopwords file
grep -vwFf data2 data1

pl " Results, cgrep:"
cgrep -x1 -vFf data2 data1

produces:

-----
 Input data file data1 and stopwords file data2:
==> data1 <==
file
types
extensions

==> data2 <==
i
file
types

-----
 Expected output:
extensions

-----
 Results, grep:
file
extensions

-----
 Results, cgrep:
extensions

On a system like:

OS, ker|rel, machine: Apple/BSD, Darwin 16.7.0, x86_64
Distribution        : macOS 10.12.6 (16G29), Sierra
bash GNU bash 3.2.57

More details on cgrep, available via brew, and from sourceforge:

cgrep   shows context of matching patterns found in files (man)
Path    : ~/executable/cgrep
Version : 8.15
Type    : Mach-O64-bitexecutablex86_64 ...)
Home    : http://sourceforge.net/projects/cgrep/ (doc)

cheers, drl

drl
  • 838
  • just got myself a new grep. – Tim Oct 17 '17 at 12:26
  • @Tim -- I hope you find cgrep as useful as I have. The speed on the tests I have done put it roughly on a par with GNU grep, and the "context / windowing" features are very useful. It also builds easily on Linux systems ... cheers, drl – drl Oct 18 '17 at 12:26