0

I have a while loop that takes a text file as input (uniq.txt) and uses grep to find duplicates in another file (stage.txt) then writes the number of duplicates and the contents of the line to another file, Output.txt.

For some reason the while loop stops at randomly around halfway through the file?

while read line; do
            results=$(grep ${line} ./stage.txt | wc -l)
            printf  '%s\n' "$line $results" >> Output.txt
            done < uniq.txt

This is where the issue is. My while loop stops at -b.

apps
archive.
AWACP
awac-pri
-b
backup
bad_file
bak.path
BasicPlu
  • 2
    your main issue will probably be gone if you use grep -- ${line}... other suggestions: paste your code in http://www.shellcheck.net/ to see improvements and avoiding pitfalls... another is https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice – Sundeep Sep 13 '17 at 15:54

2 Answers2

3

Your loop stops at -b because at that point ${line} is interpreted as option -b to grep. To prevent that you need to add --, to tell grep not to look for further options:

results=$(grep -- "$line" ./stage.txt | wc -l)
Satō Katsura
  • 13,368
  • 2
  • 31
  • 50
  • You can avoid the extra (wc) process by allowing grep to count the matching lines --- grep -c -- "$line" – JRFerguson Sep 13 '17 at 17:01
  • @JRFerguson Sure, but that's not the focus of the answer. That piece of code could (and probably should) be improved improved in a number of ways (read -r, grep -Fc, better format for printf, move >> at the end, possibly other things), but adding them into the answer would IMO make it less clear. The OP's problem was the missing --. Everything else are unrelated optimizations. shrug – Satō Katsura Sep 13 '17 at 17:15
  • @Sato Katsura : of course there are multiple optimizations possible. Piping grep output to wc is often a "red-flag" just like combining grep and awk pipelines and the infamous useless uses of cat. I don't see anything wrong with a performance pointer for something that is commonly missed, IMO. – JRFerguson Sep 13 '17 at 17:27
  • 1
    @JRFerguson It isn't wrong, it's just not the best way to explain it to a beginner. The OP should get the actual answer to his question first. After that there can be further explanations about how to improve the answer, and I probably should have written all that myself (but I'm an old fart and an impatient writer, so I didn't). But the actual answer should stand apart. – Satō Katsura Sep 13 '17 at 17:40
  • I know this isn't by any means efficient or a good standard to do this particular job but it was more of a simple proof of concept. I know there are a million different ways to make this code better but for my purposes, a bad code that outputs what I need is all that I require as of right now. At the moment I didn't have time to write a robust script; I just needed to get code into vim and run it. Now I'm writing a much more robust program in Python that suites more of my needs. @SatōKatsura Thank you for the fast and working answer. Saved me a lot of headaches – cdruckemiller Sep 13 '17 at 18:14
1

The issue comes from an variable which gets a value that looks like a command line flag, as Satō Katsura pointed out.

However, what you're doing can also be done with,

awk 'NR==FNR {p[++i]=$0;next} {for (i in p){if (match($0,p[i])){c[i]++}}} END {for (i in p){print p[i],c[i]}}' uniq.txt stage.txt >output.txt

... if the number of patterns in uniq.txt is not in the millions.

The awk script unraveled:

NR==FNR { p[++i] = $0; next     }

        {
            for (i in p) {
                if (match($0, p[i])) {
                    c[i]++
                }
            }
        }

END     {
            for (i in p) {
                print p[i],c[i]
            }
        }

It first reads each line of uniq.txt into the array p, and then continues with counting (in the array c) how many lines of input from the second file contains each pattern in p.

At the end, the patterns and the corresponding counts are outputted.

This avoids a slow shell loop (executing grep and wc once for each pattern, and also opening and writing to an output file that many times), and also avoids having to deal with reading patterns into a shell variable with read.

If you want to do fixed string matching, i.e. not treating the lines in uniq.txt as regular expression patterns but as fixed strings (as with grep -F), just change the match($0, p[i]) function call to index($0, p[i]).

Kusalananda
  • 333,661
  • This does make the script considerably faster. I figured AWK would be a better tool for the job but at the time I was looking for a easy-to-write and easily understandable script to prove a point. It was easier for me to explain a while loop than an awk command. – cdruckemiller Sep 13 '17 at 18:16