1

I have two directories that both have a couple thousand files each, and I am trying to grep certain IPs from the files. My grep string is:

grep "IP" cdr/173/07/cdr_2018_07*

This grep string returns "grep: Argument list too long". However, when I do the following:

grep "IP" cdr/173/06/cdr_2018_06*

it returns what I am looking for.

Below is the ls -l for the parent directory for each of these. It seems that the difference is about 400KB, so I'm not sure that size is really the issue here. Am I missing something?

jeblin@debian:~$ ls -l cdr/173
total 18500
REDACTED
drwxr-xr-x 2 jeblin jeblin 2781184 Jul  2 09:34 06
drwxr-xr-x 2 jeblin jeblin 2826240 Aug  1 07:33 07

If it makes a difference, I wrote a Python script that automates this process (searching for multiple IPs), and it works for 06, but not 07 as well, which is why I tried to do the manual grep search first.

Kusalananda
  • 333,661

3 Answers3

7

The shell is not able to call grep with too many files, or rather, the length of the command line1 for calling an external utility has a limit, and you're hitting it when the shell tries to call grep with the the expanded cdr/173/07/cdr_2018_07* globbing pattern.

What you can do is either to grep each file individually, with

for pathname in cdr/173/07/cdr_2018_07*; do
    grep "IP" "$pathname" /dev/null
done

where the extra /dev/null will force grep to always report the filename of the file that matched, or you can use find:

find cdr/173/07 -maxdepth 1 -type f -name 'cdr_2018_07*' \
    -exec grep "IP" /dev/null {} +

which will be more efficient as grep will be called with as many matching pathnames as possible in batches.

It could also be that if you first cd into cdr/173/07 and do

grep "IP" cdr_2018_07*

it may work since the generated list of filenames would be shorter due to not containing the directory bits, but you're probably very close to the limit with 44.7k files and should seriously consider moving to another way of doing this, especially if you're expecting the number of files to fluctuate around that number.

Related:


1The limit is on the combined length on the command line and the length of the environment (the sum of the length of each argument and environment variable's name and value, also accounting for the pointers to them), and it is a limit imposed by the execve() system call which is used by the shell to execute external commands. Built-in commands such as echo etc. do not have this issue.

Kusalananda
  • 333,661
  • There probably should be a tag for this specific issue by now :) – Sergiy Kolodyazhnyy Aug 01 '18 at 15:52
  • I think I understand what you're saying, but it works when I use the same grep string for the 06 directory, which is very close to the same size as the 07 directory. In fact, when I do ls -lh, the 06 directory is 43,200 files @ total 5.6G , and the 07 directory is 44,640 files @ 5.0G. If I understand you correctly, it shouldn't work for the 06 directory either. – Josh Eblin Aug 01 '18 at 16:03
  • @JoshEblin The size of the files does not matter, but the length of the filenames does matter, and one of the globs expands to something that is too long, while the other doesn't. I will add an extra bit to the answer. – Kusalananda Aug 01 '18 at 16:05
  • I do know for sure that if I manually cd into the directory and then grep, it works, however I haven't been able to script the cd then grep. The length of the filenames is always the same. I have changed my script to use the find command, and am running it now, but it seems to be much slower, assuming it is working. – Josh Eblin Aug 01 '18 at 16:12
  • @JoshEblin Building your script around the fact that you need to cd into the directory for the grep to work is a bit like playing with fire. It'll stop working the day you accumulate even more files. Using find would be safer. It should not be much slower than your original command if you use -exec grep ... {} +. – Kusalananda Aug 01 '18 at 16:39
1

The issue is the maximum limit of bytes allowed in the shell command.

* is expanded to the total list of all files in the directory, so what matters isn't the file size but the length of the filenames and the amount of files.

You can get your machine's limit in bytes by running $ getconf ARG_MAX. Please note that this limit is imposed by the OS/Kernel and not by the shell itself.

A way to circumvent this is to use find:

$ find cdr/173/07/ -iname "cdr_2018_07*" -type f -exec grep "IP" {} \;

confetti
  • 1,964
0

i am answering question on basis of below points:- There are two directories:- 1)cdr/173/07 2)cdr/173/06

There can be many files in these two directories from which i need to search

a)Below is first solution

grep -r "IP" cdr/173/07 -e "IP" cdr/173/06

b)But if there are many other directories you want to search you can use

grep -r "IP" cdr/173/*

c)Lets suppose we have 1000 files and we want to search only particular types of files. grep -r "IP" cdr/173/07/cdr_2018_07* -e grep "IP" cdr/173/06/cdr_2018_06*