1

I have a a file that contains more than a hundred thousand of IDs. Each ID is composed of 8~16 hexadecimal digits:

178540899f7b40a3
6c56068d
8c45235e9c
8440809982cc
6cb8fef5e5
7aefb0a014a448f
8c47b72e1f824b
ca4e88bec
...

I need to find the related files in a directory tree that contains around 2×109 files.

Given an ID like 6c56068d219144dd, I can find its corresponding files with:

find /dir -type f -name '* 6[cC]56068[dD]219144[dD][dD] *'

But that takes at least two days to complete...

What I would like to do is to call find with as much -o -iname GLOB triplets as ARG_MAX allows.

Here's what I've thought of doing:

sed -e 's/.*/-o -iname "* & *"' ids.txt |
xargs find /dir -type f -name .

My problem is that I can't force xargs to take in only complete triplets.

How can I do it?

Fravadona
  • 561

3 Answers3

4

That's the wrong approach, if the point is to find all the files whose name has one of those IDs as any one of their space delimited words, then you could do:

find /dir -type f -print0 |
  gawk '
    !ids_processed {ids[$0]; next}
    {
      n = split(tolower($NF), words, " ")
      for (i = 1; i <= n; i++)
        if (words[i] in ids) {
          print
          break
        }
    }' ids.txt ids_processed=1 RS='\0' FS=/ -

Then you process the file list only once, and looking up the 100k ids is just a lookup in a hash table instead of doing up to 100k regex/wildcard matchings.

  • I've never seen awk being called like that, with some variables been defined after the first file name. Where is this behaviour defined? Couldn't find it in the man, maybe I missed it. I understand what you're trying to do, which is setting some variables only after the first file is read completely, but how is this allowed? For instance, how come awk doesn't look for a file named ids_processed=1 and rather treats it as new variable definition? – aviro Sep 04 '23 at 08:08
  • 1
    Ok found it: "If a filename on the command line has the form var=val it is treated as a variable assignment. The variable var will be assigned the value val. (This happens after any BEGIN block (s) have been run.) Command line variable assignment is most useful for dynamically assigning values to the variables AWK uses to control how input is broken into fields and records. It is also useful for controlling state if multiple passes are needed over a single data file." Nice! I never knew you could do that! – aviro Sep 04 '23 at 08:10
  • 1
    Note that contrary to -v var=value, that was also available in the original awk from the late 70s (-v was added in nawk in the 80s). That means it can't process files whose name contains = characters if what's left of the first = is a valid awk variable name. That's why you need awk '...' ./*.txt instead of awk -- '...' *.txt for instance (you'll find several answers here mentioning this kind or problem). See also the -E of gawk to work around it. – Stéphane Chazelas Sep 04 '23 at 10:45
  • excellent : much faster approach, and with lots of precise not-well-known informations, as usual for your anwsers! Thank you. (I learned about the variable interpretation in the argument list, and its workaround. Using ./files* is anyway almost always preferable to files* for many usages, and is a good habit to take, as it avoids several pitfalls for many commands (ex: interpreting the characters in a filename beginning with a '-' as options for rm, etc) ). You should write a book with all the tips and "good to know" things about the shell and many utilities – Olivier Dulac Sep 04 '23 at 14:54
1

What I would do:

Write a script to save all the file names to a temporary:

# maybe run this from cron or behind inotifywait
find dir -type f -print > /tmp/filelist

Then do a lookup as needed using your input file:

fgrep -if hexids /tmp/filelist 

I might suggest using -wif instead of -if but from the other comments it's not clear that you are providing accurate information in your question. man grep for more information.

  • That would look for the ids in the whole file paths, not just their (base) names. – Stéphane Chazelas Sep 04 '23 at 10:51
  • Yes, and...? The original question appears to try to separate on word boundaries, and says all the files are in one directory. I provided the -w option. The sample shows a solution to the problem posted, not any other. – rand'Chris Sep 05 '23 at 12:03
0

Thanks to @Kusalananda, I thought of one possible solution:

The first step is to make each -a -b X triplet considered as a single argument by xargs. Then you re-split those single-argument-triplets in an inline sh script and call the utility in there.

... |
awk '{ printf("%s%c", $0, 0) }' |
xargs -0 sh -c '[ "$#" -gt 0 ] && { printf %s\\n "$@" | xargs "$0" }' my_command
Fravadona
  • 561
  • 1
    if you're on GNU, xargs -d '\n' would do to look at the lines as units. Or if not, tr '\n' '\0' | xargs -0 is a bit shorter than the awk. And yeah, my idea for the splitting would have been something like sh -c 'set -f; my_command $@' _. – ilkkachu Aug 31 '23 at 21:00
  • 1
    Though now I do wonder if it's always safe wrt. the limit to transform the single arg a b c into the three a, b and c. If the pointers to the arg strings count (theones passed to the main() of the executed program), then the effective size of the string would increase when split. Though I suppose what with the first xargs filling the available space pretty well, I guess you'd see it quickly if that issue shows up – ilkkachu Aug 31 '23 at 21:01
  • I can't use my_command $@ as there are spaces in the third component of the triplet; that's also the reason for using awk, as I can do further escaping with it – Fravadona Aug 31 '23 at 21:28
  • Well, in that case, I would say your example data isn't representative (I'm not also sure I'd call something with spaces a "word" in the general). There's some difference between splitting on all spaces, vs. splitting max N times, vs. handling quotes while doing it. – ilkkachu Sep 01 '23 at 06:08
  • @ilkkachu I simplified the problem on purpose, because it isn't relevant when the splitting is done with xargs; also, I thought that xargs might have an obscure option to do the job easily. – Fravadona Sep 01 '23 at 08:48
  • yes, it's not relevant if the splitting is done with xargs. But while xargs does support quotes and escaping, it does so with a syntax that's (slightly) different than e.g. the shell syntax, and there are way more tools that would lend themselves nicely to whitespace-separated inputs. Also, with questions posted on the site, it's often the case that the easiest tool to use is not the one the poster originally tried. But if the question doesn't represent the real data, it's impossible to know which tools would be valid. – ilkkachu Sep 01 '23 at 09:20
  • 2
    Someone could have spent time working on a solution with e.g. Perl, just to have you tell them that a-ha! the data is actually different and they wasted their time. Trying to help you. For free. So yeah, please try to avoid setting up even the opportunity for that to happen. If the data is actually something like -a -b "foo bar", and known to be aimed for xargs, just say it. It only needs one sentence: "The data uses quotes and escapes as interpreted by xargs". – ilkkachu Sep 01 '23 at 09:23
  • @ilkkachu Well, such solution wouldn't help me but it would still be useful to a lot of people with simpler requirements. – Fravadona Sep 01 '23 at 09:40