28

I have the following files:

Codigo-0275_tdim.matches.tsv  
Codigo-0275_tdim.snps.tsv  
FloragenexTdim_haplotypes_SNp3filter17_single.tsv  
FloragenexTdim_haplotypes_SNp3filter17.tsv  
FloragenexTdim_SNP3Filter17.fas  
S134_tdim.alleles.tsv    
S134_tdim.snps.tsv  
S134_tdim.tags.tsv

I want to count the number of files that have the word snp (case sensitive) on their name. I tried using

grep -a 'snp' | wc -l   

but then I realized that grep searches within the files. What is the correct command to scan through the file names?

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
Lucia O
  • 427

5 Answers5

36

Do you mean you want to search for snp in the file names? That would be a simple shell glob (wildcard), used like this:

ls -dq *snp* | wc -l

Omit the -q flag if your version of ls doesn't recognise it. It handles filenames containing "strange" characters (including newlines).

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • Wasn't sure if I could use ls to retrieve file names with specific text in them. That worked though, thanks. – Lucia O May 12 '15 at 22:29
  • @LuciaO re-reading your comment, it's not ls that is matching the file names, it's the shell. ls sees a list of files matching the pattern; it does not see the pattern itself. – Chris Davies May 12 '15 at 22:34
  • 2
    note this might not work if you have too many files returning. – Dennis Nolte Aug 26 '16 at 11:33
  • @qwr there's no need to shout, particularly as there is no parsing going on here. – Chris Davies Sep 29 '22 at 07:07
  • sorry for caps I was always under the impression that ls is output for humans. For example not in this case but it hides files starting with dot by default and shows directories too. – qwr Sep 29 '22 at 14:59
  • You're right - in the general case I also would never recommend parsing ls. However, for this particular example I would consider ls | wc to be sufficient (but not perfect). If the OP then wanted to iterate over that with an anti-pattern of for f in $(ls *snp*) I would replace it with for f in *snp*. Looking at an earlier comment it seems likely that the OP thought it was ls handling the wildcard rather than the shell. Does this help? – Chris Davies Sep 29 '22 at 15:11
6

If you stand quietly in the hallways of Unix&Linux and listen carefully, you’ll hear a ghostly voice, pitifully wailing, “What about filenames that contain newlines?”

ls -d *snp* | wc -l

or, equivalently,

printf "%s\n" *snp* | wc -l

will output all the filenames that contain snp, each followed by a newline, but also including any newlines in the filenames, and then count the number of lines in the output.  If there is a file whose name is

                                f o o s n p \n b a r . t s v

then that name will be written out as

foosnp
bar.tsv

which, of course, will be counted as two lines.

There are a few alternatives that do better in at least some cases:

printf "%s\n" * | grep -c snp

which counts the lines that contain snp, so the foosnp(\n)bar.tsv example from above counts only once.  A slight variation on this is

ls -f | grep -c snp

The above two commands differ in that:

  • The ls -f will include files whose names begin with .; the printf … * does not, unless the dotglob shell option is set.
  • printf is a shell builtin; ls is an external command.  Therefore, the ls might use slightly more resources.
  • When the shell processes a *, it sorts the filenames; ls -f does not sort the filenames.  Therefore, the ls might use slightly less resources.

But they have something in common: they will both give wrong results in the presence of filenames that contain newline and have snp both before and after the newline.

Another:

filenamelist=(*snp*)
echo ${#filenamelist[@]}

This creates a shell array variable listing all the filenames that contain snp, and then reports the number of elements in the array.  The filenames are treated as strings, not lines, so embedded newlines are not an issue.  It is conceivable that this approach could have a problem if the directory is huge, because the list of filenames must be held in shell memory.

Yet another:

Earlier, when we said printf "%s\n" *snp*, the printf command repeated (reused) the "%s\n" format string once for each argument in the expansion of *snp*.  Here, we make a small change in that:

printf "%.0s\n" *snp* | wc -l

This will repeat (reuse) the "%.0s\n" format string once for each argument in the expansion of *snp*.  But "%.0s" means to print the first zero characters of each string — i.e., nothing.  This printf command will output only a newline (i.e., a blank line) for each file that contains snp in its name; and then wc -l will count them.  And, again, you can include the . files by setting dotglob.

  • all this messing with ls and printf and shell arrays when it's all handled by find! – qwr Sep 29 '22 at 15:01
3

Abstract:

Works for files with "odd" names (including new lines).

set -- *snp* ; echo "$#"                             # change positional arguments

count=$(printf 'x%.0s' snp); echo "${#count}" # most shells

printf -v count 'x%.0s' snp; echo "${#count}" # bash


Description

As a simple glob will match every filename with snp in its name a simple echo *snp* could be enough for this case, but to really show that there are only three files matching I'll use:

$ ls -Q *snp*
"Codigo-0275_tdim.snps.tsv"  "foo * bar\tsnp baz.tsv"  "S134_tdim.snps.tsv"

The only issue remaining is to count the files. Yes, grep is an usual solution, and yes counting new lines with wc -l is also an usual solution. Note that grep -c (count) really counts how many times a snp string is matched, and, if one file name has more than one snp string in the name, the count will be incorrect.

We can do better.

One simple solution is to set the positional arguments:

$ set -- *snp*
$ echo "$#"
3

To avoid changing the positional arguments we can transform each argument to one character and print the length of the resulting string (for most shells):

$ printf 'x%.0s' *snp*
xxx

$ count=$(printf 'x%.0s' snp); echo "${#count}" 3

Or, in bash, to avoid a subshell:

$ printf -v count 'x%.0s' *snp*; echo "${#count}"
3

File list

List of files (from the original question with one with an newline added):

a='
Codigo-0275_tdim.matches.tsv
Codigo-0275_tdim.snps.tsv
FloragenexTdim_haplotypes_SNp3filter17_single.tsv
FloragenexTdim_haplotypes_SNp3filter17.tsv
FloragenexTdim_SNP3Filter17.fas
S134_tdim.alleles.tsv
S134_tdim.snps.tsv
S134_tdim.tags.tsv'
$ touch $a

touch $'foosnp\nbar.tsv'

That will have a file with one newline in the middle:

f o o s n p \n b a r . t s v

And to test glob expansion:

$ touch $'foo * bar\tsnp baz.tsv'

That will add an asterisk, that, if unquoted, will expand to the whole list of files.

-1

In general, don't parse ls. If you want to find files, use find:

find -name "*snp*" | wc -l 

This will count the number of files (and directories) in the current directory and subdirectories matching glob *snp*. Find works for newlines in files but I haven't tested other weird characters.

For more options, you could modify the find command like

find . -maxdepth 1 -type f -name "*snp*" 

This will only search for files in the current directory (and not in subdirectories).

See also https://stackoverflow.com/questions/36933032/count-files-in-a-directory-with-filename-matching-a-string

qwr
  • 709
  • The second find suggestion returns a warning message about the order of actions. Put -maxdepth 1 first – Chris Davies Sep 29 '22 at 08:24
  • @roaima interesting. something about options before search expressions https://unix.stackexchange.com/questions/328851/warning-find-warning-you-have-specified-the-maxdepth-option-after-a-non-optio – qwr Sep 29 '22 at 08:46
  • Yes. You have to put *depth related actions are the beginning of the list - I think it's enforcing semantic rules so you don't accidentally end up with constructs such as -type f -o -depth , where the presence of -depth actually applies to the entire find command but wouldn't necessarily appear to do so from the construction – Chris Davies Sep 29 '22 at 10:08
  • Did you read the explanation of why some of the answers are a little more complex than this? Parsing the output from find isn’t much better than parsing the output from ls. – G-Man Says 'Reinstate Monica' Oct 02 '22 at 02:56
-2

let's say you wanted to count the number of html files:

ls | grep ".html" | wc -l

so if you're counting occurrences of "snp":

ls | grep "snp" | wc -l