3

I'm trying to create a script that can be executed to search a list of files, compare the file size field to the specified file size, and then display the files that are less than the specified file size.

I understand that I must use 'ls -l' in order to get a detailed list of files. However, how can I go about searching the list and pulling the files?

JVAN
  • 165

3 Answers3

8

Your approach is clumsy (and fair to say wrong). There are dedicated tools for these sort of tasks, find is one of those.

For example, to find all files in the current directory that are less than 1 MiB (1048576 Bytes), recursively:

find . -type f -size -1048576c

Or with shells that provide such size based glob qualifiers e.g. zsh, recursively:

print -rl -- **/*(.L-1048576)

Here, contrary to find above, without the hidden files. Add the D glob qualifier to include them.

heemayl
  • 56,300
  • Note that at least GNU find rounds the size up to full units before applying the less-than test if you give a test like -size -1M. Meaning that it will only match files that have a size of zero, unlike -size -1048576c which will match files smaller than a (binary) megabyte. – ilkkachu Mar 15 '17 at 11:16
  • @JVAN, for assignments, I suggest writing a solution with the given assumptions and requirements, and then writing an example input that blows up spectacularly showing those assumptions horribly invalid. And mentioning the fact that you're doing an assignment and forced to use improper tools if you happen to ask on SE... – ilkkachu Mar 15 '17 at 11:18
3

Things to bear in mind with parsing the output of ls -l:

  • the format depends on the locale. The format is only specified by POSIX in the POSIX/C locale and even then it allows some variations (like the amount of spacing between the fields, the width of the first field...). For instance, you can't easily detect portably file names that start with blank characters.
  • Many systems allow blanks in user and group names, making parsing the output reliably almost impossible there. Best is to use ls -n (to use numeric user ids) instead of ls -l.
  • It's impossible to parse the output of ls reliably if the file names may contain newline characters (and newline is allowed in a filename in virtually all POSIX systems) unless you use the -q option (but then you can't tell the exact file name, just see a quoted representation from which you can't get back to the original file name) or use non-standard options found in some implementations. Though see also the trick below.
  • The size field is not provided for all types of files (and the meaning of the size field varies between systems for some types of files). You'd probably want to limit to regular files.
  • The above is assuming a POSIX ls. Old versions have been known to have different output formats, or missing blanks between fields under some circumstances...

So, with that in mind, provided you can guarantee that file names don't contain newline characters and don't start with blank characters, to list the regular files whose size is strictly less than 1MiB, you could do:

(
  export LC_ALL=C
  ls -n | awk '
    /^-/ && $5 < 1048576 { 
      gsub(/([^[:blank:]]+[[:blank:]]+){8}/, "")
      print
    }'
)

Add the -a option if you want to include hidden files. Add -L if for symlinks, you want to consider the file they (eventually) resolve to.

As other have said, the correct solution would be to use find here.

Trick to avoid the newline and leading blank issue.

If instead of ls -n, we use ls -nd ./*, we would be able to know where the file name begins (on ./) and on what line it ends (on the line before the next ./), so you could do:

(
  export LC_ALL=C
  ls -nd ./* | awk '
    /\// {
      selected = (/^-/ && $5 < 1048576)
      sub(/.*\//, "./")
    }
    selected'
)

However note that it won't work if there's a large number of files in the current directory as the ./* is expanded by the shell, and that could cause the limit on the number of arguments to be reached.

To include hidden files, -a won't help here, we'd need to tell the shell to expand them. POSIXly, it's a bit unwieldy:

ls -dn ./..?* ./.[!.]* ./*

(which is likely to cause warning messages about missing ./..?* or ./.[!.]* files).

  • @RakeshSharma, that would assume there's only one blank character between the date and the filename which POSIX doesn't guarantee. Note that perl's \s is more like [[:space:]]. See \h for blanks only. – Stéphane Chazelas Mar 15 '17 at 10:35
  • @RakeshSharma, the difference between \h and \s is not only \n. It's all the vertical spacing ones (vertical tab, form feed, carriage return at least). Those would not be used in the field separators by ls, so matching on them is unneeded. Only space and tab (in the C locale) are needed. In practice, ls implementations use spaces and no tabs, but POSIX doesn't guarantee that. – Stéphane Chazelas Mar 15 '17 at 10:56
  • @S.C.: Oh ok got that. Thanks. That's being really paranoid ;-) –  Mar 15 '17 at 10:58
1

Since you only want to parse ls's output then you can find the regular files that are lesser in size than an upper-limit (MAX bytes) and ASSUMING no whitespace/newlines in your filenames you could do the following:

/bin/ls -l | awk -v MAX=150 '/^-/ && $5 <= MAX { print $NF }'