1

I have a hierarchy of directories. Some directories do not contain files and they only contain other directories. Some contain files.

For example:

- movies
  - 2022
    - action
      - movie.mp4
      - another-movie.mp4
  - 2023
    - drama
      - movie2.mp4
  - 2024
    - thriller
      - movie3.mp4
    - movie4.mp4

I want a find command that would provide this result:

/movies/2022/action
/movies/2023/drama
/movies/2024
/movies/2024/thriller

I read this and this and this, but I could not figure out a find command.

Update: A directory can have files and directories in it. I updated the question. For example in the 2024 I have a movie and another directory. The result should contain both /movies/2024 and /movies/2024/thriller directories.

  • Please make sure to always mention your operating system. Different systems have different implementations of find with different capabilities so we need to know what you are working with . – terdon Oct 15 '23 at 11:21
  • "Some directories do not contain files and they only contain other directories." – Strictly these must be empty. In Unix/Linux a directory is a file of the type directory. – Kamil Maciorowski Nov 15 '23 at 01:15

5 Answers5

6

Find the files of type regular and print the enclosing directory. Unique the list to ensure each directory is listed only once. If on a GNU system, that can be done reliably even in the presence of file paths containing newline characters with:

find /path -type f -print0 |
    LC_ALL=C sed -z 's!/[^/]*$!!' |
    LC_ALL=C sort -zu |
    tr '\0' '\n'

Standardly and assuming file paths don't contain newline characters:

find /path -type f -print |
    LC_ALL=C sed 's!/[^/]*$!!' |
    LC_ALL=C sort -u
Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • 1
    ... or -printf '%h\0' to skip the sed step (AFAIK a find with -print0 will also support -printf). – Stephen Kitt Oct 15 '23 at 08:54
  • May I ask why have you used -z? I mean as much as I understand by using -print0 you remove everything and connect the results together. I tried your code without -z and it worked the same. I'm confused about it. – Saeed Neamati Oct 15 '23 at 09:27
  • In general, I would appreciate it if you explain your Regex for sed too. Or your entire command in general :D. – Saeed Neamati Oct 15 '23 at 09:29
  • hey @SaeedNeamati, that's a rather broad question. Could you explain what you specifically couldn't figure out? – Marcus Müller Oct 15 '23 at 10:13
  • @MarcusMüller, basically I didn't get why using -print0 and removing new lines and then using -z flags for all commands. And I didn't understand the s! part of the sed too. – Saeed Neamati Oct 15 '23 at 11:49
  • as man find states, -print0 uses 0-bytes to separate the search results; man sed and man sort will tell you that -z means "use 0 bytes as the separator". So that was easily solved by reading the official documentation! – Marcus Müller Oct 15 '23 at 12:06
  • Regarding s!: that's just the usual s search command in sed, and the delimiter here between pattern, replacement and flags is set to !. You have probably seen s/pattern/replacement/flags, but the choice of / is arbitrary, you can use any character. Here, ! is used, because / is part of the pattern, and having to escape it all the time would be awkward. – Marcus Müller Oct 15 '23 at 12:08
  • @StephenKitt. No -print0 is supported my most find implementations these days and will even be in the next version of the POSIX standard. -printf AFAIK is GNU-only. – Stéphane Chazelas Oct 15 '23 at 12:16
  • @StéphaneChazelas ah, OK, so “(if your find supports GNU-style -printf)”. – Stephen Kitt Oct 15 '23 at 12:40
  • @StéphaneChazelas why the cast to C locale? I don't particularly care about sort order as long as it's self-consistent, and I don't want to truncate multibyte characters that might have a byte corresponding to the character /, so what am I missing? – Chris Davies Oct 15 '23 at 14:58
  • There is no multibyte character that can contain a / character, POSIX guarantees that. / (0x2F on ASCII based systems) is the directory separator for the Unix kernel API, and the kernel API has no notion of what character encoding programs in user space may be using. – Stéphane Chazelas Oct 15 '23 at 20:42
  • As for LC_ALL=C, compare the output of printf '%s\n' $'\x80\x81' $'\xfe\xff' | sort -u between the C locale and locales using the UTF-8 encoding for instance. – Stéphane Chazelas Oct 15 '23 at 20:43
  • @StéphaneChazelas "sort: string comparison failed: Invalid or incomplete multibyte or wide character" - is that what you're expecting me to see? – Chris Davies Oct 15 '23 at 21:19
  • @ChrisDavies, that's one of the possible outcomes. GNU sort doesn't choke but would treat those strings that can't be decoded as text and that are of same length as sorting the same, and therefore only one of them would survive the -u. The fairy and vampire characters also happen to sort the same in locales that have those characters on GNU systems as their order is not defined. – Stéphane Chazelas Oct 15 '23 at 21:33
2

Simply,

find movies -type f -print | \
    xargs -r dirname | \
    sort --uniq

Read man find xargs dirname sort.

Here's an explanation:

  • find outputs a list of all the files, and their directories, e.g. /movies/2022/action/movie.mp4 /movies/2022/action/another-movie.mp4 ... to STDOUT.
  • xargs packs as many filenames as will fit (see xargs --show-limits </dev/null), and repeatedly executes dirname until xargs runs out of filenames.
  • dirname chops off the righmost / and the filename. E.g. /movies/2022/action/movie.mp4 becomes /movies/2022/actionand /movies/2022/action/another-movie.mp4 also becomes /movies/2022/action/
  • sort --unique eliminates duplicate directory names.

This structure (find, xargs, post-process) is useful for many tasks. Put some effort into understanding it.

waltinator
  • 4,865
1

This should do it:

find /movies -type d -not -empty -links 2

To quote this answer:

The number of links is the number of hard links to the file. For a directory, the number of hard links is the number of (immediate) subdirectories plus the parent directory and itself.

So when the number of links is 2, there is only the parent directory (..) and itself (.), hence no subdirectories.


While this answer worked for the original question, it doesn't work for the later updated question anymore. I'll still leave it, as it might be helpful for others.

  • Can you please explain your command? Especially the -links 2 on how it makes it possible. Thanks – Saeed Neamati Oct 15 '23 at 08:55
  • @SaeedNeamati This answer is a good summary. The number of links is the number of hard links to the file. For a directory, the number of hard links is the number of (immediate) subdirectories plus the parent directory and itself.. So when the number of links is 2, there is only the parent directory and itself, hence no subdirectories. – Gerald Schneider Oct 15 '23 at 09:09
  • Ah I see, using the fact that subdirectories will contain a .. hardlink. Interesting! – Marcus Müller Oct 15 '23 at 09:12
  • 1
    -nlinks 2 doesn't work with btrfs file systems. Its directories always have 1 hard link. – raf Nov 14 '23 at 14:54
0

With zsh:

print -rC1 -- **/*(NDFe['()(($#)) $REPLY/*(ND.Y1)'])

Would print raw on 1 Columnt the Full (non-empty) directories that contain at least 1 file that can be determined to be a regular file (.), including Dot files (hidden ones).

Replace . with -. to also take into account symlinks eventually resolving to regular files, or ^/ for files of any type except directory or -^/ for that check to be done after symlink resolution. Remove the first D to not consider hidden directories and/or the second to not consider hidden files in those directories.

0

Disclaimer: I'm the current author of rawhide (rh) (see github.com/raforg/rawhide).

With rawhide (rh) you can do:

rh /movies 'd && !empty && "[ -n \"$(rh -ref -- %S)\" ]".sh'

/movies is a path to search.

The rest is the search criteria:

d means it's a directory.

!empty means it's not empty. This isn't needed but it makes it faster by reducing the number of shell processes created by the next bit.

"[ -n \"$(rh -ref -- %S)\" ]".sh runs the shell command [ -n "$(rh -ref -- %S)" ] which checks if there are any regular files in the candidate directory (with a nested use of rh).

rh -ref -- %S is short for rh -r -e f -- %S.

The -r is like find's -mindepth 1 -maxdepth 1 to only search one level down.

The -e f specifies the search criteria expression f which matches regular files.

The -- stops command line option parsing so as to prevent any malicious filenames from being interpreted as options to rh (e.g. -xreboot) (thanks Stéphane).

The %S is the name of the current candidate directory that the nested rh needs to search.

The [ -n ... ] tests that the nested rh command produced some output (i.e., that it found some regular files in the candidate directory).

raf
  • 171
  • 1
    It seems it escapes newlines as \<newline> in the expansion of %S which is incorrect for sh – Stéphane Chazelas Nov 14 '23 at 17:15
  • 1
    Also, you shouldn't use backticks as inside backticks, there's an extra layer of backslash processing. Use $(...) instead. – Stéphane Chazelas Nov 14 '23 at 17:15
  • 1
    Using backslash for escaping is not advisable in general. See Escape a variable for use as content of another script for details. – Stéphane Chazelas Nov 14 '23 at 17:16
  • 1
    You're also missing a -- which is likely introducing ACE vulnerabilities (like if there's a file called -e"reboot".sh – Stéphane Chazelas Nov 14 '23 at 17:26
  • 1
    I'd recommend passing the file paths as extra arguments to the shell so they can be referred to as "$1" or "$@" in the sh code (calling sh with sh, -c, code supplied by the user, sh, found-file), rather than embedding them quoted in the shell code as doing it properly in all contexts is tricky – Stéphane Chazelas Nov 14 '23 at 17:47
  • Thanks for all the great feedback. I'll follow your suggestion and implement %s/%S interpolation via extra arguments for sh -c. It'll be simpler and correct. And there's no risk of ACE without -- because there can only be one search criteria (-e f in this case). An additional expression (like -e'"reboot".sh' would be an error and rh wouldn't proceed. – raf Nov 15 '23 at 14:37
  • 1
    Still an ACE via a -xreboot file for instance. In any case, that -- is needed. – Stéphane Chazelas Nov 15 '23 at 14:41
  • I don't think it matters whether backticks or $(...) are used. It's only checking that there is some output. It wouldn't matter if some filenames containing backslashes had those backslashes processed (unless I'm missing something). But I never knew there was that difference between backticks and $(...), so many thanks for the new knowledge. I thought $(...) was just to make it easier to nest, but it makes sense that not processing backslashes contributes to that. And well-spotted with the -xreboot. Thanks. – raf Nov 15 '23 at 21:56
  • 1
    Try after mkdir '\reboot`'; touch "$_/f"whererhadds one backslash before the backticks while they should add 3 when that%S` is expanded inside backticks. But in any case, backslash, backticks, are as valid characters as any in file names, and beware than in some locales, the charset has some characters that contain the same byte value as that of those characters (0x5c and 0x60). – Stéphane Chazelas Nov 16 '23 at 07:52
  • Thanks so much for taking the time to demonstrate and explain the difference. It's much appreciated. Your generosity of time is formidable. I tried this with "echo hi" rather than "reboot" and got the error sh: 1: echo hi: not found so it looks like the command can only be a word with no arguments, not that that makes it any less of a problem, but it's interesting/unexpected. I even tried with "reboot" (on a VM) and it refused with Call to Reboot failed: Interactive authentication required. Lucky! :-) This will be a good test after reimplementing %s/%S via extra arguments to sh -c. – raf Nov 16 '23 at 12:39
  • That would be because %S escapes space with a backslash. You can always replace \echo hi` with`$X` and passX='echo hi'in the environment (%Swill likely (hopefully) add a backslash in front of that$but inside backticks,$is also special contrary tha\space`). – Stéphane Chazelas Nov 16 '23 at 13:47
  • One problem if you change it so the file name is passed as an extra (data) argument to sh (instead of embedding quoted in the code argument) and have the user refer to it as $1 in the code is that that $1 invariably needs to be quoted so you'll end up with things like rh '"whatever \"$1\"".sh' which will soon ugly and encourage bad practice by omitting the quotes. Maybe you could introduce {...} as strong quotes like in TCL (and like in TCL allow them to nest). rh '{ { blah "$1" | bligh "$2";} }.sh'. – Stéphane Chazelas Nov 16 '23 at 19:39
  • GNU find has those -printf '%s %p %m' which are great (much greater than the equivalent ones in GNU stat for that matters), but lacks the ability to run commands with the expansion of those arguments. What you could do here is have rh ' {printf "%20d: %s\n" "$RH_SIZE" "$RH_PATH"}.env.sh' and pass all those as environment variables. – Stéphane Chazelas Nov 16 '23 at 19:42
  • See also AT&T's tw for a similar project (which never took off unfortunately, and remains quite experimental) – Stéphane Chazelas Nov 16 '23 at 20:25
  • My intention is to replace %s with "$1" and replace %S with "$2". Will that be OK? It's important that %s and %S continue to not require additional quoting by the user, and I don't want the user to need to change to using \"$1\" and \"$2\". rh also has a -printf equivalent/superset (which also doesn't execute). tw! Wow. I think the original version of rh (1990) was inspired by tw presented at Baltimore 89 Usenix. The author knew of it at least. It's barely documented but its language is very powerful. rh's language is very simple. – raf Nov 17 '23 at 00:23
  • I suppose %s changed to "$1" could be a problem when used inside single/double quotes or backticks (where " in theory needs to be escaped though you can get away without with most modern sh implementations) or heredocs or when preceded with backslash, but I guess it's already a problem atm and at least users would always have the possibility to use $1 (or ${1%.*}...) instead of %s. Having a rh '{for file do...; done}.sh.+' or rh '{cmd "$@"}.sh.+5' like with find -exec {} + could be useful as well. – Stéphane Chazelas Nov 17 '23 at 09:16
  • It looks like it doesn't work properly if there's a file called 'd && !empty && "[ -n \"$(rh -ref -- %S)\" ]".sh' in the current working directory. rh -e 'd && !empty && "[ -n \"$(rh -ref -- %S)\" ]".sh' /movies seems OK. – Stéphane Chazelas Nov 17 '23 at 16:54
  • In any case, that's a great tool! Thanks for reviving and improving it, that's going to be a nice addition to my tool box. Expect me to advertise it here going forward at least wherever the issues mentioned above don't get in the way. – Stéphane Chazelas Nov 17 '23 at 19:11
  • Thanks. I'll document the ability and possible need to use $1 or $2 directly. The .+ makes more sense with -x and -X (final actions that could be grouped for speed) rather than with "cmd".sh (search criteria for each candidate file) but I'm trying to keep it simple. But I'll probably end up adding it. The desire to make -e optional (for streamlined commands) does mean ambiguity, but I think it's worth it, and the heuristic for determining if an argument is a path or an expression seems OK. Thanks for the kind words. I'm really glad you like it. I'm enjoying using it too. – raf Nov 17 '23 at 21:51