21

Unix systems usually just error out if they are confronted with a path that contains a symlink loop or just too many symlinks, because they have a limit to the number of symlinks they will traverse in one path lookup. But is there a way to actually decide if a given path resolves to something or contains a loop, even if it contains more links than a unix is willing to follow? Or is this a formally undecidable problem? And if it can be decided, can it be decided in a reasonable amount of time/memory (e.g. without having to visit all files on a filesystem)?

Some examples:

a/b/c/d
where a/b is a symlink to ../e
and e is a symlink to f
and f is a symlink to a/b

a/b/c/d
where a/b/c is a symlink to ../c

a/b/c/d
where a/b/c is a symlink to ../c/d

a/b/c/d
where a/b/c is a symlink to /a/b/e
where a/b/e is a symlink to /a/b/f
where a/b/f is a symlink to /a/b/g

Edit:

To clarify, I am not asking about finding loops in the file system, I am asking about a decision algorithm that decides of a given path whether it resolves to a definite file/directory or whether it does not resolve at all. For example in the following system, there is a loop, but the given path still resolves fine:

/ -- a -- b
where b is a symlink to /a

This directory tree clearly has a cycle, but the path a/b/b/b/b/b still resolves fine to /a.

JanKanis
  • 1,069
  • What does the command line tool readlink ... say about the above situations? – slm Nov 06 '13 at 23:42
  • 1
    Are you asking if we can tell just from the pathname if there are loops? Or can we do this in a real operating system, using the standard tools and checking what the various components of the pathname resolve to? – Mike Diehn Nov 07 '13 at 02:14
  • @MikeDiehn Obviously one can't tell from just a path if it resolves without doing filesystem operations. But also with an OS environment it is not straightforward to distinguish a path that merely requires traversing a lot of symlinks to resolve from one that does not resolve at all. – JanKanis Nov 07 '13 at 08:57

7 Answers7

13

I don't fully understand what you're asking. If I didn't know any better I think you were asking if there was a way to detect this while in the midst of dealing with a file. I don't believe this is possible.

The only method I can conceive of is doing a find where you specifically start looking through a particular branch in the directory tree.

Example

$ tree 
.
`-- a
    `-- b
        |-- c
        |   `-- d
        |       `-- e -> ../../../../a/b
        `-- e -> e

5 directories, 1 file

The find command will detect this loop but not really tell you a whole lot about it.

$ find -L . -mindepth 15
find: File system loop detected; `./a/b/c/d/e' is part of the same file system loop as `./a/b'.
find: `./a/b/e': Too many levels of symbolic links

I arbitrarily picked 15 levels so as to block any output being displayed by the find. You can however drop that switch (-mindepth) if you don't care about the directory tree being displayed. The find command still detects the loop and stops:

$ find -L . 
.
./a
./a/b
./a/b/c
./a/b/c/d
find: File system loop detected; `./a/b/c/d/e' is part of the same file system loop as `./a/b'.
find: `./a/b/e': Too many levels of symbolic links

Incidentally, if you want to override the default MAXSYMLINKS which is apparently 40 on Linux (newer 3.x versions of the kernel) you can see this U&L Q&A titled: How do you increase MAXSYMLINKS.

Using the symlinks command

There is a tool that FTP site maintainers could use called symlinks which will help expose issues with too long or dangling trees that were caused by symbolic links.

In certain cases the symlinks tool could be used to delete offending links too.

Example

$ symlinks -srv a
lengthy:  /home/saml/tst/99159/a/b/c/d/e -> ../../../../a/b
dangling: /home/saml/tst/99159/a/b/e -> e

The glibc library

The glibc library looks to offer some C functions around this, but I don't entirely know their role or how to actually use them. So I can only merely point them out to you.

The man page, man symlink shows the function definition for a function called symlink(). The description goes like this:

symlink() creates a symbolic link named newpath which contains the string oldpath.

One of the error states that this function returns:

ELOOP Too many symbolic links were encountered in resolving newpath.

I'll also direct you to the man page, man path_resolution which discusses how Unix determines paths to items on disk. Specifically this paragraph.

If  the component is found and is a symbolic link (symlink), we first 
resolve this symbolic link (with the current lookup directory as starting 
lookup directory).  Upon error, that error is returned.  If the result is 
not a directory, an ENOTDIR error is returned.  If the resolution of the 
symlink is successful and returns a directory, we set the current lookup
directory to that directory, and go to the next component.  Note that the 
resolution process here involves recursion.  In order  to  protect  the 
kernel against stack overflow, and also to protect against denial of 
service, there are limits on the maximum recursion depth, and on the maximum 
number of symbolic links followed.  An ELOOP error is returned  when  the
maximum is exceeded ("Too many levels of symbolic links").
slm
  • 369,824
  • If possible I would like a way to detect a symlink loop when given a single path, and resolving the symlinks manually in a program instead of letting the OS do it. But I am wondering if this is possible at all. The find solution looks interesting, but do you have any idea /how/ find detects symlink loops, and if the method it uses is complete (i.e. detects all possible loops and doesn't misidentify any non-looping paths)? – JanKanis Nov 07 '13 at 09:04
  • @Somejan - see my updates to the A. Let me know if that makes sense. – slm Nov 07 '13 at 13:14
7

OK, after some more thought I think I have a clear solution.

The critical insight is that if every link that is part of a path resolves to something, then the entire path resolves. Or the other way around, if a path does not resolve then there must be a specific symlink that requires traversing that does not resolve.

While thinking about this problem previously I was using an algorithm that traversed elements of a path starting from the root, and when it encountered a symlink it replaced that path element with the contents of the symlink and then continued traversing. Since this approach doesn't remember which symlink it is currently resolving it cannot detect when it is in a nonresolving loop.

If the algorithm keeps track of which symlink it is currently resolving (or which symlinks in case of recursive links), it can detect if it is attempting to resolve a link again recursively which it is still busy resolving.

Algorithm:

initialize `location` to the current working directory
initialize `link_contents` to the path we want to resolve
initialize `active_symlinks` to the empty set

def resolve_symlink(location, link_contents, active_symlinks) : loop forever: next_location = location / [first element of link_contents] see if next_location is a symlink. if so: if next_location in active_symlinks: abort, we have a loop location = resolve_symlink(location, readlink(next_location), active_symlinks ∪ {next_location}) else: location = next_location strip first element of link_contents if link_contents is empty: return location

edit:

I have a working implementation of this in python at https://bitbucket.org/JanKanis/python-inotify/src/853ed903e870cbfa283e6ce7a5e41aeffe16d4e7/inotify/pathresolver.py?at=pathwatcher.

edit 2: Bitbucket stopped hosting mercurial repos. Here is the full file:

# pathresolver.py - This module contains an iterator that iterates over all
# elements of a path including any symlinks.

Copyright 2012-2013 Jan Kanis

This library is free software; you can redistribute it and/or

modify it under the terms of the GNU Lesser General Public

License as published by the Free Software Foundation; either

version 2.1 of the License, or (at your option) any later version.

""" This module contains a few functions that help traverse paths with symlinks.

resolve_symlinks is a generator that yields pairs of paths, with one yield for each path element that is traversed to reach the final target of a path. This includes path elements and paths from symlinks.

resolve_path is a wrapper around resolve_symlinks that takes a single path as an argument and sets up the other arguments to resolve_symlinks.

get_symlinkmax is a function that determines the maximum number of symlinks the system will traverse in a path.

Note: this module forms part of python-inotify, but is considered an internal module. As such there are no stability guarantees regarding the api's of these functions. """

author = "Jan Kanis"

import sys, os, errno import tempfile, shutil from pathlib import PosixPath

_curdir = PosixPath('.') _root = PosixPath('/') _parentdir = PosixPath('..')

def resolve_path(path): '''Resolve the symlinks in path, yielding all filesystem locations that are traversed.

The yielded value is a tuple, of which the first element is a symlink-free
path, and the second element is a path relative to the first element that
has not yet been traversed. This second element may contain more symlinks.

The resolution implementation will follow an unbounded number of symlinks
but will still detect symlink loops if they prevent a path from resolving.

path can be given as a string or as a pathlib object. The yielded values
are pathlib.PosixPath objects.

'''
linkcache = {}
linkcounter = [0]
for p in resolve_symlink(_curdir, PosixPath(path), set(),
                              linkcache, linkcounter):
    yield p


def resolve_symlink(location, link_contents, active_links, known_links, linkcounter): '''Recursively resolve a symlink (or another path) to the file or directory it ultimately points to. This function handles an unlimited number of symlinks, and correctly detects symlink loops. All path parameters should be given as pathlib.PosixPath instances.

location: The directory in which the currently to be resolved link resides.

link_contents: The path stored in the symlink as returned by readlink().

active_links: a set of symlinks that is currently being resolved.

linkcache: a dictionary of link location -> resolved target paths. This
cache prevents this function from having to resolve the same symlink
twice. (Note that having to traverse the same symlink multiple times
does not necessarily mean that the path does not resolve to anything.)

linkcounter: A list containing a single number. (We use a list so that the
value can be passed by reference.) This number is updated to indicate the
total number of symlinks that has been traversed.

'''

while True:
    if link_contents.is_absolute():
        location = _root
        link_contents = link_contents.relative()

    yield location, link_contents
    if link_contents == _curdir:
        return

    if link_contents.parts[0] == '..':
        # We need to choose here if we allow traversing of a path above
        # the root or above the current directory. Going above CWD
        # should be allowed as long as we don't go above / by doing
        # so. The OS allows going to /.. (which just ends up at /
        # again), so for consistency with that we also allow it,
        # although a path that requires us to do this is probably a bug
        # somewhere.
        if all(p in ('/', '..') for p in location.parts):
            location = location['..']
        else:
            location = location.parent()
        # Strip the first part of link_contents off
        link_contents = link_contents.parts[1:]
        continue

    try:
        nextpath = location[link_contents.parts[0]]
        newlink = PosixPath(os.readlink(str(nextpath)))
    except OSError as e:
        if e.errno == errno.EINVAL:
            # The entry is not a symbolic link, assume it is a normal file
            # or directory. If it is a file and we need it to be a
            # directory that will be detected the next time through the
            # loop in the os.errno.ENOTDIR check. Checking it here would be
            # possible, but keeping the number of system calls at one per
            # loop makes reasoning about race conditions easier.
            location = nextpath
            link_contents = link_contents.parts[1:]
            continue
        if e.errno == errno.ENOENT:
            # The entry does not exist
            raise FileNotFoundError(nextpath)
        if e.errno == errno.ENOTDIR:
            # At this point we know the path is not valid, but we can not
            # determine with certainty what is wrong. If there were no
            # concurrent modifications we can safely make an is_dir()
            # call. If concurrent modifications did happen the is_dir()
            # check may succeed erroneously but we can't detect all
            # concurrent modifications anyway. If the check fails
            # (erroneously or not) that indicates a concurrent modification
            # so we fall through.
            if not location.is_dir():
                raise NotADirectoryError(location)
            # We should not be able to get here, unless there is a bug
            # or some relevant part of the file system was changed
            # concurrently while we were resolving this link.
            raise ConcurrentFilesystemModificationError(nextpath)
        if e.errno == errno.ELOOP:
            # Can only happen if a path component was changed concurrently
            raise ConcurrentFilesystemModificationError(nextpath)
        # For other OSErrors (such as in case of EACCESS) we propagate to
        # the caller. 
        raise

    # It is a symlink!
    if nextpath in active_links:
        raise SymlinkLoopError(nextpath)

    link_contents = link_contents.parts[1:]
    # We have not yet attempted traversing this symlink during the
    # current call or any of its parents.
    if nextpath in known_links:
        # known_links stores fully resolved paths, so we don't need to
        # traverse the cached path and can just continue our traversal from
        # there.
        location = known_links[nextpath]
        continue

    # An unknown link, resolve it recursively
    linkcounter[0] += 1
    # Don't yield the very last result of this recursive call immediately,
    # we still want to process that further. 
    lastloc, lastlink = None, None
    for loc, link in resolve_symlink(location, newlink,
                      active_links.union((nextpath,)), known_links, linkcounter):
        if lastloc:
            yield lastloc, lastlink[link_contents]
        lastloc, lastlink = loc, link
    # The last yielded location is the final resolution of the symlink. The
    # last yielded link_contents is always '.' so we can ignore that.
    known_links[nextpath] = loc
    location = loc
    continue


_symlinkmax = None def get_symlinkmax(): '''Returns the maximum number of symlinks that this system will traverse in resolving a file path.

''' global _symlinkmax if _symlinkmax is not None: return _symlinkmax

try: tempdir = tempfile.mkdtemp(prefix='inotify-symlinkmax-tmpdir-') open(tempdir+'/testfile', 'w').close()

target = 'testfile'
for i in range(1, 60):
  name = 'l'+str(i)
  os.symlink(target, tempdir+'/'+name)
  target = name

  try:
    open(tempdir+'/'+name).close()
  except IOError as e:
    if e.errno == errno.ELOOP:
      _symlinkmax = i - 1
      break
    raise

finally: if tempdir: shutil.rmtree(tempdir) return _symlinkmax

class InvalidPathError (OSError): def init(self, msg, path, errno=None, args): self.filename = path self.errno = errno if errno: self.strerror = os.strerror(errno) OSError.init(self, msg, args)

class SymlinkLoopError (InvalidPathError): def init(self, path, args): msg = "Path not valid: The symlink at '{}' forms a symlink loop".format(path) InvalidPathError.init(self, msg, path, errno=errno.ELOOP, args)

class ConcurrentFilesystemModificationError (InvalidPathError): def init(self, path, args): msg = "Path not valid: A concurrent change was detected while traversing '{}'".format(path) InvalidPathError.init(self, msg, path, errno=None, args)

To be Python 2 and 3 compatible and also inherit from the right exception

types in python 3, we need some boilerplate.

fnf_msg = "Path not valid: '{}' does not exist" nad_msg = "Path not valid: '{}' is not a directory"

if sys.version_info >= (3, 3): class FileNotFoundError (InvalidPathError, FileNotFoundError): def init(self, path, args): InvalidPathError.init(self, fnf_msg.format(path), path, errno=ENOENT, args)

class NotADirectoryError (InvalidPathError, NotADirectoryError):
    def __init__(self, path, *args):
        InvalidPathError.__init__(self, nad_msg.format(path), path,
                                      errno=errno.ENOTDIR, *args)

else: class FileNotFoundError (InvalidPathError): def init(self, path, args): InvalidPathError.init(self, fnf_msg.format(path), path, errno=errno.ENOENT, args)

class NotADirectoryError (InvalidPathError):
    def __init__(self, path, *args):
        InvalidPathError.__init__(self, nad_msg.format(path), path,
                                      errno=errno.ENOTDIR, *args)

JanKanis
  • 1,069
  • Too bad that your link is dead (please just post the code here) because your pseudocode leaves some room for interpretation… Does this work with a directory like a -> . and b -> a/a/a/a/a/c and a file c? – Caesar Sep 21 '20 at 04:18
  • 1
    @Caesar I've inlined the file content – JanKanis Sep 22 '20 at 14:55
6

On a quiescent system (i.e. when no changes are taking place), yes, there is an algorithm. There is a finite number of symbolic links, so they constitute a finite graph, and detecting cycles is a finitary process.

On a live system, there is no way to detect cycles, because symbolic links can change while the cycle detector is running. Reading each symbolic link is atomic, but following a symbolic link is not. If some symlinks keep changing while the kernel is doing the traversal, it could end up on an infinite path involving distinct links.

  • There are ways to mitigate those changes to bring it up to 98-99% accuracy. You could make it pay attention to the time stamps on files and I wouldn't suggesting actually following the links. Since it is recursive from the root it will find the actual directory later. – Back2Basics Nov 13 '13 at 06:06
  • 1
    @Back2Basics These numbers are completely meaningless. This is a kernel interface. If it doesn't work all the time, it doesn't work, period. – Gilles 'SO- stop being evil' Nov 13 '13 at 08:18
4

Python has a function called networkx.simple_cycles() that can be used for this. But yes it would need to read every file on the system.

>>> import networkx as nx
>>> G = nx.DiGraph()
>>> G.add_edge('A', 'B')
>>> G.add_edge('B', 'C')
>>> G.add_edge('C', 'D')
>>> G.add_edge('C', 'A')
>>> nx.simple_cycles(G)
[['A', 'B', 'C', 'A']]
  • I also thought about using some kind of graph algorithm, but I'm not sure if a directory tree with symlinks can be adequately represented in a simple graph. In a directory tree a-b-c where c is a symlink to .., there is a loop, but paths like a/b/c/b/c/b still resolve as they only follow the loop a finite number of times and don't keep looping. – JanKanis Nov 07 '13 at 09:09
  • @Somejan: a filesystem namespace is a graph, and a filename is a path chosen over that graph. – ninjalj Nov 07 '13 at 11:55
  • @ninjalj: Yes a filesystem is a graph, but I don't think a filename is simply a path over that graph. The filename can be seen as a set of instructions on how to traverse the graph. Even if the graph contains cycles that does not mean that a filename that follows that cycle necessarily does not resolve, see my example in my previous comment. – JanKanis Nov 08 '13 at 00:09
3

As near as I can tell from looking at current Linux kernel sources, all the kernel does is keep a count of how many links it's followed, and it errors out if that's bigger than some number. See line 1330 in namei.c for the comment, and the nested_symlink() function. The ELOOP macro (the error number returned from a read(2) system call for this situation) shows up in a number of places in that file, so it may not be as simple as counting links followed, but that's sure what it looks like.

There are a number of algorithms for finding "cycles" in linked lists (Floyd's cycle detection algorithm) or in directed graphs. It's not clear to me which one you'd have to do to detect an actual "loop" or "cycle" in a particular path. In any case, the algorithms could take a long time to run, so I'm guessing that just counting the number of symbolic links followed gets you 90% of the way to your goal.

  • For practical uses, just counting the number of links traversed is fine, especially since that is what the kernel does, so even if you encounter a correctly resolving path that has too many symlinks, you still can't use that path for anything practical (i.e. that does not involve manually resolving symlinks) – JanKanis Nov 08 '13 at 00:13
1

Make a list with all symlinks/files you visit so when you follow the first symlink you add it to the list. If a symlink points to a file in the list (so a file you already visited) you know there is a loop

1

BTW, to any programmers looking here, you should know that the Linux/BSD function fts_open() gives you an iterator for traversing all sub directory contents while also detecting such symlink recursions.

Most command line tools use this function to handle this case for them. Those that don't often have trouble with symlink recursions because doing this "by hand" is difficult (any anyone being aware of it should just use the above function instead).