12

I'm writing a library for manipulation Unix path strings. That being the case, I need to understand a few obscure corners of the syntax that most people wouldn't worry about.

For example, as best as I can tell, it seems that foo/bar and foo//bar both point to the same place.

Also, ~ usually stands for the user's home directory, but what if it appears in the middle of a path? What happens then?

These and several dozen other obscure questions need answering if I'm going to write code which handles every possible case correctly. Does anybody know of a definitive reference which explains the exact syntax rules for this stuff?

(Unfortunately, searching for terms like "Unix path syntax" just turns up a million pages discussing the $PATH variable... Heck, I'm even struggling to find suitable tags for this question!)

2 Answers2

14

There are three types of paths:

  • relative paths like foo, foo/bar, ../a, .. They don't start with / and are relative to the current directory of the process making a system call with that path.
  • absolute paths like /, /foo/bar or ///x. They start with 1, or 3 or more /, they are not relative, are looked up starting from the / root directory.
  • POSIX allows //foo to be treated specially, but doesn't specify how. Some systems use that for special cases like network files. It has to be exactly 2 slashes.

Other than at the start, sequences of slashes act like one.

~ is only special to the shell, it's expanded by the shell, it's not special to the system at all. How it's expanded is shell dependent. Shells do other forms of expansions like globbing (*.txt) or variable expansion /$foo/$bar or others. As far as the system is concerned ~foo is just a relative path like _foo or foo.

Things to bear in mind:

  • foo/ is not the same as foo. It's closer to foo/. than foo (especially if foo is a symlink) for most system calls on most systems (foo// is the same as foo/ though).
  • a/b/../c is not necessarily the same as a/c (for instance if a/b is a symlink). Best is not to treat .. specially.
  • it's generally safe to consider a/././././b the same as a/b though.
  • So in summary, if I don't care about shell path manipulation (which is vast and complicated), I only need to care about /, . and .. (?) – MathematicalOrchid Apr 19 '14 at 09:22
  • An example of //foo handling is in Cygwin, where it's used for UNC paths. That is, //server/share/dir/file.txt is a legal path that points off-system by default. Cygwin does fall back to looking at the local system if it cannot find server. – Warren Young Apr 19 '14 at 09:36
3

For example, as best as I can tell, it seems that foo/bar and foo//bar both point to the same place.

Yes. This is common because software sometimes concatenates a path assuming the first part was not terminated with a forward slash, so one is thrown in to make sure (meaning there may end up being two or more). foo///bar and foo/////bar also point to the same place as foo/bar. A nice function for a path manipulation library would be one which reduces any number of sequential slashes to one (except at the beginning of a path, where it may be used in an URL-ish way, or, as Stephane points out, for any unspecified special purpose).

Also, ~ usually stands for the user's home directory

That transformation is done via the shell and tilde exapansion, which only works if it is the first character in the path. Whether or not you need to deal with this depends on context. If the library is to be used with normal programs which receive, e.g., command line arguments containing a path, tilde expansion is already done when they see the path. The only situation I can see it being a concern is if you are processing paths directly from a text file.

Beyond that, ~ is a legal character in a *nix path and should not be changed to anything else. As per this, the only characters which aren't legal in a unix filename are / (because it is the path separator) and "null" (aka. a zero byte) because they are illegal in text generally.

goldilocks
  • 87,661
  • 30
  • 204
  • 262
  • +1 for the explanation of tilde expansion; I had no idea you could refer to other users with it! – MathematicalOrchid Apr 19 '14 at 09:17
  • 2
    As Stephane says, you cannot blindly collapse all repeated forward slashes. Multiple forward slashes at the start of the path have to be treated carefully. – Warren Young Apr 19 '14 at 09:39
  • @WarrenYoung Edited to make this clear. PS. Forward??! O_O – goldilocks Apr 19 '14 at 10:01
  • Better, though I wouldn't say this has anything to do with URLs. UNC goes back to the late 1980s, while URLs didn't appear until years later. – Warren Young Apr 19 '14 at 10:27
  • @WarrenYoung Fair enough, although it would seem that UNC's are specific to MS platforms, so // is technically not that either. Both URLS and the newer, according-to-S.C. freely ambiguous POSIX spec for // may have been derived from such, in which case "URL-ish" seems an apt label for the convention (even if UNCs are older, and even if the semblance is unintentional). I would never say that "they are URLS", only that // or \\ serves an "URL-ish" purpose. – goldilocks Apr 19 '14 at 11:04