37

This is a slightly exotic question, but there doesn't seem to be much information on the net about this. I just added an answer to a question about the zip format's external file attribute. As you can see from my answer, I conclude that only the second byte (of 4 bytes) is actually used for Unix. Apparently this contains enough information when unzipping to deduce whether the object is a file or a directory, and also has space for other permission and attribute information. My question is, how does this map to the usual Unix permissions? Do the usual Unix permissions (e.g. below) that ls gives fit into exactly one byte, and if so, can someone describe the layout or give a reference, please?

$ ls -la
total 36
drwxr-xr-x   3 faheem faheem  4096 Jun 10 01:11 .
drwxrwxrwt 136 root   root   28672 Jun 10 01:07 ..
-rw-r--r--   1 faheem faheem     0 Jun 10 01:07 a
drwxr-xr-x   2 faheem faheem  4096 Jun 10 01:07 b
lrwxrwxrwx   1 faheem faheem     1 Jun 10 01:11 c -> b

Let me make this more concrete by asking a specific question. Per the Trac patch quoted in my answer above, you can create a zip file with the snippet of Python below.

The 040755 << 16L value corresponds to the creation of an empty directory with the permissions drwxr-xr-x. (I tested it). I recognize 0755 corresponds to the rwxr-xr-x pattern, but what about the 04, and how does the whole value correspond to a byte? I also recognize << 16L corresponds to a bitwise left shift of 16 places, which would make it end up as the second from top byte.

def makezip1():
    import zipfile
    z = zipfile.ZipFile("foo.zip", mode = 'w')
    zfi = zipfile.ZipInfo("foo/empty/")
    zfi.external_attr = 040755 << 16L # permissions drwxr-xr-x
    z.writestr(zfi, "")
    print z.namelist()
    z.close()

EDIT: On rereading this, I think that my conclusion that the Unix permissions only correspond to one byte may be incorrect, but I'll let the above stand for the present, since I'm not sure what the correct answer is.

EDIT2: I was indeed incorrect about the Unix values only corresponding to 1 byte. As @Random832 explained, it uses both of the top two bytes. Per @Random832's answer, we can construct the desired 040755 value from the tables he gives below. Namely:

__S_IFDIR + S_IRUSR + S_IWUSR + S_IXUSR + S_IRGRP + S_IXGRP + S_IROTH + S_IXOTH
0040000   + 0400    + 0200    + 0100    + 0040    + 0010    + 0004    + 0001
= 40755 

The addition here is in base 8.

Faheem Mitha
  • 35,108
  • 1
    I know nothing about zip permissions, but I know that traditional unix permissions use 12 bits, which is more than one bytes. Maybe zip doesn't bother with setxid and sticky, but that still leaves 9 (rwx × ugo). – Gilles 'SO- stop being evil' Jun 10 '11 at 00:03

1 Answers1

46

0040000 is the traditional value of S_IFDIR, the file type flag representing a directory. The type uses the top 4 bits of the 16-bit st_mode value, 0100000 is the value for regular files.

The high 16 bits of the external file attributes seem to be used for OS-specific permissions. The Unix values are the same as on traditional unix implementations. Other OSes use other values. Information about the formats used in a variety of different OSes can be found in the Info-ZIP source code (download or e.g in debian apt-get source [zip or unzip]) - relevant files are zipinfo.c in unzip, and the platform-specific files in zip.

These are conventionally defined in octal (base 8); this is represented in C and python by prefixing the number with a 0.

These values can all be found in <sys/stat.h> - link to 4.4BSD version. These are not in the POSIX standard (which defines test macros instead); but originate from AT&T Unix and BSD. (in GNU libc / Linux, the values themselves are defined as __S_IFDIR etc in bits/stat.h, though the kernel header might be easier to read - the values are all the same pretty much everywhere.)

#define S_IFIFO  0010000  /* named pipe (fifo) */
#define S_IFCHR  0020000  /* character special */
#define S_IFDIR  0040000  /* directory */
#define S_IFBLK  0060000  /* block special */
#define S_IFREG  0100000  /* regular */
#define S_IFLNK  0120000  /* symbolic link */
#define S_IFSOCK 0140000  /* socket */

And of course, the other 12 bits are for the permissions and setuid/setgid/sticky bits, the same as for chmod:

#define S_ISUID 0004000 /* set user id on execution */
#define S_ISGID 0002000 /* set group id on execution */
#define S_ISTXT 0001000 /* sticky bit */
#define S_IRWXU 0000700 /* RWX mask for owner */
#define S_IRUSR 0000400 /* R for owner */
#define S_IWUSR 0000200 /* W for owner */
#define S_IXUSR 0000100 /* X for owner */
#define S_IRWXG 0000070 /* RWX mask for group */
#define S_IRGRP 0000040 /* R for group */
#define S_IWGRP 0000020 /* W for group */
#define S_IXGRP 0000010 /* X for group */
#define S_IRWXO 0000007 /* RWX mask for other */
#define S_IROTH 0000004 /* R for other */
#define S_IWOTH 0000002 /* W for other */
#define S_IXOTH 0000001 /* X for other */
#define S_ISVTX 0001000 /* save swapped text even after use */

As a historical note, the reason 0100000 is for regular files instead of 0 is that in very early versions of unix, 0 was for 'small' files (these did not use indirect blocks in the filesystem) and the high bit of the mode flag was set for 'large' files which would use indirect blocks. The other two types using this bit were added in later unix-derived OSes, after the filesystem had changed.

So, to wrap up, the overall layout of the extended attributes field for Unix is

TTTTsstrwxrwxrwx0000000000ADVSHR
^^^^____________________________ file type as explained above
    ^^^_________________________ setuid, setgid, sticky
       ^^^^^^^^^________________ permissions
                ^^^^^^^^________ This is the "lower-middle byte" your post mentions
                        ^^^^^^^^ DOS attribute bits
Random832
  • 10,666
  • 1
    @Random832: Wow, that's impressively complete and through. Can you also explain how the value 040755 << 16L is constructed? Specifically, what representation/base is it using (I think possibly Octal), and most importantly, how does the language (the Python interpreter in this case) know what the representation is? Hmm, maybe the type is declared in the C code. Also, what file are you getting the "file type" values from? Adding some links/references would be helpful. – Faheem Mitha Jun 10 '11 at 07:52
  • @Random832: I see that zipinfo.c is in the source for unzip on Debian. Alternatively one can use the more convenient apt-get source unzip. You could append that to your answer, or use an unstream source. I usually quote Debian because I have faith they will be around for the long haul.:-) – Faheem Mitha Jun 10 '11 at 07:53
  • @Random832: Ok, I think I see how this works. You just add together all the values for the things that are set in base 8 as per your table, and you get the number 040755. That would be worth mentioning imo for people who don't know or have forgotten. Of course, that still leaves the question of how it knows it is base 8, but maybe the type is declared as base 8. – Faheem Mitha Jun 10 '11 at 07:57
  • 1
    It's base 8 because it begins with a 0. I'll clarify that in an edit – Random832 Jun 10 '11 at 14:29
  • @Random: Thanks for the clarification. I was not aware of the leading 0 convention. The stat.h file on Linux (I'm assuming the correct file is /usr/include/sys/stat.h) doesn't contain the definition of these constants in such a clear way as the file you linked to. Are they hidden away somewhere else? I see you used the term test macros, but I'm not sure what that means. – Faheem Mitha Jun 10 '11 at 16:29
  • "test macros" really isn't relevant to here, i was just talking about the S_ISDIR() macros that posix defines that don't directly have values. For linux/glibc, the values are in /usr/include/bits/stat.h, which is included (maybe indirectly) by sys/stat.h – Random832 Jun 10 '11 at 16:33
  • @Random832: I see. How do these S_ISDIR macros work and what is their function? IMO it is worth mentioning in the question that the values in linux are in /usr/include/bits/stat.h since most people here are probably using linux. – Faheem Mitha Jun 10 '11 at 16:37
  • if you have struct stat sb you use e.g. S_ISDIR(sb.st_mode) to see if the file is a directory. How they work isn't defined by the standard, but conventionally they just internally compare the file type field to 040000. – Random832 Jun 10 '11 at 16:43
  • @Random832: Thanks, that was all very informative. This stuff really ought to be documented somewhere. – Faheem Mitha Jun 10 '11 at 18:11
  • 1
    This was such a great writeup and it really helped me. Thanks! – jterrace Jul 23 '12 at 04:12
  • 3
    I created an account here just to upvote this question and answer. Thanks a lot for the detailed explanation. It helped me a lot. – Srikanth Reddy Lingala Jun 03 '19 at 10:56
  • The ZIP file format is not as well documented when it comes to examples of real-world values in ZIP files as I'd like. This helped. Thanks. I recently encountered a UNIX source ZIP file with 0x4000 for the lower word (0x81A44000). Microsoft Docs say FILE_ATTRIBUTE_ENCRYPTED (0x4000), but that doesn't really make any sense. Probably a lot of strange values floating around as the spec doesn't really clamp down on usage. I guess it makes interoperability more challenging. – CubicleSoft Jun 09 '20 at 06:11
  • 1
    @CubicleSoft The bit in that position is the "have Unix UID/GID info" flag according to the info-zip source comment in the other linked question. – Random832 Jun 11 '20 at 15:36
  • Is there any reference for what the lower bytes are for? – OrangeDog Sep 15 '21 at 16:21
  • @OrangeDog The other post linked in the question goes in some detail about bit fields sometimes used in the 'lower-middle' byte, the low byte is the traditional DOS attributes [readonly, hidden, system, volume, directory, archive] – Random832 Sep 16 '21 at 21:44