6

I am trying to understand why wc and stat report different things for /proc/[pid]/cmdline.

wc says my shell's cmdline file is 6 bytes in size:

$ wc --bytes /proc/$$/cmdline
6 /proc/10425/cmdline

stat says the file is 0 bytes in size:

$ stat --format='%s' /proc/$$/cmdline
0

file agrees with stat:

$ file /proc/$$/cmdline
/proc/10425/cmdline: empty

cat gives this output:

$ cat -vE /proc/$$/cmdline
-bash^@

All of this is on Linux rather than on any other *nix OS.

Do the stat and wc programs have a different algorithm for computing the number of bytes in a file?

2 Answers2

11

The files under /proc are not regular your usual files, but virtual things created on the fly by the kernel. For most (all?) of them, the system doesn't bother calculating a size beforehand, but a program reading it just gets whatever data there is to get.

The difference between what your wc does and what stat and e.g. ls do, is that here, wc opens the file, reads it, and counts what it gets, while stat and ls use the stat() system call to ask the system about the metadata of the file, including the size (but also getting e.g. the owner and permissions). In the case of virtual files, these don't give the same result.

If you run e.g. ls -l /proc/$$/, you'll see a lot files of size 0, even though most of them can be read for data.

Device nodes like /dev/sda are similar, though in their case ls doesn't even bother to show the size, but shows the device numbers instead.

With file in particular, you can use file -s to ask it to just read the data and not care about if it's a special file.

ilkkachu
  • 138,973
  • 7
    Many wc implementations with -c resort to stat() as an optimisation to report the number of bytes in regular files. GNU wc has special cases to handle files in /proc and /sys since version 8.24. See https://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=2662702b9e8643f62c670bbf2fa94b1be1ccf9af – Stéphane Chazelas Feb 01 '23 at 20:06
  • 3
    For /dev/sda, like any other file stat reports what lstat() returns. On Linux lstat() returns 0 in st_size for devices, but you'll find some other systems return the size of the underlying storage device there. See also How can I get the size of a file in a bash script? – Stéphane Chazelas Feb 01 '23 at 20:10
  • 2
    They are "regular files" in the technical meaning of not being fifos, sockets, devices, symlinks, or whatever else. They're not normal files, though, as you say: materialized from kernel data structures only on read, with stat being kept cheaper by not doing that and just showing their existence and permissions. – Peter Cordes Feb 02 '23 at 06:11
  • 2
    @StéphaneChazelas Many wc implementations with -c resort to stat... Oh, dear. Thanks for that note. That was an egregiously wrong "optimization", IMO; but I do need to know that some poor sod went out on a limb and actually implemented it. Me, if I want to know the — possibly wrong — information that stat will give, I use stat(1). And if I have any suspicion that stat's output might be wrong or misleading, I use wc: opening the file as a file, and reading and counting every byte, is what wc is for. Sheesh. – Steve Summit Feb 02 '23 at 14:02
  • @SteveSummit, it's a valid optimisation as long as not used on file systems that report incorrect information there. Technically, returning 0 could be seen as being not completely wrong as those files don't contain anything until something starts to read from then (at which point the contents is being dynamically generated for that reading process). Many /proc files would return different contents depending on who/what requests their contents (and when). – Stéphane Chazelas Feb 02 '23 at 16:12
  • @StéphaneChazelas I understand, and this isn't the place for a long discussion on it, but I do a certain amount of low-level work, including on experimental filesystems that might be broken, so "as long as not used on file systems that report incorrect information" is just not an assumption I'm willing to make. And this is all the more relevant because, if I suspect that st_size has just given me a wrong answer, the very first tool I instinctively reach for is wc! But I guess I'm going to have to modify that instinct, because it's not safe any more. – Steve Summit Feb 02 '23 at 16:35
  • 1
    @SteveSummit, you can always to cat file | wc -c here. cat can't have this kind of optimisation. – Stéphane Chazelas Feb 02 '23 at 16:57
  • @StéphaneChazelas Ah, yes, I thought of that, too, but remember, you can not say cat file | program here on unix.stackexchange.com, because someone will immediately come along and scold you for being a newbie and lecture you that program is perfectly capable of opening the file itself! :-) – Steve Summit Feb 02 '23 at 17:17
  • @SteveSummit Oh, we just as often lecture people that the shell is perfectly capable of opening the file itself, hence <file program. Sure, some programs do try to fstat their stdin, but it's not exactly a common pattern :) – Charles Duffy Feb 02 '23 at 17:38
  • @CharlesDuffy, well <file wc -c doing a fstat() as discussed here is a common pattern. Actually, it should be doing a lseek(SEEK_END) instead or as well. See how { wc -c; wc -c; } < file reports the file size twice with GNU wc. I should probably raise that as a bug. – Stéphane Chazelas Feb 02 '23 at 18:47
  • 1
    Same problem with FreeBSD's wc. I find ast-open's wc does both fstat() and lseek(), presumably fstat() to determine the type and lseek() to seek to the end and get the offset there. In any case, they need to do a lseek(SEEK_CUR) first to know the current position within the file. – Stéphane Chazelas Feb 02 '23 at 18:55
  • 1
    @CharlesDuffy, actually ast-open's optimisation is also not right. It shouldn't seek to the end if the current position is past the end. See for instance { printf 1234 >&0; printf 12 > file; wc -c; printf 56 >&0; } 0<> file outputting -2 with that implementation and leaving file containing 1256 instead of 12^@^@56 – Stéphane Chazelas Feb 02 '23 at 19:04
  • "Premature optimization is the root of all evil". Or well, overzealous optimization here. That stuff is positively priceless. – ilkkachu Feb 02 '23 at 19:36
  • @StéphaneChazelas A related issue I've been thinking of reporting is that if you do (lseek 10000; gzip) < file, you get a false-positive warning from gzip saying "stdin: file size changed while zipping". (Admittedly, most people don't do that and so don't encounter the issue, because most people don't have or use a command-line lseek.) – Steve Summit Feb 02 '23 at 19:43
2

Yes, wc and stat have different algorithms for computing the number of bytes in a file.

wc counts the number of bytes in the file, which in this case is 6.

stat displays information about a file, including its size in bytes. However, /proc/[pid]/cmdline is not a typical file on a file system, but rather a virtual file in the proc file system. This file contains the command line arguments used to start the process with the given process ID. It is stored in memory and not on the file system, which means its size can be different from the actual number of bytes that have been written to it. This is why stat reports the size as 0.

The file command is used to determine the type of a file based on its contents, and it correctly identifies /proc/[pid]/cmdline as an "empty" file.

Summarised, wc counts the actual number of bytes in the file, stat displays information about the file, and file determines the type of the file based on its contents.

Ben
  • 31
  • 2
    Actually, the cmdline file is not "stored in memory", it's not stored at all and that's precisely why the kernel can't easily tell you how long it is. Any time a read() call is done on such a file descriptor, the data is synthesized on the fly and returned to the calling application. If the kernel wanted to tell you how long the file is, it would have to perform a fake internal read and count the bytes in the result, which wouldn't be any faster than having userspace do the counting if it really needs to. – TooTea Feb 02 '23 at 12:21
  • 2
    One can still argue that the data is stored elsewhere in kernel memory; sure, it's not anywhere procfs-specific until the read() call is invoked (which certainly does make it true that it's unavailable for stat() calls to take the size of unless we did the same work needed to generate data to back a raed() every time size was requested), but that doesn't mean it's not in memory, just that it's not cheaply accessible to the filesystem layer. – Charles Duffy Feb 02 '23 at 19:23