How does the ELF loader determine the initial stack size?

Question

I'm studying the ELF specification (http://www.skyfree.org/linux/references/ELF_Format.pdf), and one point that is not clear to me about the program loading process is how the stack is initialized, and what the initial page size is. Here's the test (on Ubuntu x86-64):

$ cat test.s
.text
  .global _start
_start:
  mov $0x3c,%eax
  mov $0,%edi
  syscall
$ as test.s -o test.o && ld test.o
$ gdb a.out -q
Reading symbols from a.out...(no debugging symbols found)...done.
(gdb) b _start
Breakpoint 1 at 0x400078
(gdb) run
Starting program: ~/a.out 

Breakpoint 1, 0x0000000000400078 in _start ()
(gdb) print $sp
$1 = (void *) 0x7fffffffdf00
(gdb) info proc map
process 20062
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
            0x400000           0x401000     0x1000        0x0 ~/a.out
      0x7ffff7ffa000     0x7ffff7ffd000     0x3000        0x0 [vvar]
      0x7ffff7ffd000     0x7ffff7fff000     0x2000        0x0 [vdso]
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0 [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000        0x0 [vsyscall]

The ELF specification has very little to say about how or why this stack page exists in the first place, but I can find references that say that the stack should be initialized with SP pointing to argc, with argv, envp and the auxiliary vector just above that, and I have confirmed this. But how much space is available below SP? On my system there are 0x1FF00 bytes mapped below SP, but presumably this is counting down from the top of the stack at 0x7ffffffff000, and there are 0x21000 bytes in the full mapping. What influences this number?

I am aware that the page just below the stack is a "guard page" that automatically becomes writable and "grows down the stack" if I write to it (presumably so that naive stack handling "just works"), but if I allocate a huge stack frame then I could overshoot the guard page and segfault, so I want to determine how much space is already properly allocated to me right at process start.

EDIT: Some more data makes me even more unsure what's going on. The test is the following:

.text
  .global _start
_start:
  subq $0x7fe000,%rsp
  movq $1,(%rsp)
  mov $0x3c,%eax
  mov $0,%edi
  syscall

I played with different values of the constant 0x7fe000 here to see what happens, and for this value it is nondeterministic whether I get a segfault or not. According to GDB, the subq instruction on its own will expand the size of the mmap, which is mysterious to me (how does linux know what's in my register?), but this program will usually crash GDB on exit for some reason. It can't be ASLR causing the nondeterminism because I'm not using a GOT or any PLT section; the executable is always loaded at the same locations in virtual memory every time. So is this some randomness of the PID or physical memory bleeding through? All in all I'm very confused as to how much stack is actually legally available for random access, and how much is requested on changing RSP or on writing to areas "just out of range" of legal memory.

see https://gist.githubusercontent.com/slmingol/d11cc294e9f60464609fdcc32a38a7fc/raw/cba88807307bd05cb1fb28608629f408e12cf4f8/ELF_101_linux_executable_walk-through.png and this for more ideas on tools https://unix.stackexchange.com/questions/418354/understanding-what-a-linux-binary-is-doing/418357#418357 — Rui F Ribeiro, May 30 '19 at 12:56
@Rui Thanks for the references. I should clarify that I'm actually fairly far along in understanding the ELF standard. I'm trying to build a formal specification of the process load behavior given the file input, and I've got that mostly down-pat; tools like readelf and objdump have been very useful to this end. This is just one last bit of underspecified behavior that I hope to settle. (Put another way, my ultimate goal isn't "I wonder what's going on under the hood" so much as "I want an exact description of which bytes go where on the following input") — Mario Carneiro, May 30 '19 at 13:06
I suppose then you are fairly familiar with assembly/M/C and know what a SP is and for what the stack (area) is for. — Rui F Ribeiro, May 30 '19 at 13:08
I have a formal specification of x86 semantics already, but I need some linux stuff for IO. I know what a stack is for and how it is normally used, but I am trying to determine how the linux kernel defines the initial machine state given an ELF file. — Mario Carneiro, May 30 '19 at 13:09
I know this is for FreeBSD, but one of the finest books about it. It might give you further clues about the subject. https://www.amazon.com/Design-Implementation-FreeBSD-Operating-System/dp/0321968972 — Rui F Ribeiro, May 30 '19 at 13:12
I should probably have stripped the file with ld -s in the test above but then b _start wouldn't have worked. But we can assume this is a minimal file with one program header and no section headers. The specification says that the text section should be loaded into memory, and it is; but I didn't say anything about a stack segment and yet I've got one. Is there a specification somewhere of how big it is and what's there? — Mario Carneiro, May 30 '19 at 13:14
Also a possible source is https://elixir.bootlin.com/linux/v3.18/source/fs/binfmt_elf.c#L571 , but I have yet to find where the stack mmap happens and how big it is. — Mario Carneiro, May 30 '19 at 13:16
All very interesting stuff, but i am afraid I wont have the time to keep up with you atm. At work. — Rui F Ribeiro, May 30 '19 at 13:20
Given that stack frames are typically allocated with little-to-no knowledge of the frames before them, knowing the initial stack size would only be useful for the very first frame. After that, the risk of overshooting would be the size of the "Guard Page" matched against the size of the stack frame. — Philip Couling, May 30 '19 at 15:15
@PhilipCouling When you say "frames before them" do you mean other thread stacks or stack frames in physical memory? Because in virtual memory there is nothing else in the entire 64-bit address space, so it should be completely open, and there is no way for the process to "see" when it will actually run out of memory until mmap/sbrk returns a failure. But maybe I'm misunderstanding...? I am only interested in the very first thing that a process loaded from disk sees in its environment (however the process may not be the first program on the system). — Mario Carneiro, May 30 '19 at 15:28
"I am only interested in the very first thing that a process loaded from disk sees in its environment" In a round about way this is what I'm drawing attention to. Information about the initial stack size is useful for the very first thing to execute and practically useless after that. — Philip Couling, May 30 '19 at 15:31
After the code in the file starts running properly, I expect it to take over management of the environment itself. So in this case I expect that if it doesn't trust the "guard page" functionality then it can treat the 0x1FF00 bytes as a fixed size allocation which it can expand explicitly via mmap calls when necessary. (Modeling the guard page functionality is tricky because it is IO triggered by a regular write without a syscall. I'm inclined to just say "that page is not writable so UB if you do" and rely on explicit mmap instead.) — Mario Carneiro, May 30 '19 at 15:36
In particular, doing IO on a write means that the write could fail, which might corrupt the process state. So I guess the kernel must throw an exception in this case? You are right that most functions just assume that stack frame allocation succeeds and have no backup IO handling, which is probably why the guard page mechanism exists, so showing that these sort of functions are valid depends on the stack frame being smaller than a page (or the writes in the frame happening in a certain order). I certainly didn't expect that until I started looking into this. — Mario Carneiro, May 30 '19 at 15:46
I re-read the question, and there's one thing I'd like to correct: the "guard page" is not just below the page pointed to by the stack pointer, but below the whole stack region (typically 8 MB). Only as much as needed for the stack is mapped to physical memory. More memory is mapped on demand, and this is done by the kernel, transparently to the user process. When the stack region is exhausted and the CPU tries to write to the guard region, the exception is propagated to the user process in form of a segmentation fault signal. — Johan Myréen, May 31 '19 at 06:57

score 6 · Accepted Answer · edited Jun 11 '20 at 14:16

I don't believe this question is really to do with ELF. As far as I know, ELF defines a way to "flat pack" a program image into files and then re-assemble it ready for first execution. The definition of what the stack is and how it's implemented sits somewhere between CPU specific and OS specific if the OS behaviour hasn't been elevated to POSIX. Though no-doubt the ELF specification makes some demands about what it needs on the stack.

Minimum stack Allocation

From your question:

I am aware that the page just below the stack is a "guard page" that automatically becomes writable and "grows down the stack" if I write to it (presumably so that naive stack handling "just works"), but if I allocate a huge stack frame then I could overshoot the guard page and segfault, so I want to determine how much space is already properly allocated to me right at process start.

I'm struggling to find an authoritative reference for this. But I have found a large enough number of non-authoritative references to suggest this is incorrect.

From what I've read, the guard page is used to catch access outside the maximum stack allocation, and not for "normal" stack growth. The actual memory allocation (mapping pages to memory addresses) is done on demand. Ie: when un-mapped addresses in memory are accessed which are between stack-base and stack-base - max-stack-size + 1, an exception might be triggered by the CPU, but the Kernel will handle the exception by mapping a page of memory, not cascading a segmentation fault.

So accessing the stack inside the maximum allocation shouldn't cause a segmentation fault. As you've discovered

Maximum stack Allocation

Investigating documentation ought to follow lines of Linux documentation on thread creation and image loading (fork(2), clone(2), execve(2)). The documentation of execve mentions something interesting:

Limits on size of arguments and environment

...snip...

On kernel 2.6.23 and later, most architectures support a size limit derived from the soft RLIMIT_STACK resource limit (see getrlimit(2))

...snip...

This confirms that the limit requires the architecture to support it and also references where it's limited (getrlimit(2)).

RLIMIT_STACK

This is the maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)).

Since Linux 2.6.23, this limit also determines the amount of space used for the process's command-line arguments and envi‐ronment variables; for details, see execve(2).

Growing the stack by changing the RSP register

I don't know x86 assembler. But I'll draw your attention to the "Stack Fault Exception" which can be triggered by x86 CPUs when the SS register is changed. Please do correct me if I'm wrong, but I believe on x86-64 SS:SP has just become "RSP". So if I understand correctly a Stack Fault Exception can be triggered by decremented RSP (subq $0x7fe000,%rsp).

See page 222 here: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce.html

The old SP register has become just RSP, and SS has effectively vanished. The x86-64 in long mode, which is the "normal" mode in 64-bit Linux, does not really use segmentation anymore. Only "the FS and GS segments are retained in vestigial form for use as extra-base pointers to operating system structures". WIkipedia. Loading the rsp with a non-canonical address can cause an exception, where a non-canonical means an address that does not contain all ones or all zeroes in (typically) the upper 16 bits of the 64-bit virtual address. — Johan Myréen, May 31 '19 at 11:22
@JohanMyréen Thanks that's pretty close to what I thought. The one detail I couldn't find was what this change has done to the Stack Fault Exception. Has the loss of SP removed the exception entirely or can it now be triggered by rsp? — Philip Couling, May 31 '19 at 11:27
I guess they still call it Stack Fault Exception, since they mention the Violation Exception in the manual. One thing is for sure: you'll get an exception of you tread outside the allowed memory region. — Johan Myréen, May 31 '19 at 12:12
I was unable to demonstrate any side effects of, for example, setting R10 <- RSP, RSP <- 0xbababa, RSP <- R10 where the bad value of RSP is never used before it is restored to a reasonable value. This probably isn't a very good test, but I have a hard time believing that this would ever cause a fault on its own without significant performance overhead in the hardware. — Mario Carneiro, May 31 '19 at 16:34
@MarioCarneiro Yes, you are right, even the manual says so. It is not an error to just store the non-canonical address in rsp, you have to reference memory using the invalid address to trigger the exception. I don't know why they mention the non-canonical addresses separately, because they are illegal anyway. — Johan Myréen, May 31 '19 at 18:30

Johan Myréen · Answer 2 · 2019-05-30T18:33:01.323

4

Every process memory region (e.g code, static data, heap, stack, etc.) has boundaries, and a memory access outside of any region, or a write access to a read-only region generates a CPU exception. The kernel maintains these memory regions. An access outside of a region propagates up to user space in the form of a segmentation fault signal.

Not all exceptions are generated by accessing memory outside the regions. An in-region access can also generate an exception. For example, if the page is not mapped to physical memory, the page fault handler handles this transparently to the running process.

The process main stack region initially has only a small number of page frames mapped to it, but grows automatically when more data is pushed to it via the stack pointer. The exception handler checks that the access is still within the region reserved for the stack, and allocates a new page frame if it is. This happens automatically from the point of view of the user level code.

A guard page is placed right after the end of the stack region, to detect an overrun of the stack region. Recently (in 2017) some people realized that a single guard page is not sufficient, because a program can potentially be tricked to decrement the stack pointer by a large amount, which may make the stack pointer point to some other region that permits writes. The "solution" to this problem was to replace the 4 kB guard page with a 1 MB guard region. See this LWN article.

It should be noted that this vulnerability is not entirely trivial to exploit, it requires, for example, that the user can control the amount of memory a program allocates via a call to alloca. Robust programs should check the parameter passed to alloca, especially if it is derived from user input.

edited May 30 '19 at 18:33

answered May 30 '19 at 18:10

Johan Myréen

13,168

Suppose I want to be sure I'm not mishandling the stack region. How can I know what memory is available for use? Should I just assume that there are 0 bytes available after RSP and always mmap any bytes I touch? – Mario Carneiro May 30 '19 at 18:20
The simple answer is that you can assume the stack can grow big enough if you don't put large objects on the stack. The stack is intended for small objects, like scalar local variables, return addresses, etc. You can store pointers to large object on the stack, but the objects themselves should be put on the heap. If you run out of stack space anyway, you can't rely on mmap, since you don't know if the addresses are available or occupied by some other region, and it is, in fact, occupied by the guard region. – Johan Myréen May 30 '19 at 18:45
I get that as general advice, but my goal is a formal specification, and for that I need actual numbers on how large is large. (Put another way, I'm running arbitrary x86 code from a malicious user and I want to be sure I have sandboxed them entirely.) At least with mmap I know if it fails the system call will return an error and memory will not be allocated; with guard page auto-allocation what happens? Do I have to catch segfaults, because that's really icky. – Mario Carneiro May 30 '19 at 18:53
If you are running arbitrary malicious code, you need some other approach to sandboxing. You should be worried about system calls, access to (device) files etc. The process external resources are at risk. You don't need to worry about the stack, the attacker can access the process memory space via other means, or change the value of the stack pointer to anything. – Johan Myréen May 30 '19 at 19:18
It's a bit off topic for the present question, but I do actually have (almost) all system calls locked down. It can read files and use mmap but that's about it. It's a good point that file IO is an attack vector, but how else can a standalone ELF file acquire information? The attacker can indeed set the stack pointer to anything. Is that fact on its own sufficient to cause system instability? My investigations have suggested that RSP is a bit magic - I've been assuming that arbitrary modifications to regs is okay but memory access has to be on an existing page - hence the Q. – Mario Carneiro May 30 '19 at 19:27
Rather than relying on the guard page to tell you the end of the region, why not ask the kernel for RLIMIT_STACK ? Also see the man page for execve has it comments on the maximum allocation for args and environment. – Philip Couling May 30 '19 at 21:01
If the "guest" code runs in a process of its own, and you are able to lock down access to system calls, then you don't need to worry about stacks. You can let the guest code shoot itself in the foot all it wants. But if you are expecting to return to your code in the same process, then all is lost. The guest code can arrange for the execution of any code after returning. You will also need to be prepared for the code never returning or calling exit. To properly sandbox the guest code you need some sort of virtual machine. – Johan Myréen May 31 '19 at 05:39
Linux is a multiuser operating system. Nothing that a non-privileged process does is supposed to be able to cause system instability. You can corrupt the memory of your own process, but that's just shooting yourself in the foot. Think of a Linux machine in your university, with dozens of students writing and executing their own programs. These programs are "arbitrary x86 code" from the administrator's and all the other users' point of view. The programs can of course use up system resources, like spinning in an infinite loop, or reserve a lot of memory, but that's normal load. – Johan Myréen May 31 '19 at 11:00
There is nothing magical about the stack pointer register (rsp), other than that it is used implicitly by the call, ret, push, and pop instructions. – Johan Myréen May 31 '19 at 11:03

How does the ELF loader determine the initial stack size?

2 Answers2

Minimum stack Allocation

Maximum stack Allocation

Growing the stack by changing the RSP register