February 22, 2018

New bypass and protection techniques for ASLR on Linux

By Ilya Smith (@blackzert), Positive Technologies researcher

0. Abstract


The Linux kernel is used on systems of all kinds throughout the world: servers, user workstations, mobile platforms (Android), and smart devices. Over the life of Linux, many new protection mechanisms have been added both to the kernel itself and to user applications. These mechanisms include address space layout randomization (ASLR) and stack canaries, which complicate attempts to exploit vulnerabilities in applications.

This whitepaper analyzes ASLR implementation in the current version of the Linux kernel (4.15-rc1). We found problems that allow bypassing this protection partially or in full. Several fixes are proposed. We have also developed and discussed a special tool to demonstrate these issues. Although all issues are considered here in the context of the x86-64 architecture, they are also generally relevant for most Linux-supported architectures.

Many important application functions are implemented in user space. Therefore, when analyzing the ASLR implementation mechanism, we also analyzed part of the GNU Libc (glibc) library, during which we found serious problems with stack canary implementation. We were able to bypass stack canary protection and execute arbitrary code by using ldd.

This whitepaper describes several methods for bypassing ASLR in the context of application exploitation.


1. ASLR


Address space layout randomization is a technology designed to impede exploitation of certain vulnerability types. ASLR, found in most modern operating systems, works by randomizing addresses of a process so that an attacker is unable to know their location. For instance, these addresses are used to:

  • Delegate control to executable code.
  • Make a chain of return-oriented programming (ROP) gadgets (1).
  • Read (overwrite) important values in memory.


The technology was first implemented for Linux in 2005. In 2007, it was introduced in Microsoft Windows and macOS as well. For a detailed description of ASRL implementation in Linux, see (2).


Since the appearance of ASLR, attackers have invented various methods of bypassing it, including:

  • Address leak: certain vulnerabilities allow attackers to obtain the addresses required for an attack, which enables bypassing ASLR (3).
  • Relative addressing: some vulnerabilities allow attackers to obtain access to data relative to a particular address, thus bypassing ASLR (4).
  • Implementation weaknesses: some vulnerabilities allow attackers to guess addresses due to low entropy or faults in a particular ASLR implementation (5).
  • Side channels of hardware operation: certain properties of processor operation may allow bypassing ASLR (6).


Note that ASLR is implemented very differently on different operating systems, which continue to evolve in their own directions. The most recent changes in Linux ASLR involved Offset2lib (7), which was released in 2014. Implementation weaknesses allowed bypassing ASLR because all libraries were in close proximity to the binary ELF file image of the program. The solution was to place the ELF file image in a separate, randomly selected region.
In April 2016, the creators of Offset2lib also criticized the current implementation, pointing out the lack of entropy by ASLR-NG when selecting a region address (8). However, no patch has been published to date.
With that in mind, let's take a look at how ASLR currently works on Linux.

2. ASLR on Linux


First, let us have a look at Ubuntu 16.04.3 LTS (GNU/Linux 4.10.0-40-generic x86_64) with the latest updates installed. The workings are largely the same regardless of Linux distribution or kernel version after 3.18-rc7. If we run "less /proc/self/maps" from the Linux command line, we will see something resembling the following:



  • The base address of the binary application (/bin/less, in our case) is 5627a82bf000.
  • The heap start address is 5627aa2d4000, being the address of the end of the binary application plus a random value, which in our case equals 1de7000 (5627aa2d4000 5627a84ed000). The address is aligned to 2^12 due to the x86-64 architecture.
  • Address 7f3631293000 is selected as mmap_base. The address will serve as the upper boundary when a random address is selected for any memory allocation via the mmap system call.
  • Libraries ld-2.23.so, libtinfo.so.5.9, and libc-2.23.so are located consecutively.


If subtraction is applied to the neighboring memory regions, we will note the following: there is a substantial difference between the binary file, heap, stack, the lowest local-archive address, and the highest ld address. There is not a single free page between the loaded libraries (files).

If we repeat the procedure several times, the picture will remain practically the same: the difference between pages will vary, while libraries and files will remain identical in location relative to one another. This fact will be crucial for our analysis.

3. Memory allocation: inner workings


Now we will look at the mechanism used to allocate virtual memory of a process. The logic is stored in the do_mmap kernel function, which implements memory allocation both on the part of the user (mmap syscall) and on the part of the kernel (when executing execve). In the first stage, an available address is selected ( get_unmapped_area); in the second stage, pages are mapped to that address (mmap_region). We will start with the first stage.

The following options are possible when selecting an address:

  1. If the MAP_FIXED flag is set, the system will return the value of the addr argument as the address.
  2. If the addr argument value is not zero, this value is used as a hint and, in some cases, will be selected. 
  3. The largest address of an available region will be selected as the address, as long as it is suitable in length and lies within the allowed range of selectable addresses.
  4. The address is checked for security-related restrictions. (For details, see Section 7.3.)


If all is successful, the region of memory at the selected address will be allocated.

Details of address selection algorithm


The structure underlying the manager of process virtual memory is vm_area_struct (or vma, for short):




This structure describes the start of the virtual memory region, the region end, and access flags for pages within the region.

vma is organized in a doubly linked list (9) of region start addresses, in ascending order, and also an augmented red-black tree (10) of region start addresses, in ascending order as well. A good rationale for this solution is given by the kernel developers themselves (11).


Example of a vma doubly linked list in the ascending order of addresses



The red-black tree augment is the amount of available memory for a particular node. The amount of available memory for a node is defined as whichever is the highest of:

  • The difference between the start of the current vma and end of the preceding vma in an ascending-ordered doubly linked list 
  • Amount of available memory of the left-hand subtree 
  • Amount of available memory of the right-hand subtree




Example of an augmented vma red-black tree

This structure makes it possible to quickly search (in O(log n) time) for the vma that corresponds to a certain address or select an available range of a certain length.

During the address selection process, two important boundaries are identified as well: the minimum lower boundary and the maximum upper boundary. The lower boundary is determined by the architecture as the minimum allowable address or as the minimum value permitted by the system administrator. The upper boundary—mmap_base—is selected as stackrandom, where stack is the maximum stack address while random is a random value with entropy of 28 to 32 bits, depending on relevant kernel parameters. The Linux kernel cannot choose an address higher than mmap_base. In the address space of the address process, large mmap_base values either correspond to the stack and special system regions (vvar and vdso), or will never be used unless explicitly marked with the MMAP_FIXED flag.

So in this whole scheme, the following values remain unknown: the address of the start of the main thread stack, the base address for loading the application binary file, the start address of the application heap, and mmap_base, which is the starting address for memory allocation with mmap.

4. Problems with current implementation


The memory allocation algorithm just described has a number of weaknesses.

4.1 Close proximity of memory location


An application uses virtual RAM. Common uses of memory by an application include the heap, code, and data (.rodata, .bss) of loaded modules, thread stacks, and loaded files. Any mistake in processing the data from these pages may affect nearby data as well. As more pages with differing types of contents are located in close proximity, the attack area becomes larger and the probability of successful exploitation rises.

Examples of such mistakes include out-of-bounds (4), overflow (integer (12) or buffer (13)), and type confusion (14).

A specific instance of this problem is that the system remains vulnerable to the Offset2lib attack, as described in (7). In short: the base address for program loading is not allocated separately from libraries, yet the kernel selects it as mmap_base. If the application contains vulnerabilities, it becomes easier to exploit them, because library images are located in close proximity to the binary application image.

A good example demonstrating this problem is a PHP vulnerability in (15) that allows reading or altering neighboring memory regions.

Section 5 will provide several examples.

4.2 Fixed method of loading libraries


In Linux, dynamic libraries are loaded practically without calling the Linux kernel. The ld library (from GNU Libc) is in charge of this process. The only way the kernel participates is via the mmap function (we will not yet consider open/stat and other file operations): this is required for loading the code and library data into the process address space. An exception is the ld library itself, which is usually written in the executable ELF file as the interpreter for file loading. As for the interpreter, it is loaded by the kernel.

If ld from GNU Libc is used as the interpreter, libraries are loaded in a way resembling the following:

  1. The program ELF file is added to the file queue for processing.
  2. The first ELF file is taken out of the queue (FIFO).
  3. If the file has not been loaded yet into the process address space, it is loaded with the help of mmap.
  4. Each library needed for the file in question is added to the queue of files for processing.
  5. As long as the queue is not empty, repeat step 2.


This algorithm means that the order of loading is always determinate and can be repeated if all the required libraries (their binary files) are known. This allows recovering the addresses of all libraries if the address of any single library is known:

  1. Assume that the address of the libc library is known.
  2. Add the length of the libc library to the libc loading address—this is the loading address of the library that was loaded before libc.
  3. Continuing in the same manner, we obtain mmap_base values and addresses of the libraries that were loaded before libc.
  4. Subtract from the libc address the length of the library loaded after libc. This is the address of the library loaded after libc.
  5. Iterating in the same manner, we obtain the addresses of all libraries that were loaded at program start with the ld interpreter.


If a library is loaded while the program is running (for instance, via the dlopen function), its position in relation to other libraries may be unknown to attackers in some cases. For example, this may happen if there were mmap calls for which the size of allocated memory regions is unknown to attackers.

When it comes to exploiting vulnerabilities, knowledge of library addresses helps significantly: for instance, when searching for gadgets to build ROP chains. What's more, if any library contains a vulnerability that allows reading or writing values relative to the library address, such a vulnerability will be easily exploited, since the libraries are sequential.

Most Linux distributions contain compiled packages with the most widespread libraries (such as libc). This means that the length of libraries is known, giving a partial picture of the distribution of virtual address space of a process in such a case.

Theoretically, one could build a large database for this purpose. For Ubuntu, it would contain versions of libraries including ld, libc, libpthread, and libm; for each version of a library, multiple versions of libraries necessary for it (dependencies) may be analyzed. So by knowing the address of one library, one can know possible map versions describing the distribution of part of the process address space.

Examples of such databases are libcdb.com and libc.blukat.me, which are used to identify libc versions based on offsets for known functions.

All this means that a fixed method of loading libraries is an application security problem. The behavior of mmap, described in the previous section, compounds the problem. In Android, this problem is solved in version 7 and later (16) (17).


4.3 Fixed order of execution


Programs have an interesting property: there is a pair of certain points in the execution thread between which the program state is predictable. For example, once a client has connected to a network service, the service allocates some resources to the client. Part of these resources may be allocated from the application heap. In this case, the relative position of objects in the heap is usually predictable.

This property is useful for exploiting applications, by "building" the program state required by an attacker. Here we will call this state a fixed order of execution.

In some cases of this property, there is a certain fixed point in the thread of execution. At this point, from the start of execution, from launch to launch, the program state remains identical except for some variables. For example, before the main function is executed, the ld interpreter must load and initialize all the libraries and then initialize the program. As noted in Section 4.2, the relative position of libraries will always be the same. During execution of the main function, the differences will consist in the specific addresses used for program loading, libraries, stack, heap, and objects allocated in memory. These differences are due to the randomization described in Section 6.

As a result, an attacker can obtain information on the relative position of program data. This position is not affected by randomization of the process address space.

At this stage, the only possible source of entropy is competition between threads: if the program creates several threads, their competition in working with the data may introduce entropy to the location of objects. In this example, creating threads before executing the main function is possible with the help of the program global constructors or required libraries.

When the program starts using the heap and allocating memory from it (usually with the help of new/malloc), the mutual position of objects in the heap will remain constant for each launch up to a certain moment.

In some cases, the position of thread and heap stacks will also be predictable in relation to library addresses.

If needed, it is possible to obtain these offsets to use in exploitation. One way is to simply execute "strace -e mmap" for this application twice and compare the difference in addresses.

4.4 Holes


If an application allocates memory with mmap and then frees up part of that memory, this can cause holes—free memory regions that are surrounded by occupied regions. Problems may come up if this free memory (hole) is again allocated for a vulnerable object (a object during whose processing the application demonstrates a vulnerability). This brings us back to the problem of closely located objects in memory.

One illustrative example of such holes was found in the code for ELF file loading in the Linux kernel. When loading the ELF file, the kernel first reads the size of the file and tries to map it in full via do_mmap . Once the file has been fully loaded, the memory after the first segment is freed up. All following segments are loaded at a fixed address ( MAP_FIXED) that is set relative to the first segment. All this is needed in order to load the entire file at the selected address and separate segments by rights and offsets in accordance with their descriptions in the ELF file. This approach can cause memory holes if the holes were present in the ELF file between segments.
In the same situation, during loading of an ELF file, the ld interpreter (GNU Libc) does not call unmap but changes permissions for the free pages (holes) to PROT_NONE, which forbids the process from having any access to these pages. This approach is more secure.

To fix the problem of ELF file loading and related holes, the Linux kernel features a patch implementing the same logic as in ld from GNU Libc (see Section 7.1).

4.5 TLS and thread stack


Thread Local Storage (TLS) is a mechanism whereby each thread in a multithread process can allocate locations for data storage (18). The mechanism is implemented differently on different architectures and operating systems. In our case, this is the glibc implementation under x86-64. For x86, any difference will not be material for the mmap problem in question.

In the case of glibc, mmap is also used to create TLS. This means that TLS is selected in the way described already here. If TLS is close to a vulnerable object, it can be altered.

What is interesting about TLS? In the glibc implementation, TLS is pointed to by the segment register fs (for the x86-64 architecture). Its structure is described by the tcbhead_t type defined in glibc source files:



This type contains the field stack_guard, which contains a so-called canary—a random or pseudorandom number for protecting an application from stack overflows (19).
This protection works in the following way: when a function is entered, a canary obtained from tcbhead_t.stack_guard is placed on the stack. At the end of the function, the stack value is compared to the reference value in tcbhead_t.stack_guard. If the two values do not match, the application will return an error and terminate.

Canaries can be bypassed in several ways:

  • If an attacker does not need to overwrite this value (20).
  • If an attacker has managed to read or anticipate this value, making it possible to perform a successful attack (20).
  • If an attacker can overwrite this value with a known one, making it possible to cause a stack overflow (20).
  • An attacker can take control before the application terminates (21).
  • The listed bypasses highlight the importance of protecting TLS from reading or overwriting by an attacker.

Our research revealed that glibc has a problem in TLS implementation for threads created with the help of pthread_create. Say that it is required to select TLS for a new thread. After allocating memory for the stack, glibc initializes TLS in upper addresses of this memory. On the x86-64 architecture considered here, the stack grows downward, putting TLS at the top of the stack. Subtracting a certain constant value from TLS, we obtain the value used by a new thread for the stack register. The distance from TLS to the stack frame of the function that the argument passed to pthread_create is less than one page. Now a would-be attacker does not need to guess or peek at the canary value—the attacker can just overwrite the reference value alongside with the stack value, bypassing protection entirely. A similar problem was found in Intel ME (22).

4.6 malloc and mmap


When using malloc, sometimes glibc uses mmap for allocating new memory areas if the size of requested memory is larger than a certain value. In such cases, memory will be allocated with the help of mmap, so, after memory allocation, the address will be close to libraries or other data allocated with mmap. Attackers pay close attention to mistakes in handling of heap objects, such as heap overflow, use after free (23), and type confusion (14).

An interesting behavior of the glibc library was found when a program uses pthread_create. At the first call of malloc from the thread created with pthread_create, glibc will call mmap to create a new heap for this stack. So, in this thread, all the addresses called via malloc will be located close to the stack of this same thread. (For details, see Section 5.7.)

Some programs and libraries use mmap for mapping files to the address space of a process. The files may be used as, for example, cache or for fast saving (altering) of data on the drive.

Here is an abstract example: an application loads an MP3 file with the help of mmap. Let us call the load address mmap_mp3. Then the application reads, from the loaded data, the offset to the start of audio data. If the application contains a mistake in its routine for verifying the length of that value, an attacker may specially craft an MP3 file able to obtain access to the memory region located after mmap_mp3.

4.7 MAP_FIXED and loading of ET_DYN ELF files


The mmap manual says the following regarding the MAP_FIXED flag:

MAP_FIXED

Don't interpret addr as a hint: place the mapping at exactly that address. addr must be a multiple of the page size. If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. If the specified address cannot be used, mmap() will fail. Because requiring a fixed address for a mapping is less portable, the use of this option is discouraged.

If the requested region with the MAP_FIXED flag overlaps existing regions, successful mmap execution will overwrite existing regions.

Therefore, if a programmer makes a mistake with MAP_FIXED, existing memory regions may be redefined.

An interesting example of such a mistake has been found both in the Linux kernel and in glibc.
As described in (24), ELF files are subject to the requirement that, in the Phdr header, ELF file segments must be arranged in ascending order of vaddr addresses:

PT_LOAD

The array element specifies a loadable segment, described by p_filesz and p_memsz. The bytes from the file are mapped to the start of the memory segment. If the segment's memory size (p_memsz) is larger than the file size (p_filesz), the "extra" bytes are defined to hold the value 0 and to follow the segment's initialized area. The file size may not be larger than the memory size. Loadable segment entries in the program header table appear in ascending order, sorted on the p_vaddr member.

However, this requirement is not checked. The current code for ELF file loading is as follows:


All segments are processed according to the following algorithm:

  1. Calculate the size of the loaded ELF file: the address of the end of the last segment end, minus the start address of the first segment.
  2. With the help of mmap, allocate memory for the entire ELF file with that size, thus obtaining the base address for ELF file loading.
  3. In the case of glibc, change access rights. If loading from the kernel, release regions that create holes. Here the behavior of glibc and the Linux kernel differ, as described in Section 4.4.
  4. With the help of mmap and the MAP_FIXED flag, allocate memory for remaining segments by using the address obtained by isolating the first segment and adding the offset obtained from the ELF file header.


This enables an intruder to create an ELF file, one of whose segments can fully overwrite an existing memory region—such as the thread stack, heap, or library code.

An example of a vulnerable application is the ldd tool, which is used to check whether required libraries are present in the system. The tool uses the ld interpreter. Taking advantage of the problem with ELF file loading just discussed, we succeeded in executing arbitrary code with ldd:


The issue of MAP_FIXED has also been raised in the Linux community previously (25). However, no patch has been accepted.



For informational purposes, the source code of this example is located in the folder evil_elf.

4.8 Cache of allocated memory


glibc has many different caches, of which two are interesting in the context of ASLR: the cache for the stack of a newly created thread and the heap stack. The stack cache works as follows: on thread termination, stack memory will not be released but will be transferred to the corresponding cache. When creating a thread stack, glibc first checks the cache. If the cache contains a region of the required length, glibc uses that region. In this case, mmap will not be accessed, and the new thread will use the previously used region with the same addresses. If the attacker has successfully obtained the thread stack address, and can control creation and deletion of program threads, the intruder can use knowledge of the address for vulnerability exploitation. Further, if the application contains uninitialized variables, their values can also be subject to the attacker's control, which may lead to exploitation in some cases.

The heap cache works as follows: on thread termination, its heap moves to the corresponding cache. When a heap is created again for a new thread, the cache is checked first. If the cache has an available region, this region will be used. In this case, everything about the stack in the previous paragraph applies here as well.

5. Examples


There may be other cases where mmap is used. This means that this problem leads to a whole class of potentially vulnerable applications.
Here are some examples illustrating these problems.

5.1 Stacks of two threads


Using pthread_create, let us create two threads and calculate the difference between local variables of both threads. Source code:



Output after the first launch:



Output after the second launch:



As we can see, even as the addresses of variables change, the difference between them remains the same. In this example, the difference is marked by the word "Diff"; address values are given it. The example shows that vulnerable code from the stack of one thread may affect another thread or any neighboring memory region, whether or not ASLR is present.

5.2 Thread stack and large buffer allocated with malloc


Now, in the main thread of an application, let us allocate a large amount of memory with the help of malloc and launch a new thread. We then calculate the difference between the address obtained with malloc and the variable in the stack of the new thread. Here is the source code:



Output after the first launch:




Output after the second launch:



Again, the difference does not change. This example shows that when a large buffer allocated with malloc is processed, vulnerable code can affect the stack of a new thread despite ASLR protections.

5.3 mmap and thread stack


Here we will allocate memory with the help of mmap and launch a new thread with pthread_create. Then we calculate the difference between the address allocated with mmap and the address of the variable in the stack of the new thread. Here is the source code:


Output after the first launch:


Output after the second launch:







The difference remains unchanged. This example shows that when a buffer allocated with mmap is processed, vulnerable code can affect the stack of a new thread, regardless of ASLR.

5.4 mmap and TLS of the main thread


Let us allocate memory with the help of mmap to obtain the TLS address of the main thread. Then we calculate the difference between the two and make sure that the canary value on the main thread stack is the same as the TLS value. Here is the source code:



Output after the first launch:



Output after the second launch:





As seen here, the difference remains the same from launch to launch, while the canary values match. So if a corresponding vulnerability is present, it is possible to alter the canary and bypass protection. For example, this could be a buffer overflow vulnerability in the stack and a vulnerability allowing to overwrite memory at an offset from the region allocated with mmap. In this example, the offset equals 0x85c8700. The example shows a method of bypassing ASLR and the stack canary.


5.5 mmap and glibc


A similar example was discussed in Section 4.2, but here is one with a slightly different twist: let us allocate memory with mmap to obtain the difference between this address and the system and execv functions from the glibc library. Here is the source code:


Output after the first launch:



Output after the second launch:




As we can see, the difference between the allocated region and functions remains the same. This example shows a method of bypassing ASLR when vulnerable code interacts with a buffer allocated with mmap. The distances (in bytes) to library functions and data will remain constant, which can also be used for exploiting the application.

5.6 Buffer overflow in child thread stack


Let us create a new thread and overflow the stack buffer up to the TLS value. If there are no arguments in the command line, we will not overwrite the TLS canary—otherwise, we will. This argument logic is simply a way of showing the difference in the program's behavior.
Overwriting will be done with the 0x41 byte. Here is the source code:



In this example, protection was successful in detecting the stack overflow and terminating the application with an error before an attacker could seize control. Now we overwrite the reference canary value:



In the second example, we successfully overwrite the canary and execute the pwn_payload function with launch of the sh interpreter.



This example shows a method of bypassing stack overflow protections. To carry out exploitation successfully, an attacker needs to overwrite a sufficient number of bytes in order to overwrite the canary reference value. In this example, the attacker needs to overwrite at least 0x7b8+0x30, or 2024, bytes.

5.7 Thread stack and small buffer allocated with malloc


Let us now create a thread, allocate some memory with malloc, and calculate the difference from the local variable in this thread. Source code:



The first launch:




And the second launch:





In this case, the difference was not the same. Nor will it remain the same from launch to launch. Let us consider the reasons for this.


The first thing to note: the malloc-derived pointer address does not correspond to the process heap address.

glibc creates a new heap for each new thread created with the help of pthread_create. The pointer to this heap lies in TLS, so any thread allocates memory from its own heap, which increases performance, since there is no need to sync threads in case of concurrent malloc use.
But why then is the address "random"?

When allocating a new heap, glibc uses mmap; the size depends on the configuration. In this case, the heap size is 64 MB. The heap start address must be aligned to 64 MB. So the system first allocates 128 MB and then aligns a piece of 64 MB in this range while unaligned pieces are released and create a "hole" between the heap address and the closest region that was previously allocated with mmap.
Randomness is brought about by the kernel itself when selectingmmap_based: this address is not aligned to 64 MB, as were the mmap memory allocations before the call of the malloc in question.
Regardless of why address alignment is required, this leads to a very interesting effect: bruteforcing becomes possible.

The Linux kernel defines the process address space for x86-64 as "47 bits minus one guard page", which for simplicity we will round to 2^47 (omitting the one-page subtraction in our size calculations). 64 MB is 2^26, so the significant bits equal 47 – 26 = 21, giving us a total of 2^21 various heaps of secondary threads.

This substantially narrows the bruteforcing range.

Because the mmap address is selected in a known way, we can assume that the heap of the first thread created with pthread_create will be selected as 64 MB close to the upper address range. To be more precise, it will be close to all the loaded libraries, loaded files, and so on.
In some cases, it is possible to calculate the total amount of memory allocated before the call to the malloc in question. In our case, we loaded only glibc and ld and created a stack for the thread. So this value is small.

Section 6 will show the way in which the mmap_base address is selected. But for now, here is some additional information: mmap_base is selected with an entropy of 28 to 32 bits depending on kernel settings at compile time (28 bits by default). So some top boundary is set off by that same amount.
Thus, in many cases, the upper 7 bits of the address will equal 0x7f, while in rare cases, they will be 0x7e. That gives us another 7 bits of certainty. There are a total of 2^14 possible options for selecting a heap for the first thread. The more threads are created, the lesser that value is for the next heap selection.

Let us illustrate this behavior with the following C code:



Then let us launch the program a sufficient number of times with Python code for collecting address statistics:




This code launches the simple './t' program, which creates a new thread, a sufficient number of times.
The code allocates a buffer with the help of malloc and displays the buffer address. Once the program has completed this, the address is read and the program calculates how many times the address was encountered during operation of the program. The script collects a total of 16,385 different addresses, which equals 2^14+1. This is the number of attempts an attacker could make, in the worst case, to guess the heap address of the program in question.

There is another option—thread stack and large buffer allocated with malloc—but this is rather similar to the one described above. The only difference is that if the buffer size is too large, mmap is called again, so it is difficult to say where the allocated region will be placed: it may fill a hole or stand in front of the heap.

5.8 Stack cache and thread heaps


In this example, we will create a thread and allocate memory with malloc. Then we record the addresses of the thread stack and the pointer obtained with malloc. Let us then initialize a certain stack variable with the value 0xdeadbeef. Then we terminate the stack and create a new one, allocating memory with malloc. We compare the addresses and values of the variable, which this time is not initialized. Here is the source code:



Output:



As clearly seen, the addresses of local variables in the stack for consecutively created threads remain the same. Also the same are the addresses of variables allocated for them via malloc; some values of the first thread's local variables are still accessible to the second thread. An attacker can use this to exploit vulnerabilities of uninitialized variables (26). Although the cache speeds up the application, it also enables attackers to bypass ASLR and carry out exploitation.

6. Mapping process address space


When a new process is created, the kernel follows the algorithm below to determine its address space:

  1. After a call to execve, the virtual memory of the process is completely cleared.
  2. This creates the very first vma, which describes the process stack (stack_base). Initially, its address is selected as 2^47 – pagesize (where pagesize is the page size, on x86-64 this equals 4096), then it is adjusted by a certain random value random1 not exceeding 16 GB (this happens quite late, after the binary file base is selected, so some interesting effects are possible: if an application binary file occupies the entire memory, the stack will be next to the base address of the binary file).
  3. The kernel selects mmap_base, an address in relation to which the system will later load all the libraries in the process address space. The address is determined as stack_baserandom2 – 128 MB where random2 is a random value whose upper boundary depends on the kernel configuration and has a range of 1 TB to 16 TB.
  4. The kernel tries to load the program binary file. If the file is PIE (regardless of the base loading address), the base address is (2^47 – 1) * 2/3 + random3, where the random3 value is also determined by the kernel configuration and has an upper boundary of 1 TB to 16 TB.
  5. If the file needs dynamically loaded libraries, the kernel tries to load an interpreter to load all the required libraries and perform all initializations. Usually, the interpreter in ELF files is ld from glibc. The address is selected in relation to mmap_base.
  6. The kernel sets the new process heap as the end of the loaded ELF file plus a certain random4 value with an upper boundary of 32 MB.

After these stages, the process is launched. The start address is either the one from the ELF file of the interpreter (ld) or the one from the ELF file of the program if there is no interpreter (a statically linked ELF).

If ASLR is on and if there is a possibility of loading at an arbitrary address, the program file process will look as follows:



Each library, being loaded with the interpreter, will get control if a list of global constructors is defined in it. In this case, library functions for allocating resources (global constructors) required for this library will be called.

Thanks to the known sequence of library loading, it is possible to obtain a certain point in the program execution thread that allows "building" memory regions in terms of their relative locations to one another, regardless of whether ASLR is present. Increasing knowledge about the libraries, their constructors, and program behavior will push this point further away from the point of process creation.

To determine specific addresses, one still needs a vulnerability allowing to obtain the address of some mmap region or to read (write) memory relative to a particular mmap region:

  • If an attacker knows the address of some mmap region that was allocated from the start of the process to the constant execution point (Section 4.3), the attacker can successfully calculate mmap_base and the address of any loaded library or any other mmap region.
  • If it is possible to address relative to a certain mmap region from the point of constant execution, it is not necessary to know any additional address.


To prove the feasibility of mapping process memory in this way, we wrote Python code to simulate kernel behavior when searching for new regions. The method of loading ELF files and the order of library loading were recreated as well. To simulate a vulnerability that allows reading library addresses, the /proc: file system was used: the script reads the ld address (thus recovering mmap_base) and, having the libraries, it repeats the process memory map. Then it compares the result with the original. The script completely repeated the address space of all processes. Script code is available at: https://github.com/blackzert/aslur

6.1 Attack vectors


Let us review some vulnerabilities that have already become "classic" because of their prevalence.
1. Heap buffer overflows: There are various well-known vulnerabilities in application operation with the glibc heap, as well as methods for exploiting them. We can categorize these vulnerabilities under two types: they either allow modifying memory relative to the address of the vulnerable heap, or allow modifying memory addresses known to the attacker. In some cases, it is possible to read arbitrary data from objects on the heap. This fact gives rise to several vectors:

• In case of modifying (reading) memory relative to an object in the heap, we are primarily interested in the heap of the thread created with pthread_create, because the distance from it to any library (stack) of the thread will be less than the distance from the heap of the main thread.
• In case of reading (writing) memory relative to some address, it is first of all necessary to try to read the addresses from the heap itself, as they usually contain pointers to vtable or to libc.main_arena.

Knowledge of the libc.main_arena address yields the glibc address and, subsequently, the mmap_base address. To obtain the vtable address, it is required to know the address of either some library (and hence mmap_base as well) or the program loading address. An attacker who knows the program loading address can read library addresses from the .got.plt section containing addresses for required library functions.

2. Buffer overflow:

• At the stack, it leads to the canary scenario in question.
• At the heap, it leads to the scenario described in #1.
• At the mmap region, it leads to overwriting neighboring regions, depending on the context.

7. Fixes


In this article, we have reviewed several problems; now we can consider fixes for some of them. Let us start with the simplest solutions and then proceed to more complicated ones.

7.1 Hole in ld.so


As shown in Section 4.4, the ELF interpreter loader in the Linux kernel contains an error and allows releasing part of the interpreter library memory. A relevant fix was proposed to the community, but was neglected without action:

https://lkml.org/lkml/2017/7/14/290

7.2 Order of loading ELF file segments

As noted above, in the kernel and in the code of the glibc library, there is no checking of file ELF segments: the code simply trusts that they are in the correct order. Proof-of-concept code, as well as a fix, is enclosed: https://github.com/blackzert/aslur

The fix is quite simple: we go through the segments and ensure that the current one does not overlap the next one, and that the segments are sorted in the ascending order of vaddr.

7.3 Use of mmap_min_addr when searching for mmap allocation addresses

As soon as a fix was written for mmap, in order to return addresses with sufficient entropy, a problem arose: some mmap calls failed with an access permission error. This happened even as root or when requested by the kernel when executing execve.

In the address selection algorithm (described earlier in Section 3), one of the listed options is checking addresses for security restrictions. In the current implementation, this check verifies that the selected address is larger than mmap_min_addr. This is a system variable that can be changed by an administrator through sysctl. The system administrator can set any value, and the process cannot allocate a page at an address less than this value. The default value is 65536.

The problem was that when the address function for mmap was called on x86-64, the Linux kernel used 4096 as the value of the minimal lower boundary, which is less than the value of mmap_min_addr. The function cap_mmap_addr forbids this operation if the selected address falls between 4096 and mmap_min_addr.

cap_mmap_addr is called implicitly; this function is registered as a hook for security checking. This architectural solution raises questions: first, we choose the address without having the ability to test it with external criteria, and then we check its permissibility in accordance with the current system parameters. If the address does not pass the check, then even if the address is selected by the kernel, it can be "forbidden" and the entire operation will end with the EPERM error.

An attacker can use this fact to cause denial of service in the entire system: if the attacker can specify a very large value, no user process can start in the system. Moreover, if the attacker manages to store this value in the system parameters, then even rebooting will not help—all created processes will be terminated with the EPERM error.

Currently, the fix is to use the mmap_min_addr value as the lowest allowable address when making a request to the address search function. Such code is already used for all other architectures.
What will happen if the system administrator starts changing this value on a running machine? This question remains unanswered, since all new allocations after the change may end with the EPERM error; no program code expects such an error and does not know what to do with it. The mmap documentation states the following:

"EPERM The operation was prevented by a file seal; see fcntl (2)."

That is to say, the kernel cannot return EPERM to MAP_ANONYMOUS, although in fact that is not so.

7.4 mmap


The main mmap problem discussed here is the lack of entropy in address choice. Ideally, the logical fix would be to select memory randomly. To select it randomly, one must first build a list of all free regions of appropriate size and then, from that list, select a random region and an address from this region that meets the search criteria (the length of the requested region and the allowable lower and upper boundaries).

To implement this logic, the following approaches can be applied:

1. Keep the list of voids in a descending-order array. In this case, the choice of random element is made in a single operation, but maintaining this array requires many operations for releasing (allocating) the memory when the current virtual address space map of the process changes.
2. Keep the list of voids in a tree and a list, in order to find an outer boundary that satisfies the length requirement, and select a random element from the array. If the element does not fit the minimum/maximum address restrictions, select the next one, and so on until one is found (or none remain). This approach involves complex list and tree structures similar to those already existing for vma with regard to change of address space.
3. Use the existing structure of the augmented red-black vma tree to bypass the list of allowed gap voids and select a random address. In the worst case, each choice will have to bypass all the peaks, but rebuilding the tree does not incur any additional slowdown of performance.

Our choice went to the last approach. We can use the existing vma organizational structure without adding redundancy and select an address using the following algorithm:
1. Use the existing algorithm to find a possible gap void with the largest valid address. Also, record the structure of vma following it. If there is no such structure, return ENOMEM.
2. Record the found gap as the result and vma as the maximum upper boundary.
3. Take the first vma structure from the doubly linked list. It will be a leaf in the red-black tree, because it has the smallest address.
4. Make a left-hand traversal of the tree from the selected vma, checking the permissibility of the free region between the vma in question and its predecessor. If the free region is allowed by the restrictions, obtain another bit of entropy. If the entropy bit is 1, redefine the current value of the gap void.
5. Return a random address from the selected gap void region.
One way to optimize the fourth step of the algorithm is not to enter subtrees whose gap extension size is smaller than the required length.

This algorithm selects an address with sufficient entropy, although it is slower than the current implementation.

As far as obvious drawbacks, it is necessary to bypass all vma structures that have a sufficient gap void length. However, this is offset by the absence of any performance slowdown when changing address space.

8. Testing fixes for ASLR


After applying the described fixes to the kernel, the process /bin/less looks as follows:



As seen in the example:

  1. All the libraries were allocated in random locations and are at a random distance from one another.
  2. The file /usr/lib/locale/locale-archive mapped with mmap is also located at random addresses.
  3. The hole in /lib/x86_64-linux-gnu/ld-2.26.so is not filled with any mmap mapping.


This patch was tested on Ubuntu 17.04 with Google Chrome and Mozilla Firefox running. No problems were found.

9. Conclusion


This research has demonstrated many interesting features of the kernel and glibc in terms of handling of program code. The problem of close memory location was articulated and considered in detail. The following problems were found:

  • The algorithm for choosing the mmap address does not contain entropy.
  • Loading of ELF files in the kernel and interpreter contains a segment processing error.
  • When searching for an address with do_mmap, the kernel does not take into account mmap_min_addr on the x86-64 architecture.
  • Loading an ELF file in the kernel allows creating memory holes in the program ELF file and the ELF file interpreter.
  • Using mmap to allocate memory for libraries, the ELF file interpreter from GNU Libc ld loads libraries at mmap_base-dependent addresses. In addition, libraries are loaded in a fixed order.
  • Using mmap to allocate a stack, heap, and TLS thread, the GNU Libc library places them at mmap_base-dependent addresses also.
  • The GNU Libc library places TLS threads created with pthread_create at the top of the stack, which allows bypassing buffer overflow protections on the stack by overwriting the canary.
  • The GNU Libc library caches previously allocated heaps (stacks) of threads, which, in some cases, allows successful exploitation of a vulnerable application.
  • The GNU Libc library creates a heap for new threads aligned to 2^26, which substantially narrows the bruteforcing range.


These problems help an attacker to bypass ASLR or protections against stack buffer overflows. For some of these problems, fixes (in the form of kernel patches) have been proposed here.
Proof-of-concept code has been presented for all problems mentioned. An algorithm ensuring sufficient entropy for address selection is proposed. The same approach can be used to analyze ASLR on other operating systems such as Windows and macOS.
A number of peculiarities of the GNU Libc implementation were reviewed; in some cases, these peculiarities inadvertently facilitate exploitation of vulnerable applications.

References

1. Erik Buchanan, Ryan Roemer, Stefan Savage, Hovav Shacham. Return-Oriented Programming: Exploits Without Code Injection. [Online] Aug 2008. https://www.blackhat.com/presentations/bh-usa-08/Shacham/BH_US_08_Shacham_Return_Oriented_Programming.pdf.
2. xorl. [Online] https://xorl.wordpress.com/2011/01/16/linux-kernel-aslr-implementation/.
3. Reed Hastings, Bob Joyce. Purify: Fast Detection of Memory Leaks and Access Errors. [Online] December 1992 https://web.stanford.edu/class/cs343/resources/purify.pdf.
4. Improper Restriction of Operations within the Bounds of a Memory Buffer. [Online] https://cwe.mitre.org/data/definitions/119.html.
5. AMD Bulldozer Linux ASLR weakness: Reducing entropy by 87.5%. [Online] http://hmarco.org/bugs/AMD-Bulldozer-linux-ASLR-weakness-reducing-mmaped-files-by-eight.html.
6. Dmitry Evtyushkin, Dmitry Ponomarev, Nael Abu-Ghazaleh. Jump Over ASLR: Attacking Branch Predictors to Bypass ASLR. [Online] http://www.cs.ucr.edu/~nael/pubs/micro16.pdf.
7. Hector Marco-Gisbert, Ismael Ripoll. Offset2lib: bypassing full ASLR on 64bit Linux. [Online] https://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html.
8. Hector Marco-Gisbert, Ismael Ripoll-Ripoll. ASLR-NG: ASLR Next Generation. [Online] 2016 https://cybersecurity.upv.es/solutions/aslr-ng/ASLRNG-BH-white-paper.pdf.
9. Doubly linked list. [Online] https://en.wikipedia.org/wiki/Doubly_linked_list.
10. Bayer, Rudolf. Symmetric binary B-Trees: Data structure and maintenance algorithms. [Online] January 24, 1972 https://link.springer.com/article/10.1007%2FBF00289509.
11. Lespinasse, Michel. mm: use augmented rbtrees for finding unmapped areas. [Online] November 5, 2012 https://lkml.org/lkml/2012/11/5/673.
12. Integer Overflow or Wraparound. [Online] https://cwe.mitre.org/data/definitions/190.html.
13. Classic Buffer Overflow. [Online] https://cwe.mitre.org/data/definitions/120.html.
14. Incorrect Type Conversion or Cast. [Online] https://cwe.mitre.org/data/definitions/704.html.
15. CVE-2014-9427. [Online] https://www.cvedetails.com/cve/CVE-2014-9427/.
16. Security Enhancements in Android 7.0. [Online] https://source.android.com/security/enhancements/enhancements70.
17. Implement Library Load Order Randomization. [Online] https://android-review.googlesource.com/c/platform/bionic/+/178130/2.
18. Thread-Local Storage. [Online] http://gcc.gnu.org/onlinedocs/gcc-3.3/gcc/Thread-Local.html.
19. One, Aleph. Smashing The Stack For Fun And Profit. [Online] http://www.phrack.org/issues/49/14.html#article.
20. Fritsch, Hagen. Buffer overflows on linux-x86-64. [Online] April 16, 2009 http://www.blackhat.com/presentations/bh-europe-09/Fritsch/Blackhat-Europe-2009-Fritsch-Buffer-Overflows-Linux-whitepaper.pdf.
21. Litchfield, David. Defeating the Stack Based Buffer Overflow Prevention. [online] September 8, 2003 https://crypto.stanford.edu/cs155old/cs155-spring05/litch.pdf.
22. Maxim Goryachy, Mark Ermolov. HOW TO HACK A TURNED-OFF COMPUTER, OR RUNNING. [online] https://www.blackhat.com/docs/eu-17/materials/eu-17-Goryachy-How-To-Hack-A-Turned-Off-Computer-Or-Running-Unsigned-Code-In-Intel-Management-Engine-wp.pdf.
23. Use After Free. [online] https://cwe.mitre.org/data/definitions/416.html.
24. [Online] http://www.skyfree.org/linux/references/ELF_Format.pdf.
25. Hocko, Michal. mm: introduce MAP_FIXED_SAFE. [Online] https://lwn.net/Articles/741335/.
26. Use of Uninitialized Variable. [online] https://cwe.mitre.org/data/definitions/457.html.

No comments:

Post a Comment