There is a lot of internal information available about Microsoft software, despite the fact that it is closed-source. For example, export of library functions by names, which provides some information on the interfaces used. Debugging symbols used for troubleshooting of operating system errors are publicly available; however, there are only compiled binary modules at hand. In this article, we will try to determine what they looked like prior to compilation using only legal methods.
Raising this question is not new, as Mark Russinovich and Alex Ionescu did this before; however, my research was more detailed. What we need is debugging symbol packages, which are publically available, in this case — the most recent release of Windows 10 (64 bit), both free and checked builds.
Debugging symbols are a set of .pdb (program database) files that keep various information used for debugging purposes of Windows binary modules including names for globals, functions, and data structures, sometimes even with field names.
We can also use information from an almost-publicly-available checked build of Windows 10. This kind of build is full of debugging assertions that contain sensitive information about local variable names and even source line numbers.
The example above, while not providing an absolute path, does expose extremely helpful path information.
If we feed debugging symbols to the "strings" utility by Sysinternals, we get around 13 GB of raw data. However, repeating this with Windows installation files is a bad idea because it would generate useless data. Therefore, we limit target file types with the following list: exe — executable files, sys — drivers, dll — libraries, ocx — ActiveX components, cpl — control panel elements, efi — EFI applications, in particular, the bootloader. Then we get additional 5.3 GB of raw data. I was initially surprised that there were so few programs that can open gigabytes-large files and even fewer programs that can search for specific data inside those files. I used 010 Editor for manual operations on the raw and temporary data and python scripts for automated data filtering.
Filtering Symbol Data
The symbol file contains a list of object files used for linking of a corresponding executable image. Object file paths are absolute.
- Filtering clue No. 1: find strings using the mask ":\\".
We are able to get the absolute paths, sort them and remove duplicates, and due to the low volume of junk data, it can be removed manually. These results indicate the source tree structure. The root directory is "d:\th", which may stand for threshold, part of the name of the November release of Windows 10 — Threshold 1. However, we only get a few filenames starting with "d:\th". This is because the linker uses already compiled files as an input. Source files are compiled into the folders "d:\th.obj.amd64fre" for the release or free version of Windows and "d:\th.obj.amd64chk" for the checked or debug version.
- Filtering clue No. 2: assuming that source files are stored as the corresponding object files after compilation, we can “decompile” object files back to the source ones. Please note that this step can produce an inaccurate structure in the source tree because we don't know for certain the compilation options used.
As for the file extensions, an object file can be produced from a range of different file types like "c", "cpp", "cxx", etc. and there is no way to identify the type of a source file, so we leave the "c??" extension.
There are a lot of different root directories, not only "d:\th". Others include "d:\th.public.chk" and "d:\th.public.fre", however, we shall omit these because they are just placeholders for publicly available SDKs. We also note there are many driver projects, which are seemingly built at developers' workplaces:
There is a standard set of drivers for the devices that are compatible with public specifications, such as USB XHCI controllers, which is a part of a Windows source tree, while all vendor-specific drivers are built somewhere else.
- Filtering clue No. 3: remove binary files, because we are only interested in source ones. Remove "pdb", "exp", "lib"; "res" files can be reverted to the original "rc" (resource compiler) files.
While this output is neat, we cannot get any additional information about source files from this step, so we must work with the next data set.
Filtering Raw Binaries Data
As there are only a few absolute filenames in this data set, we will use the following extensions as a filter:
- "c" — C sources
- "cpp" — C++ sources
- "cxx" — C or C++ sources
- "h" — C header
- "hpp" — C++ header
- "hxx" — C or C++ header
- "asm" — assembly source (MASM)
- "inc" — assembly header (MASM)
- "def" — module definition file
At this stage, there are problems with the filtered data. The first problem: we are not sure that object file paths were properly reverted to the source files paths.
- Filtering clue No. 4: let's check if there are matching filepaths between filtered symbol data and filtered data from binaries.
They do match, so that means that we properly restored most of the directory structure for the source tree. There are some folders that might not be properly restored, but this level of inaccuracy is acceptable. We can also replace the "c??" extensions with a matching filepaths extensions.
The second problem is header files. Although a header file is a very important part of a source tree, it is not compiled into an object file. This means that we can't restore the information about header files from object files, so we can only locate and restore header files that were found in the raw data from binaries.
The third problem is that we still don't know the extensions for the most source files.
- Filtering clue No. 5: assume that a directory contains source files of the same type.
This means that if a directory already contains the "cpp" source file, it is likely that all the other files in the same folder will be "cpp" sources.
- Filtering clue No. 6: use external sources of information for detail specification.
I used Windows Research Kernel as a reference to the assembler sources and renamed some assembly sources by hand.
Inspecting the Result Data
A keyword search in the source filenames for "telemetry" resulted in 424 hits, the most interesting of which are listed below.
These results don’t generate additional information about the telemetry internals, but they do provide an interesting starting point for a more detailed research.
I next found PatchGuard, but the source tree contains only one file of an unknown type (most likely binary).
Searching the unfiltered data reveals that PatchGuard is in fact a separate project.
I also searched for random phrases and words. Some interesting results are provided below:
You are invited to check Windows 10 source tree at Github and share your findings.
Author: Artem Shishkin, Positive Research