RISC-ing it all: How binary data is stored
the beginning
If you are like me, you have written a program, compiled it and run the executable binary file at some point in your life. And you’ve done it a couple of times, then you started wondering, how does it all work especially the compiling and execution part. It bothered you enough to want to learn it (or atleast read about it), but somebody told you were wasting your time, that you would be better off learning to build web apps. And you knew they were not lying. You tried to learn how to write beautiful web apps but realised you didn’t care about building beautiful things as much as you liked learning about how beautiful things get built.
And so you went on the internet and you found blogs and even big books (very big ones) dedicated to teaching how programs get compiled (or get executed) and you tried to read them but they made absolutely no sense to you. You gave up on the idea, but not before you learned that programs at the end of the day get executed by something called the CPU and that the CPU performs actions based on something called an instruction and that instructions are bit patterns and depending on the pattern of the bits, the CPU would choose the appropriate course of action.
You didn’t even know what that meant but it felt like a good thing to know so you held onto that piece of knowledge. At another time you read somewhere that all the programs that we cannot now live without are actually all just bunch of bit patterns that the CPU thing can execute (bunch of instructions and data essentially). Your mind was blown and your socks were knocked off but you put them back on and continued reading. You found out that a compiler is responsible for taking the programs you write and turning them into bit patterns that the CPU understands and can execute.
And so it all started for you, you knew there and then that there was no turning back, you knew too much to go back now so you went further and discovered that the type of bit patterns the CPU understands is dictated by something called the Instruction Set Architecture, which is even harder to understand than everything else because the manuals explaing what they are and what they do are big (alot bigger than the compilers’) and you didn’t know where to start. But then you found the precious little ISA (RISC-V), beautifully crafted in the open with much much smaller manuals, specs and a loving community so you read, and read and watched videos and continued to read some more. And everything made sense. All the stars aligned and the universe made sense, finally the glimmer of light in the darkness you were hoping for all along.
Joe, enough! cut the crap and talk about what we are doing here today
Okay, let’s get into why we are here again. We know that when we write code and compile it, it gets translated into instructions and data that the CPU can understand and execute. These instructions get stored on disk in an executable file. The tools and processes involved from source code to the executable file look vaguely like this:
- A text editor is used to create source code in a high level language (C) or a low-level language (assembly) which are just symbolic representations of instructions.
- The compiler or assembler takes the source code and turns them into object files or executable files.
- A linker takes the object files and (surprise!) links them into one big (or small) executable file we can give to the CPU to run.
As part of the journey of trying to figure out how things work, we are going to try to figure out how data and instructions are stored in our compiled programs. We are going to be experimenting with alot of RISC-V assembly code and binary data. Lets get into it!
getting object files from our source code
We have establised some common ground in the beginning. Now let’s move to figuring out how we can get instructions and data from our compiled code, and inspecting them to see how the data and instrutions are stored. The program we are using for this particular task is just a simple C program that does absolutely nothing but contains data – data is very important for things to make sense.
/* main.c */
int num = 0;
int number = 0x1234;
const int constant = 0x4567;
int main(int argc, char *argv[])
{
return 0;
}
I have the the RISC-V GNU Compiler toolchain installed on my computer. And I am going to be using these tools to compile, disassemble and dump binaries all over the place. Lets start by compiling our C code into an object file.
joe@debian:../$ riscv64-linux-gnu-gcc -c -g main.c
joe@debian:../$ ls
main.c main.o
We’ve got ourselves a nice little object file main.o
.
It looks beautiful doesn’t it. We now have our object file (containing the compiled code)
and we can inspect it.
inspecting the object file
Lets start by running file
on the object file to see what type of stuff we are
dealing with.
joe@debian:../$ file main.o
main.o: ELF 64-bit LSB relocatable, UCB RISC-V, version 1 (SYSV), with debug_info, not stripped
Well it turns out our object file is an ELF file. If you use linux you’ve used it before and probably know what it is. It is the file format of executable files, object files and shared libraries.
Lets try to make sense out of the output of the file
command up there.
- ELF: this specifies the file format, in this case an ELF file. There are many other file formats out there for executables and object files and their specification documents if you want to play around with them.
- LSB: specifies the endianness of the file. There are two types for executable files that I care and know about: little endian or Least Significant Byte (LSB) and big-endian or Most Significant Byte (MSB). It essentially indicates how bytes are stored in memory, LSB means the least significant byte of a word is stored at the lower memory address and the most significant byte at the higher memory address and MSB means the most significant byte is stored at the lower memory address and the least significant byte at a higher memory address.
- relocatable: Since this is an object file, it is not yet executable even though it contains instructions and data. It is a relocatable file because it needs to be linked (by the linker) with other object files to create an executable.
- UCB RISC-V: specifies the ISA of the instructions in the file. In this case the precious little RISC-V. UCB (UC Berkeley) is where the ISA was developed.
- (SYSV): This specifies the OS and ABI of the file. The ABI specifies how the operating systems interfaces with the ISA.
- with debug_info, not stripped: We compiled the code with the
-g
option to preserve snippets of code, names of variables and functions (called symbols) in the file. These symbols are referred to as debug symbols and they are not stripped from the object file. As the name suggests these symbols are valuable for debugging programs.
Now that we are over with that, lets take a closer look at the general structure of an ELF file.
a closer look at the elf file
Lets take a look at the ELF file header (using readelf
).
joe@debian:../$ readelf -h main.o
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: REL (Relocatable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x0
Start of program headers: 0 (bytes into file)
Start of section headers: 2992 (bytes into file)
Flags: 0x5, RVC, double-float ABI
Size of this header: 64 (bytes)
Size of program headers: 0 (bytes)
Number of program headers: 0
Size of section headers: 64 (bytes)
Number of section headers: 20
Section header string table index: 19
Well well well! we have most of the same information we got from file
here and
soo much more. We have section header
and program header
information.
Section and Program headers are the main parts of the ELF file format,
together they contain most of the things we care about – instructions and data.
Let’s take a look and see what they look like. We start with the program headers
also called segments
.
joe@debian:../$ readelf --segments main.o
There are no program headers in this file.
What! why? Well, program headers are only needed by the operating system because they provide information to it about where and how to load the sections into memory and since object files are not yet linked (by the linker) into executables, the compiler saw no need to include them into the object file. So we do not have program headers on object files. Lets move on.
What about the sections?
joe@debian:../$ readelf --sections main.o
There are 20 section headers, starting at offset 0xbd8:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .text PROGBITS 0000000000000000 00000040
000000000000001a 0000000000000000 AX 0 0 2
[ 2] .data PROGBITS 0000000000000000 0000005c
0000000000000004 0000000000000000 WA 0 0 4
[ 3] .bss NOBITS 0000000000000000 00000060
0000000000000004 0000000000000000 WA 0 0 4
[ 4] .rodata PROGBITS 0000000000000000 00000060
0000000000000004 0000000000000000 A 0 0 4
[ 5] .debug_info PROGBITS 0000000000000000 00000064
00000000000000d0 0000000000000000 0 0 1
[ 6] .rela.debug_info RELA 0000000000000000 000007c8
00000000000001f8 0000000000000018 I 17 5 8
... (bunch of stuff we don't care about)
[17] .symtab SYMTAB 0000000000000000 00000330
00000000000003d8 0000000000000018 18 37 8
[18] .strtab STRTAB 0000000000000000 00000708
00000000000000ae 0000000000000000 0 0 1
[19] .shstrtab STRTAB 0000000000000000 00000b00
00000000000000ae 0000000000000000 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
p (processor specific)
Yep we got them!, and they are numerous, more than I was expecting. Remember the with debug_sections, not stripped? we have alot of debug stuff in the compiled code that are helpful for finding bugs but take up space.
peeking at the relevant sections
Now we try to find out what all the different sections mean and what they do.
I mean there has to be a purpose to every single one of them right? But we will
only look at the those that contain data and instructions. Lets get to looking
at the different sections and the data each one of them contains. We will be
looking at some hexdumps 😬. We will look at the .text
, .data
,
.rodata
, .bss
and .symtab
sections.
.text
this section contains the instructions we love and cherish. It contains the bit patterns that tell the CPU what to do. Essentially our code but as bit patterns.
joe@debian:../$ readelf -x .text main.o
Hex dump of section '.text':
0x00000000 011122ec 0010aa87 2330b4fe 2326f4fe ..".....#0..#&..
0x00000010 81473e85 62640561 8280 .G>.bd.a..
Well since instructions are just bit patterns it makes sense that it gets stored as bunch of bytes. We can take a look at the dissambly of the above code and see what those bit patterns actually mean.
main.o: file format elf64-littleriscv
Disassembly of section .text:
0000000000000000 <main>:
int num;
int number = 0x1234;
const int constant = 0x4567;
int main(int argc, char *argv[])
{
0: 1101 addi sp,sp,-32
2: ec22 sd s0,24(sp)
4: 1000 addi s0,sp,32
6: 87aa mv a5,a0
8: feb43023 sd a1,-32(s0)
c: fef42623 sw a5,-20(s0)
return 0;
10: 4781 li a5,0
}
12: 853e mv a0,a5
14: 6462 ld s0,24(sp)
16: 6105 addi sp,sp,32
18: 8082 ret
That is a lot of assembly code for a function that does absolutely nothing.
The debug symbols interspersed with the assembly code above might make it
easier to understand.
In the C code above we have a main function with signature
int main(int argc, char *argv[])
that does nothing and returns zero.
According to the RISC-V calling conventions, arguments are passed to functions through the
a0-a7
registers and the return value of functions are placed in a0-a1
registers. s0
doubles as a frame pointer and sp
is the stack pointer.
The stack grows downwards towards the low memory addresses. The way I think about this is, if the stack grows downwards towards the lower addresses, to allocate space will require a reduction in the stack pointer’s value and to deallocate space will require an addition to the stack pointer’s value.
Because we compiled our code to to run on a 64 bit machine, memory addresses (pointers) are
64 bits (8 bytes). Memory is byte addressable and adjacent memory locations are
32/64 bits apart(I am not really sure on the exact value), either way we are able to load and
store 64 bits at a time with the ld
(load double) and sd
(store double) instructions respectively.
A double or double word is 8 bytes in RISC-V.
Now let’s dissect the assembly and understand what every part of it means. We start with the first three instructions
addi sp, sp, -32
sd s0, 24(sp)
addi s0, sp, 32
Over here it looks like the stack frame is being setup for the function. 32 is subtracted
from the stack pointer in the first line, meaning an allocation of 32 bytes on the stack.
On the next line, the frame pointer s0
(the frame pointer) is stored at an offset 24
into the stack.
addi s0, sp, 32
is setting the frame pointer s0
to point to the start of the allocated stack space for the function. So a stack
frame is a range of memory from the frame pointer to the stack pointer.
Note: the stack pointer and the frame pointer are registers that contain memory addresses, in this case the memory address of the stack
mv a5,a0
sd a1,-32(s0)
sw a5,-20(s0)
Over here, the first argument int argc
’s value in a0
gets copied to to the a5
register,
and a1
the second argument register holding char *argv[]
’s value is stored on the
stack (note that it is a pointer). The integer value in the a5
also gets
stored on the stack, this time using the sw
(store word) instruction. This is
because int
s in C are 4 bytes and a word in RISC-V is 4 bytes.
The snippet above stores both function arguments on the stack.
li a5,0
mv a0,a5
ld s0,24(sp)
addi sp,sp,32
ret
In the first line of the above snippet, the immediate value 0
is copied into the a5
register and then
the value is copied to the a0
register. The frame pointer stored initially on the stack is been copied
back into the s0
register.
addi sp, sp, 32
deallocates the stack, essentially moving it back to its state before
the main
function. Note that the 0 moved into a0 is the return value of the function.
.data
contains all the global and static variable of our compiled code.
joe@debian:../$ readelf -x .data main.o
Hex dump of section '.data':
0x00000000 34120000 4...
Lets take a look the snippet of our code that might correspond to this section (global and static variables).
/* ... */
int number = 0x1234;
/* ... */
Okay, we have the 3412
with alot of zeroes after it in the hex dump up above, but we stored
1234
.
Remember the LSB and MSB? We see its effect over here, we see that the 34
comes before 12
in the hexdump because it is storing the data in the
little-endian (stores the little end first) format.
How do we know which one is the little end and which one is the big end.
The byte with the lowest place value is the little end (least significant byte)
and the byte with the highest place value is the big end (most significant byte)
.bss
this one is just like the .data section but only contains those variables that do not have an initial value or have an initial value of zero.
/* ... */
int num;
/* ... */
joe@debian:../$ readelf -x .bss main.o
Section '.bss' has no data to dump.
Ehmm what is going on here? We declared a global variable and did not initialize it.
Because the variables in the .bss section do not have initial values, the compiler leaves the section empty in the file to save some space. It is the responsibility of the executable loader (in the operating system) fill in this section with zeroes when loading the file into memory for execution. Let’s look at the disassembly of this section.
joe@debian:../$ riscv64-linux-gnu-objdump -z -S -D -j .bss main.o
main.o: file format elf64-littleriscv
Disassembly of section .bss:
0000000000000000 <num>:
int num;
0: 0000 unimp
2: 0000 unimp
Yeah well, we see the num
variable name with a bunch of zeroes in that
sections. Everything looks good.
.rodata (read only data)
this section contains constant data. If you have an integer constant or string constant, they get stored in this section. If you pass a literal string to a function, it gets stored here.
joe@debian:../$ readelf -x .rodata main.o
Hex dump of section '.rodata':
0x00000000 67450000 gE..
What does the corresponding C code to this section look like
/* ... */
const int constant = 0x4567;
/* ... */
Well we see the same pattern. The data is stored in the little endian format with a bunch of zeroes after it.
.symtab
this section contains information about all the symbols in the object file: variables, constants, functions etc. Since this section contains the symbols of the program, lets take a look at the actual symbol table and not the hex dump of it.
joe@debian:../$ readelf --symbols main.o
Symbol table '.symtab' contains 41 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS main.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 1
3: 0000000000000000 0 SECTION LOCAL DEFAULT 2
4: 0000000000000000 0 SECTION LOCAL DEFAULT 3
5: 0000000000000000 0 SECTION LOCAL DEFAULT 4
... (bunch of stuff we don't care about) ...
34: 0000000000000000 0 NOTYPE LOCAL DEFAULT 5 .Ldebug_info0
35: 0000000000000000 0 SECTION LOCAL DEFAULT 13
36: 0000000000000000 0 SECTION LOCAL DEFAULT 15
37: 0000000000000000 4 OBJECT GLOBAL DEFAULT 3 num
38: 0000000000000000 4 OBJECT GLOBAL DEFAULT 2 number
39: 0000000000000000 4 OBJECT GLOBAL DEFAULT 4 constant
40: 0000000000000000 26 FUNC GLOBAL DEFAULT 1 main
Wow cool stuff indeed. We can see the variable names, the main
function
and even the filename in the symbol table. That’s wild!
final words
This post spiraled out of control quickly and took a life of its own, it was fun and I learnt a ton writing this it. I might continue and write more stuff on this topic of how data and code get stored in binary files. I would like to continue to figure out how especially arrays, structs and pointers are stored.. Ohh and how data is aligned in memory and padding and all the other fancy stuff. And maybe look at control flow and program logic at some other time.
Until then happy learning Campers!
resources
- MIT 6.004 Computation Structures (RISC-V programming)
- The 101 of ELF files on Linux
- ELF Man Page - Linux
- RISC-V calling convention