Notes

Linkers and Loaders

This is the seventh installment in a series of posts where I share notes taken while reading an interesting book or article. This post includes the notes made while reading the book titled “Linkers and Loaders” by John Levine. The book includes exercises that have you incrementally build a linker. A partial implementation is available on GitHub: mild. Chapter 1: Linking and Loading The basic job of any linker or loader is simple: it binds more abstract names to more concrete names, which permits programmers to write code using the more abstract names. Linkers assign relative addresses within a program. Loaders do final relocation to assign actual addresses in memory. Program loading is the process of copying a program from secondary storage into main memory. This may involve allocating storage, setting protection bits, or arranging for virtual memory to map virtual addresses to disk pages. Relocation is the process of assigning load addresses to the various parts of the program, adjusting the code and data in the program to reflect the assigned addresses. Symbol resolution is just as the name suggests: take a symbol name and resolve it to an address in the code or a library. There’s a lot of overlap between linker and loader responsibilities. In general, the loader loads the program into memory. The linker performs symbol resolution. Either can do relocation. It’s also possible to have an all in one tool that does all three functions. Both linkers and loaders patch object code. Linking is a two pass process. The linker takes as its input a set of input object files, libraries, and perhaps command files, and produces as its result an output object file, and perhaps ancillary information such as a load map or a file containing debugger symbols. Each input contains segments, contiguous chunks of code or data that are merged in the output file. Each input file also contains at least one symbol table. You can import symbols. These are names that must be present in one of the other input files. You can export symbols. This makes the symbol name visible to code in the other input files. All linkers support object code libraries. A library is just a collection of object files. The linker resolves undefined names in the input files by looking at the library’s object code for exported names that match. With dynamic linking, the linker leaves hints for the loader of the symbols required and what libraries they’re found so the loader can patch the code at runtime. Typically, you invoke the linker through the compiler driver. The various linkers can take commands. This includes command line switches and special scripts which can give the programmer control of where sections get loaded. Chapter 2: Architectural Issues Two aspects of hardware architecture affect linkers: program addressing and instruction formats. One thing the linker does is modify addresses and offsets both in data memory and in instructions. The aspect of the Abstract Binary Interface that most often affects the linker is the definition of a standard procedure call. Usually addresses are byte aligned. This means that if you have an \(N\) byte datum, its address should have at least \(log_2(N)\) least significant zero bits. Unaligned accesses can degrade performance on some systems. On others, unaligned access leads to faults. The instruction format is of concern to the linker. Specifically those instructions surrounding addressing data. Instructions can be of fixed or variable length. They usually consist of an opcode followed by operands. Every ABI defines a standard procedure call sequence using a combination of hardware-defined call instructions and conventions about register and memory use. Within a procedure, data addressing falls into four categories: A caller can pass arguments to the procedure. Local variables get allocated within the procedure and freed before the procedure returns. Local static data gets stored in a fixed location in memory and is private to the procedure. Global static data gets stored in a fixed location in memory and can be referenced from many different procedures. There’s usually a frame pointer pointing to the return address of the procedure. There’s also a stack pointer pointing to the top of the stack. Arguments to the procedure are at negative offsets to the frame pointer (higher memory addresses). Local variables are at positive offsets to the frame pointer (lower memory addresses). The OS usually sets the initial stack pointer before a program starts. The compiler generates a table of pointers to local and global static data. The address of that table gets loaded into a register. You require another register to calculate offsets from that base address. Within some architectures, the linker must create a table of pointers that captures static data across all modules/object files. In other systems, procedures within a module have their own table. A register gets loaded with the procedure’s table address before that procedure gets called. There’s a bootstrapping problem here where the first procedure called needs to have its static data table pointer set up. This is usually done by the linker via some special code. Modern computers support virtual memory. On these machines, paging hardware divides the program’s address space into fixed size pages, typically 4KB in size. The same hardware divides the physical memory into page frames of the same size. The hardware contains page tables with an entry for each page in the address space. A page table entry can contain the real memory page frame for the page, or flag bits to mark the page “not present.” When a program attempts to use a page that isn’t present, hardware generates a page fault. Page faults get handled by the OS which can copy content from disk into a free page frame. Moving data back and forth between disk and memory makes it seem like the system has more memory than it actually does (virtual memory). Page tables are usually hierarchical. The hardware divides the virtual address into three parts: a top-level page table index, a second-level page table index, and an offset within the page. The hardware uses the top-level index to find the second-level page table, then uses the second-level index to find the page table entry. This can vary by architecture. Every application runs in an address space defined by a combination of the computer’s hardware and OS. The linker or loader needs to create a runnable program that matches that address space. Address space layouts vary from system to system. Some systems have a single address space where the OS and all programs share the same address space. Others partition the address space and allocate a free chunk large enough to hold the program. Many systems provide a memory mapping mechanism for mapping a file into the address space of a program. The OS sets up the page tables so that when the program accesses a page in the mapped region, the OS loads the page from the file on disk. Policies for mapping files include read-only, read-write, and copy on write (COW). COW is interesting because it means changes made in the address space are only visible to that process. Shared libraries often use position independent code (PIC). This means you can load the code at any address in memory without modification. Only data pages still usually contain pointers which need relocation, but since data pages map COW anyway, there’s little sharing lost. Embedded systems pose special challenges for linkers and loaders. They often have address spaces that divide into regions for ROM, RAM, and peripherals. The linker needs to know about these regions and place code and data appropriately. On some embedded systems, there’s references to on-chip and off-chip memory. The linker needs to know which addresses are on-chip and which are off-chip so it can place code and data appropriately. You can also use tricks to copy code or data from slow memory to fast memory as needed. To do this, you have to tell the linker “put this code at location X but link it as though it’s at location Y.” The code gets copied from X to Y at runtime. Chapter 3: Object Files An object file contains five kinds of information: Header Information: Overall information about the file, such as the size of the code, name of the source file it derives from, and creation date. Object Code: Binary instructions and data generated by a compiler or assembler. Relocation: A list of the places in the object code that have to be fixed up when the linker changes the addresses of the object code. Symbols: Global symbols defined in this module, imported symbols from other modules or defined by the linker. Debugging Information: Other information about the object code not needed for linking but of use to a debugger. This includes source file and line number information, local symbols, and descriptions of data structures. A object file may be: Linkable: Used as input by a link editor or linking loader. Executable: Loads into memory and runs as a program. Loadable: Loads into memory as a library along with a program. A combination of the three is also possible. An object file could be all binary code. This is how MS DOS .COM files work. In UNIX, the a.out format introduces separate sections for instructions and data. Instructions live in a .text section. Data lives in a .data section. Uninitialized data lives in a .bss section. Read-only text sections are shareable amongst multiple processes. The a.out format also includes a header. The header contains sizes of the different sections. Also at the start of the header is a magic number that tells the loader how to load the file into memory. On a BSD system using the QMAGIC load format, the layout in memory looks like: The object exists as a file on disk. The text section of that file will be loaded as read-only pages into the address space of the process. The data sections load as read-write pages that are COW. The first page of the address space is blank to catch NULL pointer dereferences. The header and text section follow next in the address space. The data section follows next with the BSS section concatenated on the end. The heap follows the BSS section with stack pages allocated some ways down. Note, all sections are page aligned meaning text/data sections get rounded up to the next page boundary. The key thing with the BSD formats is that they assign a fresh address space to each process so that every program loads at the same logical address. These object formats are simple because they can be directly loaded to memory. No need for a linker. Executable Link Format (ELF) is a modern object file format used on many UNIX-like systems. ELF comes in three slightly different flavors: relocatable, executable, and shared object. Compilers and assemblers create relocatable files. The linker needs to process these before they can run. Executable files have all relocation done and all symbols resolved except perhaps shared library symbols that get resolved at runtime. Shared objects contain both symbol information for the linker and directly runnable code for runtime. Compilers, assemblers, and linkers treat an ELF file as a set of logical sections described by a section header table, while the system loader treats the file as set of segments described by a program header table. Here’s a capture of the ELF header with descriptions: ...

Code - The Hidden Language of Computer Hardware and Software

This is the sixth installment in a series of posts where I share notes taken while reading an interesting book or article. This post includes the notes made while reading the book “Code - The Hidden Language of Computer Hardware and Software 2nd Edition” by Charles Petzgold. Chapter 2: Codes and Combinations Morse Code is a binary language. International Morse Code encodes up to 8 bits of information where each bit is either a dot or a dash. Dots and dashes are the written representation of Morse Code. The dot is a short signal, while the dash is a long signal. The length of the dash is three times that of the dot. As with all binary codes, each time you add a bit, you double the number of possible combinations. Chapter 3: Braille and Binary Codes Braille is a type of binary code where 6 bits represent a character. The 6 bits arrange in a 3x2 grid, where each dot can be either raised or not raised. The 6 bits can represent 64 different characters. Though there are only 64 characters, Braille characters can have additional meanings based on their context. The number indicator marks the beginning of a sequence of numbers. The letter indicator terminates the number sequence. This is an example of a shift code. In Braille, the capital indicator capitalizes the letter that follows. The capital indicator is an example of an escape code. Escape codes change the meaning of the code that immediately follows. Chapter 4: Anatomy of a Flashlight You can characterize electricity as the flow of electrons. In the flashlight example, the battery is the source of electricity. The flashlight bulb is the load. The switch is a control device that opens and closes the circuit. Batteries produce a chemical reaction that creates a flow of electrons. Chemical energy converts to electrical energy. Spare electrons appear on the anode side of the battery. The cathode side of the battery has a deficiency of electrons. Electrons would like to flow from the anode to the cathode. An electrical circuit provides a path for the flow of electrons from the anode to the cathode. Substances composed of atoms that bias towards shedding electrons are conductors. Copper, silver, and gold are good conductors. The opposite of conductance is resistance. Substances with high resistance are called insulators. Rubber, glass, and plastic are good insulators. Current (\(I\)) is the flow of electrons through a circuit. Voltage (\(V\)) is the difference in electric potential between two points in a circuit. Voltage is directly proportional to the amount of current flowing through a circuit. Current is inversely proportional to resistance. Also known as Ohm’s Law: \(V = IR\). Power in Watts is the product of voltage and current: \(P = IV\). Chapter 5: Communicating Around Corners You can make a primitive telegraph by connecting two batteries, four wires, and two light bulbs. You can remove a wire from the circuit by using a common ground. Ground in this case means a connection to the Earth. The Earth acts a source of electrons. The electrons flow from the Earth, through the circuit, and into the positive terminal of the battery. American Wire Gauge (AWG) is a measure of wire thickness. The lower the number, the thicker the wire. The thicker the wire, the less resistance it has. When covering great distances, you need to use thicker wire or a higher voltage power supply. Chapter 6: Logic with Switches George Boole invented Boolean algebra. Boolean algebra is a system of logic. You can describe boolean algebra in terms of sets. The usual set operators such as union, intersection, and complement apply as do the concepts of the universe and NULL set. Two switches wired in series are equivalent to an AND gate. Two switches wired in parallel are equivalent to an OR gate. You can configure switches and wires to form complex boolean expressions. A primitive form of a computer. Chapter 7: Telegraphs and Relays The invention of the telegraph marks the beginning of modern communication. For the first time, people could communicate over long distances almost instantaneously. The telegraph key is a switch that opens and closes the circuit to send messages. The telegraph operator taps the key to send dots and dashes. The telegraph sounder is a device that converts the electrical signals from the telegraph key into audible sounds. One major impediment to the telegraph was the length of the wires needed to connect the telegraph stations. The longer the wire, the more resistance it has, and the weaker the signal. The invention of the relay solved the problem of long-distance telegraphy. A relay is an electrically operated switch that can amplify the signal. A relay consists of an electromagnet, a set of contacts, and a spring. When the electromagnet gets energized, it attracts the armature, which closes the contacts and completes the circuit. Chapter 8: Relays and Gates Reduced to its essentials, a computer is a synthesis of Boolean algebra and electricity. The crucial components of a computer are the logic gates. Logic gates are electronic circuits that perform boolean operations. Like switches, you can connect relays in series or parallel as logic gates. You can combine logic gates to form more complex circuits. The switches control the input to the relays, and the relays control the output. The output of one relay can be the input to another relay. A normally open relay is a relay that’s open when the electromagnet isn’t energized. A normally closed relay is a relay that’s closed when the electromagnet isn’t energized. An inverter is a logic gate that reverses the input signal. In terms of relays, an inverter is a normally closed relay that opens when the electromagnet gets energized. There are six basic logic gates: AND, OR, NOT, NAND, NOR, and XOR. Additionally, you have buffers. A buffer is a logic gate that passes the input signal to the output without changing it. In real life circuits, sometimes output must serve as many inputs. That’s called fanout, and it can result in a lessening of the power available to each input. Buffers can help boost that power acting as a relay. You can also use buffers to delay a signal. From a NAND or NOR gate, you can create all other logic gates. The following are Demorgan’s Laws: \(\lnot A \land \lnot B = \lnot(A \lor B)\) \(\lnot A \lor \lnot B = \lnot(A \land B)\) Demorgan’s Laws are useful for simplifying boolean expressions. Chapter 9: Our Ten Digits There are many possible number systems. Roman numerals were common before the introduction of the Hindu-Arabic numeral system. Key features that differentiate the Hindu-Arabic numeral system include: The use of a zero to represent the absence of a value. The positional notation, where the value of a digit depends on its position in the number. Chapter 10: Alternative 10s By convention, humans work in a base 10 number system, also known as decimal. There are many other number systems, such as binary (base 2), octal (base 8), and hexadecimal (base 16). The formula for converting from any base to decimal is: \(d = \sum_{i=0}^{n} d_i \cdot b^i\) where \(d\) is the decimal value, \(d_i\) is the digit in base \(b\), and \(n\) is the position of the digit. Binary numbers unite arithmetic and electricity. Switches, wires, and light bulbs can all represent the binary digits 0 and 1, and with the addition of logic gates, you can manipulate these numbers. Chapter 11: Bit by Bit by Bit The binary number system is the simplest number system possible. The bit, a binary digit, is the fundamental unit of information in computing. The meaning of a particular bit or collection of bits is always understood contextually. You can visualize binary codes in many ways. The example given is the Universal Product Code (UPC) barcode. The UPC barcode is a binary code that encodes information about a product. A slice of the barcode has black lines representing 1 and gaps representing 0. The thickness of the lines or gaps dictate the number of bits represented (up to 4 bits per line). In total, the UPC barcode encodes 95 bits of data. The UPC barcode also includes some error checking. The last digit is a checksum that verifies the integrity of the data. Each barcode includes a beginning, middle, and end guard pattern to help scanners identify faulty codes. Quick Response (QR) codes are another example of a binary code. QR codes can encode more information than UPC barcodes, including URLs and text. QR codes consist of black squares arranged on a white grid. The black squares represent 1s, and the white squares represent 0s. Most of the bits in a QR code are for error correction. Chapter 12: Bytes and Hexadecimal Computers often group bits into a quantity called a word with the most common word size being 8 bits, also known as a byte. Modern computers typically use 32-bit or 64-bit words. The hexadecimal number system is a base 16 number system that uses the digits 0-9 and the letters A-F to represent values. You can group the digits in a binary number into sets of four to form a hexadecimal digit. Chapter 13: From ASCII to Unicode Morse code is a variable bit-length code, meaning that different characters can have different numbers of bits. For example, the letter ‘E’ is a single dot (1 bit), while ‘Q’ is a dash followed by two dots (3 bits). Braille is a fixed-length code, meaning that each character has the same number of bits (6 bits). ASCII (American Standard Code for Information Interchange) is a character encoding standard that uses 7 bits to represent characters. ASCII can represent 128 different characters, including letters, digits, and control characters. ASCII includes 32 control characters that are not printable, such as the newline character and the tab character. ASCII is also known as plain text. ASCII data does not include any formatting information, such as font size or color. Extended ASCII is an 8-bit character encoding that includes additional characters beyond the standard ASCII set. Extended ASCII can represent 256 different characters. Extended ASCII can’t represent characters from all different languages and scripts. Unicode is a character encoding standard that can represent characters from many different languages and scripts. Unicode started as a 16-bit encoding. Unicode documents start with a Byte Order Mark to indicate endianness. Unicode has more recently extended to 21 bits, allowing it to represent over a million different characters. Unicode transformation formats (UTF) encode Unicode characters in a way that’s compatible with existing systems. The most common UTFs are UTF-8, UTF-16, and UTF-32. UTF-8 is the most widely used Unicode encoding. It uses 1 to 4 bytes to represent characters, depending on the character’s code point. How bytes are interpreted is complex enough that you should look this up for more details. Chapter 14: Adding with Logic Gates The summation of two bits produces a sum bit and a carry bit. You create an XOR gate by passing the two inputs through both a OR gate and a NAND gate, and then passing the outputs through a AND gate: An XOR gate takes two input bits and produces a sum bit. An AND gate takes two input bits and produces a carry bit. You can combine XOR and AND gates to create a half adder, which adds two bits and produces a sum bit and a carry bit. A full adder is a circuit that adds three bits: two input bits and a carry bit from a previous addition. A full adder produces a sum bit and a carry bit. You can combine two half adders and an OR gate to create a full adder (see page 177). You can combine two 8 bit full adders to create a 16 bit adder (see page 182). You can combine multiple 16 bit adders to create larger adders. Chapter 15: Is This For Real Relays and vacuum tubes were the first electronic components used in computers. You can build logic gates from vacuum tubes just like you can with relays. Relays suffer slow switching speeds and are susceptible to mechanical wear. Vacuum tubes are faster than relays but are larger, consume more power, generate more heat, and wear more often. The transition from relay technology to vacuum tube technology marked the transition from electromechanical computers to electronic computers. Von Neumann architecture is a design for a computer that uses a single memory space for both data and instructions. This architecture is the basis for most modern computers. You can control a semiconductors’ conductance by applying a voltage. This property makes them ideal for building logic gates and other electronic components. Semiconductor doping is the process of adding impurities to a semiconductor to change its electrical properties. The most common dopants are phosphorus and boron forming n-type and p-type semiconductors. You can make amplifiers out of semiconductors by sandwiching a p-type semiconductor between two n-type semiconductors. This is the basis for a NPN transistor, and the three pieces form the collector, base, and emitter. A small voltage on the base controls the flow of current between the collector and emitter. The transistor introduces solid-state electronics, which means transistors are built from solids, specifically semiconductors and most commonly silicon. Transistors require much less power, generate much less heat, and last longer than vacuum tubes. The invention of the integrated circuit (IC) allowed for the miniaturization of electronic components. An IC is a small chip that contains many transistors and other electronic components. The first ICs were usually packaged in dual in-line packages (DIPs). There are two families of ICs: transistor-transistor logic (TTL) and complementary metal-oxide-semiconductor (CMOS). TTL chips are faster but consume more power than CMOS chips. CMOS chips are slower but consume less power and are more tolerant to variations in voltages. One important fact to know about a particular integrated circuit is the propagation time, the time it takes for a change in the inputs to reflect in the output. You measure propagation time in nanoseconds. The timeline to keep in mind from this chapter is that logic gate components trended from relays, to vacuum tubes, to transistors, and then to integrated circuits. Chapter 16: But What About Subtraction This chapter introduces two’s complement, a method for representing signed integers in binary. Two’s complement makes for easy addition and subtraction of signed integers using the same binary addition circuits previously discussed. To convert a positive integer to two’s complement, you simply represent it in binary. To convert a negative integer to two’s complement, you invert the bits of its positive representation and add 1. Since binary numbers have a fixed number of bits, you can only represent integers within a certain range. For example, with 8 bits, you can represent signed integers from -128 to 127 or unsigned integers from 0 to 255. There’s opportunity for overflow when adding or subtracting signed integers. The circuit on page 210 shows how to detect overflow in a two’s complement addition circuit. Chapter 17: Feedback and Flip-Flops You can create an oscillator by connecting the output of an inverter to its input. The inverter will toggle its output between 0 and 1, creating a square wave. The period of an oscillator is the time it takes for the output to complete one cycle. The frequency is the number of cycles per second. The frequency of an oscillator is inversely proportional to its period. Usually, you measure frequency in Hertz. See page 221 for an illustration of a reset-set flip-flop. The next flip-flop type is the level triggered D-type flip-flop. The D stands for data. The D flip-flop captures the value of the data input when the clock signal is high and holds that value until the next clock cycle. A edge triggered D-type flip-flop captures the value of the data input on the rising or falling edge of the clock signal. You can make an edge triggered D-type flip-flop by combining two level triggered flip-flops (see page 229). If you combine an oscillator with a edge triggered D-type flip-flop, you can create a frequency divider. The output frequency is half the input frequency. See page 235 for an illustration of a frequency divider. Chained frequency dividers create a binary counter called a ripple counter. To find the frequency of the oscillator, you can let the ripple counter run for a certain number of cycles and then divide the number of cycles by the time it took to run them. You can augment a flip-flop with a clear and preset input. You never set both clear and preset at the same time. Having a clear input avoids the issue of the flip-flop being in an indeterminate state when powered on. With the preset input, you can set the flip-flop to a known state without waiting for the clock signal. Chapter 18: Let’s Build a Clock Binary Coded Decimal (BCD) is a way of representing decimal numbers in binary. BCD uses 4 bits to represent each decimal digit, allowing you to represent decimal numbers from 0 to 9. It’s best to read the book to get an idea of how to build the clock. At a high level, you use a ripple counter to count each digit in the seconds, minutes, and hours. The output of the high digit’s counter is the input to the next counter. Additional circuitry makes it possible to display the hours in 12-hour format. Electrical current can only flow in one direction through a diode. A Light Emitting Diode (LED) is a diode that emits light when current flows through it. A diode matrix is a grid of diodes that displays characters or numbers. The diodes are in rows and columns, and you can turn on specific diodes to create a pattern. Diode matrices are technically a form of Read Only Memory (ROM). Chapter 19: An Assemblage of Memory You can use level triggered D-type flip-flops to create a primitive form of memory. Combining say 8 flip-flops with a 3-to-8 decoder creates a memory cell that can store 8 bits of data. Add a 8-to-1 selector to the circuit and you’re able to read the data from the memory cell. See page 273 for an illustration. This is a form of read/write memory. Since you can address any of the 8 bits at will, this is also known as Random Access Memory (RAM). A tri-state buffer can have one of three states: high, low, or high impedance (Z). The high impedance state is like an open circuit, meaning it doesn’t affect the output. Both static and dynamic RAM are examples of volatile memory, meaning they lose their contents when powered off. Chapter 20: Automating Arithmetic You can build a simple adder using components introduced in previous chapters. This includes RAM, accumulators, and latches. The control signals are the most complex part of the adder. The adder includes simple instructions or opcodes for adding and subtracting, storing results in RAM, and halting the machine. The combination of the hardware and software forms a primitive computer. Byte ordering specifically little versus big endian is important when dealing with multi-byte data types. Chapter 21: The Arithmetic Logic Unit The three components of a computer include memory, the central processing unit (CPU), and input/output (I/O) devices. In memory lives both the data and the CPU instruction codes. The Arithmetic Logic Unit (ALU) is the part of the CPU that performs arithmetic and logic operations. The ALU shown in this chapter can add and subtract 8-bit numbers and perform bitwise logic operations on the same 8-bit numbers. There are opcodes for each ALU function. The ALU performs all logic operations simultaneously outputting each result to a tri-state buffer. Three functions bits select which buffer to output. A compare operation is the same as a subtraction operation, but the result is not stored. Instead, you save a carry flag and a zero flag. You can find the complete ALU circuit on page 332. Chapter 22: Registers and Busses A CPU includes a small number of special purpose registers. The registers can be general purpose or serve a specific task. For example, in the Intel 8080 CPU, the accumulator register is a general purpose register used to store intermediate results of arithmetic and logic operations. The opcodes may use one or more registers as operands. For example, the Intel 8080 CPU has an opcode for moving data to RAM. The opcode uses the H and L registers to form the 16-bit address of the RAM location to write to. Assembly language is a low-level programming language that uses mnemonics to represent opcodes and registers. Each assembly language instruction corresponds to a single opcode. Opcodes for arithmetic, moving data to registers/ram, control flow, and halting the CPU exist. The data bus is a set of wires that carry data between the CPU, memory, and I/O devices. The address bus is a set of wires that carry addresses to memory and I/O devices. Chapter 23: CPU Control Signals Most control signals are of two types (buses here are the data and address busses): Signals that put a value on one of the two busses. Signals that save a value from one of the two busses. Signals that put a value on the bus attach to the enable inputs of various tri-state buffers that connect the outputs of the components to the bus. Signals that save a value from the bus usually control the clock inputs of the various latches that connect the busses to the components on the bus. The only exception is when you save a value to memory using the RAM write signal. CPU cycles are the time it takes to execute a single instruction. When optimizing for speed, you want to minimize the number of CPU cycles needed to execute a program. The control signals are arguably the most complex part of the CPU. Chapter 24: Loops, Jumps, and Calls Loops are a fundamental control flow construct in programming. Loops repeat a block of code multiple times. To implement a loop in assembly language, you need to use a combination of jump instructions and conditional flags. Subroutines or functions are groups of instructions. The CALL and RET instructions call and return from subroutines. The stack is a special area of memory used to store temporary data. The stack provides a means to save the state of the program when calling a subroutine and to restore it when returning from the subroutine. You can nest subroutine calls, meaning you can call a subroutine from within another subroutine. It’s possible to nest so many calls that you run out of stack space resulting in a stack overflow. You can also underflow the stack by popping more items than you pushed onto it. Chapter 25: Peripherals Devices such as the mouse, keyboard, video display, and printer fall under the category of peripherals. One of the most important peripherals is the video display. The video display renders pixels on the screen. Each pixel is a small dot of color. Each pixel is a combination of 8 bit red, green, and blue (RGB) values. A display with a 1920x1080 resolution has 2,073,600 pixels, each represented by 3 bytes requiring a total of 6MB of memory to store the pixel data. The frame buffer is an area in main memory that houses the display pixels. The frame buffer is an example of a memory-mapped I/O device. The CPU can read and write to the frame buffer as if it were regular memory. Peripherals often communicate with the CPU using interrupts. An interrupt is a signal that tells the CPU to stop what it’s doing and handle a specific event. Many signals are analog. To convert an analog signal to a digital signal, you need an Analog-to-Digital Converter (ADC). An ADC samples the analog signal at regular intervals and converts each sample to a digital value. Digital-to-Analog Converters (DAC) do the opposite, converting digital values to analog signals. DACs in video displays convert the digital values of the pixels into voltages that govern the intensity of the red, green, and blue components of each pixel. Digital cameras use ADCs to convert the analog signals from the camera sensor into digital values that form a bitmap. With image data such as a bitmap, you can use compression algorithms to reduce the size of the image. Common image formats include JPEG, PNG, and GIF. Some compression algorithms are lossy, meaning they discard some data to reduce the file size, while others are lossless, meaning they preserve all the original data. JPEG is lossy while PNG and GIF are lossless. Microphones take in sound waves and convert them into electrical signals. The electrical signals are analog, so you need an ADC to convert them into digital values. Compact discs (CDs) store audio data in a digital format. Pulse Code Modulation (PCM) is the technique which encodes the data. PCM samples the audio signal at regular intervals and quantizes the amplitude of the signal to a fixed number of levels. You sample audio at a rate of 44.1 kHz, meaning you take 44,100 samples per second. Each sample is typically 16 bits, resulting in a data rate of 1,411.2 kbps for stereo audio (two channels). The Nyquist Theorem states that you need to sample at least twice the maximum frequency to accurately capture the audio signal. Much like pictures, you can also compress audio data. Common audio formats include MP3, AAC, and FLAC. MP3 and AAC are lossy formats, while FLAC is a lossless format. Chapter 26: The Operating System An operating system (OS) is a collection of software that manages computer hardware and software resources and provides common services for computer programs. An OS exposes an Application Programming Interface (API) that programs interact with to access hardware resources. The OS includes a filesystem that organizes files and directories on storage devices. The filesystem provides a way to read, write, and manage files. A bootloader is a small program that runs when the computer starts up. The bootloader loads the OS into memory and transfers control to it. In the early days, programmers would often bypass the OS and control the hardware directly usually via writing directly to memory. The introduction of operating systems that included a graphical user interface (GUI) made it easier for users to interact with the computer. Chapter 27: Coding Assembly language is a low-level programming language that uses mnemonics to represent machine code instructions. Each assembly language instruction corresponds to a single machine code instruction. Assembly language is specific to a particular CPU architecture. For example, the Intel 8080 assembly language is different from the Motorola 6800 assembly language. In essence, assembly code is non-portable. A program called an Assembler translates assembly language code into machine code. High-level programming languages such as C, Python, and Java provide a more abstract way to write programs. High-level languages are more portable across different CPU architectures. A compiler translates high-level language code into machine code. The compiler performs various optimizations to improve the performance of the generated machine code. A functional programming language is a type of high-level language that treats computation as the evaluation of mathematical functions. Examples include Haskell and Lisp. A procedural programming language is a type of high-level language that uses procedures or routines to structure the code. Examples include C and Pascal. An object-oriented programming language is a type of high-level language that uses objects to structure the code. Examples include Java, C++, and Python. An interpreted language is a type of high-level language that a program called an interpreter executes one line at runtime. Examples include Python and Ruby. The interpreter reads the source code and executes it line by line. Most languages represent floating point numbers using the IEEE 754 standard. The standard defines how to represent floating point numbers in binary, including the sign bit, exponent, and mantissa. Special hardware called a floating point unit (FPU) performs arithmetic operations on floating point numbers. The FPU is often integrated into the CPU. Chapter 28: The World Brain The Internet is a global network of interconnected computers that communicate with each other using standardized protocols. TCP/IP (Transmission Control Protocol/Internet Protocol) is the fundamental protocol suite that underlies the Internet. The World Wide Web (WWW) is a system of interlinked hypertext documents accessed via the Internet. The WWW uses the HTTP (Hypertext Transfer Protocol) to transfer documents. The Internet is a decentralized network, meaning there is no single point of control. This decentralization makes the Internet resilient to failures. Modems and routers are devices that connect computers to the Internet. A modem converts digital signals from a computer into analog signals sent over telephone lines or cable systems. A router directs data packets between different networks. The Domain Name System (DNS) is a distributed naming system that translates human-readable domain names (like www.example.com) into IP addresses.

C++ Design Patterns for Low Latency Applications Including High Frequency Trading

This is the fifth installment in a series of posts where I share notes taken while reading an interesting book or article. This post includes the notes made while reading the article titled “C++ Design Patterns for Low-Latency Applications Including High-Frequency Trading” by Paul Bilokon and Burak Gunduz. Section 2: Background 2.1 HFT High-frequency trading (HFT) is an automated trading strategy that utilises technology and algorithms to execute numerous trades at high speeds. The SEC Concept Release on Equity Market Structure outlines five key elements that define this discipline: High-speed computing. Co-location practices. Short time frames for position establishment and liquidation. The submission of multiple orders followed by cancelled orders. The objective of concluding the trading day with minimal unhedged positions and near-neutral exposure to overnight. HFT systems are mainly built from five main components: Data Feed: Responsible for receiving and processing real-time market data from various sources, enabling the algorithms to make buy, sell, or wait decisions. Order Management System (OMS): Handles the submission, routing, and monitoring of trade orders, ensuring efficient execution and management of trading activities. Trading Strategies: Employ automated algorithms to identify market opportunities, make trading decisions, and execute trades at high speeds. Risk management: Implements measures to assess and mitigate potential risks associated with high-speed trading, ensuring the preservation of capital and minimizing losses. Execution Infrastructure: Provides the necessary technological framework for low-latency communication and trade execution in HFT systems. Networking infrastructure falls in this category. HFT firms make strong use of Field Programmable Gate Arrays (FPGAs) to achieve low-latency targets. Four key characteristics of FPGAs are programmability, capacity, parallelism, and determinism. 2.2 C++ HFT shops prefer C++ for low-latency applications due to its compiled nature, close proximity to hardware, and control over resources, which enables you to optimize performance and efficiently manage resources. C++ offers various techniques to shift processing tasks from runtime to compile-time. Examples include: Using templates to move the runtime overhead associated with dynamic polymorphism to compile-time in exchange for flexibility. Inline functions which insert a function’s code at the call site. The constexpr keyword evaluates computations at compile time instead of runtime, resulting in quicker and more efficient code execution. Factors such as the compiler (and its version), machine architecture, 3rd party libraries, build and link flags can also affect latency. It’s often necessary to view the generated assembly when benchmarking C++. Tools such as Matt Godbolt’s Compiler Explorer help you do just that. 2.3 Design Patterns Design patterns in this paper don’t refer to object oriented design patterns. Instead, the authors use the term “design patterns” to refer to a number of programming strategies. Here’s a list of the different programming strategies: ...

The Practice of Programming 1st Edition

This is the fourth installment in a series of posts where I share notes taken while reading an interesting book or article. This post includes the notes made while reading the book titled “The Practice of Programming” by Brian Kernighan and Rob Pike. Chapter 1: Style Use descriptive names for globals and short names for locals. Consistency with local coding conventions is key. Use active names for functions. Name boolean functions such that the return value is obvious. Indent to show structure. Be consistent with whatever style you choose. Use parentheses to resolve ambiguity. Don’t expect the reader to be an expert in precedence rules. Break up complex expressions. Don’t cram a bunch into one line just because you can. Clarity isn’t the same as brevity. Ease of understanding is what distinguishes the two. In C/C++, avoid function macros. Just use a function. Inline or use constexpr as appropriate. Give names to magic numbers. Don’t use macros for numeric constants. Use enum, const, and constexpr. Never hardcode sizes. Use features of the language such as sizeof or size members of a container. Comments should avoid stating the obvious, contradicting the code, and should only exist to enhance clarity. Chapter 2: Algorithms and Data Structures Most programs require some form of searching or sorting. ...

Linux Kernel Development 3rd Edition

This is the third installment in a series of posts where I share notes taken while reading an interesting book or article. This post includes the notes made while reading the book titled “Linux Kernel Development” by Robert Love. Getting Started The following are key differences between userspace application development and Linux kernel development: No access to the C library or C headers. There are versions of certain libc functions included in the kernel under lib/. The kernel uses GNU C. Kernel code follows ISO C99 and relies on some of GCC’s compiler extensions. The kernel lacks memory protection you might be use to in userspace. For example, illegal memory accesses in kernel code often leads to a fatal “oops” message. Kernel code can’t perform floating point operations (not in a straightforward way). The kernel has a small, fixed-size stack. The stack is usually about 2-4 pages with a page typically being 4k on 32-bit systems or 8k on 64-bit systems. Synchronization and concurrency is always an issue. This is due to async interrupts, preemption, and SMP support. Portability is important. Kernel developers segregate architecture specific code. General kernel code makes no assumptions about word size, page size, etc. Processes In a Linux system using init, all processes all children of the init process with PID 1. Processes and threads are one and the same in Linux. The task_struct represents a process descriptor. Under the hood, the clone() function constructs a process or thread. The clone() function takes arguments which specify what process attributes get copied from the parent. When cloning, not all process data gets copied right away. Linux implements copy-on-write meaning pages aren’t copied until the parent or child process have written to them and therefore each need a separate copy. Linux does a trick where after cloning the child, it lets the child to run first. This makes it so if the child calls exec() to load a new program into its address space, it will do so immediately avoiding the situation where the parent runs, writes to one or more pages triggering a copy, and then the child runs exec() meaning that copy was for nothing. The current macro acquires the task_struct of the running process. current’s implementation varies from platform to platform with some systems storing the process descriptor in a register and others, like x86, placing it at the bottom of the process’s stack. Terminating a process doesn’t necessarily mean its gone. Terminated processes enter a zombie state. Once the process’s parent acknowledges the return code of the terminated child process, then the child process gets reaped. Zombie processes whose with a terminated parent are automatically re-parented. In the worst case they’re made direct children of the init process and then released. Process Scheduling Linux supports multiple scheduling algorithms via its scheduling classes. The big schedule() function selects the next task to run by running the highest priority scheduler class with a runnable task. The O(1) scheduler and Completely Fair Schedule (CFS) scheduler are the most well known scheduling algorithms in the kernel. nice values are a measure of how nice a process is to the others. nice values range from -19 to 20. A lower value means a process is less nice. The less nice a process is the higher priority it has and vice versa. The default nice value in Linux is 0. CFS is the dominant scheduling algorithm. Also known as SCHED_OTHER. CFS is unique in that it optimizes for fairness by giving processes a proportion of the CPU’s time. That is, there is no hardcoded timeslice (other than a lower bound on the smallest amount of time any process could be allocated). CFS has two key parameters: target latency and minimum granularity. Target latency is an estimate of the infinitely small duration of time each process would get in an ideal system. Minimum granularity is a floor that’s set on the timeslice since in a real system you can’t have infinitely small timeslices. The nice values weight the proportion of CPU time each process gets. Each process runs for a “timeslice” proportional to its weight divided by the total weight of all runnable threads. See page 50-51 for an explanation of the benefit. At a high level, CFS tracks each process’s virtual runtime (that is, the amount of time a process has spent running. A red-black tree stores the vruntimes (that is, a height balanced binary search tree). The process chosen to run next is the process with the smallest vruntime (that is, the leftmost process in the tree). Instead of incurring a \(\mathcal{O}(logn)\) cost to retrieve this process, the kernel caches the left most node. sched_entity is the structure used by the kernel to account for a process’s scheduling. sched_entity is a field in the process descriptor task_struct. The vruntime is a time in nanoseconds weighted by the number of runnable processes. A context switch involves two things. Switching the memory mapping context meaning swapping one processes pages for another. * Switching the CPU context meaning one or more registers need the process info of the process being now set to run. Both of these steps are platform specific! need_resched is a per process flag that tells the kernel whether its time to switch processes. It’s checked on a return to userspace after a system call or after an interrupt gets serviced. The Linux kernel is preemptible. The kernel will only preempt kernel tasks if they don’t hold a lock. A task that doesn’t hold a lock is reentrant. Kernel preemption can occur: When an interrupt handler exits, before returning to kernel-space. When kernel code becomes preemptible again. If a task in the kernel explicitly calls schedule(). If a task in the kernel blocks (which results in a call to schedule()). Page 65 gives a overview of how real-time priorities merge with nice values. Real-time priorities range from [0, 99]. Blended in are nice values which go from [100, 139]. System Calls System calls are the only interface to the kernel provided to userspace. The kernel maintains a system call table. The table is architecture specific. If you add a new syscall, you have to add it to each architecture’s syscall table! When a userspace application makes a system call, an exception gets triggered and the system call handler gets executed. The syscall handler is architecture specific. It will typically read the syscall code and parameters directly from CPU registers (means the caller loaded the registers up before trapping). A syscall return value also gets sent back via a specific register. System calls must be careful to validate all userspace parameters. Use the copy_from/to_user() when reading/writing data between spaces. They both block! Run the capability() method beforehand to check that the user has the right permissions to do what they’d like to do. It’s rare to add a new system call. Prefer exposing kernel info using files in sysfs or appropriate drivers with read()/write() implemented. Kernel Data Structures The kernel includes linked lists, queues, maps, and binary trees for use by the developers. No need to write your own. The linked list provided is a circular doubly linked list. The struct list_head type gets embedded in a structure of your own. Then, a head node gets created using LIST_HEAD macros, finally you use the list manipulation functions and macros to add/remove/iterate. You can implement a queue or stack using the list API. Queues called kfifo are actually pretty vanilla and work much like you would expect in say C++. The maps are special. They’re like std::map in that the data gets sorted by key. Keys are always UIDs. The values in the map are void pointers. Under the hood, the map gets implemented using a balanced search tree of some sort. The tree structure provided by the kernel is the red-black self balanced BST. If using this structure, you have to handle search and insertion yourself! Look up examples cause it ain’t trivial. Interrupts and Interrupt Handlers Interrupts are signals generated by hardware routed to the processor. The processor signals the kernel so that it may service the interrupt. Interrupts have a number assigned to them and a interrupt service routine that’s registered in the kernel to service specific IRQ numbers. Interrupts execute in an “atomic” context. Blocking isn’t allowed in an ISR. A interrupt can interrupt another executing interrupt! Interrupts get split into top and bottom halves. The top half does time-critical work. It’s meant to be quick and do just enough to service the HW and then return control to the kernel/process that was previously running. The bottom half is responsible for the actual processing of the received data and does the heavy lifting. request_irq() is the function used to register a interrupt handler. Interrupt handlers register from within the corresponding device driver. Interrupts are often shared. You enable interrupt sharing with the IRQF_SHARED flag. Interrupt handlers in Linux don’t need to be reentrant. When an IRQ line is gets serviced by a handler, that line gets masked out by the processor meaning another interrupt of that type can’t come in over the line. Interrupt handlers return IRQ_NONE or IRQ_HANDLED. IRQ_NONE gets returned when the interrupt handler detects an interrupt for which its device wasn’t the originator. IRQ_HANDLED gets returned if the interrupt handler was correctly invoked, and its device did indeed cause the interrupt. Most modern day devices provide a means for a driver to check whether the received interrupt was theirs. If it wasn’t, the ISR returns IRQ_NONE. Interrupt handlers can’t sleep/block! Interrupt handlers historically shared their stack with the interrupted process. Nowadays, the interrupt handlers have their own stack that’s one page in size. You can enable/disable interrupts on the current processor. This is typically done to support synchronization. Use local_irq_save() and local_irq_restore(). You must call these functions in the same stack frame (that is, within the same function)! You can disable an IRQ number for the entire system. You usually do this to configure a device. These are usually found in legacy ISA devices. Newer PCI devices share interrupts. Disabling interrupts for all devices on a line is not a good idea. Bottom Halves and Deferring Work The reason for deferring work to a bottom half is in large part to reduce the amount of time the system is operating without interrupts. When an interrupt gets serviced, interrupts on that line get disabled across all CPUs. Worse yet, interrupt handlers can disable all interrupts on the local processor plus the interrupt of interest on all processors. Separating interrupt handling into two halves minimizes system latency. When to perform tasks in the upper half: If the work is time sensitive, perform it in the interrupt handler. If the work relates to the hardware, perform it in the interrupt handler. If the work needs to guarantee that another interrupt (particularly the same interrupt) doesn’t interrupt it, perform it in the interrupt handler. For everything else, consider performing the work in the bottom half. The bottom half facilities provided by the kernel are softirqs/tasklets and workqueues. softirqs are statically allocated bottom halves that can run on any CPU simultaneously. Tasklets are flexible, dynamically created bottom halves built on top of softirqs. Two different tasklets can run concurrently on different processors, but two of the same type of tasklet can’t run simultaneously. Note tasklets have nothing to do with tasks/processes! Workqueues use kthreads under the hood and run in process context. Use workqueues if you need the ability to block/sleep. Prefer softirqs for performance critical applications. They take more care to implement because they can run concurrently. You also must register them statically. Tasklets are more common for bottom half handling. Use a softirq only if you want the bottom half to run on more than one processor simultaneously and are ready to likely deal with per processor variables and what that entails. The kernel enforces a limit of 32 registered softirqs. In reality, only about 9 of those 32 softirqs are in use today. The others are reserves and can be taken by a programmer looking to implement a new softirq. A softirq never preempts another softirq. The only event that can preempt a softirq is an interrupt handler. In fact, softirqs get processed in sequence in the do_irq() function. Pending softirqs get checked for and executed in the following places: In the return from hardware interrupt code path. In the ksoftirqd kernel thread. In any code that explicitly checks for and executes pending softirqs, such as the networking subsystem. Linux builds tasklets on top of the softirq system. The softirq flags differentiate between high and low priority tasklets. If the do_softirq() function finds a pending HI_SOFTIRQ or TASKLET_SOFTIRQ, then it will call the associated softirq action which happens to be one of tasklet_action() or tasklet_hi_action(). Either function will iterate over all their tasklets executing them only if the tasklet isn’t running on another processor. Tasklets with differing types can run concurrently! Tasklets can’t sleep/block. Tasklets run with interrupts enabled so be careful if you share data with an interrupt handler. If the system gets overloaded with softirqs, the kernel will spawn softirqd/n threads where n is the processor number. Idle CPUs will be able to service the softirqs. Work queues defer work as well. The most critical thing is that work queue tasks execute in a process context and can sleep/block. Work queues typically use generic kernel threads called worker threads. Worker threads have the label event/n where n is the CPU number. Work queues create a kernel thread on your behalf. Normal driver writers have two choices. First, do you need a schedulable entity to perform your deferred work? Do you need to sleep for any reason? Then work queues are your only option. Otherwise, use tasklets. Only if scalability becomes a concern do you investigate softirqs. Bottom halves can get disabled using the local_bh_enable/disable() functions. Bottom halve disabling usually comes up when process context code and bottom half code share data. You’ll need to disable bottom half processing and acquire a lock before accessing the shared data. An Introduction to Kernel Synchronization The kernel provides facilities that support atomic variables. The kernel provides various forms of locks that protect a critical section. Concurrency is the root of all evil. There’s technically two types of concurrency: pseudo and true. Pseudo concurrency is when two processes/tasks get interleaved (perhaps on a single processor) creating the effect of unprotected, concurrent access. True concurrency is when processes/tasks run simultaneously on separate processors and access shared data. The kernel has many causes of concurrency: Interrupts: An interrupt can occur asynchronously at almost any time, interrupting the current executing code. Softirqs and tasklets: The kernel can raise or schedule a softirq or tasklet at almost any time, interrupting the current executing code. Kernel preemption: Because the kernel is preemptive, one task in the kernel can preempt another. Sleeping and synchronization with userspace: A task in the kernel can sleep and thus invoke the scheduler, resulting in the running of a new process. Symmetrical Multiprocessing: Two or more processors can execute kernel code at exactly the same time. Code that’s safe from concurrent access from an interrupt handler is interrupt-safe. Code that’s safe from concurrency on symmetrical multiprocessing machines is SMP-safe. Code that’s safe from concurrency with kernel preemption is preempt-safe. Deadlock either self deadlock (double locking) or the ABBA deadlock is a real problem. The solution is to acquire resources with a fixed order and document that order. Lock contention can ruin performance. You need to balance how coarse/fine locking code is. Doing so can lend itself to making your code more scalable. Kernel Synchronization Methods The kernel offers the atomic_t and atomic64_t types along with a series of inlined functions that initialize, increment, decrement, etc. the values. Each architecture guarantees the atomic_t type. The atomic64_t is only implemented by 64-bit architectures and should be avoided unless writing architecture specific code that relies on 64-bit operations. Atomic, bitwise operations are also provided. The bitwise ops work on raw memory or pointers directly. There are non-atomic versions of the bitwise operations as well. If you don’t have a requirement for atomicity, you should use the non-atomic versions because they’re faster. The kernel implements the classic busy waiting locks: spinlocks. Spinlocks are the only locks that you can use in interrupt handlers since they don’t cause the thread to sleep. When using a lock in an interrupt handler, one must disable interrupts. The spinlock interface in the kernel provides convenience functions that lock/unlock the lock and saves/restores the interrupt context. This prevents the double acquire deadlock from occurring. Because a bottom half might preempt process context code, if data gets shared between a BH process context, you must protect the data in process context with both a lock and the disabling of bottom halves. Because an interrupt handler might preempt a BH, if data gets shared between the two, you must both obtain a lock and disable interrupts. With Reader-Writer spinlocks, one or more readers to concurrently access shared data. When the writer acquires the lock, the writer gives exclusive access and the readers wait. Readers have priority in RW spinlocks! That is, it’s possible to starve the writer with enough readers! The kernel implements counting semaphores. A semaphore with a count of 1 is a binary semaphore (AKA mutex). When downing a semaphore, prefer down_interruptible() because the down() function will make the waiting task be in the TASK_UNINTERRUPTIBLE state which is likely not ideal. RW semaphores similar to the RW spinlocks are available. The RW semaphores place waiting tasks in an uninterruptible sleep! The kernel implements the mutex locking mechanism. You can think of the mutex as something separate from the binary semaphore previously mentioned. Unlike the semaphore, you can only unlock a mutex from the context in which you locked it. The kernel mutex doesn’t support recursive locking. There’s a special completion variable type. Completion variables make it possible to signal between threads when an event has occurred. They’re a lightweight alternative to semaphores. Sequential locks are RW locks that give preference to the writer. That is, the readers can never starve the writers. You can enable/disable preemption. See pages 201-202 for the reasoning. Timers and Time Management The hardware implements a system timer whose frequency relates to a digital clock, CPU frequency, etc. When the timer goes off, a interrupt gets sent to the kernel. The kernel knows the preprogrammed tick rate so it knows the time between two successive timer interrupts. This is a tick and is equal to \(1 / tick_rate\). The kernel uses this tick to track both wall clock time and system uptime. The timer interrupt performs the following tasks. Note some of these are executed every tick others every \(N\) ticks: Update the system uptime. Update the time of day. On an SMP system, ensuring balance in the scheduler runqueues and, if unbalanced, balancing them. Running any dynamic timers that have expired. Update resource usage and processor time statistics. Tick rate is in units of HZ. Never hardcode the tick rate, use the kernel provided APIs for accessing the value. Pros of a higher tick rate: The timer interrupt has a higher resolution and, consequently, all timed events have a higher resolution. The accuracy of timed events improves. System calls such as poll() and select() that optionally employ a timeout execute with improved precision. Measurements, such as resource usage or system uptime, get recorded with a finer resolution. Process preemption occurs more accurately. Cons of a higher tickrate: Increased power consumption. Potential cache thrashing. Increased overhead from the timer interrupt handler getting triggered. The global variable jiffies holds the number of ticks that have occurred since system boot. jiffies prototypes to unsigned long volatile jiffies. On 32-bit architectures it’s 32-bits and 64-bits on 64-bit architectures. You have to be cautious of the fact that the jiffies value may wrap around! To avoid issues with wrapping, using the kernel provided macros when comparing to jiffies: time_after(unknown, known) time_before(unknown, known) time_after_eq(unknown, known) time_before_eq(unknown, known) Architectures provide two pieces of HW for timekeeping: the system timer and the real-time clock (RTC). The RTC is a nonvolatile device for storing the system time. The RTC continues to track time even when the system is off by way of a small battery. On boot, the kernel reads the RTC value into the xtime variable. This initializes the wall time. The system timer’s key job is to provide a source of timer interrupts. What drive’s the system timer is platform dependent. Some system timer’s are programmable to specific rates. The timer interrupt has two parts: architecture dependent and independent routines. The architecture dependent routine gets registered as the ISR. Its tasking is platform specific. That said, they all do share some common functions: Obtain the xtime_lock lock, which protects access to jiffies_64 and wall time value, xtime. Acknowledge or reset the system timer as required. Periodically save the updated wall time to the RTC. Call the architecture-independent timer routine, tick_periodic(). The architecture independent routine, tick_periodic(), performs much more work: Increment the jiffies_64 count by one. Update the resource usages, such as consumed system and user time, for the currently running process. Run any dynamic timers that have expired. Execute scheduler_tick(). Update the wall time, which gets stored in xtime. Calculate the infamous load average. See page 222 for a description of how to use the timer API. Always use mod_timer() to update an active/inactive timer. If you don’t, races may occur. If deactivating an active timer, prefer del_timer_sync() over del_timer(). As the name suggests, the sync version waits for an associated timer handler running on another CPU to complete before returning. This is a blocking call so don’t use it from an interrupt context! Expired timers’ handler get run by the BH of the timer interrupt. Implemented as the softirq TIMER_SOFTIRQ. If you need a short delay (think microsecond delay) in a busy loop, use the udelay(), ndelay(), and mdelay() functions. A better solution is to call schedule_timeout() passing in the amount of time you would like to sleep in jiffies. The only guarantees here are that you don’t waste CPU time spinning and that your task will sleep at least as many jiffies as requested. Only call this function from a process context. Memory Management The MMU deals with memory in terms of pages. The page struct tracks all physical pages in the system. It describes physical memory but not its contents. The goal is to indicate to the kernel whether a page is free. If a page is note free, the kernel can query the structure’s fields to know who owns it: userspace processes dynamically allocated kernel data static kernel code etc. The kernel instantiates a page struct per physical page. It’s a tiny bit wasteful of memory. Memory divides into zones. Below are four of the most popular zones: ZONE_DMA: This zone contains pages that can undergo DMA. ZONE_DMA32: LIke ZONE_DMA, this zone contains pages that can undergo DMA. Unlike ZONE_DMA, only 32-bit devices can access these pages. On some architectures, this zone is a larger subset of memory. ZONE_NORMAL: This zone contains normal mapped pages. ZONE_HIGHMEM: This zone contains “high memory,” which are pages not permanently mapped into the kernel’s address space. The kernel attempts to allocate memory from the appropriate zone. That said, if memory constrained, the kernel can pull memory from different zones. However, the kernel will never grab pages from two separate zones to satisfy a single request. Which zones are available is architecture dependent. Some architectures have only ZONE_DMA and ZONE_NORMAL because they can address the entire physical address space. Others like x86-32 have all four. struct page* alloc_pages(gfp_t gfp_mask, unsigned int order) is the kernel API for acquiring a list of \(2^{order}\) contiguous pages. You can call the void* page_address(struct page* page) API to get the logical address of a page. Use unsigned long get_zeroed_page(unsigned int gfp_mask) to get the address of a single, zeroed out page. Table 12.2 shows all the low-level page allocation functions. There’s analogous page free functions for returning the pages that were acquired. Note, page allocation may fail. Try to allocate pages early and always check for allocation failure. The low-level page allocation functions only make sense to use if you need page sized chunks of memory. kmalloc()/kfree() is appropriate for allocating/freeing byte sized chunks of memory. It behaves much like malloc()/free(). The only difference is the added gfp_t flags parameter which controls how allocation. gfp stands for get free pages. gfp flags fall into three categories: Action Modifiers: Specify how the kernel allocates the memory. Zone Modifiers: Specify which zone the memory will come from. Types: Acts as a combination of action and zone flags. There’s a couple of these like, GFP_KERNEL which define and OR of one or more action and zone flags. You only want to deal with type flags. Table 12.6 on Pg. 241 shows the available type flags and their description. vmalloc() acquires logically contiguous memory. It’s not typical that one uses vmalloc() due to its performance overhead or the need by the HW that memory acquired be physically contiguous. The function can sleep so it may only call it from a process context. Note, kmalloc() provides both physically and logically contiguous memory! The slab layer acts as a generic data structure-caching layer. It builds on the concept of free-lists where programmers maintain one or more lists of containing structures of commonly dynamically allocated types. The slab layer divides different objects into groups called caches, each of a different type of object. There’s once cache per object type. Caches divide into slabs where each slab is one or more contiguous pages of memory. Each slab contains some number of objects of a specific type. Slabs are always in one of three states: full, empty, or partial. Partial slabs allocations happen before empty slab allocations. You can make your own slab allocator caches for some custom object type. Kernel stack size is customizable ranging from 1 to 2 pages of memory. In the past, interrupt handlers shared the running process’s stack. Nowadays, each interrupt handler gets its own page for a stack. This requires one page per processor. Stack overflows occur silently in the kernel. There is no check for it! Keep stack usage to a few hundred bytes. If you need more memory, dynamically allocate it! You can map a limited number of pages from high memory into the kernel’s address space. Do this sparingly. Blocking and nonblocking interfaces are available to do this mapping. The kernel provides interfaces for statically and dynamically allocating per-cpu variables. There are also APIs for getting/putting CPU variables that take care of any preemption issues. Never access a per-cpu variable across CPUs without some form of synchronization. Reasons to use per-cpu variables: A reduction in locking requirements. Improved cache behavior. Enable access from interrupt and process context. Never sleep when working on a per-cpu variable! The Virtual Filesystem The virtual filesystem (VFS) provides an interface by which one can use the usual system calls (for example, open(), write(), read(), etc.) to interact with myriad filesystems and devices. You never need to rewrite or recompile your program to work with ext2 versus ext4 filesystem thanks to the VFS. The VFS provides an abstraction. New filesystems must implement the VFS interface and use its data structures to “plugin” to the kernel. An inode or index node is just file metadata (for example, time of creation, owner, permissions, etc.). The VFS has a OOP architecture. The four primary object types of the VFS are: superblock: Represents a specific mounted filesystem. inode: Represents a specific file. dentry: Represents a directory entry which is any component of a path (file or directory). file: Represents an open file associated with a process. Each object has a *_operations structure which contains function pointers to specific operations on that object. The kernel provides a default implementation of a few of the methods. However, filesystem developers likely have to implement their own operations so that they “make sense” for their specific use case. Each filesystem implements the superblock object and uses it to store information describing that specific filesystem (AKA the filesystem control block). An inode represents each file on the filesystem, but the inodes object constructs in memory only as files get accessed. inodes can even represent special files like pipes, block devices, or char devices (but only one at a time). Unix inodes typically separate file data from its control information (for example, metadata). dentries get cached in the dcache. When a file path is first resolved, each component in the path is a dentry and gets cached in the dcache. dentry accesses exhibit temporal and spatial locality similar to program instructions and data. This makes the dcache effective in reducing file access times. dentries associate with an inode. The dcache serves as an icache since actively used inodes get pinned along with their dentry in the dcache. The Block I/O Layer Block devices are hardware devices distinguished by the random access of fixed sized chunks of data. Block devices mount a filesystem. As a user, you interact with the block device via the filesystem. In contrast to block devices, char devices provide a sequential stream of char data. It doesn’t make sense to random access the data of a char device. That’s where the block device comes in. The smallest addressable unit on a block device is a sector. Typically a power of two with the most common size being 512 bytes. Although a device is addressable at the sector level, the kernel usually operates on blocks. The block is an abstraction of the filesystem and can be only be a multiple of the sector size, no larger than the page size, and must be a power of two. Each block gets its own buffer. The buffer is an in memory representation of the block. Each block gets a buffer head which is essentially a descriptor describing the block (which device owns the buffer, page info, etc.). The buffer head’s bh_state flag (see page 292) tells one the state of the buffer. Within bh_state there’s a number of bits reserved for driver authors to use. Block buffers benefit from storing block data on a page. That said the block buffer approach is a bit wasteful since you need multiple buffers/block heads to for example write large amounts of data. The bio struct is the block buffer’s replacement. The struct represents block IO operations that are active as a list of segments. A segment is a chunk of a buffer that’s contiguous in memory. Segments make scatter/gather IO possible in the kernel. See page 295 for a illustration. With the bio struct, the buffer is now represented as an array of bio_vec structs where each bio_vec includes a page, offset, and length. The full array of bio_vec structs is the buffer. The way to think of all this is that each block IO request gets represented as a bio struct. Each request is one or more blocks stored in the array of bio_vec structures of the bio struct (these are the segments). As the block IO layer submits request, the bi_index gets updated to point to the next bio_vec struct. Buffer heads are still relevant. They’re required for describing the device’s blocks. Block devices maintain request queues to store their pending block IO requests. The filesystem adds requests to the queue and dispatches them to the block device’s driver for processing. The kernel includes a block IO scheduler to merge and sort requests. The idea here is that IO performance would be terrible if requests were simply serviced in the order received. You want to reduce seek times and optimize the order in which requests are services. The IO schedule does the latter by virtualizing the block devices similar to how the process scheduler virtualizes the CPU. The IO scheduler works by managing a block device’s request queue. It decides the order of requests in the queue at what time each request gets dispatched to the block device. It optimizes on seek to improve *global throughput**. That is, the IO scheduler doesn’t care much for fairness. The IO scheduler performs two primary actions to minimize seeks: Merging is the coalescing of two or more request into one. Sorting refers to how the IO scheduler keeps requests in the queue sorted sector wise so that all seeking activity moves as close to sequential as possible. There are many different IO scheduler algorithms supported by the kernel: The Linus Elevator: Performs both the merge and sort operations. Uses an insertion sort to maintain the sector ordering of the request queue and will merge adjacent sectors on insertion. There’s an issue with requests starving with this algorithm if requests cluster around one area of the disk leaving the far off requests to starve. The Deadline IO Scheduler: This algorithm uses three queues: a queue sorted on sector just like the previous one, a write FIFO, and read FIFO. The write/read FIFO queues are essentially sorted on time. Write request have an expiration of about 5 seconds into the future while read requests have an expiration delta of about 500 milliseconds. When the Deadline scheduler dispatches a request, it first checks if there is an expired request in one of the FIFO queues before issuing a request from the sorted queue. In this way, you avoid starvation. Also note the bias towards read requests. If you delay read requests significantly, application performance would degrade notably (imagine all the time spent blocking on read())! The Anticipatory IO Scheduler: This algorithm is identical to the Deadline IO scheduler except there is an added heuristic: the anticipation heuristic. It’s meant to resolve the delay in write heavy systems that occasionally read. In the latter scenario with a Deadline IO scheduler, the seek head would bounce back and forth as infrequent reads would be immediately serviced triggering a long seek. The trick here is that after servicing a read request, the algorithm will wait for a configurable amount of time before returning to the previous request. This makes it such that if another read requests comes in during the wait, it can immediately get serviced with a reduction in the time spent seeking. The algorithm uses per process block IO statistics to improve its behavior over time. The algorithm avoids starvation, reduces read latency, and increases overall throughput through the reduction of seeks/seek time. The Completely Fair Queueing IO Scheduler: This one’s a bit different than the rest. CFQ gives each process an IO request queue. The queue are serviced round robin. The number of requests consumed at each queue visit is configurable. CFQ works well with specific workloads particularly those associated with multimedia. NOOP IO Scheduler: This algorithm does little more than insertion sort incoming request by sector size. Beyond that, it’s basically a FIFO algorithm. NOOP works well with block devices such as flash memory which have no overhead with seeking and thus don’t need all the bookkeeping and added overhead of the other algorithms. Debugging Unsurprisingly, printk() is one of the key debug tools. printk() is robust in that you can call it anywhere and anytime within kernel code. printk() supports eight different log level macros. Use the one that’s appropriate for the situation (or example, if debugging, use KERN_DEBUG). You will need to set the console log level accordingly to see the messages in the kernel logs. A kernel oops occurs when the kernel encounters an error condition from which it can’t proceed/recover. The Oops message contains info such as the contents of CPU registers, a stack trace, and more. Sometimes the oops that’s printed isn’t decoded (that is, the stack trace is just a bunch of addresses). You can save the oops message in a text file. Then, using the ksymoops program, you can decode the oops message. In place of ksymoops shenanigans, you can enable CONFIG_KALLSYMS_ALL at kernel config time. This decodes the entire oops message at the cost of an increased kernel image (probably worth it unless you need a min size kernel image). In the “Kernel Hacking” section of the kernel config editor, you can enable many debug options. Enable as many as needed to solve your problem. The BUG_ON(condition) macro triggers an oops purposefully. panic() is another developer macro that will halt the kernel at the call site. dump_stack() will do what the name suggests. Useful with an added printk() message to give context to the dump. You can use the kgdb features of the kernel to run gdb on a live kernel. See the StarLabs article for the details. Git bisect is your friend when tracking down kernel bugs. Portability One of the key goals of Linux is portability. The majority of the core/subsystem code in the kernel is portable/platform agnostic. Architecture specific code lives in arch/ Some code must be platform specific. For example, context switch code for registers and address space switches are platform specific. The kernel has a number of APIs that each platform must implement. For example, each platform implements switch_to() and switch_mm(). Architectures that support both 32 and 64-bit word sizes have their codebases tied together under one architecture. For example, x86 holds x86-32 and x86-64 platform code. The long type always has a size equal to the platform’s word size. Don’t assume the size of long to be 32-bits or 64-bits. Use the macro BITS_PER_WORD to compute word size portably. Only use opaque types such as pid_t and atomic_t as specified by their API. Never assume anything about their size or underlying type. Don’t convert opaque types to some C built-in type. Use the fixed size types when appropriate (for example, u32, s32, u8, s8, etc.). You can’t export the fixed size types to userspace. Instead, you must use the userspace friendly versions that prefixed with double underscores (for example, __u32). On a N-bit system, data should be (N/8) byte aligned. For example, a 32-bit system is usually 32/8 = 4 byte aligned. The bottom 3 bits of each address should be zero. Alignment is usually handled by the compiler and not a concern to the programmer. One place worth being aware of alignment is in structures where the compiler adds padding automatically to meet alignment requirements. Sometimes you can avoid the overhead of padding by re-arranging the members of the struct to meet padding requirements. The compiler will never reorder structure members on your behalf! If you have concerns over endianness and need to convert to/from the CPU ordering and LE/BE, use the kernel’s endianness conversion API. Never assume the frequency of the timer interrupt. Always use the HZ macro to compute an estimate for time. Never assume the page size. Use the PAGE_SIZE macro instead. If you need the number bits to left shift an address to derive its page number, use the PAGE_SHIFT macro. Always assume and program for an SMP/preempt/highmem system. This keeps you safe in any kernel/HW configuration.

Linux Driver Development for Embedded Processors 2nd Edition

This is the second installment in a series of posts where I share notes taken while reading an interesting book or article. This post includes the notes made while reading “Linux Driver Development for Embedded Processors” by Alberto Liberal de los Rios. Notes weren’t taken for every chapter so keep in mind that the book actually covers more topics than what’s shown here. If you are considering buying the book, you might want to checkout this review before buying. ...

What Every Programmer Should Know About Memory

This is the first installment in a series of posts where I share notes taken while reading an interesting book or article. This post includes the notes made while reading a series of articles by Ulrich Drepper titled “What Every Programmer Should Know About Memory”. Part 1: Introduction There are a number of different computer architectures each with their own tradeoffs. The commodity HW setup has the CPUs attached via a Frontside Bus (FSB) to a Northbridge and indirectly via the Northbridge to a Southbridge. The Northbridge typically houses the memory controller and attaches directly to RAM. The Southbridge hosts the various buses such as USB, PCI, PCI-E, SATA, etc. The bottleneck in modern systems often is the time required to access memory. An alternative to the commodity setup is to have a memory controller per RAM module. Then you get additional channels and thus increased bandwidth. However, the bottleneck then becomes the speed of the Northbridge. A third alternative is to have RAM directly attached to each CPU. This removes the Northbridge bottleneck. However, there is additional latency when one CPU requires data from the RAM of another. In this scenario, one or more hops to access the data. In a real system, the number hops can be large and the latency noticeable. This kind of architecture is what’s known as Non-Uniform Memory Access (NUMA). There are two key types of RAM: Static RAM and Synchronous Dynamic RAM. Static RAM (SRAM) is more expensive and typically used for CPU caches and the like. ~6 transistors make up an SRAM cell. The cell state doesn’t require recharge and you read/write the state immediately. Of course, the cell requires constant power. A DRAM cell is a transistor and capacitor combo. DRAM cells require recharge about every 64ms. When reading a DRAM cell, sense amplifying circuits help distinguish between a 0 and a 1. The latter adds significant delay. The advantage of DRAM is the cost and the fact that the form factor of the cell means you can pack many of them on a single die. Memory cells are individually accessed. To cut down on the number of address lines, you arrange cells in a 2D matrix form. A row address is first selected followed by a column address. The S in SDRAM stands for synchronous and means that the RAM runs at a frequency controlled by the memory controller’s clock. This clock determines the speed of the Frontside Bus. Each SDRAM transfer is about 8 bytes. There are different SDRAM types each offering different effective bus rates: Single Data Rate SDRAM: There’s a one-to-one mapping between the bus frequency and the data transfer frequency. For example, a 100MHz bus implies you can transfer 100Mb/s. Double Data Rate 1: Double the data transfers per cycle. Data transfers on both the rising and falling edge of a cycle. Also known as a “double-pumped” bus. DDR modules have their transfer rates calculated in bytes. For example, a 100MHz DDR1 module has a data transfer speed of 100MHz - 64 bits - 2 = 1600MB/s. Double Data Rate 2: Here you increase the frequency of the IO buffer (same IO buffer like the one used in DDR1). The frequency increase on the IO buffer doesn’t cause a large amount of additional power consumption. This leads to a quad pumped bus. Following the previous example, the transfer rate of a DDR2 module would be 100MhZ - 64 bits - 4 = 3200MB/s. Double Data Read 3: DDR3 is like a further revision of DDR2. No real innovation there? FB-DRAM: Similar to DDR2 except serial lines connect to the memory controller. Fully Buffered DRAM modules run the serial lines at high frequencies. FB-DRAM modules can have more channels per module, more channels per Northbridge/memory controller, and the serial lines are full duplex. The required pin count also drops from 240 for DDR2 to 69 for DDR3. Direct Memory Access (DMA) is in use with many devices. DMA means more competition for the FSB bandwidth. If there’s a lot of DMA traffic, a CPU might stall more than usual when accessing RAM. Part 2: CPU Caches CPUs can have sometimes up to 3 caches made of small SDRAM. The caches in ascending size are usually named L1, L2, and L3 cache. It’s often the case there are two L1 caches. One cache stores instructions, the icache, the other stores data, the dcache. L2 and L3 caches store a mix. Caches divide up into cache lines where each line is about 8 bytes. The number of cycles increase dramatically as you go up the levels of cache. The cycles required once you reach main memory are much higher than the L3 cache. The icache often exhibits good behavior inspite of the program since most code has good spatial and temporal locality. The icache on some processors caches not the original instruction but the decoded instruction. This can save a significant number of cycles in the CPUs instruction processing pipeline. The dcache’s behavior can be more directly controlled by the programmer. TBD exactly how but you can imagine certain data access patterns are more cache friendly. For example, row major access of a 2D matrix versus column major. On SMP or multicore systems, each core gets its own set of caches. Within a core, however, threads may get their own L1 I/D cache and will share L2/L3 cache. The sharing of L2/L3 by the threads can be a source of bottlenecks since if not careful they will trample each others caches. There’s additional HW to balance cache use between two threads but it doesn’t always work that well. Takeaway of this part is to take a look at the architecture of the HW you are running on. Knowing the memory layout can help you organize your program more both within the source and at the system level. For example, accessing data in a cache friendly way and segregating processes judiciously across the cores of the machine. Part 3: Virtual Memory The primary benefits of virtual memory include freeing applications from having to manage a shared memory space, ability to share memory used by libraries between processes, increased security due to memory isolation, and being able to conceptually use more memory than might be physically available using the technique of paging or segmentation. Virtual memory requires HW support for translation of virtual addresses to physical addresses. The Memory Management Unit (MMU) assists with translation. A virtual address splits into up to 5 fields with 4 of those being page table indices and the 5th being a offset into the page itself. The multiple levels of indirection make it possible for multiple processes to have their own page tables, otherwise, the memory cost would be too high. The process of virtual address to physical address translation is costly (up to 4 memory accesses). In response, HW designers have added specialized caches called Translation Lookaside Buffers (TLBs) to cache recent translations. TLBs are separate from the other caches used by the CPU. However, similar to CPU caches, there are instruction TLBs and data TLBs that leverage the spatial/temporal locality of the addresses used in a program. TLBs have their own costs associated with them. Namely, the TLB must be flushed whenever a context switch occurs or when the kernel acquires/relinquishes control of the CPU. There’s HW specific tricks to avoid this overhead. Part 4: NUMA Support This article describes how the topology in a NUMA architecture affects the memory access performance. A hypercube topology is most effective. The topology has CPU nodes in powers of two (for example, nodes = \(2^C\)). The \(C\) value dictates the maximum number of interconnects between the different CPUs. The OS is responsible for acknowledging the NUMA architecture and allocating process memory to account for the hops that occur when process memory usage exceeds the memory available at the host node. Some OSes use the strategy of striping process memory. This means memory spreads across all nodes’ memory. The advantage is that no one node is saturated by the high memory demand of a single process and it makes migrating a process from one node to another more cost effective. The downside is that you have more hops on average. In Linux on a NUMA system, the programmer can select an allocation strategy different than striping for a given process and its children. Part 5: What Programmers Can Do The theme for all memory access is the same: improve locality (spatial and temporal) and align the code and data. Modern CPUs optimize sequential uncached write and (more recently) read accesses. This comes in handy when accessing large data structures that you use only once. The CPU automatically prefetches relevant cache lines when accessing data sequentially. This is where the performance boost comes from. Part 6: More Things Programmers Can Do This section is incomplete. I took notes only on what I felt was most relevant to at the time which was the tips Drepper provides on concurrent optimization: ...