Linux

What Every Programmer Should Know About Memory

This is the first installment in a series of posts where I share notes taken while reading an interesting book or article. This post includes the notes made while reading a series of articles by Ulrich Drepper titled “What Every Programmer Should Know About Memory”. Part 1: Introduction There are a number of different computer architectures each with their own tradeoffs. The commodity HW setup has the CPUs attached via a Frontside Bus (FSB) to a Northbridge and indirectly via the Northbridge to a Southbridge. The Northbridge typically houses the memory controller and attaches directly to RAM. The Southbridge hosts the various buses such as USB, PCI, PCI-E, SATA, etc. The bottleneck in modern systems often is the time required to access memory. An alternative to the commodity setup is to have a memory controller per RAM module. Then you get additional channels and thus increased bandwidth. However, the bottleneck then becomes the speed of the Northbridge. A third alternative is to have RAM directly attached to each CPU. This removes the Northbridge bottleneck. However, there is additional latency when one CPU requires data from the RAM of another. In this scenario, one or more hops to access the data. In a real system, the number hops can be large and the latency noticeable. This kind of architecture is what’s known as Non-Uniform Memory Access (NUMA). There are two key types of RAM: Static RAM and Synchronous Dynamic RAM. Static RAM (SRAM) is more expensive and typically used for CPU caches and the like. ~6 transistors make up an SRAM cell. The cell state doesn’t require recharge and you read/write the state immediately. Of course, the cell requires constant power. A DRAM cell is a transistor and capacitor combo. DRAM cells require recharge about every 64ms. When reading a DRAM cell, sense amplifying circuits help distinguish between a 0 and a 1. The latter adds significant delay. The advantage of DRAM is the cost and the fact that the form factor of the cell means you can pack many of them on a single die. Memory cells are individually accessed. To cut down on the number of address lines, you arrange cells in a 2D matrix form. A row address is first selected followed by a column address. The S in SDRAM stands for synchronous and means that the RAM runs at a frequency controlled by the memory controller’s clock. This clock determines the speed of the Frontside Bus. Each SDRAM transfer is about 8 bytes. There are different SDRAM types each offering different effective bus rates: Single Data Rate SDRAM: There’s a one-to-one mapping between the bus frequency and the data transfer frequency. For example, a 100MHz bus implies you can transfer 100Mb/s. Double Data Rate 1: Double the data transfers per cycle. Data transfers on both the rising and falling edge of a cycle. Also known as a “double-pumped” bus. DDR modules have their transfer rates calculated in bytes. For example, a 100MHz DDR1 module has a data transfer speed of 100MHz - 64 bits - 2 = 1600MB/s. Double Data Rate 2: Here you increase the frequency of the IO buffer (same IO buffer like the one used in DDR1). The frequency increase on the IO buffer doesn’t cause a large amount of additional power consumption. This leads to a quad pumped bus. Following the previous example, the transfer rate of a DDR2 module would be 100MhZ - 64 bits - 4 = 3200MB/s. Double Data Read 3: DDR3 is like a further revision of DDR2. No real innovation there? FB-DRAM: Similar to DDR2 except serial lines connect to the memory controller. Fully Buffered DRAM modules run the serial lines at high frequencies. FB-DRAM modules can have more channels per module, more channels per Northbridge/memory controller, and the serial lines are full duplex. The required pin count also drops from 240 for DDR2 to 69 for DDR3. Direct Memory Access (DMA) is in use with many devices. DMA means more competition for the FSB bandwidth. If there’s a lot of DMA traffic, a CPU might stall more than usual when accessing RAM. Part 2: CPU Caches CPUs can have sometimes up to 3 caches made of small SDRAM. The caches in ascending size are usually named L1, L2, and L3 cache. It’s often the case there are two L1 caches. One cache stores instructions, the icache, the other stores data, the dcache. L2 and L3 caches store a mix. Caches divide up into cache lines where each line is about 8 bytes. The number of cycles increase dramatically as you go up the levels of cache. The cycles required once you reach main memory are much higher than the L3 cache. The icache often exhibits good behavior inspite of the program since most code has good spatial and temporal locality. The icache on some processors caches not the original instruction but the decoded instruction. This can save a significant number of cycles in the CPUs instruction processing pipeline. The dcache’s behavior can be more directly controlled by the programmer. TBD exactly how but you can imagine certain data access patterns are more cache friendly. For example, row major access of a 2D matrix versus column major. On SMP or multicore systems, each core gets its own set of caches. Within a core, however, threads may get their own L1 I/D cache and will share L2/L3 cache. The sharing of L2/L3 by the threads can be a source of bottlenecks since if not careful they will trample each others caches. There’s additional HW to balance cache use between two threads but it doesn’t always work that well. Takeaway of this part is to take a look at the architecture of the HW you are running on. Knowing the memory layout can help you organize your program more both within the source and at the system level. For example, accessing data in a cache friendly way and segregating processes judiciously across the cores of the machine. Part 3: Virtual Memory The primary benefits of virtual memory include freeing applications from having to manage a shared memory space, ability to share memory used by libraries between processes, increased security due to memory isolation, and being able to conceptually use more memory than might be physically available using the technique of paging or segmentation. Virtual memory requires HW support for translation of virtual addresses to physical addresses. The Memory Management Unit (MMU) assists with translation. A virtual address splits into up to 5 fields with 4 of those being page table indices and the 5th being a offset into the page itself. The multiple levels of indirection make it possible for multiple processes to have their own page tables, otherwise, the memory cost would be too high. The process of virtual address to physical address translation is costly (up to 4 memory accesses). In response, HW designers have added specialized caches called Translation Lookaside Buffers (TLBs) to cache recent translations. TLBs are separate from the other caches used by the CPU. However, similar to CPU caches, there are instruction TLBs and data TLBs that leverage the spatial/temporal locality of the addresses used in a program. TLBs have their own costs associated with them. Namely, the TLB must be flushed whenever a context switch occurs or when the kernel acquires/relinquishes control of the CPU. There’s HW specific tricks to avoid this overhead. Part 4: NUMA Support This article describes how the topology in a NUMA architecture affects the memory access performance. A hypercube topology is most effective. The topology has CPU nodes in powers of two (for example, nodes = \(2^C\)). The \(C\) value dictates the maximum number of interconnects between the different CPUs. The OS is responsible for acknowledging the NUMA architecture and allocating process memory to account for the hops that occur when process memory usage exceeds the memory available at the host node. Some OSes use the strategy of striping process memory. This means memory spreads across all nodes’ memory. The advantage is that no one node is saturated by the high memory demand of a single process and it makes migrating a process from one node to another more cost effective. The downside is that you have more hops on average. In Linux on a NUMA system, the programmer can select an allocation strategy different than striping for a given process and its children. Part 5: What Programmers Can Do The theme for all memory access is the same: improve locality (spatial and temporal) and align the code and data. Modern CPUs optimize sequential uncached write and (more recently) read accesses. This comes in handy when accessing large data structures that you use only once. The CPU automatically prefetches relevant cache lines when accessing data sequentially. This is where the performance boost comes from. Part 6: More Things Programmers Can Do This section is incomplete. I took notes only on what I felt was most relevant to at the time which was the tips Drepper provides on concurrent optimization: ...

Signing Git Commits With GPG

If you’ve been around the open source community long enough, you’ve probably heard of people signing their VCS commits/tags. This post covers the why and how of signing your Git commits. The focus will be on commits but keep in mind that these tips equally apply to tags. Why Sign Your Commits The short answer is, signing your commits makes it harder for an attacker to impersonate you. Sure, if you work solo on rinky-dink toy projects, having your commits signed isn’t a big deal. Now consider the case where you make commits to an open source project with sensitive code or at your day job where you make commits and PRs on a product. It might be worth safeguarding those commits just a bit. ...

Dotfile Mgmt With GNU Stow

Do you have a bunch of dotfiles? Do you maintain a GitHub repo with all your dotfiles? Whenever you upgrade your machine, do you find yourself manually placing the dotfiles in the right spots in your home directory? If you answered yes to these questions, read on. Enter GNU Stow GNU Stow is a dotfile management utility. Stow has all the makings of a varsity athlete: Stow is small (a 32KB Perl script). Stow is simple to use with a solid manpage. Stow doesn’t get in the way of version controlling dotfiles. Real world Stow usage is pretty simple and best explained with an example. Imagine you had your i3wm and Bash configurations stored in your home directory. The layout might look something like this: ...

keylogger: A Cross-Platform Keylogger

If you’re familiar with the kbhell application, you might realize that kbhell is about 90% of the way to being a keylogger. Why not finish the job and write a proper, cross platform keylogger that captures a victim’s every keystroke (for science reasons, of course)? The Requirements So there’s the obvious requirement of capturing user keystrokes. When you think about the fact that keyboards have different layouts, there are different language sets, etc., the task becomes challenging. ...

Linux Driver Development for Embedded Processors

If you’ve already read through “Linux Device Drivers”, it may be worth your time to read a more focused Linux driver development textbook. ARM driver development has been popular for some time now and remains relevant today. “Linux Driver Development for Embedded Processors” (ELDD for short) gives a modern look into the development of ARM drivers on Linux. ELDD has a number of selling points: Labs targeting multiple ARM processors: NXP iMX7D, Microchip SAMA5D2 and Broadcom BCM2837. Excellent Device Tree introduction with many examples. Plenty of labs using real hardware. This article dives into the details starting with processor support. ...

A Beginner's Memory Allocator

While reading through the awesome “Operating Systems: Three Easy Pieces” book, I came across the topic of memory allocators. While always having an inkling of how functions like malloc() and free() work under the hood, I never considered writing a custom allocator. To help demystify the topic, I decided to write a basic allocator on Linux. The Interface What does the API look like? The API is identical to that of malloc()/free() with only two major deviations: ...

Linux USB Serial Device Name Binding

Have you worked with USB serial devices on Linux? One annoyance you may have come across is device name changes after each reboot. This problem gets solved by binding a custom /dev name to a USB device. This post shows you how. To be clear, this article walks through assigning USB serial devices persistent names. If you have a USB block device and would like to give it a persistent name, the ArchWiki has you covered. ...

Real-time Linux App Development

If you’re an embedded systems programmer, you have likely touched on the topic of real-time operating systems (RTOS). There are plenty of commercial RTOSes available on the market: VxWorks, Integrity, DeOS, Helix, the list goes on. As a hobbyist, you may not have thousands of dollars to spend paying for commercial licenses. Real-time Linux can fill the void by providing a path to a soft real-time system. This post takes a tour through the advice given in John Ogness’s 2020 presentation: “A Checklist for Writing Linux Real-Time Applications”. You’ll explore how to optimize a Linux system and application for real-time execution using John’s real-time Linux development checklist. ...

Docker Assisted Driver Dev and LDD3

Where does a newbie start their journey into the Linux kernel? Device drivers is the most common answer. Despite its age, Linux Device Drivers 3rd Edition (LDD3) remains one of the best options for learning about device drivers. There are challenges in using such an old text. LDD3’s code examples target the 2.6.10 kernel. At the time of this writing, the kernel is at version 5.19! That said, fixing API deltas just adds to the fun. This article talks about setting up an environment for LDD3 experimentation and the LDD3 experience itself. ...