An educational introduction to Apache Arrow

Apache Arrow is everywhere nowadays, and if you are a data engineer, your favorite tools/frameworks probably have one of its libraries as a dependency. The official list of products that use Apache Arrow is filled with big players in the data ecosystem and new systems built on top of the Apache Arrow based query engine Apache DataFusion pop up regularly. It seems that no other project is shaping the future of the data landscape quite like Apache Arrow does. So, as a data engineer, trying to understand what all the fuss is about came naturally to me.

Getting a high-level overview of Apache Arrow is easy. There are many good (and quite recent) blog posts on its history and the problems it solves, see this or that one. The Apache Arrow project page is also very well written. But for someone like me, who lacked a lot of basic computer science or system programming knowledge, Apache Arrow never really "clicked". This blog post aims to fill this gap. I will cover the concepts that are needed to not just believe that a standardized columnar memory layout makes sense, but to actually get that its creation was inevitable. But note, that I am not an expert on these topics and I will omit details for readability. To get the full picture please refer to the specifications.

What actually is Apache Arrow?

For me, things that are not just installable libraries or frameworks are always harder to grasp. Apache Arrow is just like this, and in addition consists of many different parts. Yes, you can install "Apache Arrow" for example in Python by running pip install pyarrow and then use it like any other library. But this library is merely one implementation of what Apache Arrow is all about.

Apache Arrow is foremost one thing: a language-agnostic in-memory columnar data structure specification. To put this in simpler terms, we can unpack it from the end:

  • Specification: Apache Arrow is not just a single implementation, but a highly detailed description containing all the information needed to implement what it describes.
  • Columnar data structure: Apache Arrow describes a two-dimensional data structure with rows and columns, like a dataframe. It is structured so that values from the same column are stored close to each other, rather than grouping values by row.
  • In-memory: The data structure lives in a region of a process's memory. Apache Arrow describes how this flat region must be interpreted to understand it as the two-dimensional data structure and how specific parts can be accessed by simply jumping to the appropriate location in memory.
  • Language-agnostic: The data structure Apache Arrow describes is not bound to any specific or group of programming languages. This is because it describes how the data structure is laid out in memory, which is important for the running program, i.e. the process. However it does not matter which programming language was used to create the program.

While this specification sits at the core of Apache Arrow, it is not its only component. In addition, Apache Arrow is also the implementations of this specification in more than ten programming languages, including C++, Python and Rust. After all, what is a specification worth if there are no ready-to-use implementations?

With the specification and multiple implementations to choose from we would be ready to go to use Apache Arrow for a program that runs in isolation. But what if we want to exchange data between processes? To cover this, Apache Arrow also defines several protocols that specify how its in-memory data structure can be serialized into a stream of binary payloads and reconstructed somewhere else. Here, for simplicity, I will only cover Apache Arrow's inter-process communication (IPC) protocol.

While Apache Arrow includes even more components, and I will ignore things in the following like the C data interface, canonical extension types and the Arrow database connectivity, the above-stated components are enough to outline how Apache Arrow revolutionized the data tooling landscape. The first step is to understand what problem led to the creation of Apache Arrow and how it solves it.

The problem Apache Arrow is solving

In short, Apache Arrow solves the problem that our computers spend a lot of time just copying and reconstructing data structures when we pass them around in our program or send them to others. When we create a data structure in our program, it is backed by some region in memory, and because we created it our program knows how to interpret this region. Also, we can reference or pass this data structure to a function without copying any data, by simply using a pointer to this memory region. However problems arise when our data structure crosses boundaries. By boundaries, I mean situations where data moves in a way that loses the information about how the memory region should be interpreted. Such boundaries can occur in the same process when libraries that are implemented in different programming languages are used, e.g. using the DuckDB Python library (written in C) on a pandas DataFrame, or between different processes where the data must be sent via inter-process communication (IPC), e.g. PySpark passing data from the Python runtime to the JVM.

Take for example how lists in Python and NumPy arrays use different memory layouts. In this simple snippet:

import numpy as np

a = [1, 2, 3]
b = np.array(a)

we create the logical data structure of an array of three integers first as a Python list and then use this data structure as an input to create a logically identical NumPy array. While they logically the same, we could have assumed that our list and array reference the same memory region, but instead the creation of the NumPy array has allocated a new region and copied the underlying values of the list there. This is because a list is implemented in CPython as a PyListObject that contains pointers to other PyObjects. In contrast, a NumPy array uses a contiguous region of memory. Obviously, these layouts are not compatible and form a sort of boundary where memory must be copied.

Another example is IPC, where the information needed to interpret a region of memory is not available to the receiving process that only sees a stream of bytes. At this kind of boundary, this missing information must be included. This is usually done by serializing a memory object into something that can be transferred and then reconstructed, i.e. deserialized. For example dumping a Python dictionary into a JSON string and sending it over the network to a server, which can reconstruct the Python dictionary by parsing the JSON or using Python's pickle module that (greatly simplified) encodes instructions for how to reconstruct a Python object. Whatever approach one chooses, the CPU on both, the sending and the receiving side, has to do work.

Here comes Apache Arrow's time to shine. By defining a standard for how its data structure is laid out in memory, libraries can implement against it. This allows them to share the same data structure using a common memory layout. Also, when sending this data structure as a byte stream between processes, it can be understood directly, because what it represents is described in the specification. In this way, Apache Arrow makes our programs do less boring parsing and reconstructing work, while also saving memory, because no redundant copies have to be made.

The memory layout

The aforementioned two-dimensional columnar data structure is called a RecordBatch. You can view this data structure like part of a table that holds some number of records. It has a schema that defines its columns via fields, i.e. their names, datatypes, and wether they are nullable. And it holds the actual values of these columns, which are called arrays in the naming convention of Apache Arrow. These arrays are made up of one or more buffers, which are just a contiguous region of memory, i.e. the actual memory layout.

And at this point comes a fundamental aspect of Apache Arrow: the split between metadata and actual data. The actual data is what is stored in the buffers and this is what Apache Arrow specifies. It defines a number of logical data types, e.g. Int, Date, Utf8 or List, that an array could have and associates a physical memory layout to them. These layouts state what buffers are needed to store values of a data type. This mapping of data type to buffers is what Apache Arrow standardizes and what every implementation has to do the same way. On the other hand, how the metadata is represented in memory is not specified by Apache Arrow and can be handled by every implementation how it wants.

Briefly said, metadata is everything that is not actual data, i.e. not buffers. Therefore an array that represents a column of a logical data type is in the end just a collection of metadata that states what data type the array has and where to find all the buffers that actually store its data. See for example how a PrimitiveArray is implemented in rust:

pub struct PrimitiveArray<T: ArrowPrimitiveType> {
    data_type: DataType,
    values: ScalarBuffer<T::Native>,
    nulls: Option<NullBuffer>,
}

, where data_type gives the logical data type and values and nulls are just buffers that store the actual data. The same holds for the RecordBatch:

pub struct RecordBatch {
   schema: Arc<Schema>,
   columns: Vec<Arc<dyn Array>>,
   row_count: usize,
}

, which is just a collection of schema and arrays, i.e. metadata objects. In addition it also has a row_count that must be shared by all its arrays, because columns with unequal length would be invalid.

Let's make these relationships clearer with an example. Imagine we want to use Arrow to store the following logical tabular data as a RecordBatch:

agename
33Alice
nullBob
67Charlie

On the metadata side we need the following:

  • A schema that describes our data. In our case it could simply be written as:
columns:
 - name: "age"
   data_type: "Int"
   nullable: true
 - name: "name"
   data_type: "Utf8"
   nullable: false
  • One array per column that contain the buffers that store the actual data.

For storing the actual data let's start with the first column age. This column has the data type Int and following Apache Arrow's specification, we find that it must be encoded using the fixed-size primitive layout. This is a very simple layout that only consists of two buffers:

  • Validity buffer: Encodes wether a value is null.
  • Value buffer: Stores the actual integer. In our example they would look like this (with 32 bit integer):

Note here that Arrows uses little-endian bit numbering for the validity buffer, i.e. you must read it from right to left, and that most of this example is just padding to get 64 bytes long buffers. More on the latter later.

For encoding the name column we must use the variable-size binary layout. Due to the variable length of the strings (UTF-8 Encoding) this is more complex and needs three buffers:

  • Validity buffer: Same as above but can be omitted as the column is not nullable.
  • Value buffer: The actual values stored right next to each other.
  • Offset buffer: Stores the start position of each value in the value buffer. This allows to reconstruct where a value start and where it ends to allow for variable-sized values.

In our example things would look like:

In essence, that's all there is to it: define your metadata and use the associated layout for your data types to encode the values. But note that this was just a very simple example. Apache Arrow defines more than 25 data types and there are also nested layouts that are made up of multiple layouts with relationships, so things can get complex. But with the basics of how the memory layout works explained, let's look into a few aspects why it was defined this way.

Why the memory layout is also smart

Apache Arrow not only defines a standardized memory layout that allows for easy sharing of data, it is also designed to be highly efficient. This is because it takes into account the kind of operations typically used on the data and how modern CPUs access memory.

As described above, Apache Arrow stores all values of a column inside one buffer, instead of storing all values of a record contiguously. While the latter could also be used to build a standardized layout, using a columnar format brings multiple benefits. First it is optimized for typical online analytical processing queries that do projections, i.e. only selecting a subset of columns, and aggregations over large number of records. By using a contiguous region of memory for individual columns a projection operation can simply select the buffers of interest instead of scanning the entire memory region to find the selected column values of each record. For aggregations the Principle of Locality is fully used, as big regions of columns values will be loaded into the CPU registers and cache hits are more likely.

Additionally, the buffers are allocated in such a way that their start and end points are aligned with memory addresses that are a multiples of 8 or 64 bytes. If a buffer contains not enough data padding is used to over-allocate memory to ensure this alignment. This is similar to how modern computers place Data alignment restrictions on the allowable addresses for primitive data types to ensure that the CPU can access them with the minimum number of instructions. This is also one of the reasons Apache Arrow does this, but in addition using e.g. 64-bytes alignment ensures an efficient use of the Single instruction multiple data (SIMD) registers, see here.

Example: Zero-copy

Let's have a look at a concrete example of how Apache Arrow's standardized memory layout can help to avoid making copies, i.e. zero-copy, when passing data between different libraries in the same process.

For this, we will use Python and first define a function to measure the change of memory usage of our program:

import os

import psutil

process = psutil.Process(os.getpid())

def measure(operation_name):
    """Decorator to measure memory usage"""
    def decorator(func):
        def wrapper(*args, **kwargs):
            mem_before = process.memory_info().rss
            result = func(*args, **kwargs)
            mem_after = process.memory_info().rss
            print(f"\n{operation_name}")
            print(f"  Memory allocated: {(mem_after - mem_before) / 1024**2:.2f} MB")

            return result
        return wrapper
    return decorator

Then we generate some random string data and build two pandas DataFrames, one with the standard memory layout and the other with the Apache Arrow memory layout:

import random
import string

import pandas as pd

def generate_random_string(length=10):
    return "".join(random.choices(string.ascii_letters + string.digits, k=length))

n_rows = 1_000_000

data = {
    "x": [generate_random_string() for _ in range(n_rows)],
    "y": [generate_random_string() for _ in range(n_rows)],
}

@measure("Traditional pandas DataFrame")
def create_trad():
    return pd.DataFrame(data)
    
@measure("Arrow-backed DataFrame")
def create_arrow():
    return pd.DataFrame(data, dtype="string[pyarrow]")

df_trad = create_trad()
df_arrow = create_arrow()

This prints the following result:

Traditional pandas DataFrame
Memory allocated: 30.10 MB
Arrow-backed DataFrame
Memory allocated: 56.62 MB

Compared to the traditional pandas DataFrame the one with the Apache Arrow layout has a larger memory footprint, which is probably due to different string representation, padding, and additional buffers (offsets and validity bitmaps). However, Apache Arrow does not optimize for minimum memory usage, but reusability. So let's have a look at what happens if we create a Polars from the pandas ones:

import polars as pl

@measure("Traditional: pl.from_pandas(df_trad)")
def to_polars_trad():
    return pl.from_pandas(df_trad)

@measure("Arrow: pl.from_pandas(df_arrow)")
def to_polars_arrow():
    return pl.from_pandas(df_arrow)

Which prints:

Traditional: pl.from_pandas(df_trad)
Memory allocated: 31.62 MB

Arrow: pl.from_pandas(df_arrow)
Memory allocated: 1.88 MB

This shows that we needed to make another copy of the data when using the traditional pandas DataFrame as a source. On the other hand for the DataFrame that uses the Apache Arrow memory layout, the buffers could just be reused and only a small amount of metadata needed to be allocated.

IPC

Now that we know about all the advantages of the RecordBatch memory layout, wouldn't it be great to communicate it efficiently between processes? This is possible by using the inter-process communication (IPC) protocol of Apache Arrow. The key idea of this protocol is that the buffers that make up a RecordBatch can just be sent around as byte streams and the receiving process can use them without having to parse them. But to actually "reconstruct" the RecordBatch the sending process must include metadata that states what the buffers actually represent.

Before going into the details of the IPC protocol, let's first look at one way that data can be exchanged between processes. Imagine I want to send the following Python dictionary from one process to another via TCP:

data = {
    "name": "Alice",
    "age": 30,
    "scores": [10, 20, 30]
}

I could do this by first serializing it into a string, encoding it and sending it over a socket.

import json
payload = json.dumps(data)
sock.send(payload.encode())

The receiving side just sees raw bytes:

b'{"name":"Alice","age":30,"scores":[10,20,30]}'

And to get a dictionary it must first decode them, scan the bytes character by character and recreate the content of the dictionary:

data = json.loads(received_bytes.decode())

Here, the original bytes are copied in the newly created object and the CPU has to do work. Now Apache Arrow tackles this differently by first using a standard binary format that a receiver can always handle in the same way and second by including metadata as a form of "recreation manual" that can be instantly "deserialized".

The standard binary format of Apache Arrow's IPC is the encapsulated message format, which looks like this:

It consists of:

  • A 32-bit continuation indicator. The value 0xFFFFFFFF indicates a valid message. This is useful for knowing when a new message starts.
  • A 32-bit little-endian length prefix indicating the metadata size. This is needed to know how many bytes should be looked at to read the metadata. Remember that we only want to read the metadata and not all the data.
  • The actual metadata encoded as Flatbuffers.
  • Padding to an 8 byte boundary for Data alignment restrictions.
  • The actual data buffers.

This encapsulated message format is the only kind of format that is sent around, so a communication is just multiple of these messages strung together one after another. But inside this message format various message types can be stored, e.g. a Schema or RecordBatch message. What type of message is encapsulated is encoded within the metadata. Therefore, to correctly understand what type of message a process receives, it first has to identify that a new message starts via the continuation indicator and then use the metadata size to read all of the metadata.

The metadata is in fact created using a serialization framework called Flatbuffers. Here a schema is used to serialize an object by creating a binary buffer that uses offsets to organize nested structures. This way one can just jump directly to the desired location in the buffer where the desired data field is by using an offset. So to make matters more complex Apache Arrow's serialization is dependent on another serialization framework. But by leveraging Flatbuffers the metadata is just "understood" instantly by the receiver and it can use it to build the corresponding Apache Arrow object.

For example for the Schema message type the Flatbuffers schema could look something like this:

table Schema {
  /// List of fields
  fields: [Fields];
}

root_type: Schema

Where a field has a name and some associated data type. Other than that, no actual data would be needed and the metadata that was serialized using this schema could be instantly used to construct an Apache Arrow schema.

As another example let's look at a (simplified) Flatbuffers schema for a RecordBatch message type:

table RecordBatch {
  /// Number of records
  length: long;
  /// List of buffers
  buffers: [Buffer];
}
  
struct Buffer {
  /// The relative offset in the actual data binary stream
  offset: long;

  /// The length of the buffer
  length: long;
}  

root_type: RecordBatch

This encodes the buffer structure of the actual data block at the end of a message by stating the offset and length of each buffer. Here the actual data block for a RecordBatch message is just all of its buffers laid out sequentially. This in combination with a schema allows us to understand what kind of data the buffers represent and therefore reconstruct the RecordBatch.

To now actually send RecordBatches between processes one uses the IPC streaming protocol. Here, first a Schema message is sent and then one or more RecordBatch messages are sent that each use the same schema.

Example: IPC serialization

As a final example, let's compare how to serialize and deserialize a pandas DataFrame using pickle and Apache Arrow’s IPC streaming protocol.

For this, we will reuse our generate_random_string function and create a new pandas DataFrame using the Apache Arrow memory layout:

n_rows = 1_000_000

data = {
    "x": [generate_random_string() for _ in range(n_rows)],
    "y": [generate_random_string() for _ in range(n_rows)],
}

df_pandas = pd.DataFrame(data, dtype="string[pyarrow]")

We then measure how long it takes to deserialize this pandas DataFrame using pickle:

import pickle
import time

# Serialize
start = time.perf_counter()
pickled_data = pickle.dumps(df_pandas)
pickle_serialize_time = time.perf_counter() - start

# Deserialize
start = time.perf_counter()
df_pickle_loaded = pickle.loads(pickled_data)
pickle_deserialize_time = time.perf_counter() - start
 
print(f"Serialize time:   {pickle_serialize_time:.3f}s")

print(f"Deserialize time: {pickle_deserialize_time:.3f}s")

print(f"Total time:       {pickle_serialize_time + pickle_deserialize_time:.3f}s")

This prints:

Serialize time:   0.158s
Deserialize time: 0.055s
Total time:       0.214s

Now we do the same using the IPC streaming protocol and compare the results:

import pyarrow as pa

# Create arrow RecordBatch
arrow_record_batch = pa.RecordBatch.from_pydict(data)

# Serialize
start = time.perf_counter()
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, arrow_record_batch.schema)
writer.write(arrow_record_batch)
writer.close()
arrow_data = sink.getvalue()
arrow_serialize_time = time.perf_counter() - start

# Deserialize
start = time.perf_counter()
reader = pa.ipc.open_stream(arrow_data)
arrow_record_batch_loaded = reader.read_all()
arrow_deserialize_time = time.perf_counter() - start

print(f"Serialize time:   {arrow_serialize_time:.3f}s")

print(f"Deserialize time: {arrow_deserialize_time:.3f}s")

print(f"Total time:       {arrow_serialize_time + arrow_deserialize_time:.3f}s")

print("=" * 50)
print("COMPARISON")
print("=" * 50)

print(
    f"Speedup (serialize):   {pickle_serialize_time / arrow_serialize_time:.2f}x faster"
)

print(
    f"Speedup (deserialize): {pickle_deserialize_time / arrow_deserialize_time:.2f}x faster"
)

print(
    f"Speedup (total):       {(pickle_serialize_time + pickle_deserialize_time) / (arrow_serialize_time + arrow_deserialize_time):.2f}x faster"
)

This prints:

Serialize time:   0.054s
Deserialize time: 0.003s
Total time:       0.058s

==================================================
COMPARISON
==================================================

Speedup (serialize):   2.91x faster
Speedup (deserialize): 16.82x faster
Speedup (total):       3.70x faster

While this may not be a fair or complete comparison, it shows that we can deserialize data faster using Apache Arrow. This also highlights how much faster deserialization can be when buffers are reused and metadata is read directly using FlatBuffers.

Parting thoughts

I hope this kind of educational introduction to Apache Arrow can help some of you wrap your heads around it. I learned a lot while writing this blog post, and Apache Arrow feels much less daunting now. But please keep in mind that I left out a lot of details and simplified many aspects to keep this post at a m reasonable length. To go deeper, take your time and read the specifications yourself.

How to do a GPU passthrough on Beelink SER 5 - Ryzen 7 5800H with Proxmox on Ubuntu VM

After I installed Proxmox on my Beelink SER 5 - Ryzen 7 5800H I wanted to start an Ubuntu VM with GPU passthrough that I could use as a daily driver. As I had zero prior experience with doing something like this my process was full of following guides I barely understood, frustrating trial and error and reading about the things I just tried to do. This journal describes this process. It should not be seen as a complete guide, as I am far too inexperienced to verify if I did everything correctly. But maybe it helps someone like me who is lost and googling error messages.

Before describing my journey I want to point out a few guides, with which I would not have been able to get it done:

My journey

After I ran head first into following a guide and not getting results I decided to take a step back and first gather more information on what a GPU passthrough even is. I came up with the following description:

GPU Passthrough

GPU Passthrough is a technique in virtualization and a form of PCI Passthrough, where the Graphics Processing Unit of the Hypervisor (for me Proxmox) host machine is directly assigned to a Virtual machine (VM). This way the VM can access the GPU as if it is directly connected to it.

Normally a hypervisor provides virtual hardware to a VM and if this is done for a GPU one speaks of a virtual GPU (vGPU). Here multiple VMs can have access to a real physical one, but this way a single VM may not be able to use the full potential of the GPU. Additionally the hypervisor adds overhead to the communication leading to reduced performance. Now GPU Passthrough acts as a solution to this problem, by bypassing the virtualization layer, giving the VM near native performance of the GPU.

Steps to enable it

  1. Enable Input-Output Memory Management Unit (IOMMU): IOMMU is necessary to isolate the GPU. It allows to map physical memory address to virtual ones and can therefore be used to restrict the access only to a single virtual machine.
  2. Blacklist the host driver: By blacklisting the host Driver of the GPU the host machine will no longer be able to access the GPU. This way the Virtual Function IO (VFIO) will be able to take control over it without any timing issues due to interference of the host machine.
  3. Bind the GPU to the Virtual Function IO: Here the VFIO acts as a driver that then has the control over the GPU and can give direct access to the VM. This way the host machine will no longer have access to it.
  4. Assign the GPU to the VM: After the GPU is bound to the VFIO one can assign it to any VM.
  5. Install the necessary driver on the VM: To ensure that the VM can actually use the assigned GPU the necessary driver need to be present.

Naively following the steps

0. Define the VM that you want to launch

Here many guides exist like:

For my case I just copied the download link from: https://ubuntu.com/download/desktop and then downloaded the image to the local storage. Afterwards I did some standard configuration. The only thing out of the ordinary was that I used OVMF for the BIOS.

OVMF stands for Open Virtual Machine Firmware, and it is an open-source implementation of the Unified Extensible Firmware Interface (UEFI) specification. UEFI is a modern firmware interface that replaces the traditional BIOS (Basic Input/Output System) found in older systems. UEFI offers several advantages over BIOS, including support for larger disk sizes, faster boot times, secure boot, and improved system management capabilities.

What I can recommend is to install ssh on the VM so that you can still access it without a GUI.

sudo apt update
sudo apt install openssh-server
sudo systemctl status ssh

1. Enable Input-Output Memory Management Unit (IOMMU)

The Input-Output Memory Management Unit (IOMMU) is a hardware features of modern CPUs that allows the Operating System (OS) to control how I/O devices access the memory. It acts as a form of gatekeeper and is therefore similar to how the Memory Management Unit handles the access of the CPU to the memory

To enable it I needed to configure the Grand Unified Boot Loader (GRUB) of the host machine. In a normal boot process the GRUB is started by the BIOS and is responsible for loading the operating system. In my case with Proxmox it loads the Linux kernel. The way it behaves can be configured in the /etc/default/grub file, which can be edited with sudo nano /etc/default/grub. After an update the command sudo update-grub must be run to reload it.

To enable the IMMOU I often read that one has to add iommu=on or in my case with an AMD CPU amd_iommu=on. In my trial and error process it turned out that I did not have to enable the IOMMU this way. Instead I only set iommu=pt, by changing the line GRUB_CMDLINE_LINUX_DEFAULT="quite" to GRUB_CMDLINE_LINUX_DEFAULT="quite iommu=pt". As far as I know this should enable the passthrough mode in the IOMMU and enhance performance.

2. Blacklist the host driver

This step was rather straight forward. I just had to execute:

echo "blacklist amdgpu" >> /etc/modprobe.d/blacklist.conf
echo "blacklist radeon" >> /etc/modprobe.d/blacklist.conf

This puts these two drivers in the blacklist, so they won't be automatically loaded in the boot process.

If you wonder about the modprobe in the path, this stands for a Linux command that bears the same name. The command modprobe allows to load or unload a kernel module, in our case a driver. By using modprobe <module name> the specified module is loaded in the Linux kernel. This way the system can use the hardware that is specified by the module. By adding the flag -r the module is unloaded.

3. Bind the GPU to the Virtual Function IO

To bind the GPU to the VFIO I first had to ensure that all necessary modules are loaded into the Linux kernel in the boot process. To do this I executed the following command:

echo "vfio" >> /etc/modules
echo "vfio_iommu_type1" >> /etc/modules
echo "vfio_pci" >> /etc/modules
echo "vfio_virqfd" >> /etc/modules

Afterwards I needed to identify the GPU, which can be done by using lspci -v and looking for a "VGA compatible controller". In my case:

04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c5) (prog-if 00 [VGA controller])
04:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller

Here the first column is a combination of bus, device and function number. In my case 04:00.0:

  • 04: Bus number.
  • 00: Device number.
  • 0: Function number.

This value can then be used to access further information by executing lspci -n -s 04:00, which yielded in my case:

04:00.0 0300: 1002:1638 (rev c5)
04:00.1 0403: 1002:1637
04:00.2 1080: 1022:15df
04:00.3 0c03: 1022:1639
04:00.4 0c03: 1022:1639
04:00.5 0480: 1022:15e2 (rev 01)
04:00.6 0403: 1022:15e3

Here the second column is the class code in hexadecimal, which provides information about the type of device. The third column is the vendor and device id in the format <vendor-id>:<product-id>. In my case 1022 stands for AMD and e.g. 1638 uniquely identifies a specific model of device from the vendor. The last column is optional and is the Revision Id that indicates the specific version of a device.

For binding the GPU the important part is the <vendor-id>:<product-id>. I used the top one (the GPU) and the second one (an audio device) and binded them using echo "options vfio-pci ids=1002:1638, 1002:1637" > /etc/modprobe.d/vfio.conf.

After a reboot one can execute lspci -v and check if the Kernel driver that is used for the GPU is now vfio-pci instead of the previous one (in my case it was amdgpu).

4. Assign the GPU to the VM

This can be easily be done via the Proxmox GUI using VM -> Hardware -> Add -> PCI Device.

But I ran into the issue that whenever I started the VM, my host machine would become unresponsive. I could identify that this did not happen when I did not check the box "All Functions".

5. Install the necessary driver on the VM

Since I used Ubuntu for the VM and an AMD GPU, I didn't have to install anything, as the amdgpu driver is readily available.

Issues I ran into

While the steps described above sound nice and simple I encountered multiple issues.

Missing BIOS ROM

After naively following the steps I did not receive any output at the display linked to the host machine when I started the VM. After ssh-ing into the VM I could check via sudo lshw -c display that GPU was detected, but that the driver was not is use for it. Further debugging by using sudo dmesg to check the kernel ring buffer revealed the following error logs:

amdgpu: Unable to locate a BIOS ROM
amdgpu: Fatal error during GPU init

This should mean more or less, that the GPU could not be initialized by the VM, because it can not access the GPU firmware stored on in the read-only memory (ROM). I encountered this error, because I skipped an important step: Making the GPU BIOS available to the VM.

To fix this I wanted to extract GPU BIOS, by using the tool amdvbflash on the host machine, but here I had no success and only encountered the following error:

root@proxmox:~/flash-gpu# ./amdvbflash -i
AMDVBFLASH version 4.71, Copyright (c) 2020 Advanced Micro Devices, Inc.

Adapter not found

I am unsure, but I think the reason that this tool could not detect my GPU is due to my GPU being an AMD Cezanne one. Cezanne is a codename for a specific architecture GPUs that combine both a CPU and a GPU on the same chip. But I am not sure about that.

But I was in luck and this guide here linked me to a description of the BIOS extraction process see here: https://github.com/isc30/ryzen-7000-series-proxmox/?tab=readme-ov-file#configuring-the-gpu-in-the-windows-vm.

Additionally I did the following steps for the audio device. But I do not know if this was necessary.

Stuck at 800x600 resolution

After extracting the GPU BIOS finally got output on my display, but I could not adjust the resolution and was stuck at 800x600. The reason for this was probably a bug called AMD GPU reset. But I was again lucky and found another guide that described how to resolved it: https://www.nicksherlock.com/2020/11/working-around-the-amd-gpu-reset-bug-on-proxmox/ After installing the vendor reset like describer in the post I could adjust my resolution.

Redirect USB devices

As a final step I redirected my mouse and keyboard. For this I first detected the <vendor-id>:<product-id> of the devices using lsusb on the host machine. Then I rebound them using: qm set <VM-ID> -usb0 host=<vendor-id>:<product-id>.

Things I learned about learning

Learning has always been an ambivalent topic for me. In school, it was cumbersome, and I didn’t find much joy in it. On the other hand, I’ve always enjoyed picking up new skills that interest me, like juggling or card magic. While I’ve never deeply developed any particular skill, I’ve become better at learning autonomously.

This autonomy helped me later in life, especially when I did a career switch from physics to software/data engineering three years ago. While I was skilled enough to land a job, I lacked knowledge in several areas and felt behind with a lot of ground to cover. My old, intuitive way of learning had worked before, but it felt inefficient and unsatisfactory. I didn't feel in control of the things I learned, and I reacted more to the things I randomly encountered rather than following a structure. Also, nothing really seemed to stick for long. I often had to relearn things, which felt more like starting from scratch rather than building on the past.

As I believe that learning is one of the most important skills in life, especially in areas that change quickly like software engineering, I decided to try to get better at it. I have read, viewed and studied some resources and tried various approaches and refined the ones that worked best for me. In the following, I present some of my key findings that I believe have been the most important in improving my way of learning. Note that all of this is non-scientific, highly opinionated and just a snapshot of my current thinking.

You will forget things, get used to it

Our brain is not meant for endless storing and exact retrieval of knowledge[^1]. Accept this. We will all forget things we previously spent a lot of time to learn and understand.

For me this always felt disappointing, like losing something valuable, so why even acquire it? Also learning something new always has this feeling that I would need to forget something old. The key to solving this dilemma for me was to first change my view on learning. Instead of seeing the goal of learning in being able to remember facts, I shifted to enjoying the process of changing the way I think. So even though I know that I will not be able to remember all of it in the future, I know that new connections have formed in my brain that may make it easier for me to pick up something similar.

Second I implemented a knowledge management system, where I write atomic notes on the things I learn. These atomic notes are like short summaries of concepts or fractions of concepts that are small enough to view and quickly grasp in isolation, even though they may be part of something far bigger. They act like checkpoints in a video game and allow me to pause studying a topic and come back to it later. So even when I will forget things it does not feel like I lost something, as I can easily retrieve it by reading my own thoughts. After all, who could better teach you a topic than yourself? This allows me to forget and lowers the pressure of keeping everything in my brain.

In a sense I have used such knowledge management systems, or like some call it second brains, unconsciously in the past. Like remembering only where in a book or website specific things are written instead of remembering them myself. So building and using such a system consciously for learning and separating it from external resources just made sense.

Another great thing about writing about the things you learn is that you are actually forced to actively engage with the topics you are studying. You stop from being a passive consumer and already become a creator just by recapitulating things in your own words or consolidating multiple sources. This idea is so important that it will pop up multiple times throughout this blog.

To build your own knowledge management system, I would recommend starting as small as possible and getting things going. There are endless tools and frameworks out there with varying complexity, just choose something that looks fine; you will not get it perfect the first time you try anyway. Stick with it and refine it as you go. You will notice things that work for you and things that bring you no value. Feel free to drop things that bring you no value, and don't get fixated on doing it the 'right' way. Building a knowledge management system is very personal, so your way will quickly diverge from postulated systems from experts.

The only thing I would say that is a must have for such a system is that you can trust it. As you will put a lot of time and value in it, the things you write should be secure. I would recommend first implementing a backup strategy for your system that you can trust. Then you can fully engage with it without a worry. (But other people have far higher requirements for the system they use see this blog post.

For my system I mainly use markdown files that I edit with neovim. I sync those files with my phone using Syncthing, as it is very important to me to also have access to my notes when I am not at home. On my phone I use the Obsidian app, which in the end is just a nice (but extremely powerful) wrapper around markdown files. For example it allows to reference other notes, creating connections between, which can be very helpful to navigate older notes and put things in perspective. As I am using Obsidian I also use this vim plugin. To trust my system and make it durable I make daily automated backups to an external harddrive and my Google drive using rclone. With this, I have relatively recent versions of my notes on four devices at all times.

Here is a graph view in the Obsidian app of some of my notes and how they connect with each other:

Learning is effort and is not supposed to be easy

Learning differs significantly from merely being presented with knowledge. Learning is exhausting and often not pleasant. It involves being challenged and getting frustrated. If something is too pleasant, you should question whether your are actually learning or just consuming.

When I noticed this the first time I felt discouraged, because my natural instinct is to avoid uncomfortable situations. Feeling frustrated when learning something new made me think that I was doing something wrong and I rather questioned if I was on the right track. Perhaps I was indeed doing something wrong by challenging myself too hard and should have chosen a different approach. On the other hand, simply doing things that come easily to oneself will not lead to actual growth. What remains is a small sweet spot area where you feel slightly uncomfortable and may sometimes get frustrated, but you are not too far away from your area of competency to easily get discouraged. Finding this area is not easy but being sure that it exists, no matter where you are coming from, helps.

For me one example of this was when I was working through the book "The Algorithm Design Manual" by Steven S. Skiena and tried my luck on the suggested leet code challenge as an exercise in one of the early chapters. I quickly got stuck on a challenge that required a Fenwick tree, a rather advanced data structure, that was just too hard to grasp at my current level of understanding. Even though I pushed through with the help of external resources, it took me an enormous amount of time and I thought about just not doing these challenges anymore for the following chapters.

But instead I just acknowledged that some of the suggested challenges were too hard for me at the time and took a step back. I sought out related easier challenges first before tackling the suggested ones. And started to systematically work through learning paths of easier concept like LeetCode 75 , to gradually train my problem solving. While this made my progress feel slower as I had to work through more stuff, it made it far more enjoyable and helped me build up momentum.

On another note, I have become more aware of the big difference between simply consuming knowledge and actual learning by engaging with it. It may sound like a foolish thing, but I simply asked myself: "With deep knowledge on all niche topics available at our finger tips. May it be through classic books, YouTube tutorials or generated by some large language model. What stops anyone from becoming an expert by just consuming the essentials as fast as possible?" It just felt like a straightforward path that anyone with enough time on their hands could achieve.

But personally I noticed that just consuming excellent work, does not lead to excellent understanding. With recommendation algorithms suggesting all kind of content, I have consumed videos that delve into in-depth art analyses, historical events or workings of the economy. And while these videos were well structured and covered topics exhaustively I don't feel knowledgeable about any of these topics. I may have come across a new idea or concept that felt new and exciting while watching, but actually retrieving this information to explain it to someone else or putting it in context with other things, felt impossible. The only things that stuck with me are out of context fun facts. (Which is still neat.)

The key difference between consuming knowledge and truly learning is the act of actively engaging with new information. Even if a concept is presented to you in organized and bite-sized portions, you must make the effort to "re-invent" it yourself.

  • You have to care about the problem it solves, because, after all, why should you understand it if it brings you no value.
  • You have to understand how it relates to the bigger picture. Concepts do not exist in a vacuum, they are always connected. Finding these connections makes them easier to remember and accelerates learning new ones as you can leverage your existing understanding.
  • You have to challenge your understanding of it. Explain it to yourself or others without using the original source. Look for weak points. Maybe you find out your understanding was incorrect and needs adjustment.
  • You need to use it. Actually using new knowledge can feel so rewarding. Additionally, by using it you will be forced to go through the above steps.

I first encountered these ideas in an excellent YouTube video, which I highly recommen—and ultimately summarized parts of it to deepen my understanding.

To enforce parts of this in my learning routine I use a popular system, which brings me to my next topic.

Use spaced repetition with effortful retrieval

Spaced repetition is the process of repeating learning exercises in intervals that adjust based on your performance. In practice this means that you repeat a learning exercises more often when it is new to you and the better you get at it the less frequent you repeat it. The goal here is to repeat an exercise when you begin to forget about it, refreshing the connections in your brain. You may be familiar with this approach for learning vocabulary and there exist many tools to easily implement it. I use Obsidian Spaced Repetition.

Spaced repetition works great in combination with my personal knowledge management system. Here I treat the effortful retrieval of a note as a learning exercise. In practice this means that I try to recapitulate the gist of a note to myself, figuring out how I would explain the concept to someone unfamiliar with it. This forces me to actively engage with the knowledge, always re-inventing it as described in the previous section.

In action this looks like this:

Each note I want to review using spaced repetition is assigned a due date for its next review.

For every day I then have a list of notes to review.

Depending on "how good I did" I rank the notes as either easy, good or hard, which sets the next due date to either later or earlier.

Doing multiple repetitions of anything I learned sounded daunting at first and I must admit that it still is tedious sometimes. Additionally, it is tempting to start a new topic as soon as the previous one is completed, as it feels more rewarding. But I have found that spaced repetition is by far the most effective tool for persistent and sustainable knowledge increase.

First you actually remember what you just studied. I notice a great discrepancy in recall of things I just consumed and those I actively engaged with by doing a few rounds of spaced repetition. I often experience this when I discover a note I have no memory of, realizing I forgot to put in the spaced repetition loop.

But the most important aspect I see in spaced repetition is that everything, no matter if you remember it in your sleep, comes back at you. This way you can think about them with a new perspective maybe understanding them on a deeper level. Moving from a more abstract idea to a concrete example and back, see semantic wave. Or taking advantage of the things you have learned in-between to find new connection or detecting errors in your old reasoning. Additionally, it helps to tackle hard things. Quite often I write a note about a concept I do not fully grasp at the time. I no longer feel bad about this, but just drop perfectionism and add it to the loop, knowing that I will figure it out down the line.

Take on and seek out learning opportunities

Learning opportunities arise more often than you might think. The things that can have the biggest potential to increase your knowledge can sometimes be right in front of you. Take those opportunities on and seek them out.

The thing I find most stressful as an engineer is if something breaks, I don't know why, and I have to fix it quickly. I dread this combination of urgency and being out my comfort zone. Unfortunately in such situations I often find myself defaulting to inefficient problems solving techniques. Such as extreme loops of trial and error without reasoning and just using things I found online. I think I do this, because not fully grasping the issue at hand is uncomfortable and taking it slow to actually understand what is happening is challenging. Taking it slow can feel like no progress is being made. So I prefer the easy way of rapidly doing actions, while mostly inefficient, to the tougher challenge of actually facing and fixing my blind spots.

As such situations are not a rarity and arise more often than I wish, I have started to view them as learning experiences that I have to take on instead of simply fixing. I try to not panic and just solve them as fast as possible, but actually take my time to understand what is happening. This can lead to frustration and despair, but as mentioned before, while this is uncomfortable, it is often just a sign that learning is going on. I try to use those feelings as a guide and appreciate them. The first time I came across this idea was in Julia Evans' zine The Pocket Guide to Debugging, which I highly encourage you to check out.

Unfortunately this is far easier said than done, but what I have found most useful is to reduce the feeling of urgency as best as possible. Often, things feel far more urgent than they actually are, and being uncomfortable only enhances this. So when I feel the itch to fall back to inefficient fast action methods I do a reality check to figure out how much time I could actually spend on an issue. In other situations I seek out a quick and dirty solution to mitigate an issue and give myself time that I can spend in peace.

What I have found is that such situations, while uncomfortable, can serve as the best learning experiences and greatly enhance one's understanding. Additionally, I have found that I do not need to wait for things to actually crash. Often I can find such valuable experiences in the things I procrastinated on or in tasks that no one on the team wants to do. What helps me to find and take on such challenges is to reflecting on why I don't want to do them. Is it because they are tedious and boring, or am I actually just not confident enough to perform the required tasks? So, in short, if it is a skill issue, do the task and learn something.

Focus your effort

In learning, like (almost) everything else in life, focusing on concrete goals and making steady progress beats aimless sprints. You should learn with an action in mind: to apply the newfound knowledge or to follow some broader high-level goal.

Don't treat learning like some mindless consumption on an endlessly scrolling dopamine app. Don't hop from topic to topic without going deeper than surface level. I have often fallen to this trap and still do. While it is enjoyable for the sake of curiosity and broadens one's perspective, in the end, nothing really sticks, and the result feels hollow to me. To combat this, I use two approaches.

The first is that I follow one or more foundational, long-term goals, for example, teaching myself computer science. While I often cannot directly apply anything I learn here, the knowledge foundation I build helps me pick up things faster when I actually need them. It acts as a long-term investment where I am sure it will not quickly fall out of fashion, in contrast to some hyped flavor of the month.

The second approach is that I try to learn similarly to the Toyota way. Toyota only builds car parts when they are needed, so I try to only dig deeper into a topic when I want to actually apply it in the near future. While it can be tempting to experiment with a shiny new framework by skimming through documentation and completing tutorials. I have often found little value in it if I don't use it to solve real problems and gain practical experience with it. Therefore, I try to avoid getting sidetracked and focus on improving at the things I am actually using. For a deeper insight on this aspect see this great blog post.

This combination of building long-term fundamental knowledge and spiking in detailed areas that I encounter every day, has served me well. While it is still hard not to fall victim to FOMO, it feels great when I notice that I can pick up a new framework more quickly by leveraging knowledge of some underlying concept. Additionally, focusing on a few areas to gain an understanding beyond the advanced level not only tremendously helps with everyday work, but can also yield unexpected transferable skills.

Closing thoughts

Reflecting on the things I have written, I come to the conclusion that I now learn more systematically and consciously. I use systems to make hard challenges easier, and therefore feel more confident and secure tackling them. Also, due to repetition and reflection, I have become more aware of how I am thinking while learning. (a process I recently learned is called metacognition). This allows me to learn more efficiently and make it more enjoyable, so I continue to do it. Overall, I am quite happy with the progress I have made and hope that reading my experiences helps you reflect on the way you learn.