Because it will be almost 6 months since I started flirting with the idea to base my thesis on HelenOS somehow, it is time to report some progress.
Actually, there are two reasons for this post. First, I think that something has been implemented and it is time to tell the community about it. And also I feel guilty after JS and JJ complained that they feel lack of communication from students participating on HelenOS. So here you have it ;-).
First about intended goal and target platform: my work is
aimed at supporting
(cache coherent NUMA) hardware on platform
amd64 (this is
by servers with Opteron processors on NUMA capable motherboard).
More precisely, my work is targeted at such (hardware) NUMA
configurations where HelenOS is able to run without any NUMA support
This means that the address space is continuous across individual
nodes and the node boundary is transparent to the OS (i.e.
information about memory distribution on the nodes is obtained
through other channels and is not necessary for correct memory usage
and management), the same for CPUs and also that the hardware does
not require any special instructions to ensure cache coherency.
And, of course, the main goal is to defend my diploma thesis about
making HelenOS NUMA-aware.
My supervisor for the thesis is Mr. Děcký.
Now, a few words about simulators I use for development (so far I
have not tried to run HelenOS on physical NUMA hardware).
Most of the time, I use QEMU which is able to
simulate NUMA machine, using the
Because it took me some time to understand the syntax of this switch,
I post here example command that launches machine with 8 CPUs and
512M of memory, both CPU and memory distributed evenly on 4 nodes:
qemu-system-x86_64 -m 512 -smp 8 \ -numa node,mem=128,cpus=0-1 \ -numa node,mem=128,cpus=2-3 \ -numa node,mem=128,cpus=4-5 \ -numa node,mem=128,cpus=6-7 \ -boot d -cdrom ./image.iso
Second simulator I sometimes use is SimNow created by AMD which allows you to build interactively your own machine (including configuration of low level stuff such as northbridges and various other controllers) and run it. The downside is that the simulator is very precise (in the detail of the simulation) and thus extremely slow.
So, this was a bit of the background of the work and now about what
was actually added to the source codes of HelenOS.
My branch is registered at
under HelenOS team as
bzr branch lp:~vojtech-horky/helenos/numa
If you want to try it, go ahead: check out the branch and in the
configure dialog select
amd64 from the preconfigured defaults and
do not forget to check the
NUMA support option (I added several
other options which are useful for debugging, I will remove them
However, do not be disappointed after the machine boots up.
There is nothing shiny and new to be seen - all changes are quite low
level and currently the only way to see them is to try the
That was the theoretical background. Now a brief summary about what is actually implemented.
First of all, the hardware detection uses ACPI tables (
SLIT) and assigns to each CPU node id (node index).
With memory, it is a bit more complicated.
I decided not to add any other layer above existing zones (because
that would mean changing a lot of code and, furthermore, each node is
usually formed by single zone) and only split zones that run across
On AMD64, this typically means that only on the first node is memory
split into several zones (some of them are reserved) and all other
nodes have only single zone.
And that is actually all.
There were several problems when to read the topology information
from ACPI (I might write about it later), but it works now pretty
Next, I made some changes to kernel slab allocator.
Here, I decided to add extra layer to the
allocator (if you
want to know the details of the implementation of the allocator, the
one in HelenOS is very similar to the one introduced by Jeff
Bonwick), that groups the magazines of CPUs of the same node.
This is, of course, enabled only in NUMA build (
#ifdefs are used).
So far, I have not done any benchmarking so I cannot report whether
the change was worth the effort.
To allow more precise thread placement on nodes, I added support for thread affinity. It is nothing extra: each thread has bitmap of CPUs where setting/clearing bit allows the thread to run on respective CPU. The load balancer respects this setting and when the scheduler finds out that it is supposed to schedule thread that cannot run on current CPU, it migrates it away. By default, the thread can run on all CPUs.
For address space areas, I added similar thing plus something that could be called allocation policy. The affinity for AS areas determines from which nodes could be allocated frames that back the area. And the policy specifies how the allocation shall be done. So far, there is first touch (default, allocate from node where is currently running the thread that caused the fault-in) and interleave (allocate from all nodes in the affinity mask evenly, with precise ordering).
Last thing in kernel is automatic page migration (by the way, this is something I am really looking forward to measure). What is it about in a few words? When a thread is migrated on UMA system, its distance to the memory is not changed. However, when it is migrated to different node on NUMA system, it suddenly has the memory much farer. So, obvious remedy is to migrate the memory as well. However, copying huge amount of memory takes some time and it also may not be the best solution for multi threaded programs. Thus, I implemented something that could be called lazy migration: the pages are simply marked as not present and when they are accessed, the are migrated if necessary. As I stated at the beginning of this paragraph, I am very itchy about this and I will definitely write about this more.
And that is all for the kernel side. Still is missing some better load balancing that respects the fact that it is more expensive to migrate thread to different node. And, of course, I need to do some benchmarking. That will be real fun because I do not have many ideas how to create fair benchmarks that would measure performance gains and losses when the difference between accessing local and remote memory is almost imperceptible.
For userspace, there is almost complete port of libnuma API (the
one from Linux).
The implementation was written from scratch, but the
shall be compatible with the Linux version.
There are, of course, some changes due to fact that Linux expects
process tree (i.e.
the relation parent/child process is well defined and widely used)
but the core works.
If you want to see it, the already mentioned
numa application has
several tests and samples to show what is ready.
What is still missing is port of the
numactl(1) that is used to set
policies and affinities of new tasks.
The problem is again that it expects
exec() style of
launching new tasks.
This will be probably solved by adding some optional methods to the
Uff, that is all for now. I will try to write some kind of "progress report" on regular basis.