Introduction to the Go Runtime Scheduler

The Go runtime scheduler is at the heart of what gives Go great performance when writing programs that are highly I/O bound. Tens or even hundreds of thousands of Goroutines can be run concurrently. It’s not necessary to understand how the Go Runtime Scheduler works to take advantage of its power but it can certainly help to take further advantage. For instance, it’s a commonplace misconception that the number of goroutines should be kept low for higher performance and that the number of goroutines should be similar to the number of threads you may have in a multithreaded program, but this is not the case. Let’s dive into it.

Goroutines

A Goroutine is a light-weight process managed by the Go runtime which is compiled into every Go program. This means it’s managed completely in userspace and not by the operating system. There many other names you may have heard of for this such as green threads or M:N threading. Many goroutines are typically mapped onto a much smaller set of operating system threads. In Go you could imagine the “N” in M:N threading to be a G.

Goroutine Scheduler

The scheduling system in Go has three main components of its model, the processor (P), the goroutine (G) and the machine (M). Goroutines could be more accurately described as M:P:G threading rather than M:N threading due to having an intermediate abstraction for mapping userspace runtime threads (G) to real OS threads (M). The machine (M) is an operating system thread and may have one goroutine executing on it at once. The Go runtime manages a pool of Ms. A processor (P) is associated with a specific M. The number of Ps is set by the GOMAXPROCs environment variable which is normally set to be the number of CPUs available to minimise the amount of context switching needed between real threads by the OS, thus enabling higher utilisation. The P holds the state for deciding which G will be run on the M and the main mechanism for this is the local run queue (LRQ) that each P holds. When deciding which G to run next on the M the P will check which G is next in the LRQ. There is also a global run queue (GRQ). Every 1/61 ticks of the scheduler the P will schedule the next G on the GRQ to run on the M. This is done to ensure fairness.

The primary mechanism for Gs to enter the LRQ of a P is whenever the G running on the current P’s M spawns a new G. The currently running G will be preempted after 10ms by the runtime and put to the back of the queue. Whenever the LQR becomes empty the P will attempt to find some new Gs to put into the LRQ. At the time of writing, the scheduler will try and find a runnable G by first checking the GRQ, then the network poller and then resort to work stealing if neither of the previous methods yields any runnable Gs.

The work-stealing mechanism has some interesting aspects. The first is that the work-stealing may run up to 4 times if it is not successful. The second is that on the first iteration of the work-stealing, the P will first try and run any ready timers on every other P and then check its own LRQ again in case the timers caused any ready Gs to be enqueued back onto its LRQ. After running timers and checking its own LRQ the P will attempt to steal half the Gs from another P. If no Gs are available after 4 iterations the scheduler returns back to the start of the scheduling process of checking its own LRQ then the netpoller and GRQ.

When a G is executing on an M there are several mechanisms that will stop it from running to allow another G to run. Gs may be preempted after 10ms as already mentioned. The G may also finish execution and have nothing left to run. There are a group of operations that will cause the G to be dropped by the M. These include syscalls and other blocking operations such as waiting on timers and writing or reading from channels. Syscalls are treated differently to blocking operations controlled by the runtime. For blocking operations such as a timer the G is parked and re-queued when the operation is completed. Blocking syscalls are handled in a special way and non-blocking syscalls (network I/O) are handled by a component called the netpoller. The netpoller will be discussed in a section of its own.

When a blocking syscall occurs, the processor will become detached from the machine and will be attached to a new machine. The blocking goroutine stays executing on the machine. Once the syscall is complete the G will be detached from the M and placed back onto the LRQ of the P it originated from and the M will be added to the list of free Ms. The G may fail to be added back to the old P, in which case the runtime will try and add the G to the LRQ of any idle P. If this fails it will be added to the GRQ.

The Network Poller

The network poller (netpoller) handles any non-blocking IO. Goroutines are passed the netpoller which manages a set of file descriptors to be “polled” for readiness. The mechanism behind the polling is mostly operating system specific with some non operating specfic specific code.

The netpoller runs as a background thread that calls the underlying OS periodically for file descriptors that have finished their IO. Since there is no running process related to the background thread, any goroutines that become ready during the periodic checks are added to the global run queue. The netpoller is also called at special points during scheduling as an optimisation. One of these special cases is when the LRQ of a process has run out of goroutines, the netpoller is checked for any ready goroutines and these goroutines are put onto the process’s LRQ before attempting to work steal from another LRQ.

The netpoller uses different mechanisms depending on which OS it’s running on. On Linux, the go runtime uses the “epoll” API. The “epoll” API provided by Linux is described as the “I/O event notification facility” by the Linux man-pages. On MacOS the go runtime uses a similar facility provided called “kqueue”. How these operating system APIs work and how the Go runtime integrates with them is out of scope for this article.

Summary

The Go runtime scheduler is designed in a way that the scheduling decisions can be distributed among a set of what we call “processors”. This is done to ensure the scheduler can scale to a large number of CPUs. We have also seen that the scheduler employs a number of mechanisms to ensure fairness including work-stealing, pre-emption and running Gs from the GRQ periodically instead of always running from the LRQ. The purpose of such a userspace threading mechanism is to maximise CPU utilisiation for massively concurrent workloads. The Go runtime’s handling of nonblocking IO allows for a large number of Gs to be waiting on network calls while still allowing the CPUs to be used for Gs which can execute work. The combination of all these features is a large factor in the rise of Go’s use in the Cloud Native space where programs often need to handle a large number of incoming and outgoing network connections.

Written on January 3, 2021