There is a feature in the Linux Kernel that is relevant to VM’s hosted on Xen servers that is called the “steal percentage.” When the OS requests from the host system’s use of the CPU and the host CPU is currently tied up with another VM, the Xen server will send an increment to the guest Linux instance which increases the steal percentage. This is a great feature as it shows exactly how busy the host system is, and it is a feature available on many instances of AWS as they host using Xen. It is actually said that Netflix will terminate an AWS instance when the steal percentage crosses a certain threshold and start it up again, which will cause the instance to spin up in a new host server as a proactive step to ensure their system is utilizing their resources to the fullest.
What I wanted to discuss here is that it turns out there is a bug in the Linux kernel versions 4.8, 4.9 and 4.10 where the steal percentage can be corrupted during a live migration on the physical Xen server, which causes the CPU utilization to be reported as 100% by the agent.
When looking at Top you will see something like this:
As you can see in the screen shot of Top, the %st metric on the CPU(s) line shows an obviously incorect number.
During a live migration on the physical Xen server, the steal time gets a little out of sync and ends up decrementing the time. If the time was already at or close to zero, itcauses the time to become negative and, due to type conversions in the code, it causes an overflow.
CloudWatch’s CPU Utilization monitor calculates that utilization by adding the System and User percentages together. However, this only gives a partial view into your system. With our agent, we can see what the OS sees.
That is the Steal percentage spiking due to that corruption. Normally this metric could be monitored and actioned as desired, but with this bug it causes noise and false positives. If Steal were legitimately high, then the applications on that instance would be running much slower.
There is some discussion online about how to fix this issue, and there are some kernel patches to say “if the steal time is less than zero, just make it zero.” Eventually this fix will make it through the Linux releases and into the latest OS version, but until then it needs to be dealt with.
We have found that a reboot will clear the corrupted percentage. The other option is to patch the kernel… which also requires a reboot. If a reboot is just not possible at the time, the only impact to the system is that it makes monitoring the steal percentage impossible until the number is reset.
It is not a very common issue, but due to the large number of instances we monitor here at 2nd Watch, it is something that we’ve come across frequently enough to investigate in detail and develop a process around.
If you have any questions as to whether or not your servers hosted in the cloud might be effected by this issue, please contact us to discuss how we might be able to help.
-James Brookes, Product Manager