Adventures in AIX CPU babysitting
In a recent adventure, I became interested in measuring CPU usage on AIX. I don't have the benefit of professional help or third-party monitoring. I also rarely trust either of these things when they're available, because either one can have trouble describing what's actually being measured.
Generally, 100% CPU usage means 1 CPU core tied up. A multi-threaded application, or a handful of processes, can show 400% CPU on a 4-core system.
When you add AIX virtualization to the mix, this not only gets more complicated but might vary from tool to tool.
AIX has a concept of "entitled cores" which is in more accessible terms the guaranteed amount of CPU available. vmstat output will tell you both how many (in fractional increments) physical cores were used in the interval, along with the percent of entitled capacity used.
If an AIX instance is "uncapped", the number of physical cores used at any given time can grow beyond the entitled capacity. vmstat will tell you the entitled capacity ("ec") is greater than 100% when you're in this territory.
Consider this overloaded server:
$ vmstat 1 1
System configuration: lcpu=8 mem=4096MB ent=0.20
... us sy id wa pc ec
... 17 47 36 0 0.41 204.2
To translate, our server i using .41 physical cores (PC), which is over 200% of our entitled capacity. But notice the user and system CPU usage -- it's clearly relative to the "instantaneous" physical cores being used -- they are well under 100% total.
It's quite easy to misinterpret this as "just" 64% CPU usage if you are looking at the portable fields of the output, or in a similar way the "portable" fields in topas output.
On any system, there are lots of pitfalls to consider when a laymen starts wondering about CPU usage. On AIX, there are some unique pitfalls.
What's being measured?
To determine the health of a system, you might quickly check the overall CPU usage to make sure it's within some tolerance. But once a system is unhealthy, you'll often quickly need to focus on individual process usage.
Sometimes, the process-level statistics are presented in ways very different from the system-wide. Sometimes this is simply because different tools are used -- whether the calculations, meanings of fields, or just the sampling intervals are different.
Something as simple as "vmstat" vs. "vmstat 1 1" will give you drastically different results during a casual investigation.
The latter tells you how the CPU was used in the last second, the former tells you how it was used on average since the system booted. Comparing ps -A -o ...,pcpu output to can result in the same counter-intuitive differences -- %CPU is something closer to process lifetime divided by time spent running on the CPU.
On linux, %CPU in /usr/bin/ps could be very different than the same information in top on a long running system for a long-running process. The same relationship can be seen on AIX in /usr/bin/ps and topas (the top alternative for AIX)
The kissing cousin of this pitfall is memory usage in individual processes vs system-wide. The latter is notoriously tricky due to how agressively operating systems use (non-dirty) buffers when there is no memory pressure.
That depends on the definition of 100%
Generally, 100% CPU usage means 1 CPU core tied up. A multi-threaded application, or a handful of processes, can show 400% CPU on a 4-core system.When you add AIX virtualization to the mix, this not only gets more complicated but might vary from tool to tool.
AIX has a concept of "entitled cores" which is in more accessible terms the guaranteed amount of CPU available. vmstat output will tell you both how many (in fractional increments) physical cores were used in the interval, along with the percent of entitled capacity used.
If an AIX instance is "uncapped", the number of physical cores used at any given time can grow beyond the entitled capacity. vmstat will tell you the entitled capacity ("ec") is greater than 100% when you're in this territory.
Consider this overloaded server:
$ vmstat 1 1
System configuration: lcpu=8 mem=4096MB ent=0.20
... us sy id wa pc ec
... 17 47 36 0 0.41 204.2
To translate, our server i using .41 physical cores (PC), which is over 200% of our entitled capacity. But notice the user and system CPU usage -- it's clearly relative to the "instantaneous" physical cores being used -- they are well under 100% total.
It's quite easy to misinterpret this as "just" 64% CPU usage if you are looking at the portable fields of the output, or in a similar way the "portable" fields in topas output.
Takeaway:
On AIX at least, "64% CPU" in a vacuum is almost meaningless. This could mean 64% of a capped fraction of a processor (who decided that fraction was enough) or it could mean 64% of a system running with double its entitled capacity.
Bluffers guide:
On AIX, here's a good beginners guide to getting your head around a CPU problem:
- Run lparstat to determine if we are capped to entitled capacity and how much entitled capacity we have. Is the entitled capacity reasonable for the workload?
- Run vmstat (e.g. vmstat 1 10) with a few-second interval to get a handle on instantaneous system-wide CPU usage and % of entitled capacity
- Identify high CPU processes with topas -mP 1 or the handy-dandy aixtop ported from a developerworks sample:
- https://github.com/covener/ihs-tools/blob/master/aixtop.c
- If the system is unexpectedly CPU starved, profile the entire system with tprof -spex sleep 10 and look for hotspots
- Watch CPU time tick up over time with ps -ef
- One glarin omission is ps -A -o ...,pcpu,.... Since this measures process lifetime vs CPU over lifetime, it is not very useful for any kind of correlation to other activity or logs.
Comments
Post a Comment