Friday, May 23, 2014

Adventures with SQL Server & perfmon Part 2: TokenAndPermUserStore & spinlock suspects

Sick day from work, so I just might finish two blog posts in one day :-)

This is the same system from the previous post about query memory.
Adventures with SQL Server & perfmon Part 1: Resource Governor & Query Memory

Its from the following week, with a very similar query batch executing on the system.  Because the cores are licensed for SQL Server, and part of my job is making sure the system delivers value for its cost, I want to make sure that: 1) the cores are doing meaningful work 2) the cores are kept busy.  In the last post, I looked at a few timeperiods where CPU busy percent slacked off a bit, and it was easy to see that at that point, pending query memory grants was limiting the query concurrency.  And I put the other concern off to the side... two obvious timeperiods that SQL Server %CPU busy seriously broke away from the number of logical IOs performed in query execution.

Lets take a look at that now, in this later day on the same system. 







CPU (Red) = \Process(sqlservr)\% Processor Time
Buffer Page Lookups (Blue) = \SQLServer:Buffer Manager\Page lookups/sec

Still two extended periods, highlighted in green above, where CPU utilization by SQL Server is trending far above the logical IO performed by query worker threads.

When I see that for an extended period of time, I first try to determine if spinlock contention is the culprit.  I've often heard from others that "spinlock contention is rarely the cause of extended high CPU utilization."  That may be true on most SQL Server systems... maybe even almost all SQL Server systems.  But I'll tell ya... I've had a pretty good run at finding and addressing spinlock issues.  And the first sign I look for is timeperiods with significant disconnect between CPU utilization and database logical IO as indicated in the graph above.

As luck would have it, over the last year many of the spinlock contention situations I've looked at have been accompanied by CMEMTHREAD waits.  Fortunately for me, these can actually be trended via the perfmon "SQLServer:Wait Statistics" object, which refers to them as "Thread-safe memory objects waits".  For lots of good information about CMEMTHREAD waits, see this CSS Engineers Blog Post from Rohit Nayak:
How It Works: CMemThread and Debugging Them

This particular system already has trace flag 8048 active, to mitigate query memory allocation spinlock contention by promoting the memory serialization resource to per-CPU rather than per NUMA node/scheduler group. 





CPU (Red) = \Process(sqlservr)\% Processor Time
CMEMTHREAD (Medium Blue)= \SQLServer:Wait Statistics(Waits started per second)\Thread-safe memory objects waits
Requests Completed  (Dark Blue) = \SQLServer:Workload Group Stats(RG1)\Requests completed/sec, cumulative
Object Plan Pages (Light Blue) = \SQLServer:Plan Cache(Object Plans)\Cache Pages

So this is the part where I omit a lot of the bumbling and fumbling I do while trying to find out which measurements just might be relevant to the behavior I'm trying to understand. Trust me, there's a lot of bumbling.  But I eventually settled in this case for including CMEMTHREAD waits started, cumulative requests completed for the Resource Governor workload group, and Plan Cache object pages.

In the first timeperiod, from 9:00 am to almost 10:15 am, there is a ton of CMEMTHREAD wait starts.  I know that they often accompany spinlock activity.  And in the recesses of my mind I remember... wasn't this the system that I'd always heard rumors about server restarts, SQL Server restarts, and sometimes failovers to address an undiagnosed performance concern?  That's why I decided to throw the object plan pages into the graph, after looking at how other components of "\SQLServer:Plan Cache" fit the mold.

So a significant drop in object plan pages seems to accompany a better fit (for a while) of CPU utilization to database logical IO.  Hmmmm...

My first thought was that we should monitor the workload after running "dbcc freeproccache" next time around.  But I also did a lot of research.  And what I came up with was TokenAndPermUserStore cache.

Queries take a longer time to finish running when the size of the TokenAndPermUserStore cache grows in SQL Server 2005

http://support.microsoft.com/kb/927396
Potential workarounds suggested:
FORCE PARAMETERIZATION database option
Add user account to sysadmin server group
Periodic DBCC FREESYSTEMCACHE ('TokenAndPermUserStore')

A gradual increase in memory consumption for the USERSTORE_TOKENPERM cache store occurs in SQL Server 2005
http://support.microsoft.com/kb/933564
This KBa introduces trace flag 4618, which can be used to limit the size of the TokenAndPermUserStore cache.

How to customize the quota for the TokenAndPermUserStore cache store in SQL Server 2005 Service Pack 3
http://support.microsoft.com/kb/959823
This KBa introduces trace flag 4610.  Alone, trace flag 4618 limits TokenAndPermUserStore to 1024 entries.  Add trace flag 4610, and that increases to 8192 entries.
This KBa also introduces trace flag 4621, which works with Windows registry setting "TokenPermQuota" to control the size of TokenAndPermUserStore.  Add trace flag 4610 to 4621, and the "TokenPermQuota" should increase by a factor of 8.

Query Performance issues associated with a large sized security cache
http://blogs.msdn.com/b/psssql/archive/2008/06/16/query-performance-issues-associated-with-a-large-sized-security-cache.aspx
I always love reading what the CSS engineers have to say :-)

How is anyone supposed to know how to use trace flag 4621?


http://www.techurbia.com/2009/03/how-is-anyone-supposed-to-know-how-to-use-trace-flag-4621.html
This is a great blog post... my favorite line being:
"Oh wow is this not easy stuff to wade through.."

Finally... 
Description of the "access check cache bucket count" and "access check cache quota" options that are available in the sp_configure stored procedure
http://support.microsoft.com/kb/955644
OK... this is the most relevant to SQL Server 2008R2, 2012, & 2014... the versions I spend most of my time with.  Those versions are actually listed in the "Applies to" section.

So what to do?  In this case, we'll start with monitoring the cache size from the DMVs over time.  After getting some baseline data of the cache size and additional perfmon similar to that above, we'll put in some timed executions of the following and see how that changes the mix:
DBCC FREESYSTEMCACHE ('TokenAndPermUserStore')

Hopefully by doing that, we can find a reasonable size limit for TokenAndPermUserStore.  In the worst case, clearing this cache will not improve performance and then I'll have to come up with a new theory.  But if my theory holds and we find a reasonable limit, should be able to use "access check cache bucket count" and "access check cache quota" in order to implement the limit, rather than the Windows registry setting and/or any of the trace flags.  Bet you thought you'd never find me looking for a way NOT to implement a trace flag! :-)

Lets look at this graph again - because it was a long ways up there in this blog post and I've got a bit more to talk about.



All things considered, I feel really good about my hunch for the approximately 9:00 am - 10:05 am query slowdown.  If it turns out I'm wrong I'll fess up and share what I learn.

But what about 11:00 am - 11:15 am? (For some reason blogger would move the green box over if I tried to box only 11:00 - 11:15, so the box is a little bigger than the timeperiod I'm concerned with.)  Way fewer CMEMTHREAD waits at that time.  And even though Object Plan pages is now just a proxy for TokenAndPermUserStore, which I suspect to be more closely related... I think that when we monitor TokenAndPermUserStore size, we'll see its not involved in that second slowdown.

My first rule of spinlocks - get the big ones out of the way before setting your heart on diagnosing secondary conditions :)  But I'd better come up with some ideas now, that I can pursue once the first spinlock issue is out of the way.  If by some miracle solving the first timeperiod also gets rid of the second - great!  If not, I'll have a hunch to pursue from the get-go.

OK... I added the cumulative "requests completed" to the graph for this reason.  From 11:00 am to 11:15 am the pace of request completion is noticeably higher than the pace of request completion from 9:00 am to 10:05 am.

To me, this suggests two things: the rate of incoming connections was greater, and the rate of plan compilation was greater.  Unfortunately I didn't capture "Logins/sec".

That's too bad.  Its a pretty good suspect because of this:
Performance problems occur when database lock activity increases in SQL Server
http://support.microsoft.com/kb/2926217
This SQL Server 2012 KBa introduces trace flag 1236 (SQL Server 2012 SP1 CU9 & beyond), which partitions the previously global resource used for SH-DATABASE locks when establishing database context.

But plan compiles are also a potential suspect, because of this SQL Server 2012 SP1 CU10 fix.
Intense query compilation workload does not scale with growing number of cores on NUMA hardware and results in CPU saturation in SQL Server 2012
http://support.microsoft.com/kb/2928300

So if I get rid of "Object Plan Pages", which was really just a proxy anyways, and the "Requests Completed", which was another proxy... I can add "\SQLServer:SQL Statistics\SQL Compilations/sec" to the graph, hopefully without the graph becoming too busy to be useful.  

 

Uh-huh.  During 11:00 am to 11:15 am, there certainly is a decrease in CMEMTHREAD wait activity. But its still present to a degree.  And there is more compile activity.  So on this system, we'll address the TokenAndPermUserStore - or end up exonerating it and looking for another suspect.  And once SQL Server 2012 SP1 CU10 is in place, concurrent query compiles will benefit from kb2928300, and trace flag 1236 can be used to reduce spinlock contention during heavy concurrent logins to the same database.

I'll let you know how it goes. 

****
Same day update :-)

Might as well throw these graphs in here, too.  Differential spinlock information is captured every 5 minutes.  These are the only spinlocks of interest numerically from the data.  Not that the primary and secondary axes for the spins graph are different by a factor of 2000, while the backoffs are only different in scale by a factor of 40. 

 





 









No comments:

Post a Comment