Category Archives: Windows

AlwaysOn 2016 – Encrypted Databases and Performance

–By Ginger Keys

It is common knowledge that encrypting a database causes degraded performance in a SQL server. In an AlwaysOn environment performance can be even more sluggish because there is the extra element of data replication latency. How much difference does it really make? Of course the answer is “it depends” on your environment and your workload. I was curious to see for myself what kind of performance hit encryption would have on one of my test databases, so this post will look at CPU usage of an encrypted vs non-encrypted database.

Microsoft says that turning on TDE (Transparent Data Encryption) for a database will result in a 2-4% performance penalty, which is actually not too bad given the benefits of having your data more secure. There is even more of a performance hit when enabling cell level or column level encryption. When encrypting any of your databases, keep in mind that the tempdb database will also be encrypted. This could have a performance impact on your other non-encrypted databases on the same instance.

In a previous post I demonstrated how to add an encrypted database to an AlwaysOn group in SQL2016. In this article I will demonstrate the performance effects of having an encrypted database in your AlwaysOn Group compared to the same database not-encrypted.

I have 3 identical test databases I will use to look at performance metrics.

  • GKTestDB is TDE encrypted, and is part of the AlwaysOn group
  • GKTestDB2 is not encrypted, and not part of AlwaysOn group
  • GKTestDB3 is not encrypted, but is part of AlwaysOn group

Blog_20170713_1

There are some good open source tools for performing stress tests on your SQL database.   You can find one here that uses the AdventureWorks2014 database http://www.sqlballs.com/2016/05/sql-server-random-workload-generator.html. There is also one here and you can use this on any database https://github.com/ErikEJ/SqlQueryStress/wiki. I will be using the SQLQueryStress tool for my demonstration.

Step 1 – Test performance on non-encrypted DB not in AG

To measure performance metrics, create a User Defined Data Collector Set in Performance Monitor (Perfmon). There are many metrics that can be measured, but I will only be looking at CPU % Processor Time.

Blog_20170713_2

Download and open the SQLQueryStress tool, and create a statement to run against your database. In my test I will first look at the performance of running a select query for 5000 iterations on a database that has not been added to the AlwaysOn group, and has not been encrypted: (GKTestDB2)

Blog_20170713_3

Be sure to clear the buffers and cache before performing each stress test. Select your database, the server name, the number of iterations, the number of threads and the delay between queries in milliseconds.

Blog_20170713_4

Start your user defined Perfmon data collector set, then start the stress test in the SQLQueryStress tool.

Blog_20170713_5

At the end of each stress test you will need to manually stop your Perfmon collection.

Step 2 – Test performance on encrypted DB in the AlwaysOn Group

Now I will perform the same stress test to see performance on the identical but Encrypted database in the AlwaysOn group (GKTestDB). Be sure to clear the buffers and cache, and change the database in the SQLQueryStress tool.

Blog_20170713_6

Blog_20170713_7

Step 3 – Test performance on non – encrypted DB in the AlwaysOn Group

Just for curiosity sake, I’m also going to test the identical database that is not encrypted, but is included in the AlwaysOn group (GKTestDB3):

Blog_20170713_8

Blog_20170713_9

Step 4 – Compare results

I set the output files of my Perfmon results to land in a location on my local drive so that I can open up the results of each test and compare.

The CPU usage for the database not encrypted and not in my AlwaysOn group averaged 43% for the duration the test was run, as shown by the dark line on the graph below.

Not Encrypted / Not in AG database CPU usage:

Blog_20170713_10

The CPU usage for the identical database that is encrypted and is in the AlwaysOn group averaged 57.5% during the stress test as shown on the graph below. This is quite a bit more than the non-encrypted/non AG database, especially given the simple statement that was being run.

TDE Encrypted / Joined to AG Database CPU usage:

Blog_20170713_11

And finally, the CPU usage for the identical database that is not encrypted, but is included in my AlwaysOn group averaged 43.4%, which is not much different than the non-encrypted /non-AG database above.

Not Encrypted / Joined to AG Database CPU usage:

Blog_20170713_12

Blog_20170713_13

Conclusion

Having an encrypted database creates a noticeable CPU performance hit as compared to a non-encrypted database. Microsoft provides many options for protecting your data, transparent data encryption (TDE) being one of them. The benefits and assurance of securing your data outweigh the performance cost, however it may be useful to see how much of a performance hit your system will encounter prior to deciding which security options your organization will deploy.

For more information about blog posts, concepts and definitions, further explanations, or questions you may have…please contact us at SQLRx@sqlrx.com. We will be happy to help! Leave a comment and feel free to track back to us. Visit us at www.sqlrx.com!

 

 

 

 

 

SQL Server Memory In Task Manager – Tell Your Manager to Not Worry!

-by Ginger Keys

blog_20160929_1We are going to re-promote this subject because it is still so relevant. All too often we run into people who are in a panic after looking at Task Manager to try to diagnose SQL stress. They seem to become very worried when looking at how much memory SQL uses not realizing that is as expected.

Task Manager is just that, a task manager. If you are looking at it to get accurate values for memory (RAM) usage or CPU time, you will get not-so-accurate information. False alarms. Information that doesn’t exactly mean what it implies.

Blog_20150507_2What I usually see when I look at Task Manager is that CPU is low, disk usage is low, and memory used is high – which is what we expect on a SQL machine.  SQL will use as much memory as it can get, to improve performance by caching tons of data in memory in case it needs it, to avoid having to go to disk.  It will immediately release any memory to the OS if needed.  SQL is a memory hog, and will hold onto all the memory we allocate to it, whether it is actually using it or not.  And this is what we want it to do!

SQL will take all the memory that is allocated to it when it is needed, store data and execution plans in the buffer, and when it is through using it, it doesn’t release it…it caches that data into memory until either 1) SQL needs it again, or 2) Windows OS requests the memory.

This isn’t an indication of performance stress…it’s just showing that SQL has reserved all the memory that we told it to use (through setting min and max memory on the SQL instance properties [1]).  SQL doesn’t let go of memory unless the server comes under memory pressure (like if other apps need memory and Windows sends out a memory pressure notification).  As long as the OS doesn’t need to use it for anything, SQL will hold on to it for future use.

The default memory setting when you install SQL is for it to use all of the memory. When windows asks SQL to give it back, SQL gives it back in huge amounts. And then proceeds to build back up again to maximum amount. This is not optimum for stability or performance, and it is highly recommended to set the limit in the Max Memory Setting as a best practice. That way Windows OS is always guaranteed to have a certain amount.

Click here for more info on setting Max Memory: https://sqlserverperformance.wordpress.com/2009/10/29/suggested-max-memory-settings-for-sql-server-20052008/

Another good reference for setting Max Memory for SQL is here: http://www.sqlservercentral.com/blogs/glennberry/2009/10/29/suggested-max-memory-settings-for-sql-server-2005_2F00_2008/

Blog_20150507_3

And when looking at Resource Monitor, it is a mistake to use the Available and Free metrics as the key indicators for determining overall system health [2]. The Available MB of memory is comprised of both the Standby and the Free… the stuff in Standby is what Windows has put in the cache (but can be booted out if needed), and Free means it is not even being used at all. The combination is what Windows calls Available. So there is actually data in the cache that Windows says is available. Some of the categories in task manager/resource monitor can be very confusing and even misleading, which is why we shouldn’t use these metrics alone to decide if we are in danger of being out of memory.

Instead of only relying on task manager or resource monitor, we really need to look at other metrics to find out if there is a memory related performance issue on the machine.

Some of the metrics to pay attention to are:

  • Page Life Expectancy
  • Available Bytes
  • Buffer Cache Hit Ratio
  • Target & Total Server Memory
  • Memory Grants Pending
  • Pages/sec (Hard Page Faults)
  • Batch Requests/sec & Compilations/sec

Some of these metrics can be queried from TSQL, or can be observed with Perfmon (Windows Performance Monitor – located in Administrative Tools).

 Page Life Expectancy (PLE)

Found in Perfmon at SQL Server: Buffer Manager – Page Life Expectancy, this counter measures the average amount of time unlocked data pages are staying in the buffer pool.   During peak production hours this number may dip into lower numbers, but we generally want this number to remain above 300 seconds (so data is staying in the cache for 5 minutes or longer). The longer data stays in the buffer, the less likely SQL will have to go to disk for I/O operations.

There is an interesting occurrence with page life expectancy… when SQL really does run out of memory, PLE drops very quickly…but it grows back very slowly. You’re probably not still having memory issues during the coming back stage, even though it may look like its struggling. If PLE stays down however, then you have a problem.   So be careful, because PLE can be misleading when it takes a drop… there’s a difference between it going down & staying down (you have a memory problem), and going down once and crawling back up (which is normal). If it stays down below 300 seconds consistently, you may need to add more RAM.

You can monitor this through Perfmon, or run the following query:

SELECT [cntr_value]

FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%Buffer Manager%’

AND [counter_name] = ‘Page life expectancy’

Available MBytes

Found in Perfmon as Memory: Available MBytes, this counter detects how much memory is available for use, and whether Windows is experiencing memory pressure. We generally start investigating if this number consistently goes below 500MB, which would indicate low memory, and Windows may start memory management operations. If Available Mbytes is consistently low, you may need to add more RAM.

This counter cannot be queried through TSQL, it has to be observed through Perfmon only.

Buffer Cache Hit Ratio

Found in Perfmon as SQL Server: Buffer Manager: Buffer Cache Hit Ratio. This counter averages (since the last restart of your SQL instance) how often SQL Server goes to the buffer pool to get data, and actually finds that data in memory, instead of having to go to disk. We want to see this ratio high in OLTP servers – around 90-95%. The higher the ratio, the less often SQL has to go to disk for I/O operations, which translates into better performance for your end users. If this number is consistently below the 90% mark, you may need to add more RAM to your server to boost performance.

SELECT [cntr_value]

FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%Buffer Manager%’

AND [counter_name] = ‘Buffer cache hit ratio’

 Target & Total Server Memory

Found in Perfmon at SQL Server: Memory Manager – Total Server Memory, and SQL Server: Memory Manager – Target Server Memory. The Total Server Memory is the current amount of memory that SQL Server is using.  The Total Server memory will be quite a bit lower than the Target memory during the initial buffer pool ramp up… it is trying to populate the cache and get pages loaded into memory.  Performance might be a little slower during this time since more disk I/O is required, but this is normal.  After it ramps up, and normal operations resume, Total Server Memory should be very close to Target Server Memory. (The ratio should be close to 1). If Total Server Memory does not increase much, but stays significantly less than Target, this could indicate a couple of things…

1) You may have allocated much more memory than SQL can use – SQL could cache the entire databases into memory, and if the databases are smaller than the amount of memory on the machine, the data won’t take up all the space allocated to SQL. So in this case Total Memory (actually memory being used by SQL) will never reach Target Memory (amount allocated to SQL). Or,

2) SQL cannot grow the buffer pool because of memory pressure from outside of SQL. If this is the case, you need to either increase the Max Server Memory, or add more RAM to boost performance.

SELECT [cntr_value]

FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%Memory Manager%’

AND [counter_name] IN (‘Total Server Memory (KB)’,’Target Server Memory (KB)’)

 Memory Grants Pending

Found in Perfmon as SQL Server: Memory Manager – Memory Grant Pending, this counter measures the total number of SQL processes waiting for a workspace memory grant. The general recommendation for this measurement should be 1 or less. Anything above 1 indicates there are SQL processes waiting for memory in order to operate.

Memory grants pending could be due to bad queries, missing indexes, sorts or hashes. To investigate this, you can query the sys.dm_exec_query_memory_grants view, which will show which queries (if any) that require a memory grant to execute [5].

If the Memory Grants Pending are not due to the above mentioned conditions, then more memory should be allocated to SQL Server by adjusting Max Server Memory. Adding more RAM should be the last resort in this case.

SELECT [cntr_value]

FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%Memory Manager%’

AND [counter_name] = ‘Memory Grants Pending’

Pages/sec (Hard Page Faults)

Found in Perfmon as Memory: Pages/sec, this measures the number of pages read from or written to disk. Pages/sec is the sum of Pages Input/sec and Pages Output/sec and reports the number of requested pages that were not immediately available in RAM and had to be read from the disk (resulting in hard page faults). A Hard Page Fault occurs when windows has to use the swap file on the disk [3]. It’s when the address in memory that’s part of a program is no longer in main memory, but has been instead swapped out to the paging file, making the system go looking for it on the hard disk. Since disk storage is much slower than RAM, performance can be greatly degraded.

We want to see the Page/sec number below 50, and closer to 0. If you see a high value of Pages/sec on a regular basis, you might have performance degradation, but not necessarily. A high Pages/sec value can happen while doing database backups or restores, importing or exporting data, or by reading a file mapped in memory [4].

Because a hard fault doesn’t necessarily indicate a critical error condition depending upon what’s normal for your environment, it’s a good idea to measure a baseline value, and monitor this number based on that. If the values are consistently higher that your baseline value, you should consider adding more RAM.

Batch Request & Compilations

SQL Server: SQL Statistics – Batch Request/Sec. This is the number of incoming queries

SQL Server: SQL Statistics – Compilations/Sec. This is the number of new executions plans SQL had to build

If Compilations/sec is 25% or higher relative to Batch Requests/sec, SQL Server is putting execution plans in the cache, but never actually reusing them.  Your valuable memory is being used up to cache query execution plans that will never be used again – instead of caching data.  This is bad. We don’t want this to happen. And a high Compilation/sec value (like over 100) indicates there are a lot of Ad-Hoc (one-hit-wonder) queries being run. You can enable the “optimize for ad hoc” setting if this is the case, and this will put the execution plan in the buffer, but only after the second time it has been used.

To query these metrics with TSQL:

SELECT [cntr_value]

FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%SQL Statistics%’

AND [counter_name] = ‘Batch Requests/sec’

SELECT [cntr_value]

FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%SQL Statistics%’

AND [counter_name] = ‘SQL Compilations/sec’

To get the ratio of Compilations to Batch Requests:

SELECT ROUND (100.0 *

(SELECT [cntr_value]

FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%SQL Statistics%’

AND [counter_name] = ‘SQL Compilations/sec’)

/

(SELECT [cntr_value]

FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%SQL Statistics%’

AND [counter_name] = ‘Batch Requests/sec’)

,2) as [Ratio]

 

Looking at the counters in Perfmon (Report View) below, we can see that this particular server is in a healthy state, with over 11GB of Available Bytes, no page faults (so I/O is operating from buffer, not disk), Buffer cache with 100% ratio, PLE of 21,515 seconds (that’s almost 6 hours data spent in the buffer), no pending memory grants (so no SQL process is having to wait for memory), ample Total memory, and low compilation ratio (11%) compared to all the queries being run.

Blog_20150507_4

These measurements should be observed and understood, instead of using Task Manager to make critical decisions about resources on your SQL Server.

If it is decided to make changes based upon only what we see in Task Manager, and we reduce the amount of memory SQL can use, and it happens to need more to execute processes, it will have to read/write from disk, which will degrade the performance.  Reducing the amount of memory SQL can use will slow performance and will most likely increase complaints from end users. The lower you set your Maximum Server Memory, the lower your performance will go.

No matter how much memory you put in a system, SQL Server will use all it can get until its caching entire databases in memory and then some.  This isn’t an accident. SQL Server caches data in memory.  On purpose. That means it is healthy, and enhancing performance for your end users.

Now send a link to this blog post to your favorite person on staff who needs to be educated about Task Manager and SQL Server. You will be glad you did!

References

[1] “Glenn Berry’s SQL Server Performance” https://sqlserverperformance.wordpress.com/2009/10/29/suggested-max-memory-settings-for-sql-server-20052008/

[2] “Use Resource Monitor for Memory Monitoring” by Scott Lowe.   http://www.techrepublic.com/blog/the-enterprise-cloud/use-resource-monitor-for-memory-monitoring/

[3] “Swap File”. TechTarget. http://searchwindowsserver.techtarget.com/definition/swap-file-swap-space-or-pagefile

 

Correlate Performance Spikes with SQL Profiler and Perfmon

–by Ginger Keys

Have you ever watched performance metrics in Performance Monitor (Perfmon) and wondered which SQL Server activities were causing those spikes? SQLRx offers a Microsoft Partner-of-the-Year award-winning service that will correlate all performance metrics on your server with very little overhead, and will pinpoint with precision, right down to a line of code or even a network switch, what is bottlenecking your system. A real live person will then document and advise you exactly how to remedy the problem, and will walk you through each step.

However if you don’t need a comprehensive analysis, and simply want to find what TSQL statements are causing performance spikes, or if you want to observe a single TSQL statement and how it affects server resources, this can be done using SQL Profiler in conjunction with Performance Monitor (Perfmon). Both Profiler and Perfmon provide good information about your server, but being able to correlate these together will help to find the causes of performance spikes.

SQL Profiler will soon be deprecated, but will be available for a while longer. The specific version of SQL has not been determined yet. Profiler is being replaced with Extended Events, and currently there is no straightforward way to use extended events to correlate performance spikes with Perfmon.

In order to correlate performance metrics with SQL activities, we will follow these steps:

  1. Create Profiler Trace
  2. Create Perfmon Data Collector Set
  3. Simulate High CPU / or Run your TSQL Statement(s)
  4. Analyze

*Warning: This article is intended to show you how to use the tools…it is highly advised not to do this on your production server.

Create Profiler Trace

Open SQL Server Management Studio (SSMS) and connect to your instance. On the menu bar, go to Tools, and click on SQL Server Profiler to get started. SQL Profiler will put stress on your server, so focus only on the events and columns needed to analyze the activities, so to minimize the workload on your server.

We want to capture TSQL statements that might be causing performance spikes, so we will look at stored procedures, batches of queries, and individual queries. We will then filter the trace for statements that take longer than a certain Duration (ex. 5000 milliseconds), to filter out the normal running statements.

In the example below, I will use a Blank template, specify the events and columns needed, and save the results to file. Make sure you have enough space on your disk to hold the collected data.

Blog_20160519_1

Go to the Events Selection tab, and select these Events:

  • Showplan XML
  • RPC:Completed
  • SP:StmtCompleted
  • SQL:BatchCompleted
  • SQL:BatchStarting

And include these Columns

  • ApplicationName
  • CPU
  • DatabaseName
  • Duration
  • EndTime
  • EventSequence
  • IntegerData

 

  • LoginName
  • ObjectName
  • Reads
  • SPID
  • StartTime
  • TextData
  • Writes

Blog_20160519_2

Then create a Filter on

  • Duration (experiment with different times: start at 1000ms, and go to 5000 or 10000 if necessary)
  • Database name, if desired
  • Group By : Duration

Blog_20160519_3

Create Perfmon Data Collector Set

While the SQL Profiler is collecting TSQL statements that are executing, we want to also collect data on several performance metrics on the system.   From Administrative Tools, open Performance Monitor (Perfmon). Drill down to Data Collector Sets > User Defined. Right click and select New Data Collector Set.

Blog_20160519_4

Blog_20160519_5

Blog_20160519_6

Add the following Counters:

  • Memory:Available Mbytes
  • PhysicalDisk:% Idle Time (look at this as the opposite of ‘busy’ time)
  • PhysicalDisk:Avg Disk sec/Read
  • PhysicalDisk:Avg Disk sec/Write
  • Processor:% Processor Time (for each processor)
  • SQLServer:Buffer Manager: Page Life Expectancy

You can add as many performance metrics as you deem necessary to view your environment, but these are some basic memory, disk, and CPU metrics.

Set the Sample interval time to 3 seconds, to average out and not collect 1 second spikes. Save your output to a file on your disk with enough space to hold the data collected.

Simulate High CPU Usage

A few pointers before you begin…start SQL Profiler and Perfmon, making sure to start and stop both about the same time. If you are able to run these on a computer other than the server you are monitoring, that is preferred. If not, just be mindful that Profiler creates its own performance stress, so don’t do this if your system is in a critical state. Whatever you do, DO NOT DO THIS ON A PRODUCTION SYSTEM!!

If you already have specific times of the day when you know your server’s resources are peaking, run these tools during that time. Be sure to monitor the size of the data files output so that you don’t consume too much space on your drive.

If you already have some queries you want to test against the performance of your system, you can start and stop the tools during the time you are testing your queries.

If you want to artificially stress resources on your system for practice purposes, there are many TSQL statements available online to stress CPU and memory. Here are a few statements you can run to stress CPU:

— Query to Keep CPU Busy for 30 Seconds

DECLARE    @T DATETIME, @F BIGINT;

SET @T = GETDATE();

WHILE DATEADD(SECOND,30,@T)>GETDATE()

SET   @F=POWER(2,30);

 

— Query to Keep CPU Busy for 60 Seconds

DECLARE   @T DATETIME, @F BIGINT;

SET @T = GETDATE();

WHILE DATEADD(SECOND,60,@T)>GETDATE()

SET   @F=POWER(2,30);

 

— Query to stress CPU

SELECT SUM(CONVERT(BIGINT, o1.object_id)

+ CONVERT(BIGINT, o2.object_id)

+ CONVERT(BIGINT, o3.object_id)

+ CONVERT(BIGINT, o4.object_id))

FROM sys.objects o1

CROSS JOIN sys.objects o2

CROSS JOIN sys.objects o3

CROSS JOIN sys.objects o4

 

— Another Query to stress CPU

SELECT * INTO #columns

FROM sys.columns AS C

 

DECLARE @RCNT INT

SET @RCNT = 0

SET NOCOUNT ON;

WHILE @RCNT = 0

BEGIN

UPDATE #columns

SET column_id = 0

WHERE ISNUMERIC(CONVERT(FLOAT, CONVERT(NVARCHAR(20), CONVERT(VARCHAR(20),  column_id)))) = 0

SELECT @RCNT = @@ROWCOUNT

END

 

— And another Query to stress CPU using Adventureworks database

USE AdventureWorks2014

GO

DECLARE @Flag INT

SET @Flag = 1

WHILE(@Flag < 1000)

BEGIN

ALTER INDEX [PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID]

ON [Sales].[SalesOrderDetail] REBUILD

SET @Flag = @Flag + 1

END

GO

 

I started the SQL Profiler trace and also the Perfmon data collector. Then after executing the above queries, we can see by looking at Task Manager that the CPU has spiked from 30% range up to 85% range.

Blog_20160519_7

Analyze and Correlate the Results

In order to analyze the results, both sets of data (the profiler trace and the Perfmon collection) need to be uploaded into SQL profiler:

Open the Profiler Trace (trc) file from where the output was saved.

Blog_20160519_8

Then click on File > Import Performance Data > and select the Perfmon file saved.

Blog_20160519_9

The results below show all TSQL statements run during the time period, along with the Perfmon graph and numerical values. The top section shows the profiler events in the order in which they occurred. The middle section shows the Perfmon graph for the duration of the collection period with the min, max, and average values. The bottom section displays either the text of the SQL statement or an execution plan.

You want to look for spikes or bottlenecks in the performance metrics, and then identify the profiler events that occurred around that time.

Click anywhere on the Perfmon graph and the TSQL statement will be pulled up in the Profiler event list. If I click on the Perfmon graph where I see the CPU was 100% (vertical red line below), the chart takes me to the general location (time period) in the Profiler events. You will need to scroll up or down in the list of events during that time period to find the actual event that caused the CPU spike. Clicking on the CPU spike in the Perfmon graph takes us to a TSQL statement around the time frame of that spike and provides the execution plan as shown below:

Blog_20160519_10

You can also click on a line in the Profiler events to see where the performance counters were at that time in the Perfmon graph (indicated with a red line below). I have clicked on a high CPU duration in the TSQL statement list where SQL: BatchCompleted occurred, and the Perfmon metrics on the graph shows 89% processor time and 2% disk idle (which means the disk was 98% busy).

Blog_20160519_11

In our example, the memory performance metrics were not substantial, but the CPU and disks were being stressed, with the average value for CPU being around 70%, and the average disk idle time at 25% (which means it was 75% busy).

Blog_20160519_12

Beware of some ‘Gotchas’!

Don’t run Profiler in production environment! Profiler causes much stress on a system, so if you have access to another server from which you can run this tool, it is preferred. If you don’t have another server to run Profiler from, take great care not to do this during critical periods on your production machine, and be sure to set filters so you don’t collect everything that is running.

Clicking on Perfmon spike will only take you to general area in the list of Trace events. As shown in the example above, you might or might not be directed to the exact Profiler event that caused the spike in performance. You will need to look around in the general area of that time period in the Trace to find the exact event that caused the spike.

There could be non-SQL processes causing your performance issues. If there are other non-SQL applications running on your production server that are causing performance stress, you won’t see those events by using the SQL Profiler tool.

There could be multiple processes happening causing Performance spikes that don’t exceed your Duration filter. Individually these (low to medium resource users) aren’t a problem, but collectively they can cause problems if they are occurring simultaneously.   If there are lots of smaller resource users that don’t meet your filter, you won’t see them, but together they could be causing performance issues.

Catching Showplan XML will catch everything happening in SQL because it doesn’t have a Duration column to filter on…which will bloat your trace file. Again, it is better to run these tools on a server apart from your production server.

* Disclaimer: Our SQLRx performance analysis process vanquishes all these ‘gotchas’…just sayin…give us a call

Conclusion

Being able to correlate TSQL statements with performance metrics can help you determine why you are seeing stress on your system, it can pinpoint queries that may need tuning, and it can help you observe how your queries will affect the performance of your hardware and resource usage. This can be done during a period of spikes in performance, especially if you know beforehand when these spikes occur. However keep in mind that running SQL Profiler contributes to the stress on your system, so take care not to do this during a critical performance spike, and as mentioned before it is most ideal to do this from a non-production server.

This article describes the general methods for using the SQL Profiler tool in conjunction with the Windows Performance Monitor. Correlating performance issues with this process is not 100% accurate, but will give you a great starting point to investigate performance stress on your system. Finding performance bottlenecks and knowing how to correct them is an in-depth process that usually requires years of experience. If you need guidance or assistance with this, contact us. We are experts in this, we have over 40 years of experience, and have helped over 700 clients. We can help you!

If you would like assistance configuring, maintaining, or monitoring your SQL Server and instances please contact us at SQLRx@sqlrx.com. We would be happy to answer any question you have, and would be interested to hear about your SQL experiences as well! Visit us at www.sqlrx.com.

 

 

 

 

 

 

 

 

 

Load Windows Event Log Errors into a SQL Table

–By Lori Brown @SQLSupahStah

Part of being a proactive DBA is to get information on what is going on at the server level as well as at the SQL level. We know that the Windows Event logs are around but I find that not many DBA’s check them regularly. You can make this easier by using a vbscript to load errors and warnings from the Event Logs into a SQL table.

Review or scrape the Event Logs at a minimum to find out if any Windows or hardware related errors or warnings are being written. Most hardware vendors write warnings to the Event Logs when they anticipate an error is going to occur, so this gives you the opportunity to be proactive and correct the problem during a scheduled down time, rather than having a mid-day emergency.

Build a SQL table to hold Event Log info:

— table for the Windows Event Log

CREATE TABLE [dbo].[WinEventLog](

[ID] INT IDENTITY(1,1) NOT NULL,

[ComputerName] VARCHAR(128) NULL,

[EventCode] INT NULL,

[RecordNumber] INT NULL,

[SourceName] VARCHAR(128) NULL,

[EventType] VARCHAR(50) NULL,

[WrittenDate] DATETIME NULL,

[UserName] VARCHAR(128) NULL,

[Message] VARCHAR(MAX) NULL

) ON [PRIMARY]

GO

Use the vbscipt code below by saving it into a notepad document with the extension .vbs. Make sure to modify connection strings to connect to your instance and the correct database. And, set the number of days you want to check as well. I have it defaulted to 2 days but you can set it to whatever makes sense for you.

************************

‘VBScript

‘Purpose of script to query Application log for errors

dim strConnect, strComputer, strMessage, RoleStr

dim Category, Computer_Name, Event_Code, Message, Record_Number, Source_Name, Time_Written, Event_Type, User

dim dtmStartDate, dtmEndDate, DateToCheck

dim dtTimeWritten

 

‘Connection string for SQL Server database.

strConnect = “DRIVER=SQL Server;” _

& “Trusted_Connection=Yes;” _

& “DATABASE=<<Database Name>>;” _

& “SERVER=<<SQL Instance Name>>”

 

‘Use this string if SQL Server driver does not work

‘strConnect = “Provider=SQLOLEDB;” _

‘& “Data Source=<<SQL Instance Name>>;” _

‘& “Initial Catalog=<<Database Name>>;” _

‘& “Integrated Security=SSPI;”

 

‘ Connect to database.

Set adoConnection = CreateObject(“ADODB.Connection”)

adoConnection.ConnectionString = strConnect

adoConnection.Open

 

Set dtmStartDate = CreateObject(“WbemScripting.SWbemDateTime”)

Set dtmEndDate = CreateObject(“WbemScripting.SWbemDateTime”)

‘ Set number of days to scrape here

DateToCheck = Date – 2

dtmEndDate.SetVarDate Date, True

dtmStartDate.SetVarDate DateToCheck, True

 

strComputer = “.”

Set objWMIService = GetObject(“winmgmts:” _

& “{impersonationLevel=impersonate}!\\” & strComputer & “\root\cimv2”)

Set colLoggedEvents = objWMIService.ExecQuery _

(“Select * from Win32_NTLogEvent Where Logfile = ‘Application’ and (” & _

“TimeWritten >= ‘” & dtmStartDate & _

“‘ and TimeWritten < ‘” & dtmEndDate & _

“‘) and (EventType = ‘1’ or EventType = ‘2’)”)

 

For Each objEvent in colLoggedEvents

Category = objEvent.Category

Computer_Name = objEvent.ComputerName

Event_Code = objEvent.EventCode

Message = objEvent.Message

Record_Number = objEvent.RecordNumber

Source_Name = objEvent.SourceName

Time_Written = objEvent.TimeWritten

Event_Type = objEvent.type

User = objEvent.User

 

‘Fix single quotes in the message string

strSQ = Chr(39)

strDQ = Chr(34)

if len(Message) > 0 then

strMessage = Replace(Message, strSQ, strDQ)

else

strMessage = ” “

end if

 

dtTimeWritten = WMIDateStringToDate(Time_Written)

 

RoleStr = “SET NOCOUNT ON INSERT INTO WinEventLog (ComputerName, EventCode, RecordNumber,” _

& “SourceName, EventType, WrittenDate, UserName, Message) VALUES” _

& “(‘” & Computer_Name & “‘, ‘” & CLng(Event_Code) & “‘, ‘” & CLng(Record_Number) _

& “‘, ‘” & Source_Name & “‘, ‘” & Event_Type & “‘, ‘” & dtTimeWritten _

& “‘, ‘” & User & “‘, ‘” & strMessage & “‘)”

adoConnection.Execute RoleStr

Next

 

adoConnection.Close

 

Function WMIDateStringToDate(Time_Written)

WMIDateStringToDate = CDate(Mid(Time_Written, 5, 2) & “/” & _

Mid(Time_Written, 7, 2) & “/” & Left(Time_Written, 4) _

& ” ” & Mid (Time_Written, 9, 2) & “:” & _

Mid(Time_Written, 11, 2) & “:” & Mid(Time_Written, _

13, 2))

End Function

************************

To run the vbscript, you can double click it to run it manually or you can set up a SQL job that uses CMDEXEC to call the script.

@subsystem = N’CmdExec’,

@command = N’cscript E:\SQLRX\ScrapeWindowsEventLog.vbs’

Hopefully this will help make you more proactive and knowledgeable about your servers in general.

Enjoy!

For more information about blog posts, concepts and definitions, further explanations, or questions you may have…please contact us at SQLRx@sqlrx.com. We will be happy to help! Leave a comment and feel free to track back to us. Visit us at www.sqlrx.com!

Using Ranking Functions to Partition and Order Unfamiliar Data

The Problem

How many times have you had to subdivide, group, rank, and organize data when you knew very little about the data itself except for the meanings of the columns? This dilemma is one that the members of the SQLRx team encounter frequently. When a customer requests help with a performance problem, we often do not know the number of databases, tables, or indices (or how busy any of them are) before we begin our analysis of the performance data. We also do not know beforehand which metrics will provide the most insight into a customer’s problems. The unknown nature of the data necessitates that we make no assumptions and therefore, look at all performance data.

 

To expedite our analyses, we developed Reporting Services routines that enable us dynamically to depict the data graphically. This is particularly useful for processing data from the SQL Server DMVs, which can produce voluminous amounts of data. For example, a simple 12-hour collection can produce millions of data records that are written periodically and efficiently by our T-SQL scripts into flat files for post-processing off-site. To put the analysis problem into better perspective, it is instructive to consider how one might graph the single-record lookups metric for the 6,035 possible database/table/index combinations contained in the Index Operational DMV data used in this article. If every index were used during the data collection period and each graph contained 14 series, this would result in 432 graphs. Obviously, this is unmanageable. Unless some method is devised to place the most interesting series on the first graph or two, the graphs will be as useless as the number columns themselves. A member of our team, Jeff Schwartz, developed some T-SQL code that grouped the data by metric and any other subdivision entity, e.g., instance, database, file, table, or index without resorting to cursors or dynamic SQL. The series were also ranked with the largest N (specified by the user via the Reporting Services UI) maximums on the first graph with the next N series on second graph, etc. The partitioning and ranking mechanism he devised, which employs the DENSE_RANK and ROW_NUMBER T-SQL ranking functions, is applicable to variety of data subdivision and organization problems and is the basis for this article.

 

Initial Attempts and Experimentation

Although the NTILE function appeared to be the obvious choice, Jeff found that he could not obtain the desired metric partitioning for the graphs by using this function.  NTILE requires a fixed number of partitions. Some metrics had hundreds of unique series, whereas others had one, seven, or nine (see Table 1).  For some combinations, several graphs were created for a given metric, but each graph only had one or two series on it! For those with an intermediate number of combinations, NTILE evenly divided the total number of series into equal portions. This meant that although the user specified a series per graph value such as seven, each graph would only contain five series on it (illustrated in Table 2 by the Leaf Ghosts records), thereby effectively ignoring the user specification. The query below, whose output is shown in Table 1, determines how many unique column combinations existed in the data for each metric.  Clearly, the number per metric varied greatly.  Please also note that the GroupID was assigned automatically based upon the order by clause within the DENSE_RANK function. This will become an integral part of the final partitioning strategy.

 

SELECT MetricName, DENSE_RANK() OVER(ORDER BY MetricName) AS GroupID, COUNT(*) AS [Count]

FROM MeasureBeforeUpdate

GROUP BY MetricName

ORDER BY MetricName

 

Blog_20160307_1

Method Comparisons

The next task was to reestablish the ordering for all the graphs with the worst ones on the first graph, the next worst on the second, etc.  After a fair amount of experimenting, Jeff hit upon the following combination that does the entire grouping, ranking, splitting, etc. all in one pass as shown below (it also uses a CTE to simplify the coding and improve legibility).  The example below shows all the building blocks so you can see how the final query was derived. For the sake of comparison, ALL attempted ranking variations are shown in the query. The ones that proved useful, DR and RNDiv, are highlighted. along with the combined item named MeasureRank. To aid with interpretation, @I_GroupNbr specifies the number of series per graph provided by the user, @MaxCount determines the largest number of combinations (Count in Table 1), and @MaxGroups specifies the maximum number of possible groups for each metric. This value is computed earlier in the stored procedure to simplify the computations greatly and is established using the following formula: SET @MaxGroups = CEILING (@MaxCount * 1.0 / @I_GroupNbr). The DENSE_RANK function provided the metric groupings, and the ROW_NUMBER function partitioned the columns by metric and ordered them by the maximum value in descending order. The integral division enabled maintaining the relative ranking within a group. The query output is shown below.

WITH RankingList AS (

SELECT MsId,

@MaxGroupID + (DENSE_RANK() OVER (ORDER BY MetricName) – 1) * @MaxGroups +

NTILE(@MaxGroups) OVER (PARTITION BY MetricName ORDER BY MetricName)

AS [OldMeasureRank],

DENSE_RANK() OVER (ORDER BY MetricName) AS DR,

NTILE(@MaxGroups) OVER (PARTITION BY MetricName ORDER BY MetricName) AS NT,

ROW_NUMBER() OVER (PARTITION BY MetricName ORDER BY MaxVal DESC) AS RN,

((ROW_NUMBER() OVER (PARTITION BY MetricName ORDER BY MaxVal DESC) – 1) / @I_GroupNbr) + 1 AS RNDiv,

(ROW_NUMBER() OVER (PARTITION BY by MetricName ORDER BY MaxVal DESC) – 1) % @I_GroupNbr AS RNMod,

@MaxGroupID + (DENSE_RANK() OVER (ORDER BY MetricName) -1) * @MaxGroups + ((ROW_NUMBER() OVER (PARTITION BY MetricName ORDER BY MaxVal desc) -1) / @I_GroupNbr) + 1 AS [MeasureRank]

FROM MeasureBeforeUpdate

)

SELECT GroupId = RankingList.MeasureRank, RankingList.*, MeasureBeforeUpdate.*

FROM MeasureBeforeUpdate

INNER JOIN RankingList ON RankingList.MsId = MeasureBeforeUpdate.Msid

Blog_20160307_2

Final Code and Sample Graphs

The final code compares the original partitioning code values with those of the new method. The biggest differences involve the replacement of NTILE with ROW_NUMBER and the addition of integral division. This combination automatically numbers the series for a given group and resets this number once it hits the maximum number of items to be shown on a Reporting Services page.  Since this is a parameter specified by the user, it now breaks the output up into N sections depending upon how many series are requested for each graph. Examples of the graphs with the number of series per graph specified to be seven are shown below the query. As highlighted in the Table 1, only five Leaf Page Merges combinations existed, whereas there were 105 for Leaf Inserts. As a result, only one graph was needed for the first metric, but 15 were possible for the second. Since each graph set has its own ID (as displayed in the graph titles), it is trivial to terminate the graph generation process early by avoiding the calls for any group greater than a certain value. When returning large numbers of metrics, e.g., 40 for Index Operational statistics, being able to terminate graphing early, e.g., at two, can result in a reduction of more than 93 percent in graph generation duration. With this particular data, the total possible number of graphs was 907, whereas by stopping at only two graphs per metric, the total decreased to 61.

 

WITH RankingList AS (

SELECT MsId,

@MaxGroupID + (DENSE_RANK() OVER (ORDER BY MetricName) – 1) * @MaxGroups + NTILE(@MaxGroups) OVER (PARTITION BY MetricName ORDER BY MetricName) AS [OldMeasureRank],

@MaxGroupID + (DENSE_RANK() OVER (ORDER BY MetricName) -1) * @MaxGroups + ((ROW_NUMBER() OVER (PARTITION BY MetricName ORDER BY MaxVal DESC) -1) / @I_GroupNbr) + 1 AS [MeasureRank]

FROM MeasureBeforeUpdate

)

SELECT GroupId = RankingList.MeasureRank, RankingList.*, MeasureBeforeUpdate.*

FROM MeasureBeforeUpdate

INNER JOIN RankingList ON RankingList.MsId = MeasureBeforeUpdate.Msid

Blog_20160307_3

Blog_20160307_4

Blog_20160307_5

 

 

 

Conclusion

This article illustrated the pros and cons of using various ranking functions, especially in combination, which is seldom shown in ranking code examples. This article provides a different perspective regarding Windows ranking functions and their use in solving everyday partitioning, ranking, and organization problems.

 

For more information about blog posts, concepts and definitions, further explanations, or questions you may have…please contact us at SQLRx@sqlrx.com. We will be happy to help! Leave a comment and feel free to track back to us. Visit us at www.sqlrx.com!

 

Disk Metrics – Why PerfMon’s, SQL Server’s Filestats, and Disk Vendors’ Values Sometimes Appear to Disagree

— By Jeffry Schwartz

Whenever multiple data sources are used, actual or perceived discrepancies can arise. Sometimes this is due to the location of the data collector, e.g., Windows, SQL Server, or the disk array. Other times, it is due to the collection rate and whether the information is a simple recording of continuously incremented data or it is an instantaneous sample. True sampling presents obvious issues because its accuracy is related directly to the sample rate. Clearly, the more often data is sampled, the more precisely it will represent actual behavior. However, this improved accuracy comes with the price of adding pressure to the disk, thereby distorting the picture of what the disk LUN is actually doing.

Disk Vendors

By default some disk array vendors sample disk performance data infrequently unless a more detailed view is required. This is important to know because this kind of sampling can miss spikes in activity completely, e.g., SQL Server checkpoints, which can be extremely write-intensive. As a result, a disk array that performs poorly during database checkpoints, but otherwise is effectively idle, will appear to be functioning well unless one of the samples happens to coincide with a checkpoint. Although the primary difference involves the sampling rate, the fact that the disk array measurements do not include the Windows I/O stack sometimes causes disk vendors to be suspicious of any Windows I/O metrics, and often suggest that either Windows’ I/O handling or its instrumentation is faulty. The author has not observed this.

Perfmon

Blog_20160107_1

PerfMon and similar programs that rely upon Windows performance counters retrieve the averages for I/O metrics from the disk drivers themselves, i.e., from the code inside Windows that actually handles the I/Os. The “sampling” rate simply determines how often the averages are recorded (and reset for the next interval). Therefore, these are not true samples because the disk driver constantly updates its statistics and the collector periodically retrieves the averages. As a result, the disk driver will detect intermittent activities such as checkpoint behavior. The primary difficulty with these measurements is that any summarization of the data inevitably involves averaging averages. For example, assume that the collector retrieves the I/O data from the driver every 30 seconds and one needs to summarize disk behavior over a 15-minute period. In this scenario, the period average will be the average of 30 averages. The statistical literature is replete with examples of situations when this is sufficiently accurate, as well as those when it is not. Regardless, it is easy to understand why the averaging of averages is preferable to the alternative of highly frequent true sampling.

SQL Server FileStats

Blog_20160107_2

SQL Server captures I/O performance data differently. It records the times at which an I/O request is initiated and completed. It is important to note that in order to complete an I/O, the thread handling the I/O must pass through the processors’ ready queue and become active on a processor so the I/O completion code can be executed. If the processors are sufficiently busy, a delay will occur while the thread waits for a processor. Although SQL Server measures these kinds of delays for an entire instance, it does not expose them via the sys.dm_io_virtual_file_stats DMV. Therefore, differences between PerfMon’s and SQL Server’s measurements can be due to processor delays or the comprehensive nature of the SQL Server data, or the averaging of averages for PerfMon data. Unfortunately, there is no way to be certain without obtaining extremely detailed data, which, as cited previously, will distort any measurements. Note: if the SQL Server I/O metrics are captured and averaged every 30 to 60 seconds, the difficulties with averaging averages will be present in this data as well. Therefore, the only way to avoid this issue is to subtract the first snapshot’s values from those of the second and compute the differences. This method was used to compute the SQL Server metrics in the table below.

Differences

A simple example, whose data appears in the table below, helps illustrate the differences and similarities between PerfMon and SQL Server. The rightmost two columns illustrate the percentage and absolute differences for each pair of metrics. The Pct Diff SQL Server vs. PerfMon column highlights the percentage differences, whereas the last column lists the absolute differences. The read, write, and I/O times are in milliseconds, so clearly, the two disk LUNs shown below were under extreme duress during the data collection period, and reads and writes performed very poorly. Good read and write times should be 20 milliseconds or less, but the best time shown in the table below is 548 milliseconds. The similarities between the two data sources are notable because all of the rate values are close, although there is more variation in the response times, in particular for writes on LUN 10 O. Closer inspection reveals that LUN 09 N’s response metrics are very close and that the 10 O LUN accounts for most of the differences. However, despite the cited differences, both data sources support the conclusion that the disk LUNs performed very poorly.

Blog_20160107_3

In summary, despite the significant differences in the origins of the PerfMon and SQL Server I/O metrics, both sets of metrics agree sufficiently and they also correctly detect I/O performance issues. Differences can result from the averaging of averages issues for PerfMon data, or processor-related delays and the comprehensive nature of the SQL Server I/O metrics. Regardless, either of these usually provides a more accurate representation of disk behavior than the infrequent sampling methods used by some disk vendors.

For more information about blog posts, concepts and definitions, further explanations, or questions you may have…please contact us at SQLRX@sqlrx.com. We will be happy to help! Leave a comment and feel free to track back to us. We love to talk tech with anyone in our SQL family!

Troubleshooting SQL Server CPU

–by Ginger Keys

Recently, a client of mine contacted me in a panic because CPU was running at 99-100% over an extended period of time, which was affecting everything on the machine.

We’ve all seen it at one time or another. And there are several reasons this could be happening. It could be SQL consuming most of the CPU, or it could be something outside of SQL. As a DBA, you don’t want to spend a lot of time troubleshooting high CPU usage if SQL isn’t the culprit. If the issue is caused by processes outside of SQL, you can take a high level look around, but should probably defer that to your Sys Admin or your IT team. If it is SQL that is causing the issues though, we need to dig in deeper.

This article discusses basic CPU troubleshooting. We will take a very ‘high level’ approach to a topic that can potentially get very in-depth and complex. The goal is not only to provide immediate relief but also to help prevent future stress to your CPU caused by SQL.

Is it SQL or System Processes causing high CPU Usage

The first step is to determine if it is SQL or not causing CPU pressure. There are several ways you can make this determination.

1. Task Manager/Resource Monitor

The quick way is to open Task Manager and/or Resource Monitor, and look to see if sqlservr.exe is the culprit. Be cautious when looking at task manager to troubleshoot, as this tool records data once per second. So task manager will record spikes in performance (which is normal), instead of averaging it out over a longer time span (which is a better indicator of stress).

Blog_20151001_1

Blog_20151001_2

2. Perfmon Counters  

You can also determine if SQL is the culprit for high CPU usage by using Performance Monitor (Perfmon). Unlike task manager which records data once/second, you can adjust Perfmon to poll the data every 3-5 seconds to get a better indication of CPU stress. There are four counters you will want to look at:

    • Processor: % Privileged Time (_Total) – measures the amount of time your computer is executing I/O operations and other System services (aka Kernel mode).  If this metric is high (>20%) it’s possible that something like system backups are running, your driver is bad, or large files are being copied into or out of the system.   Either way, it’s time to get your Sys Admin involved.
    • Processor: % User Time (_Total) – measures amount of time your computer is executing user applications, including SQL.  If user time is consuming most of the time, we need to look at what is running on SQL to see what is causing the high CPU usage.
    • Process: % Privileged Time (sqlservr.exe instance) – measures the percentage of elapsed time that the process threads spent executing code in privileged (Kernel/System) mode
    • Process: % User Time (sqlservr.exe instance) – measures the percentage of elapsed time that the process threads spent executing code in user (application/SQL) mode.

Blog_20151001_3

3. Query DMVs

Another method to find out what is causing high CPU usage is to execute the following DMV:

–Get CPU Utilization % from SQL, NonSQL (System), and Idle

DECLARE @ms_ticks_now BIGINT

SELECT @ms_ticks_now = ms_ticks

FROM sys.dm_os_sys_info;

SELECT TOP 10 record_id

,dateadd(ms, – 1 * (@ms_ticks_now – [timestamp]), GetDate()) AS EventTime

,SQL_CPU_Use

,100 – SystemIdle – SQL_CPU_Use AS NonSQL_CPU_Use

,SystemIdle

FROM

(

SELECT record.value(‘(./Record/@id)[1]’, ‘int’) AS record_id

,record.value(‘(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]’, ‘int’) AS SystemIdle

,record.value(‘(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]’, ‘int’) AS SQL_CPU_Use

,TIMESTAMP

FROM

(

SELECT TIMESTAMP

,convert(XML, record) AS Record

FROM sys.dm_os_ring_buffers

WHERE ring_buffer_type = N’RING_BUFFER_SCHEDULER_MONITOR’

AND record LIKE ‘%<SystemHealth>%’

) AS x

) AS y

ORDER BY record_id DESC

(from: http://sqlblog.com/blogs/ben_nevarez/archive/2009/07/26/getting-cpu-utilization-data-from-sql-server.aspx)

The query results break down CPU usage as SQL usage, NonSQL (or System) usage, and Idle. If the SQL_CPU_Use has predominantly higher percentages than NonSQL or Idle, then we know that SQL is causing the pressure on CPU.

Blog_20151001_4

                                                                                                                                                                                

System Process causing high CPU usage

If after your preliminary investigating, you discover that SQL is not causing CPU pressure, but rather it is a system issue, the problem could be any number of things. It’s good to have a general understanding of these other issues, but as a SQL DBA, you may not be able to spend a lot of time troubleshooting these. Some system, or non-SQL issues that could cause CPU stress are:

1. Architecture

2. Virtual Environment

  • Host CPU settings
  • Host processor usage
  • vCPUs allocated among guests
  • # guests
  • # other machines on host, and how busy are they
  • Resource governor settings

3. CPU Power Settings

  • Should be set to high performance on SQL server.  However, this has had inconsistent results, and may or may not make a significant difference.

4. Hardware & Software Interrupts – these should be looked at over a period of time, (at least 5 seconds per sample over 30 minutes or more), and should also be interpreted in relation to a baseline of what’s normal for your system

  • Processor: % Privileged Time (Kernel Time + DPC + Interrupts) over 15-20% consistently is bad.  High Kernel time could indicate windows backups are happening or result from a bad driver
  • Processor: % Interrupt Time – over 5-10% consistently is a problem.  This shows the percentage of time processor spent servicing hardware interrupts.  A high rate could indicate disk or network adapter issues.
  • Processor: % DPC Time – over 5-10% consistently is a problem.  This shows percentage of time processor handles deferred procedure calls.  These are interrupts with a lower priority, and have been deferred to some kind of queue.  Could be software or hardware causing this, and might be necessary to redistribute interrupts or upgrade to faster processor.

5. Antivirus Scans – temporarily disable anti-virus software to rule this out, or locate the antivirus process in task manager/resource monitor.

6. SvcHost.exe – this is a generic process name, and includes individual services that Windows uses to perform various tasks.  There can be multiple occurrences of this process running on your computer at the same time.  

                                                                                                                                                                                

SQL causing high CPU usage

If you determine that it is definitely SQL causing the CPU stress, you first need to find the problem and correct or stop it. After you provide some immediate relief to your system, you then need to troubleshoot the issue so that you can make necessary adjustments and prevent it from happening in the future.

1. sys.dm_exec_requests

First you want to find what in SQL is monopolizing the CPU.  To do this, look at the currently executing queries along with the CPU usage.  Run this script to identify the queries using the most CPU time:

SELECT sessions.session_id

,requests.status

,requests.blocking_session_id

,requests.wait_type

,requests.wait_time / (1000.0) ‘wait_time(sec)’

,requests.cpu_time

,requests.total_elapsed_time / (1000.0) ‘total_elapsed_time(Sec)’

,Substring(sqltext.TEXT, (requests.statement_start_offset / 2) + 1,

((CASE requests.statement_end_offset

WHEN – 1 THEN Datalength(sqltext.TEXT)

ELSE requests.statement_end_offset

END – requests.statement_start_offset  ) / 2

) + 1) AS statement_text

,requests.command

,sessions.login_name

,sessions.host_name

,sessions.program_name

FROM sys.dm_exec_sessions AS sessions

INNER JOIN sys.dm_exec_requests AS requests ON requests.session_id = sessions.session_id

CROSS APPLY sys.dm_exec_sql_text(requests.sql_handle) AS sqltext

WHERE requests.session_id != @@SPID

ORDER BY requests.cpu_time DESC

 

When you identify the query consuming the most CPU time, you should be able to observe the TSQL command that SQL is processing, the user issuing the command, and determine whether it can be safely terminated or not.

2. sys.dm_os_wait_stats

This provides historical (from the time that the SQL instance started), cumulative data and records waits related to the workload of your server.  We can use this DMV to investigate the most prevalent wait types over a period of time, and specifically see CPU waits (signal time) which might indicate CPU pressure.  Since this data is cumulative, current waits might not be evident and adjacent samples should be differenced to determine actual resource and signal wait times.  It is best to track these statistics over time and establish a baseline on which to compare and look for anomalies.

3. Activity Monitor

Using Activity Monitor in SSMS displays both DMVs above in an easy to read report, but can create a performance hit to run it…maybe not a good idea if your CPU is already under extreme stress.

Blog_20151001_5

4. Stored Proc / Query

    After you have taken care of the query that was causing the CPU pressure, you can run the following to get recent CPU utilization by query.  This will show the most expensive queries that have already completed and their execution plans, so that you can check for ways to prevent them from beating up your CPU in the future.

— Get CPU utilization by query

SELECT TOP 10 st.text

,st.dbid

,st.objectid

,qs.total_worker_time

,qs.last_worker_time

,qp.query_plan

FROM sys.dm_exec_query_stats qs

CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st

CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp

ORDER BY qs.total_worker_time DESC

(from: http://sqlskills.com/blogs/glenn)

Some of the steps you might need to take in order to improve performance for these queries are:

  • Check execution plan and tune query.  For a great explanation click here https://blog.sqlrx.com/2015/03/16/introduction-to-sql-server-execution-plans/
  • Update statistics on tables called – statistics are what SQL uses to decide how to execute your query in the most efficient way.
  • Identify high I/O queries
  • Create/rebuild indexes – be sure not to create indexes without thoroughly testing first!

5. Database

    To find out if one or more of your databases is predominantly causing CPU pressure, the following query will help determine which database(s) has used the most CPU resources on your instance:

— Get CPU utilization by database

WITH DB_CPU_Stats

AS

(SELECT DatabaseID, DB_Name(DatabaseID) AS [Database Name], SUM(total_worker_time) AS [CPU_Time_Ms]

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY (SELECT CONVERT(int, value) AS [DatabaseID]

FROM sys.dm_exec_plan_attributes(qs.plan_handle)

WHERE attribute = N’dbid’) AS F_DB

GROUP BY DatabaseID)

SELECT ROW_NUMBER() OVER(ORDER BY [CPU_Time_Ms] DESC) AS [CPU Rank],

[Database Name], [CPU_Time_Ms] AS [CPU Time (ms)],

CAST([CPU_Time_Ms] * 1.0 / SUM([CPU_Time_Ms]) OVER() * 100.0 AS DECIMAL(5, 2)) AS [CPU Percent]

FROM DB_CPU_Stats

WHERE DatabaseID <> 32767 — ResourceDB

ORDER BY [CPU Rank] OPTION (RECOMPILE);

(from: http://sqlskills.com/blogs/glenn)

This information should be gathered over a period of time, to determine if it is a regular occurrence. If you see that a database is consistently using most of the CPU resources, you can

  • Update statistics – to ensure SQL is executing queries in the most efficient way
  • Create/rebuild indexes – to help eliminate fragmentation and forwarded records, as these cause  SQL to work harder to find records

6. Excessive compilations/recompilations

When a query is submitted to SQL, it has to create or reuse a query plan. Creating a query plan can be very CPU intensive, so reusing them is easier on the CPU. Query plans may have to be recompiled if the object schema changes, or if the statistics change on the underlying table or database.

You can view these metrics through Perfmon using these counters:

  • SQLServer:SQL Statistics: SQL Compilations/sec
  • SQLServer:SQL Statistics: SQL Re-Compilations/sec

The ideal value for Re-compilations per second is zero.  And the ideal value for Compilations is less than 25% of SQLServer:SQL Statistics: Batch Requests/sec.

Blog_20151001_6

If your compilations or recompilations are at a higher than optimal number, consider the following:

  • Optimize for ad-hoc
  • Enable parameterization – Query plan reuse is critical for optimal performance on highly transactional systems and it is most often achieved by parameterizing your queries or stored procedures.
  • Avoid statistics based recompilations – by using KEEP PLAN hint or KEEPFIXED PLAN hint

7. SQL Server Configuration

    Making sure your SQL server is configured for optimal performance and best practices can make a huge difference in performance.  Some of the things you can look at are:
    • Max Degree of Parallelism – if you are seeing a lot of CXPACKET wait types, you may need to adjust your MAXDOP.
    • Affinity Mask – this may or may not make a big difference
    • Priority Boost (don’t enable) –this too, may or may not make a big difference
    • Max worker threads – adjusting this could help, but is beyond the scope of this article.
    • File organization on underlying disks – choice of RAID array, and the organization and placement of file types make a big difference in the performance of your server

                                                                                                                                                                                

Conclusion

Troubleshooting CPU issues can be a very painstaking process as there are multiple combinations of issues that can be causing stress. The methods outlined above provide a basic approach to troubleshooting CPU pressure. It could be hardware, software, system, SQL, web, network, and a host of other problems, as well as a complicated combination of these! As a DBA, your main concern is to find and correct any SQL related issues causing this pressure on the CPU.

If you would like assistance configuring, maintaining, or monitoring your SQL Server and instances please contact us at SQLRx@sqlrx.com. We would be happy to answer any question you have, and would be interested to hear about your SQL experiences as well!

                                                                                                                                                                                

For more information about blog posts, concepts and definitions, further explanations, or questions you may have…please contact us at SQLRx@sqlrx.com. We will be happy to help! Leave a comment and feel free to track back to us. We love to talk tech with anyone in our SQL family!

The Pros & Cons of Hyperthreading

— By Jeff Schwartz

Hyperthreading is one of those features that most IT personnel have heard about, but many do not understand exactly what it does or why it may improve or degrade the performance of their servers. Although hyperthreaded processors have been available for many years, the confusion surrounding it is as commonplace today as it was when the technology was introduced in 2002. Server, processor, and motherboard manufacturers have contributed to the confusion by activating the feature by default for several years after it was introduced, then deactivating it by default for a few years, and recently by activating it again by default. In addition, for many years, determining whether it was on or off could be accomplished only by reviewing the BIOS settings during a reboot, which is extremely difficult to schedule for most production servers.

What is Hyperthreading?

Hyperthreading involves having two distinct pipelines into a single processor execution core. Research into code sequences showed that code executing on a processor frequently, and naturally, pauses because of things such as main memory fetches into the processor cache and improper code branch prediction. The rationale is to provide the execution portion of an idling core processor with work to do while these kinds of activities are handled. These two pipelines are presented to the outside world as two distinct processors: one physical and one “logical.” Therefore, as far as any operating system is concerned, twice as many processors are available when hyperthreading is active compared to when it is not.

Why not use it all the time?

In theory, utilizing available processor cycles is a great idea and therefore, this mechanism should be used all the time. However, in practice, some problems arise. Two main issues are involved: processor schedulers and the length of time a unit of work stays on a processor without interruption. Since the execution portion of a processor is still a single entity, scheduling appropriately is critically important, i.e., two threads from the same process should NOT be scheduled on the same core. If they are, each thread will take turns stopping the other and this degrades performance. The difficulty, particularly in the SQL Server environment, is that SQL Server schedules its own threads and depending upon the SQL Server version as well as the Windows version on which it is running, SQL Server may or may not be aware of the physical/logical distinction or know how to handle this distinction properly. Therefore, there is a good chance that it may schedule two dependent threads on the same core. The other issue is that a large or parallelized SQL Server query threads usually consume their full time allotment on a processor before they are suspended. Any thread sitting in the other pipe of a hyperthreaded processor will wait the full time before it is allowed to run again. If that thread is from the same query, the two threads will stop each other from running and make the query run substantially longer. Therefore, under normal conditions, most SQL Server customers do not benefit from employing hyperthreading.

One type of workload, high volume small transaction processing, DOES benefit from hyperthreading, but even then it does not benefit as much as one might expect. Benchmarks and actual user experiences have demonstrated that at most, a 30 percent gain can be obtained when transactions are extremely small, e.g., they execute in under 200 milliseconds. Therefore, if a server only handles small high volume transactions regardless of whether they are inquiry, insert, or update, it will benefit from hyperthreading. As soon as any large queries arrive at the server, the benefit is gone and the problems cited in the previous paragraph arise.

How can I tell whether Hyperthreading is active?

Historically, this has been difficult to determine, but in the past few years, Windows and WMI have improved to provide this information without interrogating the BIOS. Older versions did not provide adequate or accurate information, but beginning with Windows 2008, the information is quite useful. This results from the fact that older versions of Windows either did not know or report the physical/logical processor distinction. The following PowerShell (other languages that can access WMI will work also) commands provide the necessary information. The first command returns the number of sockets on the system followed by the number of logical processors, and the second command returns information regarding the processors themselves. The first output was obtained on a quad-core single-socket system with hyperthreading turned off, and the second output was obtained from the same system with hyperthreading turned on. Clearly, the number of logical processors doubles when hyperthreading is turned on.

gwmi win32_computersystem | fl NumberOfProcessors, NumberOfLogicalProcessors

Blog_20150813_1

Conclusion

Hyperthreading can be useful in a SQL Server environment if the workload is comprised solely of extremely lightweight transactions. Under these conditions, the most one can expect to obtain from hyperthreading is a 30 percent improvement. If the server is running the latest versions of Windows and SQL Server, it may be possible to obtain some improvement for workloads that are highly transactional with a few small reports mixed in. However, in most SQL Server environments, hyperthreading will most likely degrade performance.

SQL Server Memory in Task Manager – Should I Be Worried?

Blog_20150507_1

–by Ginger Keys

Recently, a client of mine contacted me in a panic because almost all of the memory on the SQL server was used up…according to Task Manager, that is.

Don’t panic…task manager is probably lying.  Well, a little bit anyway.  And Task Manager is just that, a task manager. If you are looking at it to get accurate values for memory (RAM) usage or CPU time, you will get not-so-accurate information.  False alarms.  Information that doesn’t exactly mean what it implies.

 

Blog_20150507_2After speaking with my very worried client, I checked around on the servers and looked at many performance indicators.  What I usually see is that CPU is low, disk usage is low, and memory used is high – which is what we expect on a SQL machine.  SQL will use as much memory as it can get, to improve performance by caching tons of data in memory in case it needs it, to avoid having to go to disk.  It will immediately release any memory to the OS if needed.  SQL is a memory hog, and will hold onto all the memory we allocate to it, whether it is actually using it or not.  And this is what we want it to do!

 

SQL will take all the memory that is allocated to it when it is needed, store data and execution plans in the buffer, and when it is through using it, it doesn’t release it…it caches that data into memory until either 1) SQL needs it again, or 2) Windows OS requests the memory.

This isn’t an indication of performance stress…it’s just showing that SQL has reserved all the memory that we told it to use (through setting min and max memory on the SQL instance properties [1]).  SQL doesn’t let go of memory unless the server comes under memory pressure (like if other apps need memory and Windows sends out a memory pressure notification).  As long as the OS doesn’t need to use it for anything, SQL will hold on to it for future use.

The default memory setting when you install SQL is for it to use all of the memory. When windows asks SQL to give it back, SQL gives it back in huge amounts. And then proceeds to build back up again to maximum amount. This is not optimum for stability or performance, and it is highly recommended to set the limit in the Max Memory Setting as a best practice. That way Windows OS is always guaranteed to have a certain amount.

Click here for more info on setting Max Memory: https://sqlserverperformance.wordpress.com/2009/10/29/suggested-max-memory-settings-for-sql-server-20052008/

Another good reference for setting Max Memory for SQL is here:  http://www.sqlservercentral.com/blogs/glennberry/2009/10/29/suggested-max-memory-settings-for-sql-server-2005_2F00_2008/

Blog_20150507_3

And when looking at Resource Monitor, it is a mistake to use the Available and Free metrics as the key indicators for determining overall system health [2]. The Available MB of memory is comprised of both the Standby and the Free… the stuff in Standby is what Windows has put in the cache (but can be booted out if needed), and Free means it is not even being used at all. The combination is what Windows calls Available. So there is actually data in the cache that Windows says is available. Some of the categories in task manager/resource monitor can be very confusing and even misleading, which is why we shouldn’t use these metrics alone to decide if we are in danger of being out of memory.

Instead of only relying on task manager or resource monitor, we really need to look at other metrics to find out if there is a memory related performance issue on the machine.

Some of the metrics to pay attention to are:

  • Page Life Expectancy
  • Available Bytes
  • Buffer Cache Hit Ratio
  • Target & Total Server Memory
  • Memory Grants Pending
  • Pages/sec (Hard Page Faults)
  • Batch Requests/sec & Compilations/sec

Some of these metrics can be queried from TSQL, or can be observed with Perfmon (Windows Performance Monitor – located in Administrative Tools).

Page Life Expectancy (PLE)

Found in Perfmon at SQL Server: Buffer Manager – Page Life Expectancy, this counter measures the average amount of time unlocked data pages are staying in the buffer pool.   During peak production hours this number may dip into lower numbers, but we generally want this number to remain above 300 seconds (so data is staying in the cache for 5 minutes or longer). The longer data stays in the buffer, the less likely SQL will have to go to disk for I/O operations.

There is an interesting occurrence with page life expectancy… when SQL really does run out of memory, PLE drops very quickly…but it grows back very slowly. You’re probably not still having memory issues during the coming back stage, even though it may look like its struggling. If PLE stays down however, then you have a problem.   So be careful, because PLE can be misleading when it takes a drop… there’s a difference between it going down & staying down (you have a memory problem), and going down once and crawling back up (which is normal). If it stays down below 300 seconds consistently, you may need to add more RAM.

You can monitor this through Perfmon, or run the following query:

SELECT [cntr_value] FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%Buffer Manager%’ AND [counter_name] = ‘Page life expectancy’

Available MBytes

Found in Perfmon as Memory: Available MBytes, this counter detects how much memory is available for use, and whether Windows is experiencing memory pressure. We generally start investigating if this number consistently goes below 500MB, which would indicate low memory, and Windows may start memory management operations. If Available Mbytes is consistently low, you may need to add more RAM.

This counter cannot be queried through TSQL, it has to be observed through Perfmon only.

 Buffer Cache Hit Ratio

Found in Perfmon as SQL Server: Buffer Manager: Buffer Cache Hit Ratio. This counter averages (since the last restart of your SQL instance) how often SQL Server goes to the buffer pool to get data, and actually finds that data in memory, instead of having to go to disk. We want to see this ratio high in OLTP servers – around 90-95%. The higher the ratio, the less often SQL has to go to disk for I/O operations, which translates into better performance for your end users. If this number is consistently below the 90% mark, you may need to add more RAM to your server to boost performance.

SELECT [cntr_value] FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%Buffer Manager%’ AND [counter_name] = ‘Buffer cache hit ratio’

Target & Total Server Memory

Found in Perfmon at SQL Server: Memory Manager – Total Server Memory, and SQL Server: Memory Manager – Target Server Memory. The Total Server Memory is the current amount of memory that SQL Server is using.  The Total Server memory will be quite a bit lower than the Target memory during the initial buffer pool ramp up… it is trying to populate the cache and get pages loaded into memory.  Performance might be a little slower during this time since more disk I/O is required, but this is normal.  After it ramps up, and normal operations resume, Total Server Memory should be very close to Target Server Memory. (The ratio should be close to 1). If Total Server Memory does not increase much, but stays significantly less than Target, this could indicate a couple of things…

1) You may have allocated much more memory than SQL can use – SQL could cache the entire databases into memory, and if the databases are smaller than the amount of memory on the machine, the data won’t take up all the space allocated to SQL. So in this case Total Memory (actually memory being used by SQL) will never reach Target Memory (amount allocated to SQL).   Or,

2) SQL cannot grow the buffer pool because of memory pressure from outside of SQL. If this is the case, you need to either increase the Max Server Memory, or add more RAM to boost performance.

SELECT [cntr_value] FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%Memory Manager%’ AND [counter_name] IN (‘Total Server Memory (KB)’,’Target Server Memory (KB)’)

Memory Grants Pending

Found in Perfmon as SQL Server: Memory Manager – Memory Grant Pending, this counter measures the total number of SQL processes waiting for a workspace memory grant. The general recommendation for this measurement should be 1 or less. Anything above 1 indicates there are SQL processes waiting for memory in order to operate.

Memory grants pending could be due to bad queries, missing indexes, sorts or hashes. To investigate this, you can query the sys.dm_exec_query_memory_grants view, which will show which queries (if any) that require a memory grant to execute [5].

If the Memory Grants Pending are not due to the above mentioned conditions, then more memory should be allocated to SQL Server by adjusting Max Server Memory. Adding more RAM should be the last resort in this case.

SELECT [cntr_value] FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%Memory Manager%’ AND [counter_name] = ‘Memory Grants Pending’

Pages/sec (Hard Page Faults)

Found in Perfmon as Memory: Pages/sec, this measures the number of pages read from or written to disk. Pages/sec is the sum of Pages Input/sec and Pages Output/sec and reports the number of requested pages that were not immediately available in RAM and had to be read from the disk (resulting in hard page faults). A Hard Page Fault occurs when windows has to use the swap file on the disk [3]. It’s when the address in memory that’s part of a program is no longer in main memory, but has been instead swapped out to the paging file, making the system go looking for it on the hard disk. Since disk storage is much slower than RAM, performance can be greatly degraded.

We want to see the Page/sec number below 50, and closer to 0. If you see a high value of Pages/sec on a regular basis, you might have performance degradation, but not necessarily. A high Pages/sec value can happen while doing database backups or restores, importing or exporting data, or by reading a file mapped in memory [4].

Because a hard fault doesn’t necessarily indicate a critical error condition depending upon what’s normal for your environment, it’s a good idea to measure a baseline value, and monitor this number based on that. If the values are consistently higher that your baseline value, you should consider adding more RAM.

Batch Request & Compilations

SQL Server: SQL Statistics – Batch Request/Sec. This is the number of incoming queries

SQL Server: SQL Statistics – Compilations/Sec. This is the number of new executions plans SQL had to build

If Compilations/sec is 25% or higher relative to Batch Requests/sec, SQL Server is putting execution plans in the cache, but never actually reusing them.  Your valuable memory is being used up to cache query execution plans that will never be used again – instead of caching data.  This is bad. We don’t want this to happen. And a high Compilation/sec value (like over 100) indicates there are a lot of Ad-Hoc (one-hit-wonder) queries being run. You can enable the “optimize for ad hoc” setting if this is the case, and this will put the execution plan in the buffer, but only after the second time it has been used.

To query these metrics with TSQL:

SELECT [cntr_value] FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%SQL Statistics%’ AND [counter_name] = ‘Batch Requests/sec’

SELECT [cntr_value] FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%SQL Statistics%’ AND [counter_name] = ‘SQL Compilations/sec’

To get the ratio of Compilations to Batch Requests:

SELECT ROUND (100.0 *

(SELECT [cntr_value] FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%SQL Statistics%’ AND [counter_name] = ‘SQL Compilations/sec’)

/

(SELECT [cntr_value] FROM sys.dm_os_performance_counters

WHERE [object_name] LIKE ‘%SQL Statistics%’ AND [counter_name] = ‘Batch Requests/sec’)

,2) as [Ratio]

Looking at the counters in Perfmon (Report View) below, we can see that this particular server is in a healthy state, with over 11GB of Available Bytes, no page faults (so I/O is operating from buffer, not disk), Buffer cache with 100% ratio, PLE of 21,515 seconds (that’s almost 6 hours data spent in the buffer), no pending memory grants (so no SQL process is having to wait for memory), ample Total memory, and low compilation ratio (11%) compared to all the queries being run.

Blog_20150507_4

These measurements should be observed and understood, instead of using Task Manager to make critical decisions about resources on your SQL Server.

If it is decided to make changes based upon only what we see in Task Manager, and we reduce the amount of memory SQL can use, and it happens to need more to execute processes, it will have to read/write from disk, which will degrade the performance.  Reducing the amount of memory SQL can use will slow performance and will most likely increase complaints from end users. The lower you set your Maximum Server Memory, the lower your performance will go.

No matter how much memory you put in a system, SQL Server will use all it can get until its caching entire databases in memory and then some.  This isn’t an accident. SQL Server caches data in memory.  On purpose. That means it is healthy, and enhancing performance for your end users.

References

[1] “Glenn Berry’s SQL Server Performance” https://sqlserverperformance.wordpress.com/2009/10/29/suggested-max-memory-settings-for-sql-server-20052008/

[2] “Use Resource Monitor for Memory Monitoring” by Scott Lowe.   http://www.techrepublic.com/blog/the-enterprise-cloud/use-resource-monitor-for-memory-monitoring/

[3] “Swap File”. TechTarget. http://searchwindowsserver.techtarget.com/definition/swap-file-swap-space-or-pagefile

[4] “SQL Server memory performance metrics – Part 1 – Memory pages/sec and Memory page faults/sec”. Milena Petrovic.  http://www.sqlshack.com/sql-server-memory-performance-metrics-part-1-memory-pagessec-memory-page-faultssec/

[5] “SQL Server memory performance metrics – Part 5 – understanding Lazy Writes, Free List Stalls/sec, and Memory Grants Pending”. M Petrovic. http://www.sqlshack.com/sql-server-memory-performance-metrics-part-5-understanding-lazy-writes-free-list-stallssec-memory-grants-pending/

Top PerfMon Performance Counters for Determining Causes of Slow Response Times

— By Jeff Schwartz

Introduction

Have you ever been placed in the position of being forced to address user complaints about how slow the system is performing without really knowing exactly where to begin looking? Some analysts begin with Task Manager or viewing the Windows event log. Others who are working with SQL Server search the ErrorLog for problems or run Activity Monitor from SQL Server Management Studio (SSMS). Sometimes, when I/Os take WAY too long to complete, entries will appear in the Errorlog, but generally none of these displays much pertinent information. On Windows Server 2012, Task Manager was enhanced greatly so the memory, disk, and network subsystems can be observed better in real-time. Previous versions displayed considerably less detail, particularly with respect to I/O. Unfortunately, Task Manager has two inherent weaknesses: an analyst has to be watching it when the problem occurs and be able to interpret all the data before it moves off the screen in 60 seconds and the one-second sampling often gives the viewer the impression that things are horrible when they are actually fine! Both are too fast for anyone to figure out a performance problem!

When confronted with a situation like this, it is often helpful to know about a handful of PerfMon counters that can be used to rule out a given area, if not highlight the problem area(s). The counters can be collected over time and then reviewed once a performance problem has occurred. Although the list below is far from complete, they provide a very nice overview of what is actually going on with the system.

Processor Object

% Processor Time

This metric measures the amount of time that one or more processors are busy performing any kind of work. This metric can be captured for the entire processor complex (same as Task Manager) or for an individual processor. If this value is greater than 70 percent for long periods, the processor(s) is struggling to keep up with the presented workload. Very high spikes are commonplace, but sustained levels that approach saturation are problematic.

% Privileged Time

This metric measures the amount of time that one or more processors are busy performing Windows overhead work, e.g., handling the behind the scenes work for scheduling processors on threads, disk and network I/Os, and memory and device management. This metric is comprised of Windows kernel time, % Interrupt Time, and % DPC Time, so to compute the true Windows kernel time, the other two must be subtracted from % Privileged Time. Although this metric can be captured for the entire processor complex or for an individual processor, it is best to collect for each processor, and if its value is consistently greater than 30 percent for any single processor, overhead activities are unusually high and should be investigated using Resource Monitor or Event Tracing for Windows (ETW). The only exception to this rule occurs during system backup periods because this is usually performed in a mode that is charged to the Windows overhead work category.

% Interrupt Time

Windows utilizes two kinds of interrupts: immediate and deferred. This metric measures the amount of time that one or more processors are busy handling immediate interrupts, and contributes to % Privileged Time. Immediate interrupts are those that MUST be handled right away. Interestingly, as Windows has evolved, disk I/Os once were immediate and now they are often deferred. Although this metric can be captured for the entire processor complex or for an individual processor, it is best to collect for each processor, and if its value is consistently greater than 20 percent for any single processor, immediate interrupt handling activities are unusually high and should be investigated using Resource Monitor or Event Tracing for Windows (ETW). The only general exception to this rule occurs during system backup periods because this is usually performed in a mode that is charged to the Windows overhead work category. Expect this value to be higher on the processor(s) that handle network connections.

% DPC Time

This metric measures the amount of time that one or more processors are busy handling deferred interrupts, and contributes to % Privileged Time. The most consistent example of deferred interrupts involves network packets. A system that is handling an enormous amount of network traffic will always exhibit high numbers of deferred interrupts. Beginning with Windows Server 2008, network packets are handled my multiple processors, although only a single processor handles any given network connection. Although this metric can be captured for the entire processor complex or for an individual processor, it is best to collect for each processor, and if its value is consistently greater than 25 percent for any single processor, deferred interrupt handling activities are unusually high and should be investigated using Resource Monitor or Event Tracing for Windows (ETW). The only general exception to this rule occurs during system backup periods because this is usually performed in a mode that is charged to the Windows overhead work category and the network is generally utilized extensively. Expect this value to be higher on the processor(s) that handle network connections.

PhysicalDisk Object (and LogicalDisk)

% Idle Time

This metric measures the amount of time that one or more physical or logical (depending upon the object) disk LUNs are NOT busy performing any kind of work. This metric is only useful when applied to a specific disk LUN; the aggregate value is meaningless except in a capacity planning context. If this value is LESS than 40 percent for long periods, the physical or logical disk LUN is struggling to keep up with the presented workload. Disk response times often increase exponentially as this value approaches zero. In this situation, either the workload should be reduced or the hardware should be upgraded.

Avg. Disk sec/Transfer

This metric measures the amount of time in seconds that the average I/O request requires to complete. This can be considered a true disk LUN response time because the timer is activated when the I/O is extracted from the disk queue and it is terminated when a processor handles the I/O completion. If this value is greater than 0.040 seconds (40 milliseconds) for long periods, the disk LUN is struggling to keep up with the presented workload, so the pressure should be alleviated by workload reduction or hardware upgrade.

Avg. Disk sec/Write

This metric measures the amount of time in seconds that the average write I/O request requires to complete. This value can be especially useful in RAID 5 environments because it will indicate clearly when the disk LUN is asked to handle too many writes. If this value is greater than 0.040 seconds (40 milliseconds) for long periods, the disk LUN is struggling to keep up with the write workload. If this occurs, the write workload should be reduced or a different level of RAID should be used, if appropriate.

Memory Object

Available Bytes

This metric measures the amount of memory available to Windows. Note: this value often includes memory that has been allocated to the Windows cache, so you may observe values that appear to be double-counted as shown in the picture below. Free Memory is the true amount of memory that isn’t being used by anything. Windows always attempts to leave memory that is cached resident as long as possible to reduce the probability of having to perform repeated I/Os. If the available value is less than one GB for long periods, the system is short on physical memory. In SQL Server environments, it may be possible to reduce the amount allocated to SQL Server, but this may create problems with SQL Server performance.

Memory

Page Reads/sec

This metric measures the number of read operations performed outside SQL Server. This can involve reading from files and may highlight a memory shortage only indirectly. When this number is consistently high and files are not being read into memory, the system may be short on memory.

SQLServer:Buffer Manager

Page Life Expectancy

This metric measures the weighted average of the time unlocked buffers remain in the SQL Server buffer pool. This metric exhibits unusual behavior in that when SQL Server runs out of memory, this value will often plummet and even when the problem is gone, it will increase very slowly over time. Therefore, although it is tempting to interpret values below the desired threshold of 300 seconds as representing a memory problem, one has to look at the trend because it may be simply that SQL Server ran out of memory and then recovered. However, it can also indicate a chronic SQL Server memory shortage.

Page Reads/sec

This metric measures the number of read operations performed by SQL Server. Windows reads are not part of this number. This metric can be combined with the physical or logical disk metrics to determine whether SQL Server is causing stress on the disk LUNs.

Conclusion

Using this short list of PerfMon counters, an analyst can quickly look at data that has been captured over any given period to determine whether significant problems occurred. Note: the PerfMon sampling rate should be set to no less than five seconds because anything less than that creates so much noise and distraction that it makes accurate and reasonable assessment of performance virtually impossible.