Increase VMBus buffer sizes to increase network throughput to guest VMs
02 February 10 09:22 PM | winsrvperf | 0 Comments   

Under load, the default buffer size used the by the virtual switch may provide inadequate buffer and result in packet loss. We recommend increasing the VM bus receive buffer from 1Mb to 2Mb.

Traffic jams happen every day, all across the world. Too many vehicles competing for the same stretch of road, gated by flow control devices like stop signs and traffic lights, conspire to ensnare drivers in a vicious web of metal and plastic and cell phones. In the technology world, networking traffic is notoriously plagued by traffic jams, resulting in all sorts of havoc, including delayed web pages, slow email downloads, robotic VOIP and choppy YouTube videos. (Oh, the humanity!)

Virtualized networking can be complicated, what with the root and child partitions relaying packets across the VM bus to reach the physical NIC. The VM bus, anticipating contention, uses buffers to queue data while the recipient VM is swapped out or otherwise not keeping up with the traffic. The default buffer size for WS08 R2 is 1Mb, which provides 655 packet buffers (1,600 bytes per buffer).

The hypervisor, meanwhile, calculates a scheduling interval, or quantum, derived from the system's interrupt rate. The hypervisor attempts to ensure every VM has a chance to run within that interval, at which time the VM wakes up and does whatever processing it needs to do (including reading packets from the VM bus). At very low interrupt rates, that quantum can be nearly 10ms.

Whereas the native system handles on the order of 260,000 packets/second, virtualized systems, in some scenarios, can—in the worst case scenario—begin seeing packet loss under traffic loads as low as 65,500 packets/second. This isn’t an inherent tax incurred by virtualizing or a design limit; rather, it’s the result of specific characteristics of server load requiring more VM bus buffer capacity. If the logical processors hosting the guest partitions are receiving very few hardware interrupts, then scheduling quantum grows larger, approaching 10ms. The longer scheduling quantum results in longer idle periods between VM execution slices. If the VM is going to spend almost 10ms asleep, then the VM bus’ packet buffers must be able to hold 10ms worth of data. As the idle time for a VM approaches 10ms, the maximum sustainable networking speed can be calculated as:

655 default packet buffers / ~10ms idle interval = maximum 65,500 packets / second

We can increase throughput, though, by increasing the amount of memory allocated to the buffers. How much should it be increased? On paper, 4Mb is the maximum useful size; a 4Mb buffer provides about 2600 buffers, which can handle 10ms’ worth of data flowing at approximately 260,000 packets per second (the max rate sustainable by native systems). In reality, depending on the workload, the VM’s swapped-out time probably doesn’t approach the maximum 10ms quantum. Therefore, depending on how frugal you want to be with memory, increasing to 2 Mb is probably adequate for most scenarios. If you’re living large in the land of RAM, lighting your cigars by burning 4Gb memory sticks, then go for broke, cranking the buffers up to 4Mb.

The buffers are allocated from the guest partition’s memory and updating the buffer size requires, per each guest VM, adding two registry keys. To increase the buffer size, we first need the GUID and index associated with the network adapter. In the guest VM, open the Device Manager, expand Network Adapters , right click Microsoft Virtual Machine Bus Network Adapter and choose Properties (if you have an a driver marked “(emulated)”, you should take a detour to install Integration Services from the VM’s Action menu, then add a new synthetic network driver through the VM setup. See http://technet.microsoft.com/en-us/library/cc732470(WS.10).aspx , step 3 for instructions).

On the Network Adapter Properties dialog, select the Details tab. Select Driver Key in the Property pull-down menu as shown in figure 1 (click the images to see a version that's actually readable):

 

Record the GUID\index found in the Value box, as shown in figure 1, above. Open regedit and navigate to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{GUID}\{index} as shown in figure 2:

 

Right click the index number and create two new DWORD values, entitled ReceiveBufferSize and SendBufferSize (see figure 3). These values measure the memory allocated to buffers in 1Kb units. So, 0x400 equates to 1,024Kb buffer space (the default, 640 buffers). In this example, we’ve doubled the buffer size to 0x800, or 2,048Kb of memory, as shown in figure 3:

 

Your workloads and networking traffic may not need increased buffers; however, these days, 4Mb of RAM isn’t a tremendous amount of memory to invest as an insurance policy against packet loss. Now, if only I could increase a few buffers and alleviate congestion on my daily commute!

Tom Basham

Virtualization Performance PM, Windows Fundamentals Team

NUMA Node Balancing
10 December 09 10:56 PM | winsrvperf | 0 Comments   

For those of you running Hyper-V on NUMA (Non-Uniform Memory Architecture) based platforms (such as HP’s DL785), this blog presents a tuning suggestion for how to fine tune the placement of virtual machine (VM) on to a specific NUMA node.

 

Every VM in Hyper-V has a default NUMA node preference. Hyper-V uses this NUMA node preference when assigning physical memory to the VM and when scheduling the VM’s virtual processors. A VM runs best when the virtual processors and the memory that backs the VM are both on the same NUMA node, since such “remote” memory access is significantly slower than “local” access.

 

By default, the Hypervisor will assign the NUMA node preference every time the VM is run, and when choosing the NUMA node, it will find the best fit based on the resources the VM needs – preference will be given to a NUMA node that can support the VM’s memory on locally available node memory.

 

Obviously, the Hypervisor can’t predict the RAM or CPU usage needs of VMs that haven’t been created yet. It may have already distributed the existing VMs across the NUMA nodes, fragmenting the remaining NUMA resources such none of the nodes can’t fully satisfy the next VM to be created. Under these circumstances, you may want to place specific VMs on specific NUMA nodes to avoid the use of remote NUMA memory or alleviate CPU contention.

 

Consider the situation with the following 4 VMs:

·         Test-1, configured with 2 virtual processors and 3 GB of memory

·         Test-2, configured with 4 virtual processors and 4 GB of memory

·         Test-3, configured with 2 virtual processors and 3 GB of memory

·         Test- 4, configured with 1 virtual processors and 1 GB of memory

 

The partial screen shot below for Performance Monitor (PerfMon) show how the Hypervisor has nicely balanced the 4 virtual machines across 2 different NUMA nodes (see “Preferred NUMA Node Index”).

 

 

Notice how Hyper-V has balanced on the VMs across the two NUMA nodes, two VMs on node 0, and two VMs on node 1. In this case the memory requirements of each VM can be met by the local node memory (“Remote Physical Pages” is zero for all the VMs).

 

Now consider the case where I wanted test-2, my biggest, memory and CPU hungry VM to have its own dedicated NUMA node. How can this be achieved?

 

The trick is to set the NUMA node preference for the VM. In this way when the VM is started, Hyper-V will use the NUMA node preference for the VM.

 

Tony Voellm previously wrote a blog posting about setting and getting the NUMA node preferences for a VM (http://blogs.msdn.com/tvoellm/archive/2008/09/28/Looking-for-that-last-once-of-performance_3F00_-Then-try-affinitizing-your-VM-to-a-NUMA-node-.aspx). 

 

By using Tony’s script, I can explicitly set the preference appropriately to force test-2 to always be on first NUMA node and all the other machines (test-1, test-3, and test-4) on the second NUMA node (e.g. “numa.ps1 /set test-2 0”, “numa.ps1 /set test-1 1”, “numa.ps1 /set test-3 1”, “numa.ps1 /set test-4 1”).

 

Setting the preference take affect when the VM is started, so if you set the preference on an already running VM, the VM will need to be restarted for it to take effect. And the preference moves with the VM, so if the VM is moved to another host, either by Quick Migration, or Live Migration, the preference will take effect on the new host.

 

After restarting the VMs (so that my preference to took effect), you can see in the partial screen shot from PerfMon below that test-2 is only node 0, and the other VMs are on node 1.

 

 

Now whenever the test-2 VM is started, it will also be assigned to NUMA node 0, regardless of what is already running on node 0.

 

Finally a word of caution: be careful when choosing to set the NUMA node preference of a VM; if all the VMs are assigned to the same NUMA node, the VMs may behave poorly if they cannot get enough CPU time, or if there isn’t enough local memory on the NUMA node for the new VM, the VM won’t even start.

 

Tim Litton & Tom Basham

WIndows Performance Team 

 

File Server Capacity Tool (FSCT) 1.0
13 November 09 07:06 PM | winsrvperf | 0 Comments   

What is “FSCT”?

New builds of Microsoft Windows are produced almost every day for internal development and testing. In order to detect performance regressions as soon as possible, those builds have to be evaluated and compared to their predecessors as well as previous public releases. A range of performance tests are used for these comparisons, including one called “FSCT” (which stands for File Server Capacity Tool).  FSCT was developed by the Windows Server Performance team as a tool capable of simulating multiple concurrent users accessing a file server using CIFS/SMB/SMB2. Its architecture allows for usage of “workloads”. Each workload consists of a set of basic scenarios (e.g. upload of a file using Explorer, upload of a file using xcopy, Microsoft Word opening a file, etc.), information on how often those scenarios should be performed by users, and what files on the file server under test should be used.  FSCT measures a file server’s capacity for a given workload.  This includes the highest throughput the server can sustain and the maximum number of active users the server can support.  FSCT also reports on server resource utilization, which can help identify performance bottlenecks such as network or storage bandwidth or CPU utilization.

The HomeFolders workload

The publically available version of “FSCT” comes with a single workload called HomeFolders. It was created by working with Microsoft IT administrators to capture Event Tracing for Windows (ETW) traces from real, heavily accessed file servers and their clients. The data included server side traces, client API traces and network traces, to capture the requests type distribution, parameters distribution, file set characteristics, and connection characteristics. Finally, the workload was created and tested to achieve approximately the same usage patterns found in the traces.

The workload development effort included creating the scenarios, creating the file sets and defining the number of runs per user per hour of each scenario. It is important to mention that “FSCT” and the HomeFolders workload were developed independently. The “FSCT” architecture allows for the creation of custom workloads and scenarios, as well as tweaking the existing workloads. However, as mentioned previously, only the HomeFolders is included in the initial release of “FSCT”.

You can find more information on FSCT at http://blogs.technet.com/josebda/archive/2009/09/16/file-server-capacity-tool-fsct-1-0-available-for-download.aspx

High Speed Networking Deployment Guide Released
20 October 09 05:51 PM | winsrvperf | 0 Comments   

Loyal readers - many of the performance questions we get in the Windows Server Performance team pertain to networking. We've described several networking performance features and applicable scenarios in previous posts and in our tuning guides.

We wanted to inform you that an incredibly useful white paper was recently published from our friends in the High Speed Networking team titled "Networking Deployment Guide: Deploying High-Speed Networking Features." The document covers concepts, deployment instructions, and diagnostic monitoring tools for high speed networking features available in Windows Server 2008 and 2008 R2 such as TCP Chimney, Receive Side Scaling (RSS), netDMA, and virtual machine queue (VMQ).

If you've ever been curious about the difference between the Automatic and Enabled settings for TCP Chimney, or wanted to know which registry keys to set to enable VMQ interrupt coalesing (tip: coalescing may reduce the overhead of VMQ interrupt handling!), this document (direct link) is where to get the information.

Happy Configuring!

Matthew Robben

Program Manager, Windows Server Performance Team

Interpreting CPU Utilization for Performance Analysis
06 August 09 09:02 PM | winsrvperf | 3 Comments   

CPU hardware and features are rapidly evolving, and your performance testing and analysis methodologies may need to evolve as well. If you rely on CPU utilization as a crucial performance metric, you could be making some big mistakes interpreting the data. Read this post to get the full scoop; experts can scroll down to the end of the article for a summary of the key points.

 

If you’re the type of person who frequents our server performance blog, you’ve probably seen (or watched) this screen more than a few times:

 perf

This is, of course, the Performance tab in Windows Task Manager. While confusion over the meaning of the Physical Memory counters is a regular question we field on the perf team, today I’m going to explain how CPU utilization (referred to here as CPU Usage) may not mean what you would expect!

 

[Note: In the screenshot above, CPU utilization is shown as a percentage in the top left. The two graphs on the top right show a short history of CPU usage for two cores. Each core gets its own graph in Task Manager.]

 

CPU utilization is a key performance metric. It can be used to track CPU performance regressions or improvements, and is a useful datapoint for performance problem investigations. It is also fairly ubiquitous; it is reported in numerous places in the Windows family of operating systems, including Task Manager (taskmgr.exe), Resource Monitor (resmon.exe), and Performance Monitor (perfmon.exe).

 

The concept of CPU utilization used to be simple.  Assume you have a single core processor fixed at a frequency of 2.0 GHz. CPU utilization in this scenario is the percentage of time the processor spends doing work (as opposed to being idle). If this 2.0 GHz processor does 1 billion cycles worth of work in a second, it is 50% utilized for that second. Fairly straightforward.

 

Current processor technology is much more complex. A single processor package may contain multiple cores with dynamically changing frequencies, hardware multithreading, and shared caches. These technological advances can change the behavior of CPU utilization reporting mechanisms and increase the difficulty of performance analysis for developers, testers, and administrators. The goal of this post is to explain the subtleties of CPU utilization on modern hardware, and to give readers an understanding of which CPU utilization measurements can and cannot be compared during performance analysis.

 

CPU Utilization’s Uses

For those who are unaware, CPU utilization is typically used to track CPU performance regressions or improvements when running a specific piece of code. Say a company is working on a beta of their product called “Foo.” In the first test run of Foo a few weeks ago, they recorded an average CPU utilization of 25% while Foo was executing. However, in the latest build the average CPU utilization during the test run is measured at 75%. Sounds like something’s gone awry.

 

CPU utilization can also be used to investigate performance problems. We expect this type of scenario to become common as more developers use the Windows Performance Toolkit to assist in debugging applications. Say that Foo gets released for beta. One customer says that when Foo is running, their system becomes noticeably less responsive. That may be a tough bug to root cause. However, if the customer submits an XPerf trace, CPU utilization (and many other nifty metrics) can be viewed per process. If Foo.exe typically uses 25% CPU on their lab test machines, but the customer trace shows Foo.exe is using 99% of the CPU on their system, this could be indicative of a performance bug.

 

Finally, CPU utilization has important implications on other system performance characteristics, namely power consumption. Some may think the magnitude of CPU utilization is only important if you’re bottlenecked on CPU at 100%, but that’s not at all the case. Each additional % of CPU Utilization consumes a bit more juice from the outlet, which costs money. If you’re paying the electricity bill for the datacenter, you certainly care about that!

 

Before I go further, I want to call out a specific caveat for the more architecturally-aware folks. Earlier, I used the phrase “cycles worth of work”. I will avoid defining the exact meaning of “work” for a non-idle processor. That discussion can quickly become contentious. Metrics like Instructions Retired and Cycles per Instruction can be very architecture and instruction dependent and are not the focus of this discussion. Also, “work” may or may not include a plethora of activity, including floating point and integer computation, register moves, loads, stores, delays waiting for memory accesses and IO’s, etc.  It is virtually impossible for every piece of functionality on a processor to be utilized during any given cycle, which leads to arguments about how much functionality must participate during “work” cycles.

 

Now, a few definitions:

Processor Package: The physical unit that gets attached to the system motherboard, containing one or more processor cores. In this blog post “processor” and “processor package” are synonymous.

Processor Core: An individual processing unit that is capable of executing instructions and doing computational work. In this blog post, the terms “CPU” and “core” are intended to mean the same thing. A “Quad-Core” processor implies four cores, or CPU’s, per processor package.

Physical Core: Another name for an instance of a processor core.

Logical Core: A special subdivision of a physical core in systems supporting Symmetric Multi-Threading (SMT). A logical core shares some of its execution path with one or more other logical cores . For example, a processor that supports Intel’s Hyper-Threading technology will have two logical cores per physical core. A “quad-core, Hyper-Threaded” processor will have 8 logical cores and 4 physical cores.

Non Uniform Memory Access (NUMA) – a type of system topology with multiple memory controllers, each responsible for a discrete bank of physical memory. Requests to each memory bank on the system may take different amounts of time, depending on where the request originated and which memory controller services the request.

NUMA node: A topological grouping of a memory controller, associated CPU’s, and associated bank of physical memory on a NUMA system.

Hardware thread: A thread of code executing on a logical core.

Affinitization: The process of manually restricting a process or individual threads in a process to run on a specific core, package, or NUMA node.

Virtual Processor: An abstract representation of a physical CPU presented to a guest virtual machine.

 

Comparisons & Pitfalls

CPU utilization data is almost always useful. It is a piece of information that tells you something about system performance. The real problem comes when you try to put one piece of data in context by comparing it to another piece of data from a different system or a different test run. Not all CPU utilization measurements are comparable - even two measurements taken on the same make and model of processor. There are a few sources of potential error for folks using utilization for performance analysis; hardware features and configuration, OS features and configuration, and measurement tools can all affect validity of the comparison.

 

1.       Be wary of comparing CPU utilization from different processor makes, families, or models.

This seems obvious, but I mentioned a case study above where the Foo performance team got a performance trace back from a customer, and CPU utilization was very different from what was measured in the lab. The conclusion that 99% CPU utilization = a bug is not valid if processors are at all different, because you’re not comparing apples to apples. It can be a useful gut-check, but treat it as such.

 

Key takeaway #1: Processor of type A @ 100% utilization IS NOT EQUAL TO Processor of type B @ 100% utilization

 

2.       Resource sharing between physical cores may affect CPU utilization (for better or worse)

Single-core processors, especially on servers, are uncommon; multi-core chips are now the norm. This complicates a utilization metric for a few reasons. Most significantly, resource sharing between processor cores (logical and physical) in a package makes “utilization” a very hard-to-define concept. L3 caches are almost always shared amongst cores; L2 and L1 might also be shared. When resource sharing occurs, the net effect on performance is workload dependent. Applications that benefit from larger caches could suffer if cache space is shared between cores, but if your workload requires synchronization, it may be beneficial for all threads to be executing with shared cache. Cache misses and other cache effects on performance are not explicitly called out in the performance counter set.  So the reported utilization includes time spent waiting for cache or memory accesses, and this time can grow or shrink based on the amount and kind of resource sharing.  

 

Key takeaway #2: 2 HW threads on the same package @ 100% utilization IS NOT EQUAL TO 2 HW threads on different packages @ 100% utilization (for better or worse)

 

 

3.       Resource sharing between logical cores may affect CPU utilization (for better or worse)

Resource sharing also occurs in execution pipelines when SMT technologies like Intel’s Hyper-threading are present. Logical cores are not the same as physical cores - execution units may be shared between multiple logical cores. Windows considers each logical core a CPU, but seeing the term “Processor 1” in Windows does not imply that the corresponding silicon is a fully functioning, individual CPU.

 

Consider 2 logical cores sharing some silicon on their execution path. If one of the logical cores is idle, and the other is running at full bore, we have 100% CPU utilization for one logical core. Now consider when both logical cores are active and running full bore. Can we really achieve double the “work” of the previous example? The answer is heavily dependent on the workload characteristics and the interaction of the workload with the resources being shared. SMT is a feature that improves performance in many scenarios, but it makes evaluating performance metrics more…interesting. 

 

Key takeaway #3: 2 HW threads on the same logical core @ 100% utilization IS NOT EQUAL TO 2 HW threads on different logical cores @ 100% utilization (for better or worse)

 

 

4.       NUMA latencies may affect CPU utilization (for better or worse)

An increasing percentage of systems have a NUMA topology. NUMA and resource sharing together imply that system topology can have dramatic effects on overall application performance. Similar to the previous two pitfalls, NUMA effects on performance are workload dependent.

 

If you want to see which cores belong to which NUMA nodes, right click on a process in the “Processes” tab of Task Manager and click “set affinity…”. You should get a window similar to the one below, which shows the CPU-to-node mapping if a server is NUMA-based. Another way to get this information is to execute the “!NUMA” command in the Windows Debugger (windbg.exe).

 Affinity

 

 Key takeaway #4: 2 HW threads on the same NUMA node @ 100% utilization IS NOT EQUAL TO 2 HW threads on different NUMA nodes @ 100% utilization (for better or worse)

 

 

5.       Processor power management (PPM) may cause CPU utilization to appear artificially high

Power management features introduce more complexity to CPU utilization percentages. Processor power management (PPM) matches the CPU performance to demand by scaling the frequency and voltage of CPU’s. During low-intensity computational tasks like word processing, a core that nominally runs at 2.4 GHz rarely requires all 2.4 billion potential cycles per second. When fewer cycles are needed, the frequency can be scaled back, sometimes significantly (as low as 28% of maximum). This is very prevalent in the market - PPM is present on nearly every commodity processor shipped today (with the exception of some “low-power” processor SKUs), and Windows ships with PPM enabled by default in Vista, Windows 7, and Server 2008 / R2.

 

In environments where CPU frequency is dynamically changing (reminder: this is more likely than not), be very careful when interpreting the CPU utilization counter reported by Performance Monitor or any other current Windows monitoring tool. Utilization values are calculated based on the instantaneous (or possibly mean) operating frequency, not the maximum rated frequency.

 

Example: In a situation where your CPU is lightly utilized, Windows might reduce the operating frequency down to 50% or 28% of its maximum. When CPU utilization is calculated, Windows is using this reference point as the “maximum” utilization. If a CPU nominally rated at 2.0 GHz is running at 500 MHz, and all 500 million cycles available are used, the CPU utilization would be shown as 100%. Extending the example, a CPU that is 50% utilized at 28% of its maximum frequency is using approximately 14% of the maximum possible cycles during the time interval measured, but CPU utilization would appear in the performance counter as 50%, not 14%.

 

You can see instantaneous frequencies of CPUs in the “Performance Monitor” tool. See the “Processor Performance” object and select the “% of Maximum Frequency” counter.

 

[Side note related to Perfmon and power management: the “Processor Frequency” and “% of Maximum Frequency” counters are instantaneous samples, not averaged samples. Over a sample interval of one second, the frequency can change dozens of times. But the only frequency you’ll see is the instantaneous sample taken each second. Again, ETW or other more granular measurement tools should be used to obtain statistically better data for calculating utilization.]

 

Key takeaway #5: 2 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

 

Key takeaway #6: 4 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

 

6.       Special Perfmon counters should be used to obtain CPU utilization in virtualized environments

Virtualization introduces more complexity, because allocation of work to cores is done by the hypervisor rather than the guest OS. If you want to view CPU utilization information via Performance Monitor, specific hypervisor-aware performance counters should be used. In the root partition of a Windows Server running Hyper-V, the “Hypervisor Root Virtual Processor % Total Runtime” counter can be used to track CPU utilization for the Virtual Processors to which a VM is assigned. For deeper analysis of Hyper-V Performance Counters and Processor Utilization in virtualized scenarios, see blog posts here and here.

 

Key takeaway #7:  In a virtualized environment, unique Perfmon counters exposed by the hypervisor to the root partition should be used to get accurate CPU utilization information.

 

 

7.       “% Processor Time” Perfmon counter data may not be statistically significant for short test runs

For someone performing performance testing and analysis, the ability to log CPU utilization data over time is critical. A data collector set can be configured via logman.exe to log the “% Processor Time”counter in the “Processor Information” object for this purpose. Unfortunately, counters logged in this fashion have a relatively coarse granularity in terms of time intervals; the minimum interval is one second.  Relatively long sample sizes need to be taken to ensure statistical significance in the utilization data. If you need higher precision, then out-of-band Windows tools like XPerf in the Windows Performance Toolkit can measure and track CPU utilization with a finer time granularity using the Event Tracing for Windows (ETW) infrastructure.

 

Key takeaway #8: Perfmon is a good starting point for measuring utilization but it has several limitations that can make it less than optimal. Consider alternatives like XPerf in the Windows Performance Toolkit.

 

 

 

Best Practices for Performance Testing and Analysis Involving CPU Utilization

If you want to minimize the chances that hardware and OS features or measurement tools skew your utilization measurements, consider the following few steps:

1.       If you’re beginning to hunt down a performance problem or are attempting to optimize code for performance, start with the simplest configuration possible and add complexity back into the mix later.  

a.       Use the “High Performance” power policy in Windows or disable power management features in the BIOS to avoid processor frequency changes that can interfere with performance analysis.

b.      Turn off SMT, overclocking, and other processor technologies that can affect the interpretation of CPU utilization metrics.  

c.       Affinitize application threads to a core.  This will enhance repeatability and reduce run-to-run variations. Affinitization masks can be specified programmatically, from the command line, or can be set using the GUI in Task Manager..

d.      Do NOT continue to test or run in production using this configuration indefinitely. You should strive to test in out-of-box or planned production configuration, with all appropriate performance and power features enabled, whenever possible.

 

Key Takeaway #9:  When analyzing performance issues or features, start with as simple a system configuration as possible, but be sure to analyze the typical customer configuration at some point as well.

 

2.       Understand the system topology and where your application is running on the system in terms of cores, packages, and nodes when your application is not explicitly affinitized. Performance issues can suddenly appear in complex hardware topologies; ETW and XPerf in the Windows Performance Toolkit can help you to monitor this information.

a.       Rebooting will generally change where unaffinitized work is allocated to CPUs on a machine. This can make topology-related performance issues reproduce intermittently, increasing the difficulty to root cause and debug problems. Reboot and rerun tests several times, or explicitly affinitize to specific cores and nodes to help flush out any issues related to system topology. This does not mean that the final implementation is required to use thread affinity, or that affinity should be used to work around potential issues; it just improves repeatability and clarity when testing and debugging.

3.       Use the right performance sampling tools for the job. If your sample sets will cover a long period of time, Perfmon counters may be acceptable. ETW generally samples system state more frequently and is correspondingly more precise than Perfmon, making it effective even with shorter duration samples. Of course, there is a tradeoff - depending on the number of ETW “hooks” enabled, you may end up gathering significantly more data and your trace files may be large.

 

 

 

Finally, keep in mind that these problems are not isolated to the Windows operating system family. The increase in processor features and complexity over the past decade has made performance analysis, testing, and optimization a challenge on all platforms, regardless of OS or processor manufacturer.

And If you are comparing CPU utilization between two different test runs or systems, use the guidance in this post to double check that the comparison makes sense. Making valid comparisons means you’ll spend more of your time chasing valid performance issues.

 

 

Summary of Key Takeaways

Key takeaway #1: Processor of type A @ 100% utilization IS NOT EQUAL TO Processor of type B @ 100% utilization

Key takeaway #2: 2 HW threads on the same package @ 100% utilization IS NOT EQUAL TO 2 HW threads on different packages @ 100% utilization (for better or worse)

Key takeaway #3: 2 HW threads on the same logical core @ 100% utilization IS NOT EQUAL TO 2 HW threads on different logical cores @ 100% utilization (for better or worse)

Key takeaway #4: 2 HW threads on the same NUMA node @ 100% utilization IS NOT EQUAL TO 2 HW threads on different NUMA nodes @ 100% utilization (for better or worse)

Key takeaway #5: 2 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

Key takeaway #6: 4 HW threads @ 100% utilization and 50% of rated frequency IS NOT EQUAL TO 2 HW threads @ 100% utilization and 100% of rated frequency

Key takeaway #7: In a virtualized environment, unique Perfmon counters exposed by the hypervisor to the root partition should be used to get accurate CPU utilization information.

Key takeaway #8: Perfmon is a good starting point for measuring utilization but it has several limitations that can make it less than optimal. Consider alternatives like XPerf in the Windows Performance Toolkit.

Key Takeaway #9: When analyzing performance issues or features, start with as simple a system configuration as possible, but be sure to analyze the typical customer configuration at some point as well.

 

Feel free to reply with questions or additional (or alternative) perspectives, and good luck!

 

Matthew Robben

Program Manager

Windows Server Performance Team

 

 

Performance Tuning Guidelines for Windows Server 2008 R2 Released
14 July 09 06:34 PM | winsrvperf | 2 Comments   

With Windows Server 2008 R2 almost at RTM, a new tuning guide has been released. It can be found here.

And don't forget the Windows Server 2008 Turning Guide is still availiable and has recently been refreshed for SP2.

Windows Server Performance Team

Filed under:
Configuring Windows Server 2008 Power Parameters for Increased Power Efficiency
04 December 08 11:10 PM | winsrvperf | 6 Comments   

Matthew Robben here, I’m a Program Manager on the Windows Server Performance team and my primary responsibility is Windows Server power management. Server power efficiency is a topic of considerable importance – in today’s difficult economy, IT organizations need to contain and reduce costs. Yet the cost of energy to power and cool a 1U server is now more than the amortized cost of the server (over 3 years). 

 

Energy efficient hardware and software reduces operational costs and directly impacts an organization’s bottom line. We’re in the midst of developing Windows Server 2008 R2, and one of our goals for the product is to build a server operating system that is more power efficient than all of our previous releases. Furthermore, to help IT administrators better understand server power management and optimize their current Windows Server 2008 installations, we’re releasing a comprehensive white paper called “Power In, Dollars Out: Reducing the Flows in the Data Center” today. The white paper gives detailed explanations of many factors affecting server power efficiency, and contains a list of best practices for optimization.

 

One of the stated best practices is to properly configure Windows Server 2008’s power management features. According to the Green Grid, just turning on PPM features in the operating system can reduce power consumption by 20%. In Windows Server, this can be done simply by choosing the Balanced or Power Saver power policies (found in the Power Options applet in the Control Panel). Of course, PPM is a complicated technology, with many more toggles than a simple on/off switch. We’ve done quite a bit of work on the Windows Server processor power management (PPM) algorithms and parameters during R2 development. One of the results of this work was the development of a set of parameters that can boost power efficiency by up to 10% on standard benchmark workloads.

 

Good news - you don’t need to wait until R2 to deploy these new parameters on your servers. This blog post will describe PPM technology, explain the parameters involved, and show benchmark test results for the parameter changes on a commodity server. It will also give you a handy command-line walkthrough of the powercfg.exe commands necessary to implement these changes in your environment.

 

First, some context. Power management requires cooperation from the hardware and the operating system to work efficiently. For example, hardware might support low power states, but the operating system schedules computational work and is in the best position to decide when low power states can be leveraged. The Advanced Configuration and Power Interface (ACPI) defines an interface between the operating system and server hardware to be used for power management purposes.

ACPI Processor Performance States

The processor has traditionally consumed the most power in a server, which makes it a great candidate for power-efficiency optimizations. To add detail and flexibility for processor power management, ACPI defines a few sets of states for processors. Performance states, or P-states, are one such state that can be leveraged to increase power efficiency.

P-States

Processors can transition between multiple performance states, or P-states. P-states define incremental levels of processor performance, from P0 (most performant) to Pn (least performant). The ACPI specification does not specify a maximum number of P-states, so Pn is used to refer to the highest numbered, lowest performant P-state that a processor supports.

Each successively higher numbered P-state consumes less power than the previous P‑state. Processors can dynamically switch between these states during operation to provide only as much computational capacity as is necessary, which saves power during periods of low usage.

Figure 1 below shows a hypothetical set of six P-states that would be available to a processor. Note that the maximum P-state (P0) has the highest frequency, while successively higher numbered P-states reduce in frequency. In this case, the minimum P-state is P5, so the terms Pn and P5 would be interchangeable.

P-state explanatory chart

Figure 1. Illustration of P-state number and corresponding frequency

Tuning P-State Parameters for Increased Power Efficiency

Windows Server contains a number of configurable P-state parameters. These can be used to finely tune the power/performance balance of Windows Server PPM. The defaults for these parameters are tuned to deliver excellent power efficiency for most systems and workloads out of the box. However, these are “safe” defaults. They balance performance and power efficiency. Default settings are shown in Table 1. Note that “P-state increase” in this context refers to a transition to a lower numbered, more performant P-state, whereas “P-state decrease” refers to a transition to a higher numbered, less performant P-state. Looking back to Figure 1, an increase would mean moving upward in the chart while a decrease would mean moving downward.

Table 1. Default P-State Parameter Settings in Windows Server 2008

Name

Default

Description

Time Check

100 ms

The time interval at which the operating system considers a change of the current P-state.

Increase Time

100 ms

The minimum time period that must expire before considering a P-state increase.

Decrease Time

300 ms

The minimum time period that must expire before considering a P-state decrease.

Increase Percentage

30%

The utilization percentage1 that the CPU must exceed to increase P‑state.

Decrease Percentage

50%

The utilization percentage that the CPU must be below to decrease P-state

Domain

Accounting

Policy

0 (On)

Determines how the kernel power manager accumulates idle time. Settings:

 0 (On): idle time is accumulated only when all processors in an idle state domain2 are idle.

1 (Off): idle time is accumulated and P-states are calculated for each processor without regard to any other processor in the domain.

Increase Policy

IDEAL (0)

Determines how P-state transition decisions are made. Settings:

 IDEAL (0): calculates the target P-state based only on processor utilization and then finds a nearby available P-state on the system.

SINGLE (1): calculates an ideal P-state but only increases or decreases by one P-state per time check interval.

ROCKET (2):  transitions to the highest P-state available on increase or lowest P-state available on decrease

Decrease Policy

SINGLE (1)

1The utilization percentage referenced here is not the same as the CPU usage counter in the Task Manager tool. Without going into more details, this setting is best optimized through empirical experimentation.

2A “state domain” is a dependency between different processor cores or packages on a server.  Often, processor designs require that if one core is at a particular performance or idle state, the other cores or packages in the domain must also be at the same state. The hardware notifies the operating system of this dependency by establishing a domain through the ACPI interface.

During Windows Server 2008 R2 development, our team determined a set of parameters that can boost energy efficiency with a very minor performance cost. Notice in Table 1 that the decrease time default is larger than the increase time default. This setting favors P-state increases over decreases. The default increase and decrease percentage settings of 30 and 50 percent, the default domain accounting policy, and the increase and decrease policy defaults favor P‑state increases as well.

 

To tune the machine for more aggressive power savings, we suggest reducing the decrease time to 100 ms to match the increase time, changing the increase and decrease policies to favor P-state decrease, and switching the domain accounting policy to 0 (off). We left the increase and decrease percentages as their defaults to ensure that the system PPM parameters were not completely biased toward power savings and to reduce negative performance consequences. Table 2 summarizes these changes.

Important:  Modifying any of these parameters changes the behavior of performance state handling from the out-of-box experience. Before you deploy to production servers, validate the effects of any changes in a test environment.

Table 2. Default and New PPM Parameter Values

Setting

Default value

“Aggressive” value

Time Check

100 ms

100 ms

Increase Time

100 ms

100 ms

Decrease Time

300 ms

100 ms

Increase Percentage

30 %

30 %

Decrease Percentage

50 %

50 %

Domain Accounting Policy

0 (On)

1 (Off)

Increase Policy

0 (Ideal)

1 (Single)

Decrease Policy

1 (Single)

0 (Ideal)

 

These parameters can only be set using the powercfg.exe command-line tool, which is installed by default to the Windows\System32 folder on Windows Server 2008. The commands to change the P-state settings by using powercfg.exe are given at the end of this post.

Energy Efficiency Analysis of P-State Settings

To test the efficiency of these new power settings (henceforth called “Aggressive” settings), we performed a set of benchmark runs on a four-socket quad-core server. Table 3 gives the system configuration.

Table 3. Four-Socket Quad-Core Server Configuration

System configuration

Processors

  4  quad-core 2.9-GHz

Memory

32 4-GB DDR2 667-MHz DIMMs

Disk

  4  72-GB, 15,000 SCSI

Network adapter

  2  1-GBps 

 

We ran the SPECPower benchmark with both the default settings and the Aggressive power saving settings. Figure 2 and Figure 3 show the power usage and power efficiency across different workload levels. The Aggressive settings exhibit significant power efficiency over the default settings at a majority of the load levels. The maximum power saving is achieved at 60‑percent workload level on this configuration with approximately 10‑percent improvement in power efficiency when it is compared to the default setting. There is a negligible reduction in overall throughput at utilization levels above 97%.

Power savings of Default vs. Aggressive parameters

Figure 2.  System power across varying SPECPower load levels

Efficiency of default vs. aggressive settings

Figure 3.  System power efficiency across varying SPECPower load levels

These settings were tested on commodity servers with the SPECPower workload. Your particular hardware and workload might deliver different results. Please test any parameter changes before deploying in your production environment.

 

Changing P-State Parameters with Powercfg.exe

If you decide you want to deploy the new P-state parameter settings in your environment, you’ll first need to verify that your Windows Server 2008 installation is configured to use the Balanced power policy. Verify this by going to Power Options in the Control Panel.

 

Done? Next, you need to start a command prompt with administrator privileges. Get the binary dataset that represents the current Balanced AC power settings for P‑states with the following command line (corrected from earlier versions of this post, thanks to Asmus for the heads up!):

>powercfg /getpossiblevalue sub_processor procperf 2

You should see the following:

Type: BINARY

Value: 640864000000A0860100E09304001E00000032000000

 

This value represents an encoded dataset of power policy parameters. The parameter values for this dataset can be shown with the decode command:

>powercfg /ppmperf /decode 640864000000A0860100E09304001E00000032000000

Verify that your power parameter values match the defaults shown below and in Table 1. If your parameter settings do not match these values, your Windows Server parameters may have already been reconfigured for optimal power efficiency in your environment.

Busy Adjust Threshold: 100

Time Check: 100

Increase Time: 100000

Decrease Time: 300000

Increase Percent: 30

Decrease Percent: 50

Domain Accounting Policy: 0

Increase Policy: 0

Decrease Policy: 1

Next, you need to change the parameter values to match the “Aggressive” settings described in this post. To do so, use the following command:

>powercfg /ppmperf /encode base 640864000000A0860100E09304001E00000032000000 /decreasetime 100000 /domainaccountingpolicy 1 /increasepolicy 1 /decreasepolicy 0

After executing this command, powercfg will print out a binary dataset representing the new values, like the one shown below.

640364000000A0860100A08601001E00000032000000

 

You need to apply the new dataset by using the setpossiblevalue command:

>powercfg /setpossiblevalue /sub_processor /procperf 2 binary 640364000000A0860100A08601001E00000032000000

 

Finally, use the setactive command to enable the new parameter set. No reboot is necessary for these parameters to take effect.

>powercfg /setactive scheme_balanced

 

If you want to restore the default setttings, use the setpossiblevalue command with the default dataset value (shown below), and follow it with a setactive command:

>powercfg /setpossiblevalue /sub_processor /procperf 2 binary 640864000000A0860100E09304001E00000032000000

>powercfg /setactive scheme_balanced 

 

That’s it! You’ve taken your first step to increasing energy efficiency in your datacenter. As our white paper explains, there’s even more you can do. It’s a highly recommended read for cost-sensitive administrators.

 

Thanks for reading!

 

Matthew Robben

Program Manager

Windows Server Performance Team

 

Filed under:
Greater than 64 Logical Processor support on Windows Server 2008 R2
22 November 08 01:19 AM | winsrvperf | 1 Comments   

In the past few weeks, there have been a number of new feature announcements around Windows 7 and Windows Server 2008 R2 at PDC and WinHEC conferences.  From a Server perspective, Power Management, Virtualization, and Greater than 64 processor support are considered the top three features for Windows Server 2008 R2.  I will focus on the greater than 64 processor support, given it is a new milestone for Windows, and sets up the stage for competing on much larger and higher end servers.  The support enables large scale database customers to deploy their solutions on Windows and expect good scalability numbers.  Of course, the scalability realized is highly dependent on the applications and drivers being able to scale well beyond 64 processors.  To do so, application and driver developers are strongly encouraged to read up on the greater than 64 processor work here to see what has changed, and what type of code modifications are necessary to take full advantage of this new capability.  The document describes the architecture, terminology, and goes into more details about the new APIs.

 

Ahmed Talat

Performance Manager

Windows Server Performance Team

Filed under: ,
Hyper-V and VHD Performance - Dynamic vs. Fixed
19 September 08 07:08 PM | winsrvperf | 10 Comments   

My name is Tim Litton, I work as a Program Manager within the Microsoft Windows Server team, and my particular area of focus is performance optimization for Hyper-V.

 

With the recent release of Hyper-V, customers are starting to ask us how to configure Hyper-V to get the best performance.  It’s generally recognized that there is overheard running a virtualized environment, but the question that really needs to be answered is how much?

 

With this in mind, I thought I’d share some of our recent testing of Hyper-V and how disk workloads perform when using Fixed or Dynamic VHDs.  The goal here is to provide some data that backs up the tuning guidance that can be here: http://www.microsoft.com/whdc/system/sysperf/Perf_tun_srv.mspx.

 

The following graph shows the relative performances for a number of different scenarios (with Dynamic VHD being the baseline).

 

Hyper-V VHD Performance - Dynamic vs. Fixed

 

Fixed VHD always performs better than a Dynamic VHD in most scenarios by roughly 10% to 15% with the exception of 4k writes, where Fixed VHD performs significantly better.

 

We ran 16 virtual machines when performing these tests (see “How We Tested” below) with the goal of evaluating how well Hyper-V performed in the server consolidation scenario.  Being able to consolidate a number of physical machines onto a single machine and have the virtual machines able to handle the load is a very important design goal for Hyper-V.

 

The exact result that a customer is going to see will depend on quite a few variables (e.g. how large and often the reads and writes are, how many outstanding I/O there can be at one time), so performing real-world testing is the best way to assess what impact virtualization will have.

 

Recently, QLogic  published a benchmark for I/O throughput for storage devices going through Windows Server 2008 Hyper-V (http://www.qlogic.com/promos/products/hyper-v.aspx) that closely matches the native performance, thus demonstrating Hyper-V’s ability to bring the advantages of virtualization to large-scale datacenters.

 

How We Tested

Hardware: DP DL580 G5, 16 x 2.4 GHz (Intel E7340), 16 GB RAM

 

Disk: HP P800, 25 spindles

 

Virtual Machine Setup: 16 Virtual machines, each running Windows Server 2008 Enterprise Edition (64 bit), 1 CPU, 796 MB RAM

 

Testing Software: We used IO Meter (http://www.iometer.org) to generate the workload for the I/O system, with a maximum number of 8 outstanding I/Os per virtual machine to a 100MB file.

Getting system topology information on Windows
13 September 08 12:16 AM | winsrvperf | 2 Comments   

On Windows Server 2008 and later, applications can programmatically get information about how the underlying hardware components relate to one another.  Examples include spatial locality and memory latency.  This article describes how developers can get the system topology information and use it to build scalable solutions on multi-processor and NUMA (Non-Uniform Memory Architecture) systems.

To start things off, the following is a refresher of some definitions that will be used throughout the article:

Term

Definition

Affinity Mask

A bitmask specifying a set of up to 64 processors.

Core

A single physical processing unit.  It can be represented to software as one or more logical processors if simultaneous hardware multithreading (SMT) is enabled.

Hyper-threading

Hyper-Threading Technology (HT) is Intel's trademark for their implementation of the simultaneous multithreading technology.

Logical Processor (LP)

A single processing unit as it is represented to the operating system.  This may represent a single physical processor or a portion of one.

Node

A set of up to 64 processors that are likely in close proximity and share some common resources (i.e., memory). It is usually faster to access resources from a processor within the node than from a processor outside of the node.

Proximity Domain

A representation of hardware groupings, each with a membership ID.  Associations are only made to processors, I/O devices, and memory.  The HAL provides information about processors and memory proximity domains through its interface.

 

From an application’s perspective, a physical system is composed of three components, Processors, Memory, and I/O devices.  These components will be arranged into one or more nodes interconnected by an unknown mechanism.  The following figure is one type of configuration:

        

In this example, each node contains four processors shown by squares:   Node X contains processors X0, X1, X2, and X3.  Other systems may have a different number of processors per node, which may be multiple packages or multiple cores on the same physical package.  Attached to each node is a certain amount of memory and I/O devices.  An application can programmatically identify where the different pieces of hardware are, how they relate to one another, and then partition its I/O processing and storage  to achieve optimum scalability and performance.

Application Knowledge of Processors

Applications can determine the physical relations of processors in a system by calling GetLogicalProcessorInformation with a sufficiently large buffer that will return the requested information in an array of SYSTEM_LOGICAL_PROCESSOR_INFORMATION structures.  Each entry in the array describes a collection of processors denoted by the affinity mask and the type of relation this collection holds to each other.  The following table outlines the type of possible relations:

Value

Meaning

RelationProcessorCore (0)

The specified logical processors share a single processor core, for example Intel’s Hyper-Threading™ technology.

RelationNumaNode (1)

The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).

RelationCache (2)

The specified logical processors share a cache.

RelationProcessorPackage (3)

The specified logical processors share a physical package, for example multi-core processors share the same package.

 

An application may be only interested in which processors belong to a specific NUMA node so it can schedule work on these processors and improve performance by keeping state in that NUMA node’s memory.  The application may also want to avoid scheduling work on processors that belong on the same core (e.g. Hyper-Threading) to avoid resource contention..

Application Knowledge of Memory

An application may also be interested in knowing how much memory is available on a specific NUMA node before deciding to make allocations.  Making a call to GetNumaAvailableMemoryNode will provide this kind of information.  An example might be an application interested in keeping data its threads are working on (e.g. some sort of a software cache) in memory that belongs to the same node hosting the processors to which the threads have their affinities set.  This way, when the data is not resident in the processor’s cache, the cost of reading and writing to the data in local memory is less expensive than accessing remote memory from another node.  When it is time to make the allocation, Windows provides the VirtualAllocExNuma API that takes in a preferred node number as parameter for determining in which node the application would like the memory to reside.  This is an example of an application choosing to allocate memory from a specific node.

Application Knowledge of Devices

Every driver loaded on the system has an associated interface that it supports and registers when the driver starts.  Storage and networking drivers are common examples.

By calling SetupDiGetClassDevs, an application sets up a list of devices supporting a particular interface and calls SetupDiEnumDeviceInfo to enumerate through the list and get a specific device entry.  Once an application knows which devices it is interested in, it can then use the device properties to identify how a particular device relates to other components in the system like processors and memory. 

1)      Call SetupDiGetDeviceRegistryProperty requesting DEVPKEY_Numa_Proximity_Domain.

2)      If a device does not have this property, then move to that device’s parent.  Repeat until either a proximity domain is found or the root device has been reached.

3)      If no proximity domain information is found , then it is possible the device locality information is not exposed.

4)      Given a proximity domain, applications can figure out which NUMA node a device belongs to by calling GetNumaProximityNode

Brian Railing

Windows Server Performance Team

Tuning Windows Server 2008 for PHP
25 July 08 07:23 AM | winsrvperf | 3 Comments   

Tom Hawthorn, Karthik Mahesh - Windows Server Performance Team

A significant percentage of web sites utilize PHP as a platform for dynamic content.  During the development of Windows 2008, Microsoft included improvements that enable PHP to run more efficiently than previous Windows releases.  This article describes how to tune Windows 2008, IIS 7.0 and PHP for environments with a single site and high concurrent user traffic.

Introduction

Windows 2008 and IIS 7.0 have new features and optimizations that allow PHP to run more efficiently and robustly.   The most significant improvement Microsoft made was to update the CGI support in IIS 7.0 to conform to the Fast CGI standard.  Fast CGI saves a process create/destroy operation per request by pooling and reusing multiple worker processes.  This translates into significant performance improvements for PHP applications on Windows.

By default, the Fast CGI support in IIS 7.0 is configured to be conservative when allocating resources in order to best support the scenario where hundreds of web sites are running on a single physical server.  This was thought to be the most important deployment environment for PHP on Windows.  This article describes the differences between tuning a Windows server for a multi-site scenario versus a single site scenario assuming high overall traffic.

Multi-site versus Single-site Web Servers

Let’s talk about two typical environments for web servers where resource management and performance become top concerns: the “web hosting” environment and the “enterprise” environment.

In web hosting environments servers host hundreds of sites on a single physical machine.  There are dozens of companies that sell pre-packaged web sites for under $20 per month.  They achieve cost-efficiency by deploying hundreds of sites per single server machine and they place limits on the amount of traffic that each site can service.  Perhaps it is not much of a surprise that a huge percentage of web sites on the internet run on shared hardware.  Individually, hosted sites are low traffic but the aggregate traffic adds up to some serious load.  Administrators must take care to isolate the sites from each other for security reasons and must ensure that no single site can consume all the resources on the machine.  The default configuration values in Windows 2008/IIS 7.0 are optimized for the web hosting environment.

The enterprise environment is virtually the polar opposite from the hosted environment from the perspective of how software should manage resources.  Instead of strictly limiting and aggressively reclaiming the resources per-site, an administrator wants to give all of the machine’s resources to a single site.  Administrators achieve scale and robustness by load balancing the internet traffic across multiple machines serving the same web site content.  This article describes how to tune Windows 2008 and IIS 7.0 for the enterprise environment.

Request Concurrency

Web requests are made from the user’s web browser to a web server.  The server receives the request, processes it and sends back some data.  An HTTP request is usually small, maybe a few hundred bytes.  However, the response may be large and the ephemeral memory required to generate the response can be even larger.  During the period in which the web page is executing code or waiting for a response from somewhere else (i.e. a database, disk, another web site, etc…) the memory associated with the web request cannot be released.  Once the request is completed usually memory can be released with the exception of cached items.

I refer to the term “request concurrency” to describe the number of requests being processed on a web server at any given moment in time.  As average request concurrency grows on a web server, so does the average memory utilization.  In order to conserve memory a web server can limit request concurrency by queuing new incoming requests rather than servicing it immediately if the number of in-flight requests exceeds some limit.  This approach has the side effect of increasing latency because user requests may need to wait for in-flight requests to complete before they are handled.

Increasing the Default Concurrency Limits

On Windows 2008, an HTTP request will be handled by multiple software component layers beginning with the network stack and travelling up into IIS and then sometimes into third party technologies such as PHP.  Each layer will perform some work on the HTTP request before handing the request on to the next layer.  Each time a layer receives a new HTTP request, it has the option of queuing it or processing it.  Therefore, increasing concurrency limits involves modifying configuration associated with multiple layers.  This section describes configuration parameters in http.sys, IIS 7.0, FastCGI and PHP.

 

IIS 7/queueLength

Description:

This parameter controls the maximum number of requests that IIS 7.0 will allow to be queued simultaneously.  It allows the system to be more robust in handling spikes in request concurrency beyond the configured limits.

 

Normally, if a web request is received by a web server and its queue is full the web server will return an HTTP error 503 (service unavailable).  Increasing the queue limit value has no impact on a web server that does not exceed its queue limits under normal conditions.  On web servers that experience occasional bursts of requests that would exceed the default queue limits, increasing the limit may allow the server to satisfy all requests without error but with a higher latency.  On web servers that are overloaded during steady-state operation increasing this value may have a detrimental effect.

 

Default Value:

1000

 

Tuned Value:

65535

 

Command Line:

appcmd.exe set set apppool "DefaultAppPool" /queueLength:65535

 

 

 

IIS 7/appConcurrentRequestLimit

Description:

This parameter controls the maximum number of in-flight requests in the IIS 7.0 layer.  This includes requests that are being processed or are queued by the CGI layer.

 

Increasing this value on a web server that never experiences more than 5000 concurrent requests should have no impact.  On web servers that receive very large numbers of concurrent requests and that have available resources during steady state load, increasing this setting will allow the server to fully utilize its memory and CPU.  Servers that are already 100% utilized may be negatively impacted by increasing concurrent request limits.

 

Default Value:

5000

 

Tuned Value:

100000

 

Command Line:

appcmd.exe set config /section:serverRuntime /appConcurrentRequestLimit:100000

 

 

 

Http/MaxConnections

Description:

This parameter controls the maximum number of concurrent TCPIP connections that HTTP will allow.

 

By default, only 5000 concurrent TCPIP connections are allowed by the HTTP driver in Windows.  There is typically only one outstanding HTTP request per connection, therefore increasing any other concurrency limit is pointless unless the maximum number of concurrent connections is also increased.  Each connection maintained by Windows will use some kernel memory and requires some CPU to maintain state.  I don’t recommend increasing this limit on 32 bit machines because of the limited kernel address space.

 

Default Value:

5000

 

Tuned Value:

100000

 

Command Line:

reg add HKLM\System\CurrentControlSet\Services\HTTP\Parameters /v MaxConnections /t REG_DWORD /d 1000000

 

 

 

FastCGI/Php Concurrency

Description:

This is actually two parameters, the first is the maximum concurrent requests and the second is the number of requests that can be executed by a fast CGI process before the process is recycled. 

 

The CGI model requires only a single concurrent request per pooled process.  So the max instances parameter tells IIS how many processes to start up.  Each process will consume significant resources on the server so the initial recommendation of 32 is somewhat conservative.  Increasing the number of requests that each process can handle before being recycled merely decreases the rate of process creation/destruction and reduces the average CPU required to process each request.

 

Default Value:

4/200

 

Tuned Value:

32/10000

 

Instructions:

1.       notepad %windir%\system32\inetsrv\config\applicationhost.config

2.       find the "fastCGI" element, change it to the following (assuming php-cgi.exe is in c:\php)

 

<fastCgi>

  <application fullPath="C:\PHP\php-cgi.exe" instanceMaxRequests=“10000" maxInstances="32">

    <environmentVariables>

      <environmentVariable name=”PHP_FCGI_MAX_REQUESTS” value=”10000”/>

    </environmentVariables>

  </application>

</fastCgi>

 

Conclusion

Tuning a Windows 2008 machine for PHP performance in enterprise environments is all about increasing the default concurrency limits.  Remember, if you try out some of the tunings in this article make sure to test the effects of the changes in a controlled environment before deploying them to your front line servers.   Increasing the concurrency limits will generally have the effect of increasing the steady state memory utilization and CPU if concurrency is a bottleneck on your system.  If you don’t have enough memory or your CPU is already fully utilized, don’t increase the concurrency limits!  Finally, the tuned values in this article are values that I found empirically in my own test environment.  They may or may not be the right values for your environment so play around with them to find out what works for you.  Happy tuning!

Designing Applications for High Performance - Part III
26 June 08 12:44 AM | winsrvperf | 14 Comments   
 
Rick Vicik - Architect, Windows Server Performance Team

 

The third, and final, part of this series covers I/O Completions and Memory Management techniques.  I will go through the different ways to handle I/O completions with some recommendations and optimizations introduced in Vista and later releases of Windows.  I will also cover tradeoffs associated with designing single and multiple process applications.  Finally, I will go through memory fragmentation, heap management, and provide a list of the new and expanded NUMA topology APIs.   

 

Some common I/O issues

It is recommended to use asynchronous I/O to avoid switching threads and to maximize performance.  Asynchronous I/O is more complex because it needs to determine which of the many outstanding I/Os completed.  Those I/Os may be to different handles, mixed file and network I/O, etc.  There are many different ways to handle I/O completion, not all of which are suitable for high performance (e.g. WaitForMultipleObjects, I/O Completion Routines).  For highest performance, use I/O Completion Ports.  Prior to Vista, scanning the pending overlapped structures was necessary to achieve the highest performance, but the improvements in Vista have made that technique obsolete.  However, it should be noted that an asynchronous write can block when extending a file and there is no asynchronous OpenFile. 

 

The old DOS SetFilePointer API is an anachronism.  One should specify the file offset in the overlapped structure even for synchronous I/O.  It should never be necessary to resort to the hack of having private file handles for each thread.

 

Overview of I/O Completion Processing

The processor that receives the hardware interrupt runs the interrupt service routine (ISR).  Interrupts are either distributed across all processors or the interrupts from a specific device can be bound to a particular processor or set of processors.  The ISR queues a DPC (usually to the same processor, otherwise an IPI is required) to perform the part of I/O completion processing that doesn’t need access to the user address space of the process that issued the I/O.  The DPC queues a “Special Kernel APC” to the thread that issued the I/O to copy status and byte count information to the overlapped structure in the user process.  In the case of buffered I/O, the APC also copies data from the kernel buffer to the user buffer.  In the case of I/O Completion Routines (not ports), the “Special Kernel APC” queues a user APC to itself to call the user-specified function.  Moreover, prior to Vista every I/O completion required a context-switch to the thread that issued the I/O in order to run the APC.

 

These APCs are disruptive because as the number of processors and threads increases, the probability that the APC will preempt some other thread also increases.  This disruption is less likely to happen by fully partitioning the application.  That includes having per processor threads and binding interrupts to specific processors. 

 

What’s new for I/O in Vista and Above

·         Ability to flag an I/O as low priority.  This reduces the competition between background and foreground tasks, and improves I/O bandwidth utilization.  Low priority IO is exposed via SetPriorityClass PROCESS_MODE_BACKGROUND_BEGIN, also by NtSetInformationProcess(process,ProcessIoPriority,...

·         There are no disruptive APCs running when using I/O Completion Ports.  Also, this can be accomplished for Overlapped structs if the user locks them in memory by using the SetFileIoOverlappedRange call.

·         Ability to retrieve up to ‘n’ I/O completions with a single call to GetQueuedCompletionStatusEx. 

·         The option to skip setting the event in the file handle and skip queuing a dummy completion if the I/O completes in-line (i.e. did not return PENDING status).  These can be done by making a call to SetFileCompletionNotificationModes. 

·         The Dispatcher lock is not taken when a completion is queued to a port and no threads are waiting on that port.  Similarly, no lock gets taken when removing a completion if there are items in the queue when GetQueuedCompletionStatus is called because again the thread does not need to wait for an item to be inserted.  If the call to GetQueuedCompletionStatus was made with zero timeout, then no waiting takes place.  On the other hand, the lock is taken if queuing a completion wakes a waiting thread or if calling GetQueuedCompletionStatus results in a thread waiting.

 

I/O Completion Port Example

Let’s take an example where the main thread loops on GetQueuedCompletionStatus and calls the service function which was specified when the I/O was issued (passed via an augmented Overlapped structure).  The service functions issue only asynchronous I/O and do not wait, therefore the only wait in the main thread is really on the call made to GetQueuedCompletionStatus.  The following are some examples of “events” whose completion we wait on and suggestions on what to do next once they complete:

 

-          If the completion is for a new connection establishment, set up a session structure and issue an asynchronous network receive. 

-          If the completion is for a network receive, parse the request to determine the file name and issue a call to TransmitFile API. 

-          If the completion is for a network send, log the request and issue an asynchronous network receive. 

-          If the completion is for a user signal (from PostQueuedCompletionStatus), call the routine specified in the payload.

 

The timeout parameter on GetQueuedCompletionStatus (GQCS) can cause it to wait forever, return after the specified time, or return immediately.  Completions are queued and processed FIFO, but threads are queued and released LIFO.  That favors the running thread and treats the others as a stack of “filler” threads.  Because in Vista the Completion Ports are integrated with the thread pool and scheduler, when a thread that is associated with a port waits (except on the port) and the active thread limit hasn’t been exceeded, another thread is released from the port to take the place of the one that waited.  When the waiting thread runs again, the active thread count of the port is incremented.  Unfortunately, there is no way to “take back” a thread that is released this way.  If the threads can wait and resume in many places besides GQCS (as is usually the case), it is very common for too many threads to be active.

 

PostQueuedCompletionStatus allows user signals to be integrated with I/O completion handling which allows for a unified state machine.

 

Characteristics of I/O Completion Ports

An I/O Completion Port can be associated with many file (or socket) handles, but not the reverse.  The association cannot be changed without closing and reopening the handle.  It is possible to create a port and associate it with a file handle using a single system call, but additional calls are needed to associate a port with multiple handles. 

 

While you don’t need an event in the Overlapped structure when using Completion Ports because the event is never waited on, if you leave it out, the event in the file handle will be set and that incurs extra locking.

 

In Vista, applications that use Completion Ports get the performance benefit of eliminating the IO Completion APC without any code changes or even having to recompile.  This is true even if buffered IO is used.  The other way to get the benefit of IO Completion APC elimination (locking the overlapped structure) requires code changes and cannot be used with buffered IO.

 

Even if the I/O completes in-line (and PENDING is not returned), the I/O completion event is set and a completion is queued to the port unless the SetFileCompletionNotificationModes option is used.

 

if( !ReadFile(fh,buf,size,&actual,&ioreq)){

    // could be an error or asynchronous I/O successfully queued

    if( GetLastError() == ERROR_IO_PENDING ) {

      // asynchronous I/O queued and did not complete “in-line”

    } else {

      // asynchronous I/O not queued or was serviced in-line and failed

    }

} else {

    // completed in-line, but still must consume completion

    // unless new option specified

}

 

 Memory Management Issues

When designing an application, developers are often faced with questions like - Should the application be designed with a single process or multiple processes?  Should there be a separate process for each processor or node?  In this section, we try to answer some of these questions while providing the advantages and disadvantages for each approach.

 

Advantages for designing applications with multiple processes include isolation, reliability and security.  First, an application can take advantage of more than 4GB of physical memory because each process can use up to 2GB for user-mode data.  Second, if memory is corrupted by bad code in one of the processes, the others are unaffected (unless shared memory is corrupted) and the application as a whole does not need to be terminated.  Also, separate address spaces provide isolation that can’t be duplicated with multiple threads in a single process.

 

Some disadvantages of using multiple processes include higher cost of a context switch compared to a thread switch in the same process due to the TLB getting flushed.  Also there are possible performance bottlenecks introduced by the mechanism chosen for Inter-Process Communication (IPC).  Examples of IPC include RPC, pipes, ALPC, and shared memory, so it is important that the right kind of IPC is chosen. Some estimates for round trip cost to send 100 bytes via RPC: 27,000 cycles, local named pipes: 26,000 cycles, ALPC: 13,000.  IPC via shared memory is the fastest but it erodes the isolation benefit of separate processes because bad code can potentially corrupt data in the shared memory.  Also with shared memory it is not always possible to use the data “in-place” and copying incurs an added cost of 2.5 cycles per byte copied.

 

Advantages for designing applications with a single process include not needing cross-process communication, cross process locks, etc.  Single process application can also approximate some of the advantages associated with multiple processes via workarounds.  For instance, exception-handing can trap a failing thread making it unnecessary to terminate the entire process.  The 2GB user virtual address limit is gone on x64 and can be worked around to some degree on 32bits using Address Windowing Extension (AWE) or the 3GB switch to change the user/kernel split of the 4GB virtual address space from 2:2 to 3:1. 

 

Shared Memory Issues

Shared memory is the fastest IPC but sacrifices some of the isolation that was the justification for using separate processes.  Shared memory is secure to outsiders once set up, but the mechanism by which the multiple processes gain access to it has some vulnerability.  Either a name must be used that is known to all the processes (which is susceptible to “name squatting”) or else a handle must be passed to the processes that didn’t create the shared memory (using some other IPC mechanism). 

 

Managing updates to shared memory:

1. Data is read-write to all processes, use cross-process lock to guard data or lock-free structures. 

2. Data is read-only to all but 1 process which does all updates (w/o allowing readers to see inconsistencies)

3. Same as 2 but kernel does updates

4. Data is read-only, unprotect briefly to update (suffering TLB flush due to page protection change).

 

Global Data defined in a DLL is normally process-wide but the linker “/SECTION:.MySeg,RWS” option can be used to make it system-wide if that is what is needed.  Just loading the DLL causes it to be set up as opposed to the usual CreateSection/MapViewOfSection APIs.  The downside is that the size is fixed at compile time.

 

Memory Fragmentation – What is it?

Fragmentation can occur in the Heap or in the process Virtual Address Space.  It is a consequence of the “best fit” memory allocation policy and a usage pattern that mixes large, short-lived allocations with small, long-lived ones.   It leaves a trail of free blocks (each too small to be used) which cannot be coalesced because of the small, long-lived allocations between them.  It cannot happen if all allocations are the same size or if all are freed at the same time.  Avoid fragmentation by not mixing wildly different sizes and lifetimes.  Large allocations and frees should be infrequent and batched.  Consider rolling your own “zone” heap for frequent, small allocations that are freed at the same time (e.g. constructing a parse tree).  Obtain a large block of memory and claim space in it using InterlockedExchangeAdd (to avoid locking).  If the zone is per-thread, there is no need for even the interlocked instruction.  Use the Low Fragmentation Heap whenever possible.  It is NUMA-aware and lock-free in most cases.  It replaces the heap look-asides and covers up to 16KB allocations.  It combines the management of free and uncommitted space to eliminate linear scans.  It is enabled automatically in Vista or by calling HeapSetInformation on the heap handle.

 

Best practices when managing the Heap

·         Don’t use GlobalAlloc, LocalAlloc, WalkHeap, ValidateHeap, or LockHeap.  GlobalAlloc and LocalAlloc are old functions which may take an additional lock even if the underlying heap uses lock-free look-asides.  The WalkHeap and ValidateHeap functions disable the heap look-asides.

 

·         Don’t “shrink to fit” (i.e. allocate a buffer for largest possible message, then realloc to actual size) or “expand to fit” (i.e. allocate typical size, realloc ‘n’ bytes at a time until it fits).  These are fragmentation-producing allocation patterns and realloc often involves copying the data.

 

·         Don’t use the heap for large buffers (>16KB).  Buffers obtained from the heap are not aligned on a natural boundary.  Use VirtualAlloc for large buffers, but do it infrequently, carve them up yourself and recycle them.

 

·         New in Vista: dynamic kernel virtual address allocation... no longer need to manually juggle the sizes of the various kernel memory pools when one of them runs out of space (e.g. desktop heap).

 

·         New in Vista:  Prefetch API - PreFetchCacheLine(adr,options).  The API has a large dependency on the hardware’s support for prefetch.

 

NUMA Support

APIs to get topology information (depends on hardware for the information):

 

  GetNumaHighestNodeNumber

  GetNumaProcessorNode (specified processor is on which node)

  GetNumaNodeProcessorMask (which processors are on the specified node)

  GetNumaAvailableMemory (current free memory on specified node)

 

Use existing affinity APIs to place threads on desired nodes.

 

Memory is allocated on node where thread is running when the memory is touched for the first time (not at allocation time).  For better control over where memory is allocated, use new “ExNuma” versions of the memory allocation APIs.  The additional parameter specifies node.  It is a preferred, not absolute specification and it is 1-based because 0 signifies no preference.

 

  VirtualAllocExNuma(..., Node)

  MapViewOfFileExNuma(..., Node)

  CreateFileMappingExNuma(..., Node)

 

Large pages and the TLB

The Translation Look-aside Buffer (TLB) is a critical resource on machines with large amounts of physical memory.  Server applications often have large blocks of memory which should be treated as a whole by the memory manager (instead of 4KB or 8KB chunks), e.g. a database cache.  Using large pages for those items reduces the number of TLB entries, improves TLB hit ratio, and decreases CPI.  Also, the data item is either all in memory or all out.  

 

  minsize = GetLargePageMinimum();

  p = VirtualAlloc(null, n*minsize, MEM_LARGE_PAGES, ...);

Power and Hyper-V are now part of the Windows Server 2008 Tuning Guide!
17 June 08 08:58 PM | winsrvperf | 3 Comments   

The guide has been updated with sections on Power and Hyper-V guidelines and best practices.  Check out the updated Tuning Guide and tell us what you think by following the feedback link at the top of the Tuning Guide.  We look forward to hearing from you!

Ahmed Talat
Performance Manager
Windows Server Performance Team

Filed under: , ,
Designing Applications for High Performance - Part II
21 May 08 02:48 AM | winsrvperf | 1 Comments   

Rick Vicik - Architect, Windows Server Performance Team

The second part of this series covers Data Structures and Locks. I will provide general guidance on which data structures to use under certain circumstances and how to use locks without having a negative impact on performance.  Finally, there will be examples covering common problems/solutions and a simple cookbook detailing functional requirements and recommendations when using data structures and locks.

 

In order to avoid cache line thrashing and a high rate of lock collisions, the following are suggested guidelines when designing an application:

 

·         Minimize the need for a large number of locks by partitioning data amongst threads or processors. 

·         Be aware of the hardware cache-line size because accessing different data items that fall on the same cache-line is considered a collision by the hardware. 

·         Use the minimum mechanism necessary in data structures.  For example, don’t use a doubly-linked list to implement a queue unless it is necessary to remove from the middle or scan both ways. 

·         Don’t use a FIFO queue for a free list. 

·         Use lock-free techniques when possible.  For example, claim buffer space with InterlockedAdd and use an S-List for free lists or producer/consumer queues. 

 

The “ABA” problem

The “ABA” problem occurs when InterlockedCompareExchange is used to implement lock-free data structures because what is really needed is the ability to detect any change, even a set of multiple changes that restores the original value.  If 2 items are removed from an S-List and the first is replaced, just comparing values would not detect that the list has changed (and the local copy of the ‘next’ pointer from the first item on the list is “stale”).  The solution is to add a version# to the comparison so that it fails after any change, even one that restored the old value.  That gets tricky in x64 because the size of a pointer is the same as the maximum size InterlockedCompareExchange.

 

The “Updating of Shared Data” problem

The primary concern when handling updates to shared data is to be aware of all the ways an item can be reached.  When removing from the middle of a doubly-linked list, an item becomes unreachable when the next and previous pointers of the adjacent items are nulled.  The tricky part is that a thread may have made a local copy of the “next” pointer from the previous item before it was nulled but hasn’t yet accessed that “next” item.  A “lurking reader” must make its presence known to others by using a refcount or by taking per-item locks in a “crabbing” fashion as it traverses the list.  The refcount can’t be on the target item because the lurking reader hasn’t gotten there yet.  It must be on the pointer that was used to get there or else logically apply to the list as a whole.  The “crabbing” technique is usually too expensive in most cases.  It is almost always necessary to have a lock which guards the list.

 

True Lock free Operations

The simple test for a true “lock-free” operation is whether or not a thread can die anywhere during the operation and not prevent other threads from doing the operation.  It is commonly thought that replacing a pointer with a “keep out” indicator is better than using a lock, but it is really just folding the lock and pointer into the same data item.  If the thread dies after inserting the “keep out” indicator, no other thread can get in.

 

Guidelines for good locking practices and things to avoid

 

·         Hash Lookup (cache index or “timer wheel”)

A good general practice is to use an array of synonym list heads with a lock per list (or set of lists), where locks fall on different cache lines.  This design does not increase the number of lock acquires per lookup compared to the single lock implementation while reducing the lock collision rate.  Use a doubly-linked list for synonyms to support removal from the middle (if lookup always precedes removal, a singly-linked list can be used).  If the access pattern is mostly lookups with occasional insert/remove, use a Read/Write lock with a “writer priority” policy to provide fairness.  If the entry is not found, then an exclusive lock is needed to insert it.  If the lock doesn’t support atomic promotion, then must drop, reacquire and rescan.  To avoid memory allocations (and possibly waiting) while holding lock, allocate the new block between the dropping and re-acquiring of the lock.

 

·         N:M Relationship

Suppose a company allows a many-to-many relationship between employees and projects.  That requires a set of intersecting lists to represent the relationships and a single lock can probably be used to guard it.  That solution might be sufficient if most access is read-only, but will become a bottleneck if updates are frequent.  A finer-grain solution is to have a lock on each instance of project and employee.  Adding or removing a relationship requires taking the 2 intersecting locks, which is slightly more than the single lock implementation.  A number of optimizations are possible to avoid lock-ordering issues:  Deferred removal of intersection blocks, one-at-a-time insertion & removal, InterlockedCompareExchange for removal.

 

·         Lock Convoy

FIFO locks guarantee fairness and forward progress at the expense of causing lock convoys.  The term originally meant several threads executing the same part of the code as a group resulting in higher collisions than if they were randomly distributed throughout the code (much like automobiles being grouped into packets by traffic lights).  The particular phenomenon I’m talking about is worse because once it forms the implicit handoff of lock ownership keeps the threads in lock-step.

 

To illustrate, consider the example where a thread holds a lock and it gets preempted while holding the lock.  The result is all the other threads will pile up on the wait list for that lock.  When the preempted thread (lock owner at this time) gets to run again and releases the lock, it automatically hands ownership of the lock to the first thread on the wait list.  That thread may not run for some time, but the “hold time” clock is ticking.  The previous owner usually requests the lock again before the wait list is cleared out, perpetuating the convoy.

 

·         Producer/Consumer Implementation

The first thing you should ask yourself when setting up a producer/consumer arrangement is what the gain achieved by handing off the work?  The amount of processing done per hand-off must be significantly greater than the cost of a context switch (~3k cycles direct cost, 10x or more indirect cost, depending on cache impact).  The only legitimate reasons for handing off work to another thread are:  If the isolation of a separate process is needed or if preemption is needed rather than cooperative yielding. 

 

The following are code snippets for a Producer/Consumer implementation which will be used to point out things to avoid when doing the design.

 

Producer

WaitForSingleObject(QueueMutex,...);

InsertTailList(&QueueHead, Item);

SetEvent(WakeupEvent);

ReleaseMutex(QueueMutex);

 

Consumer

for(;;) {

  WaitForSingleObject(WakeupEvent,...);

  WaitForSingleObject(QueueMutex,...);

  item = RemoveHeadList( &QueueHead);

  ReleaseMutex( QueueMutex);

  ... process item ...

}

 

1.       Don’t wake the consumer while holding the lock it will need.  In our example, the Producer is holding the QueueMutex when it made the SetEvent call.

2.       Don’t make multiple system calls to process each item even when no scheduling events occur (3 system calls in the producer, 3 in the consumer in this case). 

3.       Using the WaitForMultiple(ALL) call on both the QueueMutex and WakeupEvent seems like a clever solution because it avoids the extra context switch and combines the two WaitForSingleObject system calls into a single call.  It really isn’t much better because each time an event in the set is signaled, the waiting thread is awakened via APC to check the status of all events in the set (resulting in just as many context switches).

4.       Amortize the hand-off cost by not waking the consumer until ‘n’ items are queued, but then a timeout is needed to cap latency. 

5.       It is better to integrate background processing into the main state machine.   

6.       The consumer should be lower priority than the main thread so it runs only when the main thread has nothing to do.  Consumer should not cause preemption when delivering the “finished” notification.

7.       In the example above, consider using PostQueueCompletionStatus rather than SetEvent API.  The latter boosts the target thread’s priority by 1 which may cause it to preempt something else.   

8.       Don’t use the Windows GUI message mechanism for producer/consumer queues or inter-process communication.  Use it only for GUI messages. 

 

Synchronization Resources

 

Events are strictly FIFO, can be used cross-process by passing the name or handle, and they have no spin option or logic to prevent convoys from forming.  When an event is created, its signaling mode can be set to either auto-reset or manual-reset.  Each signaling mode dictates how APIs like SetEvent and PulseEvent interact with threads.

 

Ø  The SetEvent call on auto-reset events will allow a single thread to pass through if no others are waiting on the event.  For manual-reset events, the call will allow all threads waiting on the event to pass and “keep the door open” until the event is explicitly reset. 

Ø  The PulseEvent call on auto-reset events will allow a single thread to pass through but only if there is one waiting.  On manual-reset events, the call will allow all threads waiting on the event to pass and “leaves the door closed”.

 

Note:  The SignalObjectAndWait call allows a thread to signal another and wait without being preempted by the signaled thread.  The SetEvent call typically boosts the priority of the signaled thread by 1, so it is possible for the signaled thread to preempt the signaling thread before it waits.  This pattern is a very common cause of excess context switches. 

 

Semaphores are also strictly FIFO, can be used cross-process, and they have no spin option and no logic to prevents convoys from forming.  When it is necessary to release a specific number of threads, but not all, using the ReleaseSemaphores call is recommended.

 

Mutexes exposed to user mode supports recursion and have the current owning thread ID stored in it, whereas the Event and the Semaphore do not.

 

The ExecutiveResource is a reader/writer lock available in user and kernel mode.  It is FIFO, has no spin or anti-convoy logic, but does have a TryToAcquire option and has anti-priority-inversion logic (attempts to boost the owner(s) if acquire takes over 500millisec).  There are options regarding promotion to exclusive, demotion to shared, and whether readers or writers have priority (but not all options are available in user and kernel versions).

 

The CriticalSection is the most common lock in user-mode.  It is exclusive, supports recursion, has TryToAcquire and spin options and is convoy-free because it is non-FIFO.  Acquire and Release are very fast and do not enter the kernel if there is no contention.  It cannot be used across processes.  The event is created on first collision and all critical sections in a process are linked together for debugging purposes which requires a process-wide lock on creation and destruction (but there is an option to skip the linking).    

 

Read/Write locks (Slim Read/Write Lock (SRWLock) in user mode and PushLock in kernel mode)

The new, lighter-weight read/write lock is available in Vista and later releases of Windows.  It is similar to the CriticalSection in that it makes no system call if there are no contentions.  It is non-FIFO and therefore convoy-free.  The data structure is just a single pointer and the initialized state is 0, so they are very cheap to create.  It does not support recursion.  To start using SRWLocks, visit MSDN for a list of the supported APIs. 

 

ConditionVariables are also new in Vista.  They are synchronization primitives that allow for threads to wait until a specific condition occurs.  They use SRWLocks or CriticalSections and cannot be shared across processes.  To start using ConditionVariables, visit MSDN for a list of the supported APIs. 

 

 

 

Lock-Free APIs (Recommended when possible)

 

Atomic updates are typically better than acquiring locks, but locking can still happen at the hardware level which causes bus traffic.  It is also possible to have excessive retries (at the hardware or software level).

 

InterlockedCompareExchange compares a 4 or 8 byte data item with a test value and if equal, atomically replaces it with a specified value.  If the comparison fails, the data item is left unchanged and the old value is returned.  The S-List is constructed using InterlockedCompareExchange.  It atomically modifies the head of a singly-linked list to support push, pop and grab operations.  It uses a version# to prevent the “ABA” problem mentioned earlier in the post.  InterlockedCompareExchange can also be used to construct a true FIFO lock-free queue.  It is basically an S-List with a tail pointer that could be slightly out-of-date.  It is not widely used because it has the limitation that elements used in the list can never be re-used for anything else.  This is often worked-around by using surrogates for the list elements.  A common pool of them can be maintained in the application.  It can grow but never shrink.

 

InterlockedIncrement, InterlockedDecrement, and InterlockedExchangeAdd are similar, but the new value is derived from the old rather than specified (add or subtract 1, add specified value... use negative to subtract).  These can be better performing than InterlockedCompareExchange because they cannot fail, so retries are eliminated in most cases.  To add variety, InterlockedIncrement and InterlockedDecrement return the new value while InterlockedExchangeAdd returns the old value.

 

The Locking Cookbook

 

The following table provides a list of proposed recommendations which shouldn’t negatively impact performance when working towards meeting a specific set of functional requirements.

 

Functional Requirement

Performance recommendation

Maintaining reference count, Driving circular buffer, Construct barrier

InterlockedIncrement / InterlockedDecrement

Claim space in buffer, Roll-up per CPU counts, Construct complex locks

InterlockedExchangeAdd

8-byte mailbox, S-List * Queue

InterlockedCompareExchange

Free List or Producer/Consumer work queue

S-List

Implement a true FIFO with no need to reverse

Lock free Queue

List that supports traversal and/or removal from middle

Conventional lock (if >70% read, use Reader/Writer lock)

 

Least Recently Used (LRU) list

Use “clock” algorithm or deferred removal

Tree

Use lock sub-tree or “crab” downward like traversing a linked list

Table 1:  A locking cookbook

 

Conclusion

 

In conclusion, the following are common guidelines for designing applications for high performance while using locks and data structures to ensure data integrity through the different available synchronization mechanisms.

 

·         Minimize the frequency of lock acquires and hold time.  Don’t wait on objects, do IO, call SetEvent, or an RPC while holding a lock.  Don’t call anything that may allocate memory while holding a lock.  Note that taking a lock while holding a lock (i.e. nesting locks) inflates hold time on the outer lock.

  

·         Use a Reader/Writer lock if >70% of operations take the lock shared.  It is incorrect to assume that any amount of shared access will be an improvement over an exclusive lock.  Exclusive operations can be delayed by multiple shared operations even if alternating shared/exclusive fairness is implemented. 

          

·         Break up locks but not so that typical operations need most of them.  This is the tricky trade-off.

 

·         Taking multiple locks can cause deadlocks.  The typical solution is to always take locks in a pre-defined order, but in practice, different parts of the application may start with a different lock.  Use TryToAcquire first (it helps eliminate deadlocks because you don’t wait).  If that fails drop the lock that is held and re-acquire in the anti-convoy order.  Things can change between the drop and reacquire, so it may be necessary to recheck the data.  Not all locks have try-to-acquire option.  WaitForMultipleObject(All) is another way to solve the deadlock problem without defining a locking order (deadlocks are impossible if all locks are obtained atomically).  The expense of WaitForMultipleObject(All) is the downside.

 

·         If you find there is a need to use recursion on locks, then it means you don’t know when the lock is held.  The lack of knowledge makes it impossible to minimize the lock hold time because you don’t know when it was held.  This is a common problem with Object-Oriented design

 

NT... TTCP! Network Performance Test Tool Available
03 May 08 08:19 PM | winsrvperf | 1 Comments   

NTttcp (Windows port of Berkley's TTCP winsock based test tool) has officially gone live (http://www.microsoft.com/whdc/device/network/TCP_tool.mspx) on Microsoft.com.  NTttcp is a useful tool to help measure overall Windows networking performance with a multitude of networking adapters in different configurations.  I encourage you to install the tool today and start measuring your network throughput and efficiency.

 

Ahmed Talat
Performance Manager
Windows Server Performance Team

More Posts Next page »

Search

This Blog

Syndication

Page view tracker