What's Going Wrong at 3:00 AM?

knockNrod

Rate me:

0.00/5 (No votes)

21 Nov 2011CPOL

9.8K

Describes one effort to run performance monitor to identify what's wrong with a counter log, then set up an event trace to see who the culprit is.

Introduction

I've returned to South Korea to investigate a potential problem with a customer's computer system, and I'm exploring how Performance Monitor's Traces and Logs work in conjunction to identify what is going on inside the system when it shows a problem, and which programs might be creating bottlenecks. This is a Windows system, but you can find a similar tool, the System Analysis Report (sar) tool, on most Unix systems.

Some introductions are in order. If you're not familiar with Performance Monitor, by all means, simultaneously press your windows key and 'R', type in "perfmon" (sans quotes), and click the OK button. Your first impression will likely be that it looks like the same graph on Windows Task Manager (that program that runs when you simultaneously press Ctrl-Alt-Delete), and you wouldn't be too wrong. The default counters are Pages/sec, Avg. Disk Queue Length, and % Processor Time, with values being displayed from current activity in a short time cycle (one-minute and forty seconds). However, with Performance Monitor, you can configure thousands of different counters to watch; you can record the graphs to disk in a compact binary format, a more transportable text format, or even log the data to an SQL Server database; you can establish schedules of when to start and stop recording; and more. Performance Monitor, once you start to work with it, is a far more powerful tool than the graphs you see in Task Manager.

System Performance Tools Used

Performance Monitor: Performance Monitor (or perfmon) is a Microsoft Windows based tool for collecting performance counter data and displaying it in a graphical manner. A command line interface exists, called LogMan, which allows you to create and control Performance Monitor logs but is mostly used for administrative scripts.
System Analysis Report: System Analysis Report (or sar) is a tool for collecting performance logs on Unix based systems. Although it will not be used in this article, it's mentioned for completeness.

Performance logs are more designed for observing long-term increases in consumption of resources for the purposes of projecting future needs, but they can be used very effectively for troubleshooting complex system problems. Ideally, they would have a few more options for troubleshooting purposes, but they can be very effective as they are, and you can always create custom tools to analyze the raw data in other ways.

For now, though, let's get familiar with how to use the tool and a few of the counters that report data of interest in troubleshooting a system. With Performance Monitor open, click the counters and delete them. Next, right click within the same area and select Add Counters. The counters we'll add, along with descriptions, are given below.

When you right-click and choose "Add Counters ..." you'll be presented with the Add Counters Dialog Box. The first two radio buttons don't matter, since they likely refer to the same computer, so ignore those for now. The Performance Object Dropdown list is a selection of categories or computer subsystems you're looking at. If you know what subsystem is likely causing a problem, you can target more counters from that subsystem, but computers often surprise you. So for this discussion, I'm going with about a dozen counters from a wide range of subsystems so I can find which subsystem might reveal unexpected behavior. In the list of counters below, the name before the colon will be the category where you should find the counter listed.

Memory: % Committed Bytes in Use: This is the percentage of bytes your computer has committed to running its tasks. The higher this number, the more starved the system is for memory to perform tasks.
Memory: Page Reads/sec: This refers to how many reads the system has to perform on the virtual memory system. Any time a new application starts, you would expect it to read from the page file to allocate new memory. However, a lot of reading during usage of a single or small number of applications is a strong indication that the system needs more memory to run efficiently.
Network Interface: Bytes Received/sec: This shows you the number of bytes received over the network. It's not really useful to indicate a problem, but if we see it running at near capacity for long periods, we can dig more. Right now, we're just looking for unusual activity.
Network Interface: Bytes Sent/sec: This shows you the number of bytes sent over the network. It's similar to the counter above, but for sending information, rather than receiving it.
Physical Disk: % Disk Time: This is the percentage time the processor spends managing disk data. This is the total time spent reading and writing. The separated data is more useful in understanding what's happening, but the total is useful in telling you how much effort is spent accessing the disk.
Physical Disk: % Disk Read Time: This is the percentage time the processor spends reading from the disk. It's similar to the counter above, but it tells you whether the disk is being read from (not written to).
Physical Disk: % Disk Write Time: This is the percentage time the processor spends writing to the disk. It's similar to the previous two counters, but is limited to time writing to disk.
Processor: % Processor Time: This shows the amount of time the processor is processing (not idle).
Processor: Interrupts/sec: This counter isn't very useful by itself. It shows how many requests the processor is receiving. Interrupts are usually raised by devices attached to the system (disks, network cards, and such) to request high-priority CPU time. We include it here to compare it with the next counter.
System: System Calls/sec: This is the number of system calls being serviced by the CPU per second. By comparing the Processor's Interrupts/sec with the System Calls/sec we can get a picture of how much effort the system requires to respond to attached hardware. On a healthy system, the Interrupts per second should be negligible in comparison to the number of System Calls per second. When the system has to repeatedly call interrupts for service, it's indicative of a hardware failure. If we see this, we'll add more counters to look for specific hardware failures, but for now, we're casting a wide net looking for something out of the ordinary.

TCP Counters

The next two counters we'll be adding — TCP: Segments Retransmitted/sec and TCP: Segments Sent/sec — require a more detailed explanation. Since these two counters are not as general as the other counters, I feel some justification is due for their inclusion. The computer that I'm troubleshooting is a dedicated computer that is used to run a specific set of applications for controlling the manufacture of steel as part of an automated control system. A major part of this application is network communications. With few exceptions, most network traffic is TCP, rather than UDP or other IP protocols. However, you should not mistake these counters as a complete picture of network traffic, especially since some process automation applications make heavy use of UDP and can generate extraordinary amounts of traffic. Furthermore, these counters only measure statistics for this machine, and not for the overall traffic on the network. As a result, we're looking at the retransmit rate as a possible indirect indication of network congestion.

If you've ever listened to one half of a telephone conversation, you'll note (with humor) the stretches of conversation that go something like this: "Uh-huh." "Yep." "Oh." "Mmm-hm." "Yep." "Ah-hah." "Mmmm." In TCP/IP networking, these are called acknowledgements or ACKs. And when one isn't received for an extended period, you're likely to hear, "You still there?" or "Hello?" In TCP, this is a retransmit timeout. You don't know why the conversation stopped — it just did. One possible reason for the conversation to stop is because too many other folks were also trying to have telephone conversations, the network was overloaded, and your conversation was interrupted. Of course, there are many other reasons for the conversation to stop, ranging from attention deficit to passing through a tunnel (which happens a lot here in the Burgh). Depending on the length of conversation (and your interest in the topic) a certain amount of retransmits are to be expected. However, when the rate of retransmits increases for a given amount of conversation (Can you hear me now?), this provides an indication that there are network disruptions (or maybe that it's just time to change the topic).

Likewise, if your network is working well, the ratio of segments sent to retransmissions should be low — lower than 4%. If it's higher than that, you should look more closely at your network. By including these two counters, I'm mostly looking for burst disruptions that might account for the periodic disruptions our customer is experiencing. If we need to examine the network traffic more closely, we'll have to obtain a stand-alone tool like Wireshark, MRTG, or Nagios. For now, we're only looking for possible areas of interest, not causes, and these two are sufficient.

Recording Counters

Once you're familiar with the graphics, these settings must be persisted, and I'm afraid there is no easy way to transfer counters between the live graph and a recordable event log definition, so we'll have to set them up again. With Perfmon open, you should see a tree on the far left with a node for "Performance Logs and Alerts". Expand that node, and you should see a node called "Counter Logs". Selecting this node should change the graph on the right to a list of performance event log definitions. By default, Windows comes with an example event log called "System Overview". Right-click on an unused area of the right panel and choose "New Log Settings..." We're going to call this Baseline, so type that into the text box and click the Ok Button. The Event Log Settings Dialog opens to the General Tab. Here, you'll go through the list of counters listed above and add them. Next, click on the Schedule Tab and in the Start Log Group select the radio button to start manually. Click the dialog's Ok button to close and save your settings. From here, you can right click on the event log to start recording at your convenience. A day's worth of recording creates a file of about 3 MB and is stored (by default) in the C:\perfmon\ directory with a BLG extension. I can't imagine the disk space being an issue, but you should be aware of the disk usage and at least be prepared to transfer these log files as soon as you can. Ideally, you would record all of your event logs and (more importantly) trace logs to an external drive or extended partition.

Once the event log has begun recording, you can close perfmon and use your system as usual, blithely unaware that it's recording. When you're done, open perfmon again, browse back to the event log, right-click on it and choose stop. Once a log is created, you can open the file in perfmon on any computer. From the graphing screen (System Monitor node selected), you'll see a tool-strip just above the graph with an icon that looks like a soda-can. This is a universal symbol for data. The icon is supposed to look like an old magnetic memory drum — the precursor to the hard disk drive — which once stood as high as your knees and wide enough to sit on while still only providing storage of a megabyte or less. These drums were the source of all long-term storage and quickly became the symbol of data storage and retrieval.

Viewing the Results

When the System Monitor Properties Dialog opens with the Source Tab selected, click on the Log Files Radio Button, click the (now enabled) Add Button, and browse to your event log file. Select your log file, click Apply, and then switch to the Data Tab. From here, remove any counters defined, then click the Add Button and select the All Counters Radio Button. This setting will remain as you browse each Performance Object in turn, and all the counters you have recorded will be listed. So, go through each Performance object in the drop-down list and click the Add Button. When you're done, click the Close Button of the Add Counters Dialog, then click the OK Button of the System Monitor Properties Dialog, and the graph with all your counters will be displayed. Here's an image of my graph on the computer I'm troubleshooting.

In the tool bar above the graph, you can click the yellow light bulb to toggle highlighting. Then, below the graph, as you select each counter in turn, it will highlight the corresponding line within the graph. The highlight allowed me to quickly identify the dips and spikes above as a drop in Processor Time and Segments Sent along with an unusual spike in Pages/sec and % Disk Time. The spike is almost certainly due to our monitoring program that kills and restarts tasks which fail to respond to a watch dog timeout. Understanding a little of the system can help to quickly isolate behavior changes of interest.

Understanding the System

We have one task whose sole responsibility is to periodically send an inner process communication (IPC) message to each of the other tasks and request a response. If any of the other tasks fails to respond within a reasonable period of time, the monitor task will attempt to kill and restart that task. When the new task is loaded, there will be a surge in disk access as the operating system allocates space in the page file to run the application. In short, the spike in % Disk Time and Pages/sec are a normal response to the problem, not the source of the problem. The dip in Segments Sent is a pretty clear indication that our programs have stopped communicating over the network, and the dip in Processor Time corroborates that our program is failing, but none of this gives any real indication of why.

Our programs do maintain a running log of what's happening, so the next step is to determine the time range for when the performance dip occurs and see if we can find anything in the log files. Right click on the graph and choose properties. In the System Monitor Properties Dialog, click on the Source Tab, then drag the dialog box away so that it remains visible but does not obstruct your view of the graph behind. Just above the dialog box's Ok Button and Cancel Button, you'll see the Time Range Slider. You can drag the slider from the left and right independently to adjust the beginning and ending time of the epoch. As you adjust these sliders, you'll see a dim grey bar move across the graph to help align the time with graph trends, like our dip. When I sandwich the dip, I notice that it starts remarkably close to Midnight and ends remarkably close to 1 AM.

Midnight is a popular time for automated activities: first because it marks the end of a day, and second because it's often perceived to be a time of reduced activity, ideal for disruptions caused by backup scripts. Of course, if this is an automated task, this behavior will repeat each night at the same time. So, in addition to looking at the log files for anything unusual that might have occurred just prior to Midnight, I also want to record nightly logs to verify that this is a trend, rather than a single occurrence.

Checking the log files did not reveal anything immediately obvious happening at Midnight. If anything, Midnight activity was oddly absent, considering that there should have at least been a record of the monitor task having to kill and restart an unresponsive task. Repeated logs for the second and third day, however, did reveal this as a daily occurrence, always at Midnight. Although our software does close out and open a new log file for each day, it does not perform any daily accounting for reports at Midnight; however there is additional software running on the computer, and it's possible that some other application is taking over the machine to perform accounting. In that scenario, I would have expected to see a spike in the pages/sec as a new thread or spawned application spun up. Furthermore, I would have expected to have seen an increase in disk access during the period in which it was reviewing a record of the daily activities. However, it's entirely possible that these side effects are being masked by the unintended side-effect of stopping our programs running on here. The only way to know for sure is to look more closely at what applications and threads might be starting around this time.

Recording Events

Since our reduced services always occur at midnight, we're going to start recording all the tasks executing, starting and stopping from just before midnight until just after. To perform this recording, we're going to use the trace logs within perfmon. With Performance Monitor open, expand the Performance Logs and Alerts node in the left pane, and select Trace Logs. In the right pane, right-click on an unused portion of the screen, and choose New Log Settings. We're going to name this trace log Pikachu. I'm not especially fond of the character or the animation, but I can't resist the pun; however, if that's too cutesy for you, you're free to name it whatever you like, just click the Ok Button after you give it a name. For now, we're going to record events from the default System Provider, so check the Events Logged by System Provider Radio Button. We don't care about Disk I/O or TCP/IP, so uncheck those to help reduce the size of this log. Then select the Schedule Tab and in the Start Log Controls Group, change the start time to begin at 11:50 PM of the current night. Then under the Stop Log Controls Group, Select the At Radio Button and change the stop time to 12:10 AM the following day.

These events (Process creations/deletions, Thread creations/deletions) will accumulate a lot of data very quickly. Depending on what's running, you can expect this to be measured in gigabytes (or larger), and bad things will happen if you fill the disk drive that is running either Windows or your other applications. Ideally, you would record this to a dedicated location, either a separate partition or external drive. You can change the location of the log file in the General Tab in the Current Log File Name Textbox (at the top). Alternatively, you could change the schedule to stop when the log reaches a certain size, rather than at a specific time, but their is no way to know in advance how much time will elapse for a given size on disk — this depends on how many events are actually triggered.

Microsoft's Log Parser Tool

The event trace log file (ETL) created by the trace isn't easy to read directly. You can use the OS tool TraceRpt.exe to generate a couple of standard reports (Summary.txt and Workload.txt) as well as a Comma-Separated-Variable (CSV) file that can be examined with Excel or a text editor, but the information we're after isn't easy to glean from these files. Instead, you can download the latest copy of Microsoft's LogParser tool. You can go to www.Microsoft.com and click on the Download Center link from the Downloads & Trials menu, then search on LogParser to find the most recent release. I'm using LogParser Version 2.2.

After installing the LogParser utility, you'll see the Log Parser Program Group with a Log Parser application under your Start Menu. Running the application opens a command terminal (DOS) with the command help for LogParser. To export our data to XML, you should run the command:

LogParser -i:ETW "SELECT * INTO C:\PerfLogs\Pikachu_000004.XML 
FROM C:\PerfLogs\Pikachu_000004.etl" -o:XML

As shown, this command will read and parse the fourth trace log file Pikachu_000004.etl and export the results to the file Pikachu_000004.xml. The specific input file and output file should be modified in the command above to match your circumstances. From here, you should see each event as a separate ROW node. These events are originally recorded in time-order and are consequently in time-order in the output file. You can ensure this by adding a sort order to the query by appending "ORDERBY Timestamp ASC" at the end of the query. Time-ordered data is important for locating a time specific event; from here we can look forward through the data until we find events a few minutes before our performance dip.

Findings

Unfortunately, there were no threads or processes starting prior to the performance anomalies, so I couldn't use my results for this article. The problem turned out to be a bug in the logging task which did not account for hour zero. This was found after investigating why the application logs did not record the task restarts observed in the initial counter log recordings. In your own troubleshooting efforts, however, if you find something worth tracing back, you can find a ProcessId for each thread shown in the XML output. This can be used to trace back the thread to a specific process. Most of the processes have an ImageFileName under UserData which can be used to identify a specific application.

Here's an example of a Thread being started at 13:35:53.

XML

<eventnumber>  2464 </eventnumber>
  <eventname>
  Thread  </eventname>
  <eventtypename>
  Start  </eventtypename>
  <timestamp>
  2010-12-13 13:35:53  </timestamp>
  <userdata>
  ProcessId=856|TThreadId=3020|StackBase=0xA9B0B000|StackLimit=0xA9B08000|
  UserStackBase=0x00940000|UserStackLimit=0x0093A000|StartAddr=0x7C8106F9|
  Win32StartAddr=0x75738D6B|WaitMode=ÿ
</userdata> </row>

The ProcessId here is 856. Searching the XML file for this Process ID reveals several additional threads, but at 13:12:00, DCStart event for the process is recorded. In this row, you can see the name of the application that's spawning all these threads.

<row><eventnumber>  233
</eventnumber>  <eventname>
  Process
</eventname>  <eventtypename>
  DCStart
</eventtypename>  <timestamp>
  2010-12-13 13:12:00
</timestamp>  <userdata>
  PageDirectoryBase=0x0000C1EA|ProcessId=856|ParentId=796|SessionId=0|
  ExitStatus=0|UserSID=S-1-5-18|ImageFileName=lsass.exe</userdata>

Counters and traces provide a whole new world of opportunity for quickly identifying complex bugs. As software complexity continues to grow, tools such as these are becoming far more critical to everyday programming. I encourage you to spend a little time looking at other uses for these tools. If you do, you'll find as did I that they are far more attuned to administration of a long-running server, than to short-term debugging efforts. However, as their usefulness continues to be demonstrated, more tools will be developed which can better correlate cause and effect events. The data is there, but combing through a raw XML file is hardly a preferred method for associating a thread with a change in performance event, let alone tracing the thread back to a specific application. Nevertheless, I hope you are encouraged to try these tools in your own debugging efforts. Their potential is extraordinary.

History

Initial release. 12 December 2010
Images added. 27 December 2010
Grammatical and spelling corrections. 28 December 2010

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

knockNrod

Software Developer (Senior) TenovaCore

United States

Rod currently works in The Burgh using .Net technologies to create Human-Machine Interfaces for steel-making furnaces. He has 7 years in the steel industry, and 12 years of experience working in the nuclear industry at Savannah River Site in Aiken, SC. He enjoys riding his Honda Goldwing through the winding Pennsylvania farmlands, and taking his Jeep offroad.