Server 2008 Reliability & Performance Monitor – Part 3
Part 3 — Reliability Monitor: What’s Working and What’s Not
Windows Reliability and Performance Monitor can be leveraged by systems administrators to gather baseline information for review of system performance. This allows admins to review their server installations as well as carry out server tune ups of their Server 2008 systems.
In this series of articles we’ve been reviewing the most common uses and major functions of the tool.
Part 1 – Introduction to the Reliability and Performance Monitor took an overview look at the tool providing an introduction to basic elements of function, the interface and some of the initial features and default settings.
Part 2 – Performance Monitor Demystified was a review of some of the features and functions of the Performance part of the tool, including some of the best practices with respect to collecting and working with Performance Logs.
In this final installment, Reliability Monitor — What’s Working and What’s Not , we’ll go over some of the features and best practices with respect to troubleshooting problems found on systems reports from the tool’s output.
What Can You Do With The Reliability Monitor?
The Reliability Monitor part of the Reliability and Performance Monitor MMC allows you to review a computer’s stability details with respect to the events that impact the reliability of the system.
This is both from a positive aspect, such as a completed installation of an update, or a negative one, such as the failure of a software operation that causes it to stop working or otherwise fail.
This is done by calculating the Stability Index of all the events as part of the System Stability Chart over the span of system uptime (over the past rolling year as a maximum).
The screen shot below shows the reference system chart and subsequent events at the time of first start up of the operating system.
[NOTES FROM THE FIELD] — If I had expanded the details of the Software (Un)Installs section it would be readily apparent that this was a completed new installation of the operating system as there were over 100 software updates and driver installs that were successful.
You may also see that within a couple of working days there were some events that negatively affected the system reliability and the resulting stability index results which caused that “perfect” 10 number to begin decreasing in value.
Invariably over time with system use there is going to be negative impact on a system that takes its index rating down from 10.00. The main use of the tool, other than a quick review of recent and historical events, is to allow the system administrator the ability to quickly gauge what is going on and how it is impacting the system. Additionally, reviewing a series of reliability drops might be a good starting point for troubleshooting efforts.
[NOTES FROM THE FIELD] – There’s a very good reason why above I indicated “a good starting point for troubleshooting efforts.”
On this particular system I had daily events where Outlook was crashing. This was peculiar to me because I had performed no recent updates to the operating system nor to any of the applications so I wasn’t sure what was now suddenly causing an issue. And the system’s performance was off in that it was responding a little slowly. This was shown daily in the System Stability Chart as OUTLOOK.EXE with a Failure Type of “stopped working.”
I continued to think the issue was with Outlook over the next couple of days as it was the only majorly impacted application on my system (nothing else was showing in the reliability monitor). In short order, the system performance got really bad and I had the time to do some additional troubleshooting. I leveraged the Event Viewer which revealed a slew of Errors in the system log; Event ID 55.
The issue I was actually experiencing was a file system structure corruption which was making it difficult for the operating system and the Outlook application to make needed reads and writes to the file system in the application volume. After I ran the chkdsk utility on the volume and fixed the issues my problems disappeared and the System Stability Chart began to show an improvement in the index rating component.
The moral of the story here is that the trouble you’re having with an application is not always necessarily the direct fault of the application itself — always check potentially extraneous events and if they are or are not in fact related to the issue you’re having.
The System Stability Report
The information that is available to you via the System Stability Chart is not only reflected in just the graphical format there but within the table data at the bottom which is part of the System Stability Report.
In the report you will find information relevant by date, subsection delineation and event with respect to:
- Software (Un)Installs
- Application Failures
- Hardware Failures
- Windows Failures
- Miscellaneous Failures
The information provided in the different sections of details of the events will be highlighted by the regular icon family that was standardized for use under Windows Vista. These icons highlight the results as they occur and are reflective of the event — informational, warning, and error. Additional details for these standard icons are available at Microsoft MSDN.
[NOTES FROM THE FIELD] – You won’t generally see the question mark icon used to indicate a Help entry point within the Reliability Monitor.
• Software (Un)Installs
In the Software (Un)Installs section you will see details regarding:
- software being reported out (the name of the application being installed or removed)
- version of that software
- activity (system update install, driver install, application install, application configuration change, etc)
- activity status (success/failure)
- date the activity took place
• Application Failures
In the Application Failures section you can see information pertaining to:
- listed application that experienced the reported problem
- version of that application as reported
- failure type
- date the event occurred
[NOTES FROM THE FIELD] – It’s important to note that normally most of the details within the System Stability Report will also show up somewhere in the Event Viewer logs with the same reported information as shown.
Generally there will be additional information within the Event View logs (event IDs, error codes, etc).
• Hardware Failures
In the Hardware Failures section you will see details regarding:
- reported component type
- failure type
- date the event occurred
The information commonly show within this section of the tool is going to be limited to memory and disk failures.
[NOTES FROM THE FIELD] – Earlier in the article I mentioned I had an issue where corruption of my file system was causing an issue with Outlook. The problem didn’t show up in the Hardware Failures section because the failure was not a physical disk problem (bad sectors, controller failure, etc) but a problem with the NTFS file system.
• Windows Failures
The next set of details within the Reliability Monitor are shown in the Windows Failures section. The Failure Type section will detail an issue with a boot failure (in circumstances where there is a successful system start on subsequent attempts and where that original failure can actually be logged).
More often than a boot failure event, you will see information as a result of failures from the operating system processes. The Device section will report out which device is failing and the Failure Type will be outlined in the next column. Finally the date of the event will be shown in the Date column.
• Miscellaneous Failures
All events that do not fit into the above categories will show up in the Miscellaneous Failures section. One of the major events that will show up in the Failure Type section is a scenario where the system shutdown was unexpected.
The Version column will indicate the version of the software failure (in the example above this would be the operating system version and the installed service pack if applicable). The Failure Detail will show the information pertinent to the resulting type of failure and the Date will host the information of when the event took place.
[NOTES FROM THE FIELD] – There is an additional reliability event that may be recorded in special circumstances when a significant change to the system time is tracked. The System Clock Changes category will show information on the day that a significant clock change occurs and will be headed by the Information icon.
The Old Time section will contain the date and time prior to the clock change and the New Time section will contain the date and time selected during the clock change.
The Date column will reflect the date and time when the clock change occurred and the entry will reflect this information based on the newly applied time changes so they are relevant to the new system time in use.
And with that we are at the end of the Reliability and Performance Monitor series.
I hope you found this article series informative and a good investment of your time. I welcome any feedback that you might have on it. Additionally, I welcome any input on topics of interest that you would like to see and based on demand and column space I’ll do what we can to deliver them to you.
Best of luck in your studies.