Lately I have been troubleshooting a problem that occurred after installing Update Rollup 2 in a development environment.
My first indication of a problem was when I asked a colleague to take a look at a new request offering I had published. When he tried to log on to the portal he got access denied, something I guessed had to do him not being in the cmdb. I search for him under Configuration Items > Users, but he was not there. So I went ahead and checked where he is placed in Active Directory and found that he should be found by the AD connector. That was odd. But the connector had not run for a week, so I thought he might have been moved there recently and triggered a new synchronization. After a while it looked like it got stuck at 17%.
So I restarted the SCSM services, but that did not help. It did not help to restart the server either.
Time to troubleshoot
Windows Event Log
Service Manager logs a fair amount of information to the Windows Event Log. You will find the events under Application and Services Logs > Operations Manager
In my case I saw a pattern of events.
Source: HealthService (the System Center Management service)
Event id: 1103
Level: Warning
Summary: 155 rule(s)/monitor(s) failed and got unloaded, 155 of them reached the failure limit that prevents automatic reload. Management group "DEVELOPMENT". This is summary only event, please see other events with descriptions of unloaded rule(s)/monitor(s).
Source: HealthService
Event id: 4000
Level: Error
A monitoring host is unresponsive or has crashed. The status code for the host failure was 2164195371.
And a lot of this. I never counted, but I guess it was 155.
Source: HealthService
Event id: 1206
Level: Information
Rule/Monitor "Microsoft.SystemCenter.CollectDiscoveryData", running for instance "Microsoft.ServiceManager.InternalDiscoveryCollectorTarget" with id:"{DB18B9A2-0117-0D4F-B484-C3060D1C31F0}" failed, got unloaded and reached the failure limit that prevents automatic reload. Management group "DEVELOPMENT".
This pattern reappeared each time i restarted the services/server.
Disabling the connector
Suspecting it might be the ad connector workflow timing out and causing trouble for the other workflows I went ahead and disabled the AD Connector. It still showed up as Running at 17%, even after restarting the console, the service manager services and even the server. The pattern in the event log was still there.
Troubleshooting the database / moving the watermark
I searched for event id 4000 and found references to MonitoringHost.exe. In my case it was not running for more than a couple of minutes before disappearing. This is an essential process. It is needed to run workflows and you will see one or more instances running on the Service Manager server under different workflow accounts. I did find a related blog post by Nathan Lasnoski SCSM A monitoring host is unresponsive or has crashed Error 4000 which pointed me to the database for troubleshooting. When running the sql query I found several workflows which was more than 4500 minutes behind. Many more than I would bother to disable in a test/dev environment. So I tried a quick fix from the System Center blog and move the watermark. See the blogpost Troubleshooting Workflow Performance and Delays
If the workflows are hopelessly behind and you want them to catch up immediately, you can update the State column on the CmdbInstanceSubscriptionState table by running this query:
DECLARE @MaxTransactionId Int, @RuleId uniqueidentifier
SET @RuleId = '6AA6B941-375B-E3AD-7FB647FC7B3E' --<-- set this to your rule id!!
SET @MaxTransactionID = (SELECT(MAX(EntityTransactionLogID) FROM EntityTransactionLog)
UPDATE CmdbInstanceSubscriptionState
SET State = @MaxTransactionId
WHERE RuleId = @RuleID
(Except I did not filter by RuleID. I did it for every rule)
The blog also states something to keep in mind:
Keep in mind that running this query will move the watermark forward. Any transactions that would have triggered the subscription criteria and any resulting actions (notifications, workflows, etc) for those transactions that are being skipped over WILL NOT HAPPEN. Be very careful using this update query. Also, keep in mind that the EntityChangeLog table has a grooming routine that grooms rows from that table which are no longer needed. One of the criteria for determining whether or not a row is needed is the position of the watermarks for the workflows. We recommend slowly moving up the watermark so as not to trigger a massive grooming job that would slow down the overall system. We also recommend moving the watermark during non-peak hours in order to minimize the impact of grooming on people using the system.
In my case this did not solve the problem. Workflows still fell "Minutes behind" quickly. I thought it might help since I also had imported quite a few management packs from the production environment, and that this might had been the issue. The issue with MonitoringHost.exe crashing was still there.
Tracing & Log files
Trying to dig deeper I went ahead and started doing tracing. StartSMTracing.cmd is a tool found in the Service Manager\Tools folder that lets you trace what is happening within Service Manager. It takes a couple of parameters that let you select trace level and trace area. After recreating the issue you stop the trace with StopSMTrace.cmd. Now you will have an .etl file you can use the FormatTracing.cmd to convert it to a readable file.
This tracing gave me a lot of information. But I was not able to find the reason causing MonitoringHost.exe to crash.
The fix for me
Part 1:
Further searches and articles and led me to a hotfix for .net 3.5 "A .NET Framework 3.0-based WCF service may crash with a System.ServiceModel.CommunicationException exception if the service uses the netTcpBinding binding"
Thinking I might be a timeout issue, I hoped this would help me, so I installed this on my server and rebooted. But it still crashed after a little while as before.
Part 2:
Looking in the Service Manager directory i saw that there was no config file for MontioringHost.exe, it was renamed to .bak during the upgrade to UR2. I had previously tried to rename it back to .config, but it did not help at that point. But I thought I would give it a shoot and renamed it back once more.
Restarting the System Center Management service now proved successful. It started the MonitoringHost process(es) and they kept on running longer than they did before. I have also re-enabled the AD Connector and it has now completed successully.
It was a long way, and it took a bit of time, but I finally was able to solve the problem.
I should of course have updated the servers to R2, but we have not been able to do that yet. Hopefully this might help someone else struggling with similar problems.
/Michael
Ingen kommentarer:
Legg inn en kommentar