Busy… Starting Role… Repeat
So you deployed a Cloud Service and it’s status is stuck on “Busy (Starting role… Sites were deployed.)” By now you’ve probably checked the SDK assembly versions, checked that your Cloud Service runs in the Azure Emulator and probably everything else that you could think of. So what’s next?
In August 2013 Kevin Williamson published an awesome blog series titled Windows Azure PaaS Compute Diagnostics Data. It contains a wealth of priceless information that ended my 24 hour debug session.
Since my Cloud Service already had a Remote Desktop Connection setup at publish time, I went a head with the proposed path to diagnose the issue. This adventure ended abruptly as I was presented with this prompt.
I decided remove the Remote Desktop configurations from my Cloud Service and to redeploy it to Azure. This allowed me to try to dynamically enable Microsoft Azure Remote Desktop. I used this DIY Microsoft Azure Troubleshooting Tutorial which takes you through the steps.
After a few minutes, I was finally able to open a Remote Desktop session to one of my Cloud Service Role instances. Finally I got a glimpse of light from the other end of the tunnel!
Troubleshooting effectively and in a timely manner requires you to install a couple tools. I was happy to find out that Microsoft has made this process easy for us. On the Azure VM hosting the Cloud Service Role open a PowerShell console and Copy/Paste and Run the following script
md c:\tools; Import-Module bitstransfer; Start-BitsTransfer http://dsazure.blob.core.windows.net/azuretools/AzureTools.exe c:\tools\AzureTools.exe;c:\tools\AzureTools.exe
*This only works on Guest OS Family 2 or later
PowerShell will start AzureTools and present you with the following screen. Right click on the tool of you choice to select download from the context menu.
Generally when I troubleshoot, I usually need a way to persist files to blob storage. The second tab in AzureTools allows you to do just that.
The third tab is full of great shortcuts and tools that help you focus on finding the root cause you’re after.
From this tab you can start by clicking the Open Log Files. This will open logs that could shed some light into what’s going on. I started by looking at the WaAppAgent.log from which I got a view at a log showing the Role starting and stopping on a regular interval. Then I looked at the AppAgentRuntime.log, this provided me with a lot more information and I got lost for a little while. If something is going wrong, this is a great place to start. it’s got lots and lots of detailed information about the Roles.
Open Log Files – This will open the current log file for all of the log files most commonly used when troubleshooting on an Azure VM (see Windows Azure PaaS Compute Diagnostics Data). This is useful when you are troubleshooting an issue and you don’t quite know where to begin looking so you want to see all of the data without having to go searching for all of it
Then I navigated to the Event Viewer to get a feel for the exceptions that could be contributing to the Role’s behavior.
From the Windows Azure Event Log I was able to find clues as to what was happening.
WaIISHost Role entrypoint could not be created: System.TypeLoadException: Unable to load the role entry point due to the following exceptions: -- System.IO.FileLoadException: Could not load file or assembly 'System.Web.Http, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040) File name: 'System.Web.Http, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'
This started to be helpful, but which library is actually depending on this dependency? To find this information, go to the third tab in the AzureTools and click on Fusion Logging. This turns on or off .NET Fusion Logging verbose output. Remember that this slows down the Role, but it provides you with insights about Bindings and dependencies between assemblies.
=== Pre-bind state information === LOG: DisplayName = System.Web.Http, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35 (Fully-specified) LOG: Appbase = file:///E:/approot/bin LOG: Initial PrivatePath = E:\approot\bin Calling assembly : WebApi.OutputCache.V2, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null. === LOG: This bind starts in default load context. LOG: Using application configuration file: E:\base\x64\WaIISHost.exe.Config LOG: Using host configuration file: LOG: Using machine configuration file from D:\Windows\Microsoft.NET\Framework64\v4.0.30319\config\machine.config. LOG: Post-policy reference: System.Web.Http, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35 LOG: Attempting download of new URL file:///E:/approot/bin/System.Web.Http.DLL. WRN: Comparing the assembly name resulted in the mismatch: Minor Version ERR: Failed to complete setup of assembly (hr = 0x80131040). Probing terminated.
Having identified the dependency that cause this issue I decided to contribute to the GitHub project by updating all of it’s dependencies and it’s target framework to 4.5.1. The code can be downloaded from my GitHub repository. Subsequently, I created a pull request so that everyone working on the latest version of Web API may benefit from this adventure.
If you’re curious about Azure Roles and about troubleshooting them I recommend that you browse through the following resources. I found them to be very good at pointing me in the right direct when I was out of ideas.
- Windows Azure – Troubleshooting & Debugging
- The Diagnostic Utility used by the Windows Azure Developer Support Team
- Troubleshooting Scenario 1 – Role Recycling
- Troubleshooting Scenario 2 – Role Recycling After Running Fine For 2 Weeks
- Troubleshooting Scenario 3 – Role Stuck in Busy
- Troubleshooting Scenario 4 – Windows Azure Traffic Manager Degraded Status
- Troubleshooting Scenario 5 – Internal Server Error 500 in WebRole
- Troubleshooting Scenario 6 – Role Recycling After Running For Some Time
- Troubleshooting Scenario 7 – Role Recycling
- DIY Microsoft Azure Troubleshooting
- Enabling Microsoft Azure Remote Desktop When Publishing
Diagnostic Data Locations
This list includes the most commonly used data sources used when troubleshooting issues in a PaaS VM, roughly ordered by importance (ie. the frequency of using the log to diagnose issues).
- Windows Azure Event Logs – Event Viewer –> Applications and Services Logs –> Windows Azure
- Contains key diagnostic output from the Windows Azure Runtime, including information such as Role starts/stops, startup tasks, OnStart start and stop, OnRun start, crashes, recycles, etc.
- This log is often overlooked because it is under the “Applications and Services Logs” folder in Event Viewer and thus not as visible as the standard Application or System event logs.
- This one diagnostic source will help you identify the cause of several of the most common issues with Azure roles failing to start correctly – startup task failures, and crashing in OnStart or OnRun.
- Captures crashes, with callstacks, in the Azure runtime host processes that run your role entrypoint code (ie. WebRole.cs or WorkerRole.cs).
- Application Event Logs – Event Viewer –> Windows Logs –> Application
- This is standard troubleshooting for both Azure and on-premise servers. You will often find w3wp.exe related errors in these logs.
- App Agent Runtime Logs – C:\Logs\AppAgentRuntime.log
- These logs are written by WindowsAzureGuestAgent.exe and contain information about events happening within the guest agent and the VM. This includes information such as firewall configuration, role state changes, recycles, reboots, health status changes, role stops/starts, certificate configuration, etc.
- This log is useful to get a quick overview of the events happening over time to a role since it logs major changes to the role without logging heartbeats.
- If the guest agent is not able to start the role correctly (ie. a locked file preventing directory cleanup) then you will see it in this log.
- App Agent Heartbeat Logs – C:\Logs\WaAppAgent.log
- These logs are written by WindowsAzureGuestAgent.exe and contain information about the status of the health probes to the host bootstrapper.
- The guest agent process is responsible for reporting health status (ie. Ready, Busy, etc) back to the fabric, so the health status as reported by these logs is the same status that you will see in the Management Portal.
- These logs are typically useful for determining what is the current state of the role within the VM, as well as determining what the state was at some time in the past. With a problem description like “My website was down from 10:00am to 11:30am yesterday”, these heartbeat logs are very useful to determine what the health status of the role was during that time.
- Host Bootstrapper Logs – C:\Resources\WaHostBootstrapper.log
- This log contains entries for startup tasks (including plugins such as Caching or RDP) and health probes to the host process running your role entrypoint code (ie. WebRole.cs code running in WaIISHost.exe).
- A new log file is generated each time the host bootstrapper is restarted (ie. each time your role is recycled due to a crash, recycle, VM restart, upgrade, etc) which makes these logs easy to use to determine how often or when your role recycled.
- IIS Logs – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\LogFiles\Web
- This is standard troubleshooting for both Azure and on-premise servers.
- One key problem scenario where these logs are often overlooked is the scenario of “My website was down from 10:00am to 11:30am yesterday”. The natural tendency is to blame Azure for the outage (“My site has been working fine for 2 weeks, so it must be a problem with Azure!”), but the IIS logs will often indicate otherwise. You may find increased response times immediately prior to the outage, or non-success status codes being returned from IIS, which would indicate a problem within the website itself (ie. in the ASP.NET code running in w3wp.exe) rather than an Azure issue.
- Performance Counters – perfmon, or Windows Azure Diagnostics
- This is standard troubleshooting for both Azure and on-premise servers.
- The interesting aspect of these logs in Azure is that, assuming you have setup WAD ahead of time, you will often have valuable performance counters to troubleshoot problems which occurred in the past (ie. “My website was down from 10:00am to 11:30am yesterday”).
- Other than specific problems where you are gathering specific performance counters, the most common uses for the performance counters gathered by WAD is to look for regular performance counter entries, then a period of no entries, then resuming the regular entries (indicating a scenario where the VM was potentially not running), or 100% CPU (usually indicating an infinite loop or some other logic problem in the website code itself).
- HTTP.SYS Logs – D:\WIndows\System32\LogFiles\HTTPERR
- This is standard troubleshooting for both Azure and on-premise servers.
- Similar to the IIS Logs, these are often overlooked but very important when trying to troubleshoot an issue with a hosted service website not responding. Often times it can be the result of IIS not being able to process the volume of requests coming in, the evidence of which will usually show up in the HTTP.SYS logs.
- IIS Failed Request Log Files – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\FailedReqLogFiles
- This is standard troubleshooting for both Azure and on-premise servers.
- This is not turned on by default in Windows Azure and is not frequently used. But if you are troubleshooting IIS/ASP.NET specific issues you should consider turning FREB tracing on in order to get additional details.
- Windows Azure Diagnostics Tables and Configuration – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\Monitor
- This is the local on-VM cache of the Windows Azure Diagnostics (WAD) data. WAD captures the data as you have configured it, stores in in custom .TSF files on the VM, then transfers it to storage based on the scheduled transfer period time you have specified.
- Unfortunately because they are in a custom .TSF format the contents of the WAD data are of limited use, however you can see the diagnostics configuration files which are useful to troubleshoot issues when Windows Azure Diagnostics itself is not working correctly. Look in the Configuration folder for a file called config.xml which will include the configuration data for WAD. If WAD is not working correctly you should check this file to make sure it is reflecting the way that you are expecting WAD to be configured.
- Windows Azure Caching Log Files – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\AzureCaching
- These logs contain detailed information about Windows Azure role-based caching and can help troubleshoot issues where caching is not working as expected.
- WaIISHost Logs – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\WaIISHost.log
- This contains logs from the WaIISHost.exe process which is where your role entrypoint code (ie. WebRole.cs) runs for WebRoles. The majority of this information is also included in other logs covered above (ie. the Windows Azure Event Logs), but you may occasionally find additional useful information here.
- IISConfigurator Logs – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\IISConfigurator.log
- This contains information about the IISConfigurator process which is used to do the actual IIS configuration of your website per the model you have defined in the service definition files.
- This process rarely fails or encounters errors, but if IIS/w3wp.exe does not seem to be setup correctly for your service then this log is the place to check.
- Role Configuration Files – C:\Config\{DeploymentID}.{DeploymentID.{Rolename}.{Version}.xml
- This contains information about the configuration for your role such as settings defined in the ServiceConfiguration.cscfg file, LocalResource directories, DIP and VIP IP addresses and ports, certificate thumbprints, Load Balancer Probes, other instances, etc.
- Similar to the Role Model Definition File, this is not a log file which contains runtime generated information, but can be useful to ensure that your service is being configured in the way that you are expecting.
- Role Model Definition File – E:\RoleModel.xml (or F:\RoleModel.xml)
- This contains information about how your service is defined according to the Azure Runtime, in particular it contains entries for every startup task and how the startup task will be run (ie. background, environment variables, location, etc). You will also be able to see how your <sites> element is defined for a web role.
- This is not a log file which contains runtime generated information, but it will help you validate that Azure is running your service as you are expecting it to. This is often helpful when a developer has a particular version of a service definition on his development machine, but the build/package server is using a different version of the service definition files.
* A note about ETL files
If you look in the C:\Logs folder you will find RuntimeEvents_{iteration}.etl and WaAppAgent_{iteration}.etl files. These are ETW traces which contain a compilation of the information found in the Windows Azure Event Logs, Guest Agent Logs, and other logs. This is a very convenient compilation of all of the most important log data in an Azure VM, but because they are in ETL format it requires a few extra steps to consume the information. If you have a favorite ETW viewing tool then you can ignore several of the above mentioned log files and just look at the information in these two ETL files.
Find more details from : Windows Azure PaaS Compute Diagnostics Data
Filed under: Cloud Services, Microsoft Azure Tagged: Azure, Bindings, Cloud Services, DevOps, Event Viewer, Fusion Logging, Microsoft Azure, PowerShell, Remote Desktop, Roles, Troubleshooting, Virtual Machine