Sunday, August 03, 2014

Office 365 Availability and Performance Monitoring

How good are your monitoring your Office 365 Environment for High availability? is a question asked at various instances to an Office 365 Administrator since this is quite different from the traditional On premises monitoring as we rely more on Microsoft Cloud services in comparison with our own resources in a fully hosted scenario and we rely on both in a Hybrid environment.


Microsoft Office 365 Dashboard provides you updates over various service degradation, performance issues and outages as and when they occur, but most of the time it is not updated immediately for your visibility. Whenever there is an issue reported, Microsoft team needs to have proper update in hand before they update the Dashboard and also they will update it only for issues that's  impacting customers on a large scale and you will get more information in scenarios on which they confirmed the issue is from their service end, else the issue will become false positive.

You can subscribe for Office 365 Service Health RSS Feeds to get the service health updates directly to your Inbox for quick review.

This post is written with respect to Monitoring Messaging Environment and applies to both fully hosted and Hybrid deployment with Office 365 some components like Dirsync, ADFS are in common for other Online services.

Monitoring Office 365 Synthetic transactions is quite vital to determine the performance and availability of the service and we have few great vendors who provide this service in the market like Exoprise, Agile IT, Mailscape (Enow software)  etc. all of these are good options providing advanced functionality but, most of the customers prefer to use their in house tools already adopted for monitoring the environment and most common one is SCOM.  Microsoft team recently released their new management pack for monitoring Office 365 for System Center 2012 R2 along with Office 365 Service communication API that allows you get real time updates on Service Availability from Microsoft.

More information is available here: New Office 365 admin tools

Availability of other resources is quite critical when you use Office 365 in a Hybrid environment where your Monitoring task expands across both your On premises and Cloud Organization unified as one, Most of the Large Enterprise customers use Office 365 Hybrid with SSO using ADFS and some customers prefer Dirsync Password hash for same sign-on.

When we go for Hybrid as stated earlier every component involved in the setup is critical, It begins with your Network devices, firewall and proxy which underlines the connectivity  between your environment to the cloud, you can refer my earlier blog post "Troubleshooting Outlook Connectivity and Performance issues with Office 365"and review few important articles I referenced to know how good you need setup and monitor these key resources, even this applies to your fully hosted environments with Office 365.

Next its your Hybrid servers these servers availability are crucial for your Org relationship to work and mail flow between both the environments. Most Organizations use two or more Hybrid servers for high availability and if you use loadbalancer ensure that its highly available too. Monitoring this comes as part of your On premises Exchange server monitoring, ensure that you put additional care over these servers and take necessary action on time when you get a alert or alarm before it breaks and impacts the environment.

Next comes your Dirsync server, I put this one before ADFS because this is the most simplified way of achieving same sign-on experience for users, With Dirsync Password hash users can have same password across On premises and Office 365 with out any additional server requirement. Organizations who need same sign-on alone and doesn't require any complexity or enhanced security except password protection this is the prominent solution. We can use Only one active Dirsync server at a time for directory synchronization and you can still have a standby server to take up the active role in cause of primary server failure. Ensure that you install Dirsync on a separate highly available server, its still supported to install on a Domain controller but having it as a separate standalone box is advisable and ensure that you monitor this server availability as you regularly monitor other servers. Though this server is down you will not have a huge impact because Synchronized objects will still continue to work and new changes made to the existing object along with newly created objects in On premises will not sync to cloud until this server comes live which will cause issues at some point, This is crucial if you enable password hash because users will not be able to use their new password once changed in on premises for accessing cloud applications before it gets synched.

Up next is your ADFS Servers, When you decide to adopt federated Identity your ADFS server becomes a very Critical component on the entire setup, Organizations that require true SSO with complex requirements, enhanced security and control over user access adopt this model, When you go for ADFS you need to implement ADFS internal servers and ADFS proxy servers and some Organizations use TMG in place of ADFS proxy servers, With the latest deployments where Organizations use ADFS 3.0 we have some new benefits and ADFS proxy servers are replaced by Web Application Proxy (WAP) which comes a part of Windows Server 2012.

ADFS server farm is created with multiple servers both internally and externally with traditional ADFS 2.0 based deployment for high availability, When you decide to have multiple ADFS servers in the environment you can use SQL server instead of WID and go for SQL server high availability. Monitoring these servers can be achieved by your own traditional monitoring tools like SCOM. Having a highly available ADFS Infrastructure ensures that your environments stays highly available to end users most of the time.

One more to add is your Smart hosts, some Organizations still use Smart hosts with in the Hybrid mail flow, Microsoft doesn't recommend this per design guideline and if you still use one then ensure that server is monitored well for High availability.

Use Remote Connectivity Analyzer test for Office 365 to test how well your environment is setup with the service, this is the best tool per my knowledge that shows you how things are setup and the first tool that helps to identify potential issues in time with simple steps for corrective action.

Check here: Microsoft Remote Connectivity Analyzer

Finally, Ensure that you use Supported Clients in your environment to access the service inline with Office 365 System Requirements and if you still have unsupported clients within the environment you will have performance issues which is inevitable and also stay in an unsupported scenario, you can know more on this by reviewing  my blog post above on troubleshooting outlook connectivity. Monitoring this can be achieved by using various scripts available in the community which assists you in analyzing the type of clients and operating system used within the environment and also the new reports included to the Office 365 reports collects this information proactively and provides this data for our action.

I tried to cover the critical components in a nutshell, which you need to monitor effectively in your environment everyday and I will update this post as and when I get some new information on the topic. Stay tuned...

Update:

Microsoft team recently released few great tools for Monitoring Office 365 Service health status, review the below Office Blogs post to know more

Review here : New Office 365 admin tools

One more below,

Client library for Office 365 admin reporting web service now available

No comments:

Post a comment