Over the years VMware support has investigated numerous performance escalations of virtualized tier 1 applications. One of the more challenging aspects of this task is coordinating all the key performance monitoring metrics across the different technology layers from the application down to the hypervisor. This is where VMware vRealize Operations (vROps) with the Blue Medora Management Packs can help to expedite the troubleshooting process. I will show an example here with a virtualized SAP on Oracle system.
The Blue Medora website has links to all the installation documentation for the different application management packs. Once the SAP and Oracle Management packs are installed and configured in vROps to connect to the individual SAP and Oracle systems, the adapters will discover and generate SAP and Oracle objects in vROps which can be accessed via menu Home -> Environment -> All Objects. The following screenshot shows the discovery of the SAP system.
As shown above the different instances of the multi-tier SAP system have been discovered: two application servers; Central Services; database instance with system ID = “TST”. You can then drill down into the SAP metrics for an application server.
This SAP system is running on an Oracle database. The Oracle management pack will discover the Oracle database as shown below.
Now how do we troubleshoot this environment. Let’s show an example.
Performance Escalation Logged with the Helpdesk
SAP end users are complaining of slow response times on the SAP system. Some users are claiming its taking a long time to log into SAP.
Order of Analysis
Analysis will involve monitoring three technology layers. The analysis will start at the virtual layer, then move up to the database and finally to the SAP layer – this is described in the diagram below.
You can access the different metrics in vROps via the menu:
“Home –> Environment –> All Objects –> <Select Adapter> –> <select adapter object> –> Troubleshooting –> All Metrics –> <select object> –> <select counter> …..”
Step 1 Virtual Metrics
We begin at the infrastructure layer. The following table shows some of the key virtual metrics for this example.
From above we can see that there does not appear to be any major resource bottlenecks at the infrastructure layer . Next we move up to the database layer.
Step 2 Oracle Metrics
The following table shows two Oracle metrics for this example (note there are other Oracle metrics that would also need to be considered for an in-depth analysis).
Oracle Logical Reads Per User Call corresponds to the average Oracle blocks read from the buffer cache (part of Oracle’s System Global Area) to service queries from the application server. If the block is not available in the cache it is serviced from disk. A large number of logical reads per user call may be due to expensive SQL statements. Expensive SQL statements can be addressed via SQL tuning. Threshold value and guidelines for the logical reads per user call counter (and other key Oracle metrics ) are documented in the SAP Knowledgebase article 618868 – FAQ: Oracle performance .
The Oracle database wait time ratio counter helps to determine if the database is currently experiencing a high percentage of waits/bottlenecks. A higher database wait time ratio indicates that system performance can be improved using “wait event tuning”. The latter requires more in-depth analysis of Oracle wait events – these wait event counters can be accessed within Oracle by the database administrator or can be available in vROps via the Blue Medora management pack for Oracle Enterprise Manager.
In this example both the logical reads per user call and database wait ratio have increased to levels that requires more in-depth analysis to determine if Oracle or bad SQL statements are the cause of the performance problem. However, it is possible that Oracle is performing as expected to process the SQL statements as submitted by the application server. We now need to move to the SAP layer as ultimately all workload originates from the application tier.
Step 3 SAP Metrics
In the final step we look at the SAP counters which can help explain the workload running on the application server. The following table shows some SAP metrics for this example (note there are other relevant metrics).
The SAP dialog work process utilization shows the percentage of work processes allocated for online user activity that is currently being utilized on the SAP application server. In this example the increase in work process utilization is suspect and requires further inspection by the SAP administrator. So now at this point we would notify the SAP administrator to use SAP tools to troubleshoot further – in this example this step reveals the root cause behind the user complaints.
Root cause: in this performance troubleshooting example the root cause is at the SAP application layer where a batch job was scheduled on the application server competing with the online users. The batch job utilized many of the available work processes thus minimizing the number of free work processes available for the online users.
Potential resolution: reschedule batch job on other application servers or at different time; increase the number of work processes.
I have shown a troubleshooting scenario of an SAP on Oracle system using vROps to analyze metrics from the vSphere, Oracle and SAP layers. vROps with the Blue Medora Management Packs has enabled the required visibility across these layers to expedite root cause analysis. In this example I have accessed the required metrics directly via the menu “Home –> Environment –> All Objects –> <Select Adapter> –> etc”. Alternatively you can navigate to the relevant metrics via the out-of-the-box dashboards provided by the management packs – an example of this is described at http://www.bluemedora.com/blog/advanced-troubleshooting-of-virtualized-sap-environments-with-vrealize-operations/ .
Thanks to my colleagues for their guidance on vROps and Oracle: Cameron Jones; Jeff Godfrey; Ben Todd; John Dias; Sudhir Balasubramanian.