Guidance for procurement of APM tools

Document overview

This document cover the topics which customers in need of an APM (Application Performance Monitoring) solution need to know in order to accomplish procurement of a suitable APM solution. This document provides the necessary guidance to enrich the procurement process in order to achieve a successful APM deployment. In this document a simplistic but pragmatic approach is followed which has proven to be highly successful in APM deployments for large and medium size corporates. Unfortunately the document need to cover some technical aspects in order to enrich the discussion and understanding. This document will cover the following topics:

Intrinsic primary APM objectives
Customer obligations (assumed to be you)
Vendor obligations

1. Monitoring

2. Mappings

3. Support

Measurability and the ultimate proof for APM value
Important issues with APM reference checks
POC risks
Additional required functionality
Extensibility of the solution
Particular technical APM tool aspects to consider during procurement.

Primary monitoring and application management objectives

Application Performance Management (APM) is the process with the following steps:

Know when anything in the IT landscape is about to break or deteriorate (anomaly recognition – preferably predictive)
Know exactly what is causing the problem (root cause)
Automatically cure the problem or notified a support resource with the ability to fix the problem
Manage and track the activities to fix the problem
Notify management of the problem indicating the affected systems and services and provide estimate time to fix
Notify business indicating affected business service and target time for restoration

APM will ensure that no incident is unattended and incident turn-around time is minimized, which is the key requirements for service assurance. A lack of an APM process capability has severe service, financial and resource ramifications. Every application in the organization can be quantifiably validated against the APM process. Any effort beyond the above is merely symptom management. Care should be taken not to generate bogus notifications as it would desensitize support which is counterproductive.

Acronyms and Definitions

The following acronyms are used in this document:

Customer – A customer is a person/entity/organization in need of APM solution.
Vendor – A vendor is a supplier of APM solutions and / or tools.

Definitions include:

Business process – Complete process to perform some business activity such as the process of a customer placing an order that contains several products
Business activity – steps of the business process e.g. login, select product, add product to cart, check out, payment, logging off
Application process – group of application services evoked in order to achieve a business activity (in most cases)
Application service – activation of physical coded service such as Webservice call etc.
Data service – activation of a service to retrieve or modify data
Infrastructure services – hardware related activities
Support resources – people that is responsible for maintaining the solutions/environments.

Customer obligations

Vendors need to have an understanding of the scope of the APM deployment and the technologies involved. Any omissions is a cause for concern and leaves gaps where either the vendor is misled or it leaves areas of ambiguities where it become impossible to ring fence responsibilities and will most likely lead to implementation failures.

Per application monitoring

The customer should compile the following for each application to be monitored as part of the APM solution:

Operating systems – variants and versions
Databases – variants and versions
VM’s – variants and versions
Middle ware – variants and versions e.g. Weblogic, Tuxedo, Oracle appsvr etc.
ESB (if any) – variant and versions e.g. Tibco, Websphere etc.
Application (in the case of off the shelf e.g. SAP, Sharepiont etc)
Webservers (if any) – variant and versions e.g.
Load balancers (if any) – variants and versions
Client application eg. Web based, thick Delphi client, Angular etc.
Client communication back to server – https (get, post), proprietary sockets, Nodejs etc.
Client communication protocol – rpc, embedded JAVA objects, html, XML etc.
Interfaces to other systems – technology and protocol used e.g. Webservice using SOAP, REST API using JSON etc.
Technology on top of the application (if any) – variants and versions e.g. Cytrix etc.

The above information per target application is important as it define the technology capabilities required by the APM solution.

Network monitoring

Network monitoring is a key aspect of APM root-cause analysis and the customer should clearly state how network monitoring is/should be done. If network monitoring already exist it should be clearly stated how the APM deployment can re-use the already monitored information. The customer should clearly state if the vendor is allowed to interrogate network elements like routers to collect network monitoring metrics. The customer should also indicate if it is possible to determine how applications traverse network routes. If not the vendor should be asked to include network monitoring as part of the APM offering with a detailed explanation of how this will be achieved.

Vendor obligations

Monitoring

The vendor should define for each of the listed applications how the proposed APM solution will handle each of the items (Figure 2):

Figure 2: Components of APM tool capabilities

Mappings

In the APM space several mappings is important in order to resolve relationships between key monitored components. The most prominent relationships are:

Business to Application mapping e.g. Question: Where in the application, infrastructure and network should we look for the root cause of the problem when the end user experience of online umbrella purchases has increased by 50%?
Application to infrastructure mapping e.g. Question: Which applications will fail and which business processes will be affected if we reboot server HB01ND2?
Application to network mapping e.g. Question: After router RT7B24 has failed on subnet xxx.yyy.zzz.www/24, which applications will be affected and which business processes are out of service?
Application, infrastructure and network to operational resource mapping e.g. Question: If a fault is found in the ‘get_product’ REST Soap call of the sales application, to which support resource should this notification be send who can effectively address the problem?

The vendors need to clearly state how the APM solution handles all of the above mappings for each of the listed applications. Where mapping discovery is not automated the maintenance effort should be clearly stipulated by the vendor.

The following should be considered pertaining to relationship mapping functionality of an APM tool:

Automated discovery and maintenance of all the mentioned mappings.
Manual mappings cause unnecessary maintenance obligations and should be factored into the operational cost of the APM solution. The exact maintenance procedure for manual mapping should be clearly described.
Inability to do the mappings reduce the maturity of the APM solution significantly and such a tool should not be considered (or the customer should accept the responsibility to customize the proposed APM solution to achieve the functionality derived from the mappings)

For the rest of this document we will assume the above mappings exist (automated or manual).

Support obligations

The support obligations of an APM implementation should never be under estimated. These support obligations consist of:

Notification threshold configuration and tuning
Expansion of the monitoring and notifications in order to increase coverage
Disabling and enabling notifications in conjunction with maintenance on the target solutions. It is necessary to emphasize the importance of dependency mapping. For example all indirect notifications needs to be disabled when a server is rebooted for maintenance purposes else a notification event storm would erupt which is completely false.
Event and notification correlation needs constant adaptation
Monitoring requirements constantly change both from implementation and presentation view, which should be encouraged as it is a definite strategy to improve the maturity of the APM deployment
Support of the monitoring tooling in terms of maintenance and ensuring the continued monitoring service. This aspect is influenced by the maturity and robustness of the tooling used. Maintainability of the tooling thus needs to be carefully scrutinised during the procurement process. This is also affected by the support capability of the tooling vendor.

The above obligations are often overlooked, which leaves chasms in the value derived from the APM implementation and often lead to complete abandonment of tools. This abandonment of tools can be seen by the number of unused monitoring products shelved in most large corporates. A very rough estimation based on several APM implementations has shown that in large deployments the yearly support resource cost is equal to the maintenance cost of the tooling. This become proportionally higher in smaller organizations due to the skill diversity required for APM support. It is important for APM customers to understand the obligations mentioned else failure of the APM implementation is eminent. This problem is amplified by vendors pushing for as much as possible revenue from sales while customers are pushing towards bleeding as little as possible in the quest for profitability. The nett result of this conflict mentioned is very little funding ‘to make it work’.

Ultimate proof of value

Can we measure and evaluate the effectiveness of the APM implementation in terms of the primary objective of monitoring as stated in the initial paragraphs of this document? Fortunately the answer to this is unequivocally ‘Yes’. In the APM primary objective we stated:

.. we need to be notified of any potentially service affecting incident, we need to know exactly what is in the process of default and where to distribute the notification too. Lastly we need to manage the entire process.

If we measure the service affecting incidents in production in terms of which were associated with a notification and which not we have a definitive measurement of the coverage of the monitoring.

If we now measure the average turnaround time for remediation in comparison to what it was prior to the APM implementation we have a definitive measurement for the APM effectiveness. Refinement of this measure could be to deduct the support resource time for both cases.

These indicators can be used as service level agreements (SLA) for APM services or as the measure of compliance for the APM tooling.

Maximise the value from references

It is important that customers do reference checks of existing sites where the tooling is used as described in the paragraph ‘Ultimate Proof’. Understandably vendors have a tendency to overcommit in an effort to gain the business, which then make reference checks non-negotiable. Equally important is for the customer to make sure the essence of APM is evaluated during the reference checks. For example, most companies have Microsoft Silverlight but the author does not know of any company that runs purely on Silverlight. Having it is no measure of the capability against the requirement. Make sure the tool under scrutiny is not just used for a portion of the APM problem but used in the same way as intended to be use it in the organization. If the requirement is APM then evaluate the deployment against the previous paragraph. Watch out PowerPoint can be deceiving.

Proof of concept (POC) risks

Executing a POC on an in-house solution initially might seem to be the correct procedure, but could potentially pose serious risks, for example:

POC does not cover all areas and technologies in the corporate IT landscape.
POC’s are time bounded and limited scope restrictions normally exclude key aspects.
POC’s are to some extend similar to PowerPoint as the vendor show functionality in his/her comfort zone.
POC’s require customer involvement of highly skilled APM resources. Covering the technology diversity does make this costly and cumbersome.
True POC output evaluation is problematic due to scope and time limitations.

Although this is one of the more favourable tool validation methods, a customer need to be aware of the risks mentioned and mitigate accordingly. If possible always revert back to the core APM principals and test a POC as extensive as possible against the desired outcomes. Examples exist of extensive POC’s in the APM space which was disastrous due to some of the risks mentioned.

Additional required functionality

Up to this point in this document the discussions covered functionality and operability. APM solutions also need be acceptable for society including management, business and stakeholders.

Presentation

It is important that the APM solution has an aesthetic pleasing presentation layer which is:

A single truth (looking at the same thing from different angles provide the same answer and conclusions)
Provide the capability to look at the same information from diverse angles
Intuitive to use
Can be used concurrently by any party irrespective of geographical location
Well organised with the ability to emphasize the essence and hide clutter
Easy to extend with the behavioural properties of your organization
Easy to adapt to suit the liking of your support organization
Functional and pleasing
Extensive reporting orchestration capabilities, which mimic any presentation with scheduling and diverse delivery channels.

Management information

Management need to understand the state of IT service delivery, including:

Simplistically presented as an abstraction of the masses of technical detail
Covering all aspects of the IT landscape
With emphasis on the current operational risk areas
Showing the ramifications of the risks in the various business areas
With a summarised view and state of the remediation control process.

A lack of abstracted full transparency might lead to control anxiety with less favourable outcomes.

Business value

A proper APM implementation does have significant value to the business division of the organization. The obvious value proposition is to provide knowledge pertaining to the functioning and availability of the various business services. This enable business to react favourably to IT deficiencies by means of engagement alterations.

Most monitored information contain various levels of business detail. Due to the vast amounts of collected monitored information large amounts of the information can be easily mined to provide business intelligence. This has become a prominent aspect of the monitoring value proposition.

Due to the generalization of the APM strategy and maturity of monitoring we are entering the phase where business processes inclusive of human resources can be monitored and managed by the APM implementation. This is expected to become a significant game changer for the effectiveness of the organization.

Data analysis

Monitored data analysis is in its infancy with lots of hype and the author has yet to see value derived from this, which exceed the implementation cost. Lately most vendors provide the core analytics ingestion functionality with the tooling to extract information from the repository. However no extraction algorithms or strategies exist, which will complement the APM deployment significantly. The key here is at sales time you would hear ‘you can ….’ but not ‘it will deliver…’. This leaves the value extraction obligation with you as the customer.

Several research groups and vendors are putting vast amounts of resource effort into this area and surely in due time we should see some movement in this area. Conversely this could become another empty IT promise such as the artificial intelligence (AI) promise. Many years ago AI promised to solve all complex logical IT problems and yet today we see very little applicable implementations.

Extensibility of the solution

By the time of writing this document none of the well-known APM solutions cover all aspects of the IT landscape and neither does it cover the full requirements as stipulated in this document. The rate of technology growth and the complexity of solutions integration cause commercial APM products to be incapable of covering all monitoring needs. Due to this it is of extreme importance to ensure the APM solution under evaluation has the necessary extensibility to cover the gaps. Some extensibility points to ensure are:

Usable, easy well known application interfaces like REST API’s, SNMP client and server, Webservice etc.
Monitored metric ingestion
Sharing of monitored information with alternate monitoring solutions
Exposure of pre and post processed monitored information
Ingestion of events from alternate event providers
Emission of events to alternate solutions via various channels and API’s
Inclusion of alternate solution presentations like iFrames
Exposure of presentation information via alternate mechanisms like iFrames
Service account access to role based information
Extensibility of presentation by means of additional graphics and display components
Some standardized development language like Java script etc.
Necessary development tools to add monitoring agents for unsupported technologies
Incident delivery channel additions

Particular APM tool aspects to consider

Some APM solutions emerged based on acquisitions of diverse products, which were not well integrated into a unified APM solution. In these cases essential information is available but in different areas which is not collated. This leaves a seemingly complete APM solution but the usability is compromised by disjointed information. In this case sensible presentation of the information become impossible.

Monitoring is a service to ensure continued excellence of some primary business service. Due to this it is of extreme importance that the monitoring will not compromise the target service by means of:

CPU hogging
Excessive memory usage
File system depletion
IO hogging
Excessive network utilization
Blocking functionality of the business service in any shape or form
Slowing business service execution
Depleting connection pools, file descriptors or allowed connection.

Monitoring needs to be as non-intrusive as possible and it is important to establish the acceptable monitoring resource footprint upfront. In general 5% of the target system resources are provisioned for monitoring. It is important to ensure that the vendors are aware and in agreement with the expected resource allocation. Customers should be aware of implicit usage like instrumentation, which add consumption to the core business process or application servers - hence the measurement becomes more complicated. In this case an instrumented baseline is required to evaluate the resource influence of the instrumentation. This can be achieved during performance testing of the target solution.

Two different transaction monitoring strategies exist namely detail per transaction and average across all transactions. The two strategies vary vastly and it is important for an organization to establish upfront the corporate needs. In general detail per transaction has a larger resource overhead but provide a richer set of information for fault finding. The transaction average strategy is less intrusive but the information set is limited. If the organization has a tendency to loose transactions occasionally then the detail per transaction strategy is strongly advised as the later will leave you in the dark as before.

Cost models vary vastly. In some cases the operational cost component is much higher while the initial capital cost is lower. In all cases ensure a clear understanding of the cost model for both capital and operational cost. In some cases any additional expansion come at exuberant costs once the initial purchase is completed. Ensure full insight in the cost roadmap both capital and operational.

Note: This document is the property of APM WorX. Please do not redistribute without consent.

Home of IT service
delivery excellence

info@apmworx.co.za