Continuous Application Performance Management

I’m pretty keen on designing for user experience, user experience testing—and user experience monitoring in production. There’s some background reading on dynaTrace software—and once CTO at Segue, which is a good provenance to have. Our conversation started around the current trends towards service orientation, software as a service (SaaS), virtualisation and so on, which are making business applications fundamentally more complex. Software these days runs remotely, sometimes on platforms you don’t control, and this is much harder to manage than the old in-house mainframe applications with a real customer-facing person to handle the customer’s “user experience”. Then again, Internet interactions are now loosely coupled and asynchronous, which makes recovery from error situations very much harder (you can’t simply back out an interaction you no longer want, as people may have used its intermediate results already).

Two dysfunctional effects of this complexity, which particularly interest Bernd, are application problem resolution times and the emergence of production performance problems which can’t easily be traced back to their root cases using manual techniques.

He is proposing a new approach called Continuous Application Performance Management which fits with current trends and enables companies to do more, in the way of fixing production problems, with less, in the way of people. And, most important of all, it helps them to fix problems quickly. dynaTrace, which implements Continuous Application Performance Management, follows individual transactions down to code level and collects metrics for CPU usage, stores method arguments/returns, SQL invocations, messages, logs, exceptions and so on. Moreover, Bernd claims extremely low overhead (embedded lightweight agents just collect data and send it asynchronously to a centralised Diagnostics Server for real-time, off-line analysis). The proof of this pudding will be in the eating, but I thought that the obvious possible issues seemed to be addressed well—dynaTrace monitors its own overhead and configures itself accordingly and it prioritises production throughput over monitoring (so if something goes wrong in dynaTrace, there may be a gap in the monitoring but production shouldn’t be affected); and all transactions are monitored, not just a subset (which latter, of course, might not include those with problems).

What dynaTrace calls PurePath maps the transaction’s precise execution path, containing relevant sequence, timing, resource usage and contextual information for each method/step the transaction executes, across multiple servers, possibly running on different machines, whether running on the same or different machines (although the mainframe isn’t fully supported at present—it runs there when WebSphere does but can’t trace into COBOL code—which is a pity).

What this means, to an organisation, is that its IT staff understand the dynamic behaviour of its applications both in development and production, and can therefore anticipate and correct performance problems before the business can be affected. And, if something does go wrong “time to repair” is reduced because the problem transaction can be quickly reconstructed, from captured data to the underlying “root cause” code and repaired (Bernd claims) in minutes, not hours or days, often reducing cost per defect as much as 100 times.

That’s all well and good and you can check it out on dynaTrace’s website, but where next? Well, one sideline is OEMing dynaTrace as part of ALM tool suites from other vendors such as Borland. This provides a real “proof of concept” for dynaTrace monitoring.

However, the really interesting question is whether dynaTrace can address dysfunctional development cultures. Can it reduce the dysfunctional gaps between developers, operations and business users? Unsurprisingly. Bernd says it already does, because of the traceability between user experience and code supports communication between the different stakeholders in a problem. dynaTrace makes its information available via business-oriented dashboards—but, more than that, it provides real-time role-based dashboards for all stakeholders. These facilities are being developed further and should promote increased awareness of business user experience amongst developers and help developers build business-friendly systems “right first time” that meet user’s “working experience” as well as technical needs.

Another possible future, according to Bernd, lies in moving the dynaTrace offering up a level by looking for design “antipatterns” which are likely to result in poor user experience and suggesting refactorings that will address the potential issue. If root cause code analysis for production problems makes fixing them in the code cheaper, identifying potential design problems and fixing them before you write any code at all, will be orders of magnitude cheaper still.