Failing silently

Twice last week I sat in a meeting where there was a casual remark that seemed trivial in nature.  The remark went something like this – “I have to look at my spam folder, I registered for the community but didn’t get the confirmation email”.    Two people saying this within days of each other caught my attention.  After a little investigating (helped by the fact we knew the email addresses of the two people) we discovered about 12 RiverMuse Community membership applications had arrived by a non-standard path sitting in a queue requiring manual authorization.  Of course we fixed the problem and contacted each of the community members who had been in limbo.

This is an example of failing silently, a key concept that we set out to address and mitigate when architecting RiverMuse.  We thought we had our registration process clearly defined, simple and effective with no human intervention and with frequent automated checks that the application and SSO manager were working correctly.  Reporting shows that membership is growing on a daily basis.  What could go wrong…..

In the even more complex world of fault management platforms that have rules based workflows, the opportunities to fail silently increase with the scale of the network and systems being monitored.  An automation rule on a central server looks for a particular string or identifier but for this event to get this far there is often another rule engine sitting at the element management or probe level.  A simple error in rule creation, or an alteration of the string at this layer may well lead to an inability to process or recognize an event as significant enough to create an alert or undertake a trigger action.  Now move forwards to the new challenges of virtualization and grid or cloud infrastructures where critical events and services can be nomadic in nature and what was important yesterday is non-critical today or vice versa.

In my own real world example above I had no idea that it was possible to register using an alternative route that we had not planned, not knowing meant that for those 12 people the process had failed, and but for the chance conversation and a similar statement in an email we may never have realized, because all appeared to be well.  This is failing silently.  In my next post on this subject I will talk about the way we have designed RiverMuse to avoid many of the pitfalls that can lead to this type of scenario.

Meantime I would be interested to hear if you have any experiences of failing silently.

Filed Under: MarketProductblog

Leave a Reply