RiverMuse PRO includes a Reusable Business Logic (RBL) engine to streamline the creation of all configuration components in a single reusable package.
The configuration is loaded through a text file; it is then parsed and converted by a back-end engine that updates the configuration of multiple components within the RiverMuse product. This is a vast change from traditional Manager of Managers (MoM) solutions as configurations from multiple components are incorporated in a single configuration.
The RBL engine serves to:
- Map organizational processes into the product through one configuration channel
– Fuel the creation of a community driven repository (aka App Store). Configuration packages can be shared or bought. (i.e. a configuration package for grouping events by event type, and performing isolated problem correlation; a configuration package for interpreting Cisco alarms and integrating with Cisco inventory tools to populate RiverMuse dynamic variables).
RiverMuse PRO also incorporates a presence management engine that can discover entities on demand.
This is most useful when new alarms are reported for devices/entities not yet present in a CMDB or inventory management system. A RiverMuse Business Logic Package can first lookup Configuration Item information from the CMDB, and if nothing is found, attempt to perform a discovery using the RiverMuse Presence Management System. This will provide many additional variables that can be leveraged through correlations, automations, and escalations.
RiverMuse PRO includes a centralized rules management wizard.
Whether there are 1 or 50 remotely deployed collectors, rules are configured in a central location through a GUI. Within a traditional Manager of Managers (MoM) solution, the process of obtaining events, performing correlation, and providing business context are usually separate and distinct. Legacy MoM architectures typically require business logic rules to be updated at various levels and multiple components within their system and frequently using different, proprietary scripting languages. This creates a management challenge – and makes it hard for operations teams to keep up with infrastructure shifts.
RiverMuse PRO provides the facility to consolidate your Data Center Operations in a single pane of glass, and achieve Operational Excellence by automating tasks and streamlining processes.
RiverMuse Core, the first enterprise-class open source Real-time Consolidated Operations Console system ideally collects information via SNMP traps and Syslog messages out-of-the-box. Additionally it supports 8 standards-based APIs to obtain data from virtually any source (gSOAP, Perl, Java, C++, XML, PHP, Python, and Ruby on Rails). RiverMuse PRO builds on top of RiverMuse Core and provides a presence management discovery engine, a powerful enterprise desktop console, dynamic alert enrichment from external systems, enhanced scalability, and additional functionalities to streamline organizational processes and dramatically simplify system maintenance.
Perhaps the most compelling reason for pursuing a Consolidated Operations Console solution is being buried in a myriad of tools. And more importantly having little or no business context mapped to the results. The need for a so-called Manager-of-Managers MOM solution becomes more evident the more complex and dynamic an infrastructure becomes – thus requiring various tools to manage and monitor the environment. While multiple monitoring tools are great for specific tasks such as application monitoring, transaction monitoring, or communication device monitoring, problems that affect more than one silo take longer to identify and isolate.
RiverMuse PRO solves this problem by centralizing data across all the different tools, and retrieving events directly from devices when needed. To perform this, RiverMuse PRO includes passive as well as active collectors such as the RiverMuse VMWare agent. This active collector natively interprets CIM (Common Information Model) formatted data streams. Events from different environments are consolidated in one repository, where cross-domain correlation can occur. This allows operations personnel to quickly identify the problem and associated symptoms.
RiverMuse PRO incorporates the event processing scalability of leading commercial Manager-of-Manager solutions without sacrificing granularily.
In other words, all events related to a specific alert are kept in our system and made available on demand including through a launch in context tool. Additionally, correlation can occur against events and alerts. Other leading Manager of Managers tools are also resource intensive and require several install instances to provide value. In contrast, RiverMuse PRO was incorporates a small footprint to curb the overhead and maintenance requirements of legacy MOM solutions without sacrificing elegance and functionality.
The concept of Manager of Managers (MoM) has existed for many years but does it mean the same today as it always has?
MoMs were introduced to overcome the problem where lower order management systems e.g. Element Management Systems (EMS) gave a very fragmented view of the network, leaving it up to operations staff to piece together the puzzle to form a picture of the complete network and its current status. This picture often only existed in the mind of the user and the detail of that picture dependent on the individual’s experience. Correlation of events across the diverse ‘stove pipe’ solutions was primarily a visual correlation on the part of the user.
The introduction of MoMs enabled Network operations staff to pull together management information into one central point. Thus providing a single integrated view of the entire network and enabling the introduction of automated correlation systems. This is why MOM is sometimes referred to in some circles as the ‘single pane of glass’. While the concept was fine, in the early days the practice was somewhat limited by the lack of integration capabilities supported by the lower order systems. Standard interfaces and APIs were few and far between. It probably wasn’t until the development of standards such as the Simple Network Management Protocol (SNMP) that MoM capability became a reality. With the advent of web services enabling more federated integration between individual management systems this has received further impetus.
Today we see the adoption of MoM concepts being widely used, although ironically the term itself seems to have faded from our common vocabulary. While the term ‘MoM’ has traditionally been associated with consolidation of lower order systems into the Network Layer, the same principles are now being applied for managing systems, applications, services, customers or business units etc. Effective management at these higher levels still depends on the collection of data and information from the underlying systems. So the deployment of MoMs continues to gather pace and indeed some existing network centric MoMs are being re-positioned for managing at these higher layers with varying degrees of effectiveness. Undoubtedly, we will see wider adoption of Service Management Systems, SLA Management Systems, Business Management Systems and so on, each a new generation of MoM in their own right. For example, the BSM (Business Service Management) dashboard is a type of MoM albeit designed less for real time operations support than for broader business and technology alignment.
The question remains whether any of the current crop of MoM type technologies is ready to take on the mantle for real time dynamic infrastructure support? More about that in a later blog.
Ian Best
On this community page you can view the latest RiverMuse slide set offering an overview of the company, the event & fault management landscape, our objectives, architecture and key benefits. This is in Slideshare format and can be shared and copied. You can also post comments and questions directly on the page.
Thursday November 5th 2009, a landmark for RiverMuse – so for those of you who are in the UK today, you may see fireworks and bonfires, as we celebrate the commercial launch of RiverMuse. So from today we have a shiny new website www.rivermuse.com and a commercial support and professional services side to our opens source software (OSS). The core software has undertaken a major revision over the past 3 months and the binaries for that are also available for immediate download for RHEL 5 (or Centos).
Surf the corporate site – see if you recognise anyone in the photos, download our white paper and data sheet, or dive into the community, which still has the same great features, but has also had a bit of a face lift to match the style of dot com.
As ever your feedback is welcome, post on our forums or here on the blog.
Step One: Monitoring free space in a file system
Let’s assume your key corporate documents are all stored on a file system that is on a disk in the corporate SAN. A process is in place to monitor the free space on the file system at regular intervals. When the monitoring process recognises that there is less than 10% (of space) available, it posts an event (to RiverMuse).
If you read my previous blog, using the bird table analogy, this would be the event generated by the web cam creating a video file.
RiverMuse then processes this alert for you, triggering a series of commands causing the disk on the SAN to be enlarged and subsequently, the file system grown to take advantage of this newly allocated space. Once this process is complete the system generates an event to mark the completion.
In the bird table analogy, this relates to the file conversion activity
Step Two: Reporting and recording of events
The completion message causes the following to happen:
1) The space low message is closed.
2) A message notifies the CMDB system that SAN resources have been reallocated.
3) A ticket is created in the helpdesk system, to prompt review of what triggered the expansion.
Relate to bird table story again
If the space warning alert has a count of 2 or more then the external trigger is not fired, as an expansion process has already been initiated.
Our real world example has other business logic, including:
· If free space is OK, we still generate an even, but this time it’s a positive event, RiverMuse uses this to reset the clock on its silent failure protection rules. In the previous bird table example this could be an event based on the absence of a video upload, with the time frame based on the time of year (to account for the hours of darkness).
· Likewise if the space low message has a count that exceeds a predetermined amount, an alarm is triggered to check the file system expand process, against its silent failure. Again the analogy on the bird table is that the video conversion process has failed.
· Finally if the space drops to below 5%, before the expansion is complete, then a further external action is triggered to obtain human interaction as the file system is growing at a potentially unmanageable rate.
All of this can be bundled into a package, using RiverMuse’s transportable business logic, and applied to all the individual file systems and disks across the network. As business logic is updated, or new file systems are added, the changes will be seamlessly propagated through the system, giving you the peace of mind that routine events are managed by the system, leaving you to get on with managing the exceptions.
Twice last week I sat in a meeting where there was a casual remark that seemed trivial in nature. The remark went something like this – “I have to look at my spam folder, I registered for the community but didn’t get the confirmation email”. Two people saying this within days of each other caught my attention. After a little investigating (helped by the fact we knew the email addresses of the two people) we discovered about 12 RiverMuse Community membership applications had arrived by a non-standard path sitting in a queue requiring manual authorization. Of course we fixed the problem and contacted each of the community members who had been in limbo.
This is an example of failing silently, a key concept that we set out to address and mitigate when architecting RiverMuse. We thought we had our registration process clearly defined, simple and effective with no human intervention and with frequent automated checks that the application and SSO manager were working correctly. Reporting shows that membership is growing on a daily basis. What could go wrong…..
In the even more complex world of fault management platforms that have rules based workflows, the opportunities to fail silently increase with the scale of the network and systems being monitored. An automation rule on a central server looks for a particular string or identifier but for this event to get this far there is often another rule engine sitting at the element management or probe level. A simple error in rule creation, or an alteration of the string at this layer may well lead to an inability to process or recognize an event as significant enough to create an alert or undertake a trigger action. Now move forwards to the new challenges of virtualization and grid or cloud infrastructures where critical events and services can be nomadic in nature and what was important yesterday is non-critical today or vice versa.
In my own real world example above I had no idea that it was possible to register using an alternative route that we had not planned, not knowing meant that for those 12 people the process had failed, and but for the chance conversation and a similar statement in an email we may never have realized, because all appeared to be well. This is failing silently. In my next post on this subject I will talk about the way we have designed RiverMuse to avoid many of the pitfalls that can lead to this type of scenario.
Meantime I would be interested to hear if you have any experiences of failing silently.
We’re running a competition at the moment to come up with the most novel use of RiverMuse, see here . This got me thinking back to some of the early uses of the web, such as ensuring a good supply of coffee. I was wondering what I could come up with, on these lines, to demonstrate the flexibility of the system. So here goes…
In my hypothetical system, a web cam watches a hypothetical bird-feeding table in my garden. Whenever the web cam detects motion, it records this as an AVI file, and posts this in a directory on my home PC (web webcam only works with Windows). The network access by the webcam is recorded, in the windows event logs, and this comes into RiverMuse Core as an event. Omosd processes this event and de-duplicates it, which is important, as the next step in my process can only handle one event at once.
If the webcam alert has a count of 1, then yarpd executes an external command to process the AVI file into a low bandwidth format and upload it to a website. When this is done the conversion program sends a syslog message back to RiverMuse. This causes a number of events to happen:
- The webcam alert is closed.
- I run a shell command to tweet about it so all the fans of my bird table will know there is a new video to watch.
- I fire an alert back into RiverMuse which contains the date in it and omosd de-duplicates this so I can have a filtered view of alerts that show me how many videos got uploaded in a day.
If the count of the webcam alert is more than 1 then yarpd executes an external script to delete the extra video clips, as my conversion process can’t handle more than one at a time.
As a phase 2 I’m going to processes the web log’s so I can see which video’s are the most watched.