view counter

My Favourite Features of Enterprise Manager Cloud Control 12c - Today’s topic: Incident Management

Thanks to Oracle Enterprise Manager for this story

Enterprise Manager Cloud Control 12c (EM12c) is a huge release, both in
terms of its adoption rate (that is, its uptake in the market) and the amount
of functionality included in the product. For those of us that have been around for a long time, it’s very
reminiscent of the massive functionality leap from Oracle RDBMS version 6 to
version 7 – a quantum leap that makes it difficult to even grasp the breadth of
the product now.

To try and make the new features a bit more
understandable, I’ll be writing a number of blog entries over the coming months
to highlight just some of my favourite new features for EM12c. From an administrator’s perspective, one of
those standout features (and the subject of today’s entry) has to be incident
management.

The goal of incident management is to
enable administrators to monitor and resolve service disruptions that may be
occurring in their data centre as quickly and efficiently as possible. Instead of managing the numerous discrete individual
events that may be raised as the result of any of these service disruptions, we
want to manage a smaller number of more meaningful incidents, and to manage
them based on business priority across the lifecycle of those incidents.

To do this, Enterprise Manager now provides
a centralized incident console called Incident Manager that will enable the
administrator to track, diagnose, and resolve incidents, as well as providing
features to help rectify the root causes of recurrent incidents. Incident Manager also directly leverages
Oracle’s own expertise via My Oracle Support knowledge base articles and
documentation to enable administrators to accelerate the process of diagnosing
and resolving incidents and problems. Finally, Incident Manager also offers the ability to do lifecycle
operations for incidents, so you can assign ownership of an incident to a
specific user, acknowledge an incident, set priority for an incident, track an
incident’s status, escalate an incident or suppress it so you can defer it to a
later time. You can also raise
notifications on an incident or open a helpdesk ticket via the helpdesk
connectors.

Events

Enterprise Manager continues to be the
primary tool for managing and monitoring the Oracle data center, so it manages
and monitors Oracle applications as well as the application stack from presentation
layer to middleware, databases to hosts and the operating system, as well as
non-Oracle technology. When Enterprise Manager detects issues in any of this
infrastructure, it raises events. Sample
events might be:

1. Metric alerts (for example, CPU
utilization or tablespace usage alerts) where a critical threshold you set has
been crossed

2. Job events – events are raised by the job system for job statuses
that you specify, for example an event is raised to signal the failure of a job.

3. Standards violations – if you
are using compliance standards and any of the targets that are being monitored
violate any of the compliance standards, then a standards violation event could
be raised.

4. Availability events – if a
target is down and Enterprise Manager detects that, an availability event that
the target is down can be raised

5. Other events – there are other
types of events that occur as well

All these events signal particular issues
have occurred in the managed data centre. As an
administrator, you really want to be able to determine which of these events
are significant. From these significant
events, you then want to be able to correlate discrete events that are related
to the same underlying issue, so you in fact have to manage a smaller number of
significant incidents.

Incidents

An incident could then be defined as an
object containing a significant event (such as a target being down, for
example) or it could be a combination of events that all relate to the same
issue (for example, running out of space could be detected by Enterprise
Manager as separate events raised from the database, host and storage target
types). For example, you may have a
performance incident that amalgamates a number of performance events, another
incident related to space, and a different incident based on availability
problems.

Sound good? OK, so how do we do this? Well, events
are significant occurrences in your IT infrastructure and that Enterprise
Manager detects and raises. Each event
has a set of attributes– what type of event it is, the severity (fatal,
critical and so on), the object or entity on which the event is raised
(typically a target but it can also be a job or some other object), the message
associated with the event, the timestamp at which it occurred, as well as the
functional category (such as availability, security etc.)

Some examples of the different types of
events include:

· Target availability: raised
when a target is down or has gone into an agent unreachable state.

· Metric alert: raised when a
metric crosses its threshold.

· Job status change: raised, for
example, when a job fails.

· Compliance standard rule: raised
when a compliance standard rule is violated.

· Metric evaluation: raised when there
is an error with the evaluation of a metric.

· Other events such as SLA Alert,
High Availability and Compliance Standard Score violation can also be raised,
and of course, users can cause an event to be raised.

Associated with these event types are event
severities. The first of these, “Fatal”,
is a new severity level in Enterprise Manager specifically associated with the
target availability event type for when the target is down. Critical and warning events have the same
meaning as they had in previous releases, and then we have the Advisory
level. Typically, this is associated
with non-service-impacting events such as compliance standard violation
events. The informational level is an
event severity used to indicate simply that an event has occurred, but there is
no need to do anything about it.

As we discussed previously, an actual incident
will contain one or more events. Let’s
look at the details of an incident with one event. For example, Figure 1 shows us an
availability event:

Figure 1: Incident with one event

The event signals that the database DB1 is
down and includes a timestamp of when the event was raised. Because this is a target availability event
and the database is down, the severity is marked as Fatal. An incident can be created for that event, so
the incident contains only one event. In
order to manage and track the resolution of the incident, the incident has
other attributes such as owner (the Enterprise Manager user that is working on
the incident), status, incident severity (which is based on the event
severity), priority and a comment field.

Many incidents will instead contain
multiple events, where those events are related and pointed to the same
underlying cause. In the example shown
in Figure 2, we have two metric alert events on a host target -- a memory
utilization metric alert event and a CPU utilization metric alert event because
the host is starting to suffer from heavy load. We have a warning severity memory utilization metric alert event, and a
short time later a critical severity CPU utilization metric alert event.


Figure 2: Incident with multiple events

An incident can be created containing both
events in order to manage and track the resolution of the incident. In the current release, the administrator needs to
manually combine events into an incident in the Enterprise Manager console (the
automatic grouping of related events into an incident is a future enhancement).
Again, we have additional attributes associated with
the incident like we had in the previous example. Enterprise Manager automatically assigns the
incident severity, based on the worst case event severity of all the events
contained in the incident. Since the
worst event severity is Critical, the incident severity is also set to
Critical. Finally, the incident has a
summary which is a short description of what the incident is about. The individual events are indicating the
machine load is high so you can set the summary to that. Alternatively, you can set the incident
summary to be the same as the event messages.

If you are using one of the helpdesk
connectors to interface to a helpdesk system, an incident might also result in
a helpdesk ticket which can allow the helpdesk analyst to work on the ticket. Within Enterprise Manager, we’ll be able to
track both the ticket number and the status of that particular ticket.

Problems

A problem is the underlying root cause of
an incident. In Enterprise Manager
terms, a problem is specifically related to either an Automatic Diagnostic
Repository (ADR) incident or Oracle software incident. Enterprise Manager will automatically create
a problem whenever it detects an ADR incident has been raised. An ADR incident can be thought of as a
critical Oracle software problem where the resolution of the software problem
typically involves contacting Oracle Support, opening a service request and
possibly receiving a patch for that problem.

Whenever an ADR incident is raised, we
generate one incident in Enterprise Manager for that ADR incident, and we also
automatically generate a problem as well. All the ADR incidents that have the same problem signature (that is, the
same root cause) will be linked into a single problem object. The administrator can manage the problem in
Incident Manager in the same way as you would manage an incident, so you can
assign an owner to the problem, track the resolution and so on. In addition, there are in-context links to
Support Workbench functionality which allows the administrator to package the
diagnostic material, open a service request and view the status of diagnostic
activity such as the SR number and ultimately bug number (if one is generated)
within the user interface.

Figure 3 shows a diagrammatic example of
how incidents and problems are related. Two ADR incidents have occurred, in this example two ORA-600 errors have
occurred in my database. Both of these
incidents are of critical severity. Enterprise Manager automatically creates a problem containing those
incidents. Within the Incident Manager
interface you can link to the Support Workbench to open a service request which
you can then track from Incident Manager.

Figure 3: Incidents and problems

So now you have an understanding of the
terminology and relationships between these terms, what’s next? Well, the next thing to understand is just
how you deal with these incidents. That
will be the topic of my next blog, so stay tuned for more!

Contributed by Pete Sharman , Principal Product Manager, Oracle Enterprise Manager

Stay Connected:

Twitter |  Face book |  You Tube |  Linked in |  Newsletter

Read the entire article at its source

view counter