============ Event Engine ============ Introduction ============ The :program:`Event Engine` is the backend process used by NAV to process the *event queue*. Whenever a NAV subsystem posts an event to the queue, the :program:`Event Engine` will pick it up and decide what do to with it. Typically, the :program:`Event Engine` will generate an alert from the event, or it may ignore the event entirely, depending on the circumstances. In some cases, it will delay the alert for a grace period, while waiting for another corresponding event to resolve the pending problem. Plugins ======= Most of the work of the :program:`Event Engine` is done by event handler plugins from the :py:mod:`nav.eventengine.plugins` namespace. Each event picked from the queue will be offered to each of the plugins, until one of them decides to handle the event. If no plugins wanted to handle the event, the :program:`Event Engine` will perform a very simple default routine to translate the event directly into an alert (possibly using alert hints given in the event itself). Configuration ============= The operation of the :program:`Event Engine` can be customized using configuration options in :file:`eventengine.conf`. Most of the configuration concerns itself with configuring the grace periods (timeouts) for various types of alerts. The default configuration looks somewhat like this: .. literalinclude:: ../../python/nav/etc/eventengine.conf :language: ini .. _severity_levels: Alert severity -------------- All NAV alerts (as generated by :program:`Event Engine`) are assigned a **severity** value, in the interval *1 through 5*. These values can be used as part of your users' Alert Profile filters, and should be interpreted roughly like this: - **5** = *Information* - **4** = *Low* - **3** = *Moderate* - **2** = *High* - **1** = *Critical* Severity values are normally chosen by the NAV program that generates the event that an alert is based on. However, NAV cannot distinguish what severity level any given alert constitutes for *your* NOC. Therefore, the :program:`Event Engine` lets you configure your own severity rules, using YAML_ syntax, in the configuration file :file:`severity.yml`. Any rules present in this file will be processed to set or modify the existing severity of any matching alert that is generated. Configuring ``severity.yml`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is an example severity configuration: .. code-block:: yaml :caption: severity.yml --- default-severity: 3 rules: - alert_type: boxDown severity: 2 rules: - netbox.category.id: GSW severity: 1 - netbox.category.id: GW severity: 1 - alert_type: boxDownWarning severity: 5 - netbox.organization.id: foobar severity: '+2' This configuration starts off by assigning a *default severity* level of **3** to every alert that :program:`Event Engine` generates, regardless of what the original severity value of the event was. Then follows a list of rules that will be processed *in the order they appear* in the file. Each rule consists of: - One or more alert attribute match expressions. - One severity value modification expression to be applied to an alert that matches the attribute expressions. - Optionally, a sub-list of more rules to further apply to any alert that matched the expressions of this rule. The first example rule will match any alert whose ``alert_type`` value equals ``boxDown`` (NAV's alert type for a lasting "box is unreachable" incident). Any such alert will be assigned a severity level of **2**. Furthermore, the rule lists two additional sub-rules to ensure that if the ``boxDown`` alert was issued for any netbox (IP Device) whose category is a router (a category id of either ``GSW`` or ``GW``), the severity is set to the most critical level of **1**. The second top-level example rule will match any alert whose type is ``boxDownWarning``, and set its severity to the least critical level of **5**. This is the stateless early warning the :program:`Event Engine` issues a few minutes before declaring a stateful ``boxDown``. It is safe to consider this type of alert as only *informational*. The final top-level example rule will match any alert whose associated netbox (IP Device) is owned by the organizational id ``foobar``. This rule uses a *severity modifier expression* of ``+2``, which will add ``2`` to the current alert's existing severity value. In summary, if a ``boxDown`` alert is dispatched for a router in your network, this rule set will ensure its severity is set to **1**. However, if the router belongs to your less important ``foobar`` department, two severity levels will be deducted, and the alert comes out with a severity of **3**. Modifier expressions ~~~~~~~~~~~~~~~~~~~~ There are two types of supported severity modifier expressions for use in rules: 1. Absolute values: An absolute integer will *replace* a matching alert's current severity level. 2. Relative values: Prefixing an integer with ``+`` (or ``-``) will *increase* (or decrease) the existing severity value by the given amount. :program:`Event Engine` will silently ensure that no assigned or calculated severity value will ever exceed the valid range of 1-5. .. important:: Please note that relative values **must be enclosed in quotes**, to avoid confusion with absolute values. YAML interprets ``+2`` as the absolute value of 2, while ``'+2'`` is a relative value. A good practice would be to always quote your values, as that will work as intended in all cases. Available ``event_type`` and ``alert_type`` values ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Two of the available alert attributes that can be matched against in severity rules are ``event_type`` and ``alert_type``. However, ``event_type`` is a Python object: To match against an event type id/name, you must match against the object's ``id`` attribute, i.e. ``event_type.id``, as the example configuration file shows. See the :doc:`event- and alert-type reference documentation ` for a detailed list of available type names to match. Other matchable attributes ~~~~~~~~~~~~~~~~~~~~~~~~~~ Most alerts generated by the :program:`Event Engine` are associated with a specific IP device registered in NAV (known as ``netbox`` internally). Severity rules can be used to match against attributes of IP devices, or even sub-attributes thereof. As with the examples above, the ID (or name) of the organizational unit that is responsible for an IP device can be read from ``netbox.organization.id``. The ID of the wiring closet this device is located in (as organized by you, the admin, in SeedDB), can be had from ``netbox.room.id``. See the reference documentation for the :py:class:`Netbox ` model to see all the available attributes of an IP device. Exporting alerts from NAV into other systems ============================================ The :program:`Event Engine` can be made to export a stream of alerts. By setting the ``script`` option in the ``[export]`` section of :file:`eventengine.conf` to the path of an executable program or script, the :program:`Event Engine` will start that program and feed a continuous stream of JSON blobs. describing the alerts it generates, to that programs ``STDIN``. Alert JSON format ----------------- The :program:`Event Engine` will export each alert as a discrete JSON structure. The receiving script will therefore need to be able to parse the beginning and end of each such object as it arrives. Each object will be separated by a newline, but no guarantees are made that the JSON blobs themselves will not also contain newlines. .. tip:: Here is a `Stack Overflow comment describing how Python's existing JSON library can be used to decode arbitrarily big strings of "stacked" JSON `_, such as is the case with the the alert export stream. An exported alert may look like this as JSON: .. code-block:: json { "id" : 212310, "history" : 196179, "time" : "2019-11-05T10:03:10.235877", "message" : "box down example-sw.example.org 10.0.1.42", "source" : "pping", "state" : "s", "on_maintenance" : false, "netbox" : 138, "device_groups" : null, "device" : null, "subid" : "", "subject_type" : "Netbox", "subject" : "example-sw.example.org", "subject_url" : "/ipdevinfo/example-sw.example.org/", "alert_details_url" : "/api/alert/196179/", "netbox_history_url" : "/devicehistory/history/%3Fnetbox=138", "event_history_url" : "/devicehistory/history/?eventtype=e_boxState", "event_type" : { "description" : "Tells us whether a network-unit is down or up.", "id" : "boxState" }, "alert_type" : { "description" : "Box declared down.", "name" : "boxDown" }, "severity" : 3, "value" : 100 } Attributes explained ~~~~~~~~~~~~~~~~~~~~ These are the attributes present in the JSON blob describing an alert: ``id`` The internal integer ID of this alert in NAV. This number is volatile, as the alert object disappears from NAV as soon as the :program:`Alert Engine` has completed its processing of the alert. ``history`` The internal integer ID of NAV's corresponding alert history entry. I.e., if this alert created a new problem state in NAV, this will be a new ID. If this alert resolves or otherwise concerns an existing state in NAV, this will refer to the pre-existing history ID. E.g. if a ``boxDown`` alert is issued for an IP device, and later, a ``boxUp`` alert is issued for the same IP device, both of these alerts will refer to the same alert history entry. ``time`` This is the timestamp of the alert, in ISO8601 format. Usually, this corresponds to the timestamp of the originating event. E.g., for ``boxState`` type alerts, this corresponds to the exact timestamp the :program:`pping` program reported it could no longer receive ICMP echo replies from a device. ``message`` This is a short, human-readable description of what the alert is all about. ``source`` This is a reference to the NAV subsystem that postged the original event that caused this alert. ``state`` This is NAV's internal moniker for the state represented by this alert: ``x`` This is a stateless alert (e.g. a generic warning or point-in-time event) ``s`` This alert starts a new state in the alert history table. ``e`` This alert ends (resolves) an existing state in the alert history table. ``on_maintenance`` A boolean that tells you whether the subject of this alert is currently on active maintenance, according to NAV's schedule. This would typically be used to withhold notifications about alerts that occur during a known maintenance period for a device. ``netbox`` A database primary key to the IP device this alert is associated with. ``device_groups`` A list of NAV device groups that the associated IP device is a member of. ``device`` A database primary key to the physical device this alert is associated with. ``subid`` If this alert's subject is a sub-component of the IP device referenced in the ``netbox`` attribute, this will be some internal sub-ID of this component. This reference ID can be interpreted differently, depending on the alert type, which is what NAV does when the ``subject`` attribute described below is composed. ``subject`` An object that describes the alert's actual subject (or object, if you will, since NAV's terminology is grammatically challenged). ``subject_type`` NAV's internal model name of the subject's data type. This would typically be things like ``Netbox``, ``Interface``, ``Module``, ``GatewayPeerSession`` etc. A ``subject_type`` value combined with the ``subid`` value can be used as a unique identifier of a NAV component by a 3rd party tool. ``subject_url`` A relative canonical URI to a NAV web page (meant for human consumption) describing the alert's subject. ``alert_details_url`` A relative canonical URI to NAV's REST API, where the details of the alert state entry can be retrieved. ``netbox_history_url`` A relative canonical URI to a NAV web page (meant for human consumption) detailing the recent alert history of this alert's associated IP device. ``event_history_url`` A relative canonical URI to a NAV web page (meant for human consumption) detailing the recent history of alerts of the same event type (e.g. all the recent alerts of the ``boxState`` category, if this is a ``boxDown`` alert). ``event_type`` A sub-structure describing the event category of this alert: ``id`` The event category id of this alert. ``description`` A description of said event category. ``alert_type`` A sub-structure describing the alert type of this alert. ``id`` The event type id of this alert. ``description`` A description of said alert type. ``severity`` The severity of this alert. This is usually an integer in the interval **1** through **5**, where **1** is the most critical level. ``value`` The alert value. This is usually an integer in the range *0-100*, but at the moment, this carries no specific meaning in NAV. .. _YAML: https://en.wikipedia.org/wiki/YAML