Originally called Algorithmic IT Operations (AIOps) and renamed Artificial Intelligence for IT Operations, AIOps is an emerging industry category that was named by Gartner in 2016, and has been credited with its emergence to the company Moogsoft. Gartner's official description of AIOps is
AIOps platforms utilize big data, modern machine learning and other advanced analytics technologies to directly and indirectly enhance IT operations (monitoring, automation, and service desk) functions with proactive, personal and dynamic insight. AIOps platforms enable the concurrent use of multiple data sources, data collection methods, analytical (real-time and deep) technologies, and presentation technologies.
Which, as the definition notes, means AIOps is the application of artificial intelligence to enhance IT operations. Specifically, it uses big data, analytics, and machine learning capabilities in order to:
- Collect and aggregate the huge and ever-increasing volumes of operations data generated by multiple IT infrastructure components, applications, and performance-monitoring tools
- Intelligently sift "signals" out of the "noise" to identify significant events and patterns related to system performance and availability issues
- Diagnose root causes and report them to IT for rapid response or remediation, and in some cases to automatically resolve issues without human intervention
This allows AIOps platforms to replace multiple separate and manual IT operations and tools with a single automated IT operations platform, in order to reduce complexity in IT environments, and to allow IT operations teams to respond faster, and even proactively.
Not all AIOps tools and platforms are similar, and most organizations deploy AIOps platforms as an independent platform capable of ingesting data from all IT monitoring sources, and to act as a central system of engagement. To do this, the platform is powered by five types of algorithms that fully automate and streamline the key areas of IT operations monitoring. These include data selection, pattern discovery, inference, collaboration, and automation.
Data selection involves taking the amount of redundant and noisy IT data generated by a modern IT environment and selecting the data that indicates a problem. This can mean filtering out up to 99 percent of the total data generated in an IT environment.
By ingesting all the data in an IT environment, an AIOps platform can correlate and find relationships between the selected and meaningful data elements. This further groups them for further advanced analytics.
Also called root cause analysis, inference refers to the platform's use of data to identify root causes of problems and recurring issues, to allow IT environments and operators to take action on what has been discovered.
AIOps platforms can be used to notify appropriate operators and teams, and can facilitate collaboration among those appropriate operators and teams. This can be especially important when individuals are geographically dispersed, and to preserve data on incidents that can accelerate future diagnosis of similar problems.
AIOps platforms can be used to automate responses and remediate problems as much as possible, in order to make solutions to any problems occurring in an IT environment more quick and precise.
The main benefit of AIOps is that it helps IT operations to identify, address, and resolve slow-downs and outages faster by automated sifting through alerts from multiple IT operations tools. This can offer specific benefits:
- Achieving faster mean time to resolution (MTTR): By reducing IT operations noise and correlating operations data from multiple IT environments, AIOps can identify root causes and find or propose solutions faster than traditional IT operators. This can enable an organization to set and achieve MTTR goals.
- Going from reactive to proactive or predictive management: As AIOps uses algorithms capable of learning, the platform keep getting better at identifying less-urgent alerts or signals that correlate with more-urgent situations. This means the platform is capable of providing predictive alerts that let IT teams address potential problems before they lead to slow-downs or outages.
- Modernizing IT operations and IT teams: Instead of being bombarded with alerts from every environment, AIOps operations teams only receive alerts that meet specific service level thresholds or parameters, complete with the context required to make the best possible diagnosis and take the best and fastest corrective action. The more AIOps learns and automates, the more it helps to keep operations running with less human effort, allowing IT operations teams to focus on tasks with greater value to the business.
The primary use cases of AIOps include big data management, performance analysis, anomaly detection, event correlation and analysis, and IT service management.
AIOps can be used for performance analysis, using AI and machine learning to gather and analyze vast amounts of event data to identify the root cause of an issue. A key IT function, performance analysis has become more complex as the volume and variety of data has increased. And it has become increasingly difficult for IT professionals to analyze all that data with traditional IT methods, even as those methods have incorporated machine learning technology. AIOps can solve the problems of increasing volume and complexity of data by applying sophisticated AI techniques to those bigger data sets, and it can predict likely issues and perform root-cause analysis in order to, in some cases, prevent problems before they happen.
Anomaly detection, or outlier detection, in IT works to identify data outliers, or events and activities in a data set that stand out enough from historical data to suggest a potential problem, called anomalous events. Anomaly detection relies on algorithms, with a trending algorithm monitoring a single KPI by comparing current behavior to past behavior. If the score grows anomalously large, the algorithm raises an alert. AIOps makes this detection faster and more effective. Once a behavior has been identified, AIOps can monitor the difference between the actual value of the KPI of interest versus what the machine learning model predicts, and watch for significant deviations.
Event correlation and analysis is the ability to see through an "event storm" of multiple, related warnings to the underlying cause of events and a determination on how to fix it. The problem with traditional IT tools tends to be that they show the storm of warnings without offering insight into the problem. AIOps uses AI algorithms to automatically group notable events and therefore reduce the burden on IT teams to manage events continuously and reduce unnecessary event traffic and noise. AIOps uses AI to group related events, focus on key event groups, and perform rule-based actions such as consolidating duplicate events, suppressing alerts, or closing notable events when an event is received.
IT service management is a general term for everything involved in designing, building, delivering, supporting, and managing IT services within an organization. This includes the policies, processes, and procedures of delivering IT services to users within an organization. AIOps offers benefits to IT service management in the same way it helps other parts of IT disciplines: through the application of AI to identify issues and help solve those issues. AIOps for IT service management can help IT departments to:
- Manage infrastructure performance in a multicloud environment
- Make more accurate predictions for capacity planning
- Maximize storage resources by automatically adjusting capacity
- Improve resource utilization based on historical data and predictions
- Identify, predict, and prevent IT service issues
- Manage connected devices across a network
Many legacy IT tools require the cobbling together of information from multiple sources before it could be understood, troubleshooted, and resolved. AIOps provides an advantage through automation of the collection and correlation of data from these multiple sources, and increasing speed and accuracy of any resolutions. The AIOps approach automates functions across an organization's IT operations, including:
- Servers, OS, and networks—where AIOps collects all logs, metrics, configurations, messages, and traps to search, correlate, alert, and report across servers
- Containers—where AIOps collects, searches, and correlates container data with other infrastructure data for better service context, monitoring, and reporting
- Cloud monitoring—where AIOps can monitor the performance, usage, and availability of cloud infrastructure
- Virtualization monitoring—where AIOps can offer visibility across the virtual stack, make faster event correlations, and search transactions spanning virtual and physical components
- Storage monitoring—where AIOps can offer an understanding of storage systems in context with corresponding app performance, server response times, and virtualization overhead
There are three different types of AIOps platforms: domain-agnostic, domain-centric, and do-it-yourself.
These are useful tools because they are flexible, general-purpose platforms that are able to ingest large varieties and volumes of data, creating excellent value for enterprises. They can take data from integrated monitoring tools to capture data and apply a range of use cases.
These tend to have a more limited range of use cases in an enterprise context. As the name suggests, domain-centric AIOps revolve around one specific domain such as a network or endpoint systems. They are essentially restricted sets of data sources and data types, and can be a speed bump that inhibits the optimal performance of AIOps.
As the name suggests, these AIOps are best suited for enterprises that prefer to build their own AIOps platforms from the ground up to address their specific needs and applications. There are open-source tools and projects that provide the plug-and-play utility engineers can use to implement into an enterprise AIOps platforms. However, these are fairly uncommon as they require the right talent and an abundance of skill to get the job done correctly.
The history of AIOps arguably began in 2011, with the founding of Moogsoft, which is considered to have developed the first operations management product, based on AI, which became the first such product of AIOps. However, it was not until 2016 that research firm Gartner first coined the term AIOps for what was then a developing industry in the IT industry, and further defining AIOps as the application of AI to IT operations.
6 misconceptions about AIOps, explained
April 15, 2020
AIOps: Managing the Second Law of IT Ops
September 22, 2017
Gartner has released the latest Market Guide for AIOps Platforms.
What is AIOps | A Guide to Everything You Need to Know About AIOps