SBIR/STTR Award attributes
Digital transformation of enterprises and emergence of cloud-delivered applications and services have created virtualized, dynamic, and distributed IT infrastructures. Assuring availability, security, and performance in such an environment poses a real challenge to IT departments. Traditional IT Ops has given way to DevOps to speed up IT’s service response to rapidly changing demands from their stakeholders. The rate of configuration changes in a DevOps environment is an order of magnitude greater than in a traditional IT Ops environment. Now, IT organizations are trying to leverage machine learning and advanced analytics to further automate and improve responsiveness of infrastructure services. This new trend is referred to as Artificial Intelligence for IT Ops (AIOps). DevOps environment which uses automated provisioning and software- defined orchestration cannot ignore the impact of frequent configuration changes/updates (manifested in system/server logs) on application infrastructure performance. Information provided by non-traditional, textual data sources, e.g., syslogs, API logs, outage reports, etc. that manifest as issues on infrastructures, become critical in infrastructure performance analytics. Today’s performance-management tools primarily use numerical network-traffic-related data and limited textual data such as syslogs in silos. Mining pertinent information from textual log/event data and correlating them with numerical performance data on the same analytics platform will lead to faster troubleshooting of application/service infrastructure performance issues. Considering these realities, in this Phase II SBIR project, Ennetix will develop a novel, log-driven infrastructure analytics and management service, called LIAM, to enhance availability, security, and performance of modern IT infrastructures, and greatly accelerate root-cause analysis of issues. LIAM will mine non- traditional textual data, such as system/server logs, configuration change logs, outage reports, and event re- ports from other IT management platforms; and correlate them with numerical network trace and server/host performance data. LIAM will feature advanced machine-learning techniques based on topic mining, novelty detection, and clustering; and it will be built on a scalable architecture to accommodate other user-defined categorical data sources. LIAM will bring useful additional context to analyzing performance anomalies to reduce application/service interruptions and accelerate root-cause identification and service restoration. During Phase I of this SBIR project, requirements analysis and design of the LIAM platform were conducted, a working prototype was developed, and evaluation studies have been performed to determine LIAM’s effectiveness to support IT operations by faster root-cause analytics and troubleshooting of modern IT infrastructures. These feasibility and performance evaluation studies have been accomplished using live data gathered from a large campus IT infrastructure (namely, UC Davis). Outcomes of the Phase I R&D efforts and evaluation studies have confirmed the viability of LIAM as a commercial-grade solution. In this Phase II project (as a continuation of Phase I), the goal is to significantly expand LIAM with analytical features, AI/ML models, third-party integrations, automation methods, and innovative visualizations. A commercial-grade LIAM solution will be developed using which IT operations team can proactively manage the performance of distributed infrastructures. Early trials will be accomplished to demonstrate the functionalities and performance of LIAM on live networks and pave the way to successful market entry and deployment on premier R&E organizations such as UC Davis. The proposed solution will greatly benefit IT administrators and managers at DOE and other organizations through a new approach for IT management which considers various data sources (both textual and numerical) along with traffic data and significantly reduces operational expenditures. The wider benefits of this effort will extend well beyond the immediate DOE scientific community, and on to other enterprises, network operators, and cloud-service providers, who will be able to leverage the proposed LIAM solution to proactively manage their cloud-based, distributed, and dynamic application-delivery infrastructures.