SBIR/STTR Award attributes
Powerful tools exist to collect, visualize, and analyze performance data about HPC applications. However, usability issues with traditional HPC programming languages, libraries, and frameworks are pushing users to newer, higher-level frameworks for specialized purposes, such as deep learning and data analytics. HPC systems, including leadership Department of Energy systems, are increasingly being called upon to support such workloads. These relieve the user of worrying about data distribution and communication directly. However, existing performance tools are not well suited to collecting data from them, and single-purpose visualization tools require users to learn how to use them rather than reuse their knowledge of general-purpose visualization tools they already know. This problem will be addressed by making improvements to open-source performance tools to improve the usability and scalability of its data collection capabilities when applied to emerging data analytics and deep learning frameworks. We will provide new performance data collection, visualization and analysis tools to aid users gain insightful and actionable information from their performance data. The new tools will be built using data analytics technologies, so that users can analyze performance data of an application written using a data analytics framework using that same framework. Users will then be able to reuse their existing knowledge, rather than having to learn new skills specific to one tool. In Phase I, a proof-of-concept tool has been developed which collects and enables analysis and visualization of performance data for Data Analytics and Deep Learning applications. The proof-of-concept tool is being used by early customers to analyze the performance of research code. In Phase II, the products developed in Phase I will be hardened into a production-ready, “shrink- wrapped” software distribution which automatically provides insightful performance data about Deep Learning applications. Software images will be provided for rapid deployment in many environments. The product will integrate with Deep Learning runtimes to gather performance data that non-integrated tools could not collect, which will reduce time spent by developers in diagnosing performance issues. The Council on Competitiveness reports that over two-thirds of U.S. industry representatives claim their HPC applications could utilize a 10x increase in computing capability, and over one-third could use a 1000x increase. The affordable performance engineering products developed through this SBIR project will fill a crucial need for improved compute capability utilization by improving software scalability and developer productivity, ultimately accelerating the pace of research and development.

