Training Machine Learning for Cyberthreats
In a constantly evolving cyberthreat landscape where firewalls and antiviruses have become old hat, organizations must adopt more technologically advanced ways to protect crucial data. Advanced machine learning algorithms can learn the routine patterns of life for every user and device in a network to detect anomalies and adapt accordingly. The most pressing need for this augmented intelligence is in security operations centers, where teams of analysts search for threats by poring over hundreds of thousands of security events every day.
Machine learning capabilities can train cyberthreat detection systems and measure their performance, including error rates. Efficient training of these systems will reduce error rates and significantly decrease false positives and alerts, curtailing analyst fatigue. Error rates for human analysis also can be measured and benchmarked, enabling continuous improvement of the entire man-machine system. Cyber applications involving machine learning can now capture error rates in a statistically valid manner.
Training for machine learning and other artificial intelligence (AI) applications requires large amounts of relevant information with associated system response files. New technologies can manufacture customized data using a sophisticated rules engine specifically designed for a machine learning system’s realism, complexity and scale requirements. The fully engineered data has the unique features of longitudinal consistency—consistency over time—and internal consistency within the data record. For example, a user identified as a male will likely have a male first name. The engineered data also offers consistency across disparate data sets and provides perfectly known ground truth with associated system response files. Machine learning system requirements, including patterns-of-life data, are designed and engineered into the simulated data files as well. This enables comprehensive performance measurement, including scoring and testing the entire machine and human environment.
Synthetic, high-fidelity automatic data generation replaces the manual process for creating data or accessing existing databases and modifying the data with extract, transform, load (ETL) technology. ETL tools allow information to be pulled from one database and placed into another. The new approach uses a highly automated, rule-based engine to generate virtually unlimited amounts of realistic, fully synthetic data designed for system tests. The simulated data maintain statistical distributions and incorporate all business logic and workflows. Among the benefits of this approach are significant cost savings, faster, more precise development cycles and elimination of the cost and risk associated with managing confidential information.
One of the problems of working with existing large databases is that their ground truth cannot be determined in a cost-effective manner. Also, sophisticated patterns and use cases cannot be manually interwoven into existing databases. This has led to the undesirable situation where AI tools, including machine learning, are developed and deployed without the ability to measure processing speeds against error rates or to monitor and train data for analysis. The solution is manufacturing very large relevant databases designed for the specific application of machine learning and AI.
Manufactured synthetic data designed for cyberthreat detection systems has the unique ability to measure not only processing speed performance but also the accuracy of system algorithms, human analysts and future-state systems. This capability allows organizations to quantitatively measure and benchmark competing state-of-the-art solutions, implementing best-of-class technologies, and enabling personnel to continually do more with less while effectively managing the cyberthreat. It also inherently builds into organizational cyber solutions the ability to advance along with technology.
John Dawson is the president of ExactData LLC, Rochester, New York.