Getting the most out of mountains of log data can be trying to say the least.
In a conference where many are focused on defeating security, independent researcher Alexandre Pinto wants to find ways to make defending enterprise networks both smarter and easier. At the upcoming Black Hat conference in Las Vegas, Pinto plans to discuss how machine-learning algorithms can be used to help organizations get more value from their logs.
“The amount of security log data that is being accumulated today, be it for compliance or for incident response reasons, is bigger than ever,” said Pinto. “Given a recent push on regulations such as PCI and HIPAA, even small and medium companies have a lot of data stored in log management solutions no one is looking at. So, there is a surplus of data and a shortage of professionals that are capable of analyzing this data and making sense of it.”
SIEM (security information event management) functionality relies too much on very deterministic rules, he added. For example, a rule might state that if something happens in a network “X” amount of times, it should be flagged as suspicious. The problem is that the “somethings” and the “Xs” change between organizations and evolve over time, he said.
“But this is not exclusively a tool problem,” he said. “I have seen really talented and experienced people be able to configure one of these systems to really perform well. But it usually takes a number of months or years and a couple of these SOC [security operations center] supermen to make this happen. I used to run teams like these in my previous position, and I understand the challenges involved.”
After managing security consultants and security monitor teams for years, he began researching ways to improve the experience for analysts. His answer: machine learning.
“The [Black Hat] talk is about a model I created to help classify malicious behavior from log data and help companies make decisions based on this trove of information they have available,” Pinto explained. “It does not outperform a well-trained analyst. But it can greatly enhance the analyst’s productivity and effectiveness by letting him focus on the small percentage of data that is much more likely to be malicious based on previous happenings on the network.”
Machine learning is designed to infer relationships from large amounts of data, he added. The more data, the better the predictions—making it a “good deal” for security, he said.
Researcher Proposes Using Machine Learning to Improve Network Defense
“These kinds of algorithms are all around us, being used for sales or marketing, to serve us ads, to suggest us products that are similar to the ones we or our friends have bought,” he said. “It is my belief that we can use machine learning to parameterize the ‘somethings’ and the Xs I mentioned before with very little effort on the human side.
“We can either use what is called ‘unsupervised learning’ to try to find patterns in data that could generate new rules and relationships or ‘supervised learning’, where the humans provide examples of good and bad behavior in their networks and the algorithm can suggest other IPs that are likely to be relevant in the same way.”
To test and develop new machine learning algorithms, Pinto created the MLSec Project.
“The way it is set up is that individuals and companies submit logs extracted from their SIEMs and network equipment and they receive daily automated reports from the algorithms that help them to pinpoint anomalous behavior on their networks,” he said. “For now, we are able to report on some specific behaviors on network firewalls, IPS and other perimeter facing security tools, but additional insights will be added as the project evolves.”
The service is completely free, but has only been demonstrated to a few companies and individuals so far.
Pinto said that the algorithm he is presenting in his talk extrapolates the potential of a specific event on a log file to be significant or from a potential malicious source based on previous occurrences of malicious activity on the “same neighborhood (netblocks and ASN) of the Internet.”
“But instead of ‘let’s block off country X because we know they are bad’, we can have much more granularity, and the rules evolve as we see the malicious behavior changing origins over time. This specific implementation could be compared to a blacklist that changes and tunes itself automatically as the threat landscape changes.
“It has been trained initially on OSINT [open source intelligence] sources available on the Internet, most prominently the SANS Technology Institute, which kindly let me use their data in bulk,” he said. “Without considering companies that already contribute to the service, the algorithm is being fed on average 1.2 million relevant events summarized from over 30 million log entries per day.”
The conference will take place at Caesars Palace between July 27 and Aug. 1. Pinto’s talk is scheduled for Aug. 1.