Artificial Intelligence Primer

Artificial Intelligence and Machine Learning are the technologies that are at the forefront of what is being called the world’s 4th industrial revolution.  Since the beginning of the human race, man has strived to improve how efficiently we live and work. At first humans relied on simple manual labor and ingenuity.  We believe this is how man has produced things like the Pyramids, the Great Wall of China and Stonehenge.  Then came the first industrial revolution, which introduced mechanization, steam, and water power and brought advances in production, travel, and urbanization.  The second revolution was sparked by the inventions of mass production and electricity. The introduction of electronic and digital technologies marked the third revolution and things like computers and the internet. Today we are entering a new era enabled by massive advances and practical application of Artificial Intelligence and Machine Learning.

 

MAN vs. MACHINE

Artificial intelligence aims to help humans operate more efficiently by dramatically reducing time, money, and the human smarts required to perform routine tasks. In a nutshell, computers are being given self-learning capabilities so that they can accurately predict outcomes, identify patterns, and automatically make adjustments, based on both past and current information. The machine starts to become more efficient and as smart as the human race in some cases.

 

The potential of computers becoming as smart as (or even smarter than) humans in carrying out certain tasks raises the debate of “man vs. machine”.  Regardless of one’s belief, one thing we can all agree on is that humans have something that computers will likely never have:  emotion, intuition, and gut feel.

 

When people debate the topic of artificial intelligence, they often argue about which machine learning categories or algorithms are best. Machine learning algorithms are generally categorized into 3 types, unsupervised without prior knowledge of the labels (labeled data), supervised with some knowledge of the labels (labeled data), and reinforcement, which is between the two types.  There are more specific algorithms of these categories, such as KNN, K-means, Decision Tree, SVM, Artificial Neural Networks, Q-learning, etc. So, which one is better? Well, like anything in life, everything has pros and cons, and when it comes to machine learning, I tend not to debate the model itself, but rather redirect the conversation to the quality of the data. Machine learning models run on top of data and without the appropriate amounts and quality of data and types of data, the machine learning model can be rendered useless no matter how good it is in theory. This is not to diminish the impact of selecting the right machine learning algorithms. The data and the algorithms must complement each other to solve specific use cases.

 

DATA IS PARAMOUNT

At Aella Data we started our company off with a priority mission of collecting data – lots of data – and, more importantly, the right types of data in order to solve the breach detection problem. Once the data is collected, we then sanitize it by performing deduplication, normalization, and a number of other things. Next, we correlate the data with other bits of information, such as the threat intelligence, the disposition of a file download, the geographic location of an IP address, and more. This enrichment gives better context to the dataset as a whole. The result of this process yields clean data enriched with context. Only after these important tasks are completed, do we perform machine learning.

 

AI WITH LIMITED VS COMPLETE DATA

Let’s look closer at an example of how banks perform credit card fraud detection. If a customer normally only uses their credit card in San Jose, California, but travels to Tokyo, Japan, for the first time, and tries to use this card, some banks will flag that as an anomaly and deactivate the credit card. This often times leaves the customer embarrassed and frustrated when a merchant tells them the card is declined. While this truly may be a “machine learned” anomaly, it may not warrant the deactivation of the credit card, as this may be legitimate usage of the card.

 

The root of the above problem typically surfaces because the data itself is singular (location of the card usage only) and lacks context, like the time the card was last used, where it was used, or how it was used. If a system were to correlate other bits of information like the time, location, distance between locations, reputation of a location, or how it was used (card terminal or web site for example,) a machine learning algorithm could better determine actual fraud.

 

Take another example of a card being used in San Jose, California, at 4:00 PM PST, but then used again in a small city in the Ukraine at 5:00 PM PST the same day. The probability of this being fraud would be much higher than the previous example. The correlated pieces of data to get to a conclusion like this, would be the time it would take to travel the distance to the Ukraine, after its use in San Jose, and the use or the card in a small city (reputation small unfrequented city) in the Ukraine.

 

CLOSING REMARKS

This illustrates how Artificial Intelligence can be very useful in completing repetitive tasks involving lots of data that humans grow tired of performing and analyzing that data to solve problems. But will the technology replace humans? I tend to think not. AI may get you 90%+ there in solving repetitive tasks, but 10%+ of the effort will always be needed to make the final decision to a problem. Furthermore, as with other advances in efficiency, we can reuse our freed-up time to do even more work than before. Is one machine learning algorithm better than another? I believe that the answer lies in the understanding of the problem one is trying to solve, and I also believe that the quality of the data is as important as the algorithm itself.

 

John Peterson

SVP Product Line Management

Aella Data

What Are DGAs and How to Detect them?

Domain Generation Algorithms (DGAs) are a class of algorithms that periodically and dynamically generate large numbers of domain names. Typically, the domains are used by malware and botnets as rendezvous points to facilitate callback to the malicious actor’s Command & Control servers. DGAs allow malware to generate tens of thousands of domains per day, the vast majority of them unregistered. The enormous numbers of unregistered domains are used to masquerade the registered ones, allowing the infected botnets to evade detection and deterrence by signature or IP-reputation based security detection systems.

The first known malware family to use a DGA was Kraken in 2008. Later that year, the Conflicker worm pushed the DGA tactic into notoriety. Even after 10 years, it is still possible to find Conflicker or one of its variants on some of today’s networks.

In tandem with the increasing proliferation of malware, the usage of DGAs has become more pervasive.

The Objectives of DGA Detection

Because DGA activity is a considerable indicator of compromise, it becomes critical to detect any such activities on your network. There are three levels of DGA detection, with each subsequent level correlating to a rise in severity. Detection at later levels is more difficult, but more critical.

If a DGA is detected, it means that one or more of your systems have been infected by DGA-based malware and have become botnets. Some actions need to be taken. The first objective is to identify the affected systems, properly cleaning or quarantining them to prevent escalation.

The next objective is to determine whether a given DGA domain name is registered. If the domain is registered, it has become an active Command & Control server that presents a great risk to your network. Infected systems, now botnets, may use these servers to call home and receive commands from the malicious attacker. Therefore, the second component of an effective DGA detection system is the ability to differentiate registered domains from the unregistered ones.

For example, a DGA may generate 1000 domains, from xyzwer1, xyzwer2 …. to xyzwer1000. The hacker only needs to register one domain, i.e., xyzwer500, not the other 999 domains. If the registered domain and its associated IP can be identified, the information can be used to block the communication channel between the targeted system and the Command & Control server. Additionally, the intel should be propagated to all other prevention or detection systems in place to obstruct callback to that server from any system in the network.

The last but most critical objective of a DGA detection system is to determine whether callback was successful with the registered domains and contact was made between the infected system(s) and the Command & Control server. If such activity is detected, some damage may have already been done. Perhaps the malware in your network was updated, or new malware was installed. Sensitive data may have been exfiltrated.

How does DGA Detection work?

DGA activity is detected by capturing and analyzing network packets, usually in five general steps.

Step 1 – Detect DNS Application
Detection begins via DNS request and/or response messages. DNS is a fundamental Internet protocol, and most firewalls have a policy to allow outgoing DNS traffic on its reserved port 53. However, a hacker may take advantage of port 53 to send its traffic without adherence to the standard DNS message format. This attack is called DNS tunneling. A Deep Packet Inspection (DPI) Engine is recommended to identify the DNS applications more precisely.

Step 2 – Extract Domain Names
Once a network application is identified as DNS, the domain names in the DNS query and response messages need to be extracted. In order to extract the right domain name, the DNS message’s content needs to be parsed carefully and a DPI engine is required to perform this task.

Step 3 – Detect any DGA
Analysis needs to be performed on the domains extracted from DNS messages to determine whether they are DGAs. This is perhaps the most complicated step. The challenge is to reduce both false positives as well as false negatives. Detection mechanisms have evolved dramatically over the last 10+ years.
Some mechanisms are based on the relatively simple Shannon Entropy.
https://www.splunk.com/blog/2015/10/01/random-words-on-entropy-and-dns.html
Some mechanisms are based on more sophisticated Ngrams as presented by Fyodor in the Hitb conference
Lately, with machine learning becoming popularized, its methodologies have also been applied to DGA detection. Machine learning can combine the features of Ngrams, Shannon Entropy, as well as the length of the domain names to influence decisions. Several machine learning models have been tried. There is a very good blogpost by Jay Jacobs in 2014 describing the process.
Here is another open source DGA detector based on Machine Learning with Markov Chain:
https://github.com/exp0se/dga_detector

Step 4 – Detect Registered DGA Domains
In order to detected whether a DGA domain name is registered, DNS responses need to be checked. Merely tracking DNS requests is not sufficient – the detection system should track the entire transaction to facilitate correlation between pieces of information.

Step 5 – Detect Traffic to Registered DGA Domains
When most existing DGA detection systems focus on detecting whether a domain name is a DGA domain, they often forget the last question, the most important one: is there any traffic that has been sent to the registered DGA domains? In order to detect this in a timely fashion, DGA domain detection must be tightly coupled with network traffic inspection. The results need to be echoed back to the traffic inspection engine immediately before any damage is done.

Step 6 – Blocking the Traffic to Registered DGA Domains
While not technically a part of detection, if there is an integration with a prevention system such as a Firewall or IPS, a rule should be inserted right away to block all the traffic to the registered domains.

A great DGA detection system should perform all 5 steps. An excellent DGA detection system should also include Step 6. Unfortunately, most DGA detection systems today stop at either step 3 or step 4.

Conclusion
Because DGAs are difficult to detect with signature or reputation based detection or prevention system, they have become quite popular with malware developers.
An intelligent detection system is required to perform the detection. An excellent DGA detection system must extract domain name information from DNS transactions, perform thorough analytics to detect DGA status, check registration status of suspected domains, correlate with network traffic inspection to assess the level of compromise, and ideally integrate with prevention systems to avoid further compromise. In order to reduce both false positives as well as false negatives, a machine learning should be seriously considered. Only with comprehensive and pervasive intelligence at every stage can the threat be truly ameliorated.

Resources

The repository in Github by Andrey Abakumove contains algorithms for generating domain names, as well as dictionaries of malicious domain names.