Artificial Intelligence Primer
Artificial Intelligence and Machine Learning are the technologies that are at the forefront of what is being called the world’s 4th industrial revolution. Since the beginning of the human race, man has strived to improve how efficiently we live and work. At first humans relied on simple manual labor and ingenuity. We believe this is how man has produced things like the Pyramids, the Great Wall of China and Stonehenge. Then came the first industrial revolution, which introduced mechanization, steam, and water power and brought advances in production, travel, and urbanization. The second revolution was sparked by the inventions of mass production and electricity. The introduction of electronic and digital technologies marked the third revolution and things like computers and the internet. Today we are entering a new era enabled by massive advances and practical application of Artificial Intelligence and Machine Learning.
MAN vs. MACHINE
Artificial intelligence aims to help humans operate more efficiently by dramatically reducing time, money, and the human smarts required to perform routine tasks. In a nutshell, computers are being given self-learning capabilities so that they can accurately predict outcomes, identify patterns, and automatically make adjustments, based on both past and current information. The machine starts to become more efficient and as smart as the human race in some cases.
The potential of computers becoming as smart as (or even smarter than) humans in carrying out certain tasks raises the debate of “man vs. machine”. Regardless of one’s belief, one thing we can all agree on is that humans have something that computers will likely never have: emotion, intuition, and gut feel.
When people debate the topic of artificial intelligence, they often argue about which machine learning categories or algorithms are best. Machine learning algorithms are generally categorized into 3 types, unsupervised without prior knowledge of the labels (labeled data), supervised with some knowledge of the labels (labeled data), and reinforcement, which is between the two types. There are more specific algorithms of these categories, such as KNN, K-means, Decision Tree, SVM, Artificial Neural Networks, Q-learning, etc. So, which one is better? Well, like anything in life, everything has pros and cons, and when it comes to machine learning, I tend not to debate the model itself, but rather redirect the conversation to the quality of the data. Machine learning models run on top of data and without the appropriate amounts and quality of data and types of data, the machine learning model can be rendered useless no matter how good it is in theory. This is not to diminish the impact of selecting the right machine learning algorithms. The data and the algorithms must complement each other to solve specific use cases.
DATA IS PARAMOUNT
At Aella Data we started our company off with a priority mission of collecting data – lots of data – and, more importantly, the right types of data in order to solve the breach detection problem. Once the data is collected, we then sanitize it by performing deduplication, normalization, and a number of other things. Next, we correlate the data with other bits of information, such as the threat intelligence, the disposition of a file download, the geographic location of an IP address, and more. This enrichment gives better context to the dataset as a whole. The result of this process yields clean data enriched with context. Only after these important tasks are completed, do we perform machine learning.
AI WITH LIMITED VS COMPLETE DATA
Let’s look closer at an example of how banks perform credit card fraud detection. If a customer normally only uses their credit card in San Jose, California, but travels to Tokyo, Japan, for the first time, and tries to use this card, some banks will flag that as an anomaly and deactivate the credit card. This often times leaves the customer embarrassed and frustrated when a merchant tells them the card is declined. While this truly may be a “machine learned” anomaly, it may not warrant the deactivation of the credit card, as this may be legitimate usage of the card.
The root of the above problem typically surfaces because the data itself is singular (location of the card usage only) and lacks context, like the time the card was last used, where it was used, or how it was used. If a system were to correlate other bits of information like the time, location, distance between locations, reputation of a location, or how it was used (card terminal or web site for example,) a machine learning algorithm could better determine actual fraud.
Take another example of a card being used in San Jose, California, at 4:00 PM PST, but then used again in a small city in the Ukraine at 5:00 PM PST the same day. The probability of this being fraud would be much higher than the previous example. The correlated pieces of data to get to a conclusion like this, would be the time it would take to travel the distance to the Ukraine, after its use in San Jose, and the use or the card in a small city (reputation small unfrequented city) in the Ukraine.
This illustrates how Artificial Intelligence can be very useful in completing repetitive tasks involving lots of data that humans grow tired of performing and analyzing that data to solve problems. But will the technology replace humans? I tend to think not. AI may get you 90%+ there in solving repetitive tasks, but 10%+ of the effort will always be needed to make the final decision to a problem. Furthermore, as with other advances in efficiency, we can reuse our freed-up time to do even more work than before. Is one machine learning algorithm better than another? I believe that the answer lies in the understanding of the problem one is trying to solve, and I also believe that the quality of the data is as important as the algorithm itself.
SVP Product Line Management