Programmatic Ad Fraud Detection: Machine Learning Against False Clicks

30 Jun 2022

02 How an ML-based ad fraud detection platform can help you identify click fraud

The highly nuanced structure of the programmatic ad chain has ushered fraudsters into a portal to multi-billion ad spend. Today they are stealing 20% (or $66 billion) in the global ad spend. To do so, they use different techniques, and click fraud is one of the most common method: fake clicks can sometimes account for 90% of all registered interactions. And the most frustrating part here is, rule-based ad fraud detection tools can’t help much given that hackers are continuously inventing new ways to trick them.

Luckily, machine learning will save your good name as an ad traffic provider. And here’s why.

What is click fraud?

Click fraud is a set of techniques to dishonestly generate clicks on a PPC advertisement. In most cases, the end goal can be either increasing the publisher’s revenue or inflating the advertiser’s budget. Sounds easy, right?

Well, it was easy back in the day when fraudulent publishers registered their sites at Google AdSense and then clicked on the ads themselves. Today, fraudsters use a myriad of ways to click on an ad “illegally.” Here are some of them:

Crowdsourcing involves persuading a website’s visitors into clicking on an advertisement that doesn’t interest them. “Persuasion methods” can be different, from a note saying “click on this banner to support us” to a fake button for, say, a video that redirects a clicker to the advertiser’s site.
Click farms also involve actual human beings. But in this case, they are clicking on different ads from different devices and IPs all day long for financial reward. Aside from clicking, such farms can mimic any user interactions, such as viewing something, scrolling, moving a mouse pointer, installing something, playing games, and beyond.
Hit inflation attack is another way to trick a legitimate user into clicking something. But in this type of fraud a legitimate user is redirected to the advertiser’s site before landing the page they intended to visit. Interestingly, the user doesn’t see the ad or the advertiser’s site, but the page they need loads a bit longer.
Botnets are a type of malware that infects and then controls a large number of devices. The compromised computers are instructed to visit websites and click on ads. For the most part, their click patterns are quite generic. But some botnets are trained to mimic the behavior of real users and are harder to detect.

Click fraud (as well as any other type of ad fraud) can affect anyone in the digital advertising pipeline, financially or reputationally. So, an efficient programmatic ad fraud detection tool is a must-have solution in your advertising software arsenal, whether you are an advertiser, an ad exchange, a media buyer, or a publisher. And given that click fraud is becoming more and more sophisticated, we strongly recommend you pay attention to machine learning solutions.

How an ML-based ad fraud detection platform can help you identify click fraud

Machine learning (ML) uses historical data to classify new data elements or to predict the outcome. Besides, it can identify patterns — including hidden ones — and detect relationships between data sets. And these are just the tip of the iceberg. Here are some other “superpowers” of ML:

ML can quickly sift through large volumes of data. This is particularly useful in real-time bidding where ad placements are selected within milliseconds.
Many ML models can improve themselves and become more accurate with every new case. It makes them applicable to fast-evolving click fraud patterns.
Some ML techniques — neural networks, for example — can work with unlabeled data.
Since ML algorithms can learn from their own experience, they are less biased.
You can combine several ML algorithms for a more accurate outcome.

The potential of ML for ad fraud is immense. According to Juniper Research, ML will save the global ad market over $10 billion that would otherwise go into fraudsters’ pockets. But how exactly does ML detect false clicks? Well, there’s no one-size-fits-all answer since ML is basically an intricacy of countless algorithms. Let’s take a closer look at some of the most common algorithms used for click fraud detection.

Logistic regression

Logistic regression is a basic ML algorithm to solve binary classification problems (when you are choosing between two answers). It calculates the probability of a certain outcome, thus helping you to predict whether something is true or not. In terms of programmatic fraud detection, a logistic regression algorithm will help you decide whether click fraud takes place or not.

As shown in the picture, a logistic function looks like a big “S” that tells the probability of click fraud. It goes from “0” to “1” where “0” means that the clicks are legitimate and “1” indicates fraud.

While the sample logistic regression in the picture calculates the ad fraud probability based on one criterion (the number of clicks on the same ad within ten seconds), it can also take multiple variables into consideration, such as device, ID, location of the clicker, or their behavior before and after clicking at the ad. The hardest part here is to decide on the relevant variables. For instance, the zodiacal sign of the clicker doesn’t influence whether they are a fraudster or not.

Decision trees

The decision tree is another classification algorithm. Similar to logistic regression, it helps you to decide whether something is true or not, but the approach is totally different. Basically, a decision tree consists of a number of “if-then” components, which predict the probability of certain outcomes.

Suppose, you want to use a decision tree to define whether a publisher is fraudulent based on the number of clicks from the same IP address. In this case, we’ll first define whether the number of clicks the publisher had is more than three. If it’s “no”, the publisher is considered legitimate. If “yes”, we’ll then check if the IP of the clicker coincides with their location. In case there’s no coincidence, the publisher is fraudulent.

Decision trees are pretty easy to train and can be quite accurate if the provided constraints are true. In other words, you must be certain that if, for example, the number of repetitive clicks exceeds three, then they are fraudulent. Otherwise, be prepared for false positives or false negatives. Even one incorrect parameter is enough to produce a wrong prediction. Because of such sensitivity, this ML algorithm comes in handy only with a limited number of variables.

The random forest algorithm explained

While decision trees are very sensitive to each constraint, a random forest is here to make up for this limitation. So, what is a random forest in machine learning? Basically, it’s an ensemble of decision trees, each giving a probability of a certain outcome based on a particular feature of a data set in question. Then, the estimations of each tree are combined, and the most common one “wins”.

Now, let’s get back to our click fraud case. In the previous algorithm, one decision tree considers everything that affects the probability of click fraud, such as the number of clicks, IP address, and the clicker’s location. The random forest, on the other side, “splits” the logic of our estimations into multiple individual trees, each focusing on a certain factor. Thus, we’ll have one tree that estimates the probability of click fraud on the number of clicks, another one that bases its prediction on the IP address, and so on.

Due to its structure, the random forest algorithm allows you to consider an unlimited number of variables. This makes it extremely powerful.

K-nearest neighbors

K-nearest neighbors (kNN) is a widely used ML algorithm used to find similarities. Unlike the random forest and decision tree, it stores all available data instances, which are segmented into clusters (or classes) based on shared features.

The kNN algorithm relies upon an assumption that the more similar data samples to each other are, the more proximate they are in the kNN’s coordinate system. Given that, once a new data instance is fed into kNN, the model finds its closest neighbors (namely data instances that share maximum features with this new instance) and assigns it a class.

For example, a KNN model for fraudulent publisher detection will have two clusters: “fraudulent publishers” and “non-fraudulent publishers”. Let’s assume that based on historical data, the following characteristics are considered “fraudulent”:

Their traffic reports show fewer than 3000 unique IP addresses (1)
Every day the ads they publish have at least five consecutive clicks (2)
Consecutive clicks are performed from the same IP address (3)
There are IP addresses that don’t coincide with the clicker’s location (4)
Most clickers leave the publisher’s website shortly after clicking on the ad (5)

These characteristics are true for the majority of illegitimate ad placements. At the same time, based on the historical data, not every nefarious publisher meets all these conditions — deviations are possible. What’s more, the more typical a certain characteristic is, the more critical it is. As a result, a new publisher’s case is assigned to the “fraudulent” group because it has characteristics 1, 4, and 5 since these are the most typical ones. These characteristics make the new publisher closer to its fraudulent counterparts than to the non-fraudulent ones.

Conclusion

With its capabilities to rummage through tons of data and find patterns invisible to the naked eye, an ML-based ad fraud detection platform can outsmart even the most resourceful fraudster. But the only problem is, machine learning can work miracles only in capable hands. Luckily, you can find the needed expertise at Postindustria.

We know the ins and the outs of programmatic advertising, including the intricacies of ad fraud. Plus, there are numerous ML projects under our belt. We’ll study your case carefully, define a perfect combination of algorithms, and develop a solution that covers specifically your ad fraud detection needs.