- Artificial Intelligence
- Artificial Intelligence
The highly nuanced structure of the programmatic ad chain has ushered fraudsters into a portal to multi-billion ad spend. Today they are stealing 20% (or $66 billion) in the global ad spend. To do so, they use different techniques, and click fraud is one of the most common method: fake clicks can sometimes account for 90% of all registered interactions. And the most frustrating part here is, rule-based ad fraud detection tools can’t help much given that hackers are continuously inventing new ways to trick them.
Luckily, machine learning will save your good name as an ad traffic provider. And here’s why.
Click fraud is a set of techniques to dishonestly generate clicks on a PPC advertisement. In most cases, the end goal can be either increasing the publisher’s revenue or inflating the advertiser’s budget. Sounds easy, right?
Well, it was easy back in the day when fraudulent publishers registered their sites at Google AdSense and then clicked on the ads themselves. Today, fraudsters use a myriad of ways to click on an ad “illegally.” Here are some of them:
Click fraud (as well as any other type of ad fraud) can affect anyone in the digital advertising pipeline, financially or reputationally. So, an efficient programmatic ad fraud detection tool is a must-have solution in your advertising software arsenal, whether you are an advertiser, an ad exchange, a media buyer, or a publisher. And given that click fraud is becoming more and more sophisticated, we strongly recommend you pay attention to machine learning solutions.
Machine learning (ML) uses historical data to classify new data elements or to predict the outcome. Besides, it can identify patterns — including hidden ones — and detect relationships between data sets. And these are just the tip of the iceberg. Here are some other “superpowers” of ML:
The potential of ML for ad fraud is immense. According to Juniper Research, ML will save the global ad market over $10 billion that would otherwise go into fraudsters’ pockets. But how exactly does ML detect false clicks? Well, there’s no one-size-fits-all answer since ML is basically an intricacy of countless algorithms. Let’s take a closer look at some of the most common algorithms used for click fraud detection.
Logistic regression is a basic ML algorithm to solve binary classification problems (when you are choosing between two answers). It calculates the probability of a certain outcome, thus helping you to predict whether something is true or not. In terms of programmatic fraud detection, a logistic regression algorithm will help you decide whether click fraud takes place or not.
As shown in the picture, a logistic function looks like a big “S” that tells the probability of click fraud. It goes from “0” to “1” where “0” means that the clicks are legitimate and “1” indicates fraud.
While the sample logistic regression in the picture calculates the ad fraud probability based on one criterion (the number of clicks on the same ad within ten seconds), it can also take multiple variables into consideration, such as device, ID, location of the clicker, or their behavior before and after clicking at the ad. The hardest part here is to decide on the relevant variables. For instance, the zodiacal sign of the clicker doesn’t influence whether they are a fraudster or not.
The decision tree is another classification algorithm. Similar to logistic regression, it helps you to decide whether something is true or not, but the approach is totally different. Basically, a decision tree consists of a number of “if-then” components, which predict the probability of certain outcomes.
Suppose, you want to use a decision tree to define whether a publisher is fraudulent based on the number of clicks from the same IP address. In this case, we’ll first define whether the number of clicks the publisher had is more than three. If it’s “no”, the publisher is considered legitimate. If “yes”, we’ll then check if the IP of the clicker coincides with their location. In case there’s no coincidence, the publisher is fraudulent.
Decision trees are pretty easy to train and can be quite accurate if the provided constraints are true. In other words, you must be certain that if, for example, the number of repetitive clicks exceeds three, then they are fraudulent. Otherwise, be prepared for false positives or false negatives. Even one incorrect parameter is enough to produce a wrong prediction. Because of such sensitivity, this ML algorithm comes in handy only with a limited number of variables.
While decision trees are very sensitive to each constraint, a random forest is here to make up for this limitation. So, what is a random forest in machine learning? Basically, it’s an ensemble of decision trees, each giving a probability of a certain outcome based on a particular feature of a data set in question. Then, the estimations of each tree are combined, and the most common one “wins”.
Now, let’s get back to our click fraud case. In the previous algorithm, one decision tree considers everything that affects the probability of click fraud, such as the number of clicks, IP address, and the clicker’s location. The random forest, on the other side, “splits” the logic of our estimations into multiple individual trees, each focusing on a certain factor. Thus, we’ll have one tree that estimates the probability of click fraud on the number of clicks, another one that bases its prediction on the IP address, and so on.
Due to its structure, the random forest algorithm allows you to consider an unlimited number of variables. This makes it extremely powerful.
K-nearest neighbors (kNN) is a widely used ML algorithm used to find similarities. Unlike the random forest and decision tree, it stores all available data instances, which are segmented into clusters (or classes) based on shared features.
The kNN algorithm relies upon an assumption that the more similar data samples to each other are, the more proximate they are in the kNN’s coordinate system. Given that, once a new data instance is fed into kNN, the model finds its closest neighbors (namely data instances that share maximum features with this new instance) and assigns it a class.
For example, a KNN model for fraudulent publisher detection will have two clusters: “fraudulent publishers” and “non-fraudulent publishers”. Let’s assume that based on historical data, the following characteristics are considered “fraudulent”:
These characteristics are true for the majority of illegitimate ad placements. At the same time, based on the historical data, not every nefarious publisher meets all these conditions — deviations are possible. What’s more, the more typical a certain characteristic is, the more critical it is. As a result, a new publisher’s case is assigned to the “fraudulent” group because it has characteristics 1, 4, and 5 since these are the most typical ones. These characteristics make the new publisher closer to its fraudulent counterparts than to the non-fraudulent ones.
With its capabilities to rummage through tons of data and find patterns invisible to the naked eye, an ML-based ad fraud detection platform can outsmart even the most resourceful fraudster. But the only problem is, machine learning can work miracles only in capable hands. Luckily, you can find the needed expertise at Postindustria.
We know the ins and the outs of programmatic advertising, including the intricacies of ad fraud. Plus, there are numerous ML projects under our belt. We’ll study your case carefully, define a perfect combination of algorithms, and develop a solution that covers specifically your ad fraud detection needs.
Get actionable insights for your product
Thank you for reaching out,
Make sure to check for details.