How to Build a Fraud Detection System In-house

Detecting fraud is important. The amount of data increases year over year, and rule-based fraud detectors cannot keep up with evolving hacker technology. In this article, Igor Kaufman offers a step-by-step guide to build a fraud detection platform.
By Igor Kaufman
How to Build a Fraud Detection System In-house

Everybody talks about AI-based fraud detection systems, but how do these systems actually work? What are the exact steps to build such a system in-house? How is system decision-making controlled?

You will find all the answers below.

Step 0. Data Preparation

Data is a fuel of machine learning; the better the fuel is, the faster the car can run. I’ll skip the part that describes how to aggregate the data and how to store and transport it, and jump straight to what should be done with the data to make ML magic happen.

It is important to start by cleansing data to get a specified set of features for analysis. Let’s look at payments as an example; the relevant features would be buyer details, seller details, payment amount, the time when the transaction was sent, bank details, and IPs as well as others. In fact, there could be hundreds of parameters. The more complex the field, the more parameters necessary. Hence, the better we clean the data, removing depending or correlated features, the better the performance of the final algorithm. Otherwise, it would be hard to tell which feature caused the prediction. Typically, data preparation and exploration can take as much or even more time than the rest of the ML project.


Tech image: red squares are highly correlated features, the fewer red squares we have, the better the model interpretability is.

Step 1. Supervised Learning, or What Can You Leverage from What You Already Know?

Companies often have some type of fraud detection system in place. They might be either rule-based and become overwhelming because of the number of rules required and still struggle to filter all the suspicious cases, or AI-based, provided by a third-party vendor who processes sensitive data outside the company infrastructure.

If you decide to level up the protection and build a fraud detection system in-house, is there a way to leverage existing knowledge about fraud? In fact, there is. Supervised machine learning means that a person tells the machine what is wrong and what is right. Using trusted results from existing systems, you can expedite the learning process drastically.

Usually, there are a number of algorithms that could be used. The main goal is to find the one that works best with a specific dataset and tune its parameters to get the best balance of true positives (real fraud), false positives (not fraud marked as fraud) and false negatives (fraud that was not identified as fraud). There is a chart called a confusion matrix that data scientists use to optimize algorithms.


Tech image: Confusion Matrix - Orange or not?

The better numbers you get here, the faster the new system can replace outdated ones because the better it has learned from the historical data.

Step 2. Anomaly Detection

But what if the existing algorithms miss a number of fraudulent elements? This is why we are here — to find suspicious records that could be fraudulent.

In order to do that, we need all the items, transactions, payments, or whatever we work with, to be grouped. The small and distant clusters are the suspicious elements. It means that the clustering algorithm marked them as non-typical. They may not necessarily be fraudulent, but they are worth a look and a deeper study.


Tech image: 3D representation of clusters. The real dimensionality may be hundreds of D.

Later, the subject matter expert can take a look at such clusters and confirm that

  • all the records don’t appear to be legitimate and all new transactions that appear in this cluster should be blocked automatically
  • or only a part of the records are suspicious, and we should either tweak the cluster parameters or send all these suspicious records to a human specialist for review.

Step 3. Time-series Analytics and Dynamic Limits

In general, clustering can be done on a static dataset, but expected behavior changes over time. Personal or business revenue may grow, thus increasing spending patterns. In addition, expenditures may have a seasonal nature, for example, spending is higher during the holiday season. These behavioral patterns must also be accommodated to make sure the system performs well. It should review abnormal transactions while adjusting the expected range dynamically to accommodate the evolving nature of transactions over time.

Predicting the low-risk corridor is a job for the time-series algorithms, starting from simpler autoregression models and moving on to more sophisticated models, such as FBProphet (developed by Facebook), which counts the seasonality factor in.


Tech image: dark blue line - real and expected values, blue range - expected deviation, black dots - real values. Black dots outside the range are suspicious.

Step 4. Integration

Once we’ve prepared the data and trained and tested the model, how do we integrate it with the existing infrastructure?

Regarding technology, there is nothing complicated. The ML model is being wrapped into service (i.e. REST) to have an API, so that the rest of the system can interact with it. Then it should be tested and deployed (i.e. as a Docker container), connected to the data sources and decision support UI.

From a risk management and validation point of view, it might make sense to run the new fraud detection system and the old one (if one exists) in parallel to make sure the new one is consistent. In addition, human validation could be introduced to mitigate the risk of wrong decisions in real life, as well as improve models through additional supervised training.


At DataArt, we have a solid experience harnessing AI/ML for building fraud detection platforms, as well as predictive and recommendation systems. Contact us now if you are looking for technical partner to build a fraud detection system.

Sign Up for Updates!

Subscribe now to receive industry-related articles and updates

Choose industries of interest
Thank You for Joining!

You will receive regular updates based on your interests. No spam guaranteed

Add another email address
Sign Up for Updates!
Choose industries of interest
Thank You for Joining!

You will receive regular updates based on your interests. No spam guaranteed

Add another email address
We are glad you found us
Please explore our services and find out how we can support your business goals.
Get in Touch Envelope
Software Security Testing for Insurers Square (1).jpg

Download a comprehensive Application Security Testing Checklist to ensure your software is well-protected