You might have stumbled across it: Many banks’ online banking does not recognize transactions between your different accounts as such. A transfer from your credit card to your savings account, for example, will result in higher expenses and income being displayed, even though, of course, it actually neither changes your expenses nor your income.

In this blog we describe one of our algorithms developed for a MVP of a financial management app that solves this problem. The problem is as follows: Each transaction in one of your accounts (e.g. checking accounts, credit cards, savings, etc.) could either be associated with another transaction in different accounts of yours, we then call these two an internal transaction pair, or not (it is then an external transaction).

In some cases, you can easily recognize these internal transaction pairs because, for example, the name of the account holder appears as the sender and recipient. However, it is often more difficult as the name may be spelled differently or no client name may appear, but only e.g. “credit card statement”. Our algorithm must therefore cover such cases and should be robust enough to automatically recognize as many internal transaction pairs as possible.

## Algorithm

Before we describe the algorithm in detail, here is a short overview. From the large number of potential transaction pairs, the algorithm must decide which of them are internal transaction pairs and which are not. The procedure is as follows:

- From all transactions, construct a list of all possible transaction pairs
- Calculate a dissimilarity indicator for the two transactions of each possible transaction pair
- Learn how to distinguish internal transaction pairs from external transactions by the differences in their dissimilarity indicators
- Find the most suitable of the possible transaction pairs and decide whether they are an internal transaction pair or not

### 1.Possible transaction pairs

The algorithm requires the list of transactions of all accounts of a test person. Each transaction is then a row in a table with columns that include the date, the monetary value, the subject line, the sender and recipient names and the account number – a total of $M$ features .

We can greatly reduce the number of potential pairs of transactions by grouping all transactions into groups of equal value. For example, all transactions with a value of +1000 Euro and -1000 Euro fall into a group. All possible transaction pairs then result as all possible combinations of a positive and a negative transaction within this group.

The resulting list of possible transaction pairs is thus much smaller than by a naive formation of all possible combinations of transactions. With the account data of our test persons, for example, the number of possible transaction pairs is reduced from 3 million to around 2500, which simplifies the problem considerably.

### 2. Similarity of transactions

For each possible transaction pair, we calculate a dissimilarity vector from different combinations of the $M$ features of the two transactions. To make that more vivid: For example, one entry in this difference vector describes how far apart the dates of the two transactions are, another describes how much the names of the sender and receiver of the transaction differ, and another describes whether the account numbers of the sender and receiver are the same.

Depending on the type of feature, we use different measures for dissimilarity: For account numbers we test for equality, for text-like features we use measures based either on the longest common substring, or simply test if one text contains the other.

This results in a dissimilarity vector with 12 dimensions for every possible transaction pair — how should you imagine that? Each of these dimensions describes a different aspect of the dissimilarity between the two transactions: One refers to the time interval, differences in the subject line, and so on. To illustrate the 12-dimensional dissimilarity vector more clearly in two dimensions, we use a technique called T-distributed stochastic neighbor embedding or t-SNE for short. We use the implementation of t-SNE in the open source package Scikit-learn.

Figure 1 shows the dissimilarity vectors of the approximately 2500 possible transaction pairs as colored dots. Dissimilarity vectors that differ greatly in their 12-dimensional space are represented by t-SNE as two far apart points in two dimensions, and the other way around the points of two similar dissimilarity vectors are close together. Data points belonging to internal transaction pairs are shown in blue, all others in red. You can see that there are patterns – most blue dots are in a corner – but they don’t seem to be easy to describe or even separate.

### 3. Distinguishing internal and external transaction pairs

To identify the statistical differences between external and internal transaction pairs (i.e. to distinguish the red and blue points in Figure 1), we use methods of supervised statistical learning: we mark the external transaction pairs in a test person’s data as such, and train a classification algorithm to learn the relationship between difference vector and transaction class (internal or external). Once this relationship has been learned, we can decide based on which of the possible transaction pairs are internal and which are external.

For classification into external and internal transaction pairs we use a Support Vector Machine. Put simply, while it is impossible in the two-dimensional representation of Figure 1 to achieve a clear separation of red and blue points by a line, this is often very possible in a high-dimensional space. There, the points of the two classes can often be clearly separated by a simple decision boundary – a support vector machine automates and simplifies these steps.

We use the implementation of a support vector classifier in Scikit-learn with a radial base function kernel, and adjust the parameters until the first and second type errors in the classification are negligible.

Now we can use the support vector classifier to predict for each possible internal transaction pair with the whether this pair is an internal transaction, that is, whether it describes a transaction between my own accounts, or not.

### 4. Selection from the list of possible transaction pairs

In the final step we have to make sure that every transaction is at most part of an internal transaction pair. This is not automatically the case, since the support vector classifier, as described in the last section, could classify several of the possible transaction pairs as internal, which may contain a given transaction more than once.

We want to select out the most likely transaction pairs from the list of potential pairs — or, if none of the pairs looks reasonable, classify the transaction as external. There are several ways to approach this problem:

- Formulation as a binary integer linear programming Optimization problem
- By Simulated Annealing, a Markov Chain Monte Carlo algorithm that randomly traverses the state space of possible transaction pairs until at least a local optimum is found
- A heuristic method that assigns transactions one after another to the “most appropriate” transaction pair ( not necessarily the optimal one )

We try the heuristic method first. The algorithm loops through the list of transactions with a positive value. It either assigns each of these transactions to the most similar negative transaction (as estimated by the support vector classifier), or classifies them as an external transaction if no negative transaction appears appropriate.

Alternatively, we implement a simulated annealing method, which, however, has considerably longer run times. Since we already achieve very good results with the heuristic method, we are content with this for now.

## Performance and validation

We can validate the algorithm by applying it to a list of transactions that we have not used for training and development of the algorithm (cross validation). For this we use another test person’s transactions data — and also achieve very good results here. The first and second type errors are very small (both below 5%). That means that internal and external transaction pairs can reliably be distinguished from each other.