## Overview

On this put up we’ll prepare an autoencoder to detect bank card fraud. We may also reveal methods to prepare Keras fashions within the cloud utilizing CloudML.

The premise of our mannequin would be the Kaggle Credit score Card Fraud Detection dataset, which was collected throughout a analysis collaboration of Worldline and the Machine Studying Group of ULB (Université Libre de Bruxelles) on massive knowledge mining and fraud detection.

The dataset accommodates bank card transactions by European cardholders remodeled a two day interval in September 2013. There are 492 frauds out of 284,807 transactions. The dataset is very unbalanced, the constructive class (frauds) account for less than 0.172% of all transactions.

## Studying the info

After downloading the info from Kaggle, you may learn it in to R with `read_csv()`

:

The enter variables include solely numerical values that are the results of a PCA transformation. In an effort to protect confidentiality, no extra details about the unique options was supplied. The options V1, …, V28 have been obtained with PCA. There are nonetheless 2 options (*Time* and *Quantity*) that weren’t remodeled. *Time* is the seconds elapsed between every transaction and the primary transaction within the dataset. *Quantity* is the transaction quantity and could possibly be used for cost-sensitive studying. The *Class* variable takes worth 1 in case of fraud and 0 in any other case.

## Autoencoders

Since solely 0.172% of the observations are frauds, we’ve got a extremely unbalanced classification downside. With this type of downside, conventional classification approaches normally don’t work very effectively as a result of we’ve got solely a really small pattern of the rarer class.

An autoencoder is a neural community that’s used to be taught a illustration (encoding) for a set of information, sometimes for the aim of dimensionality discount. For this downside we’ll prepare an autoencoder to encode non-fraud observations from our coaching set. Since frauds are imagined to have a distinct distribution then regular transactions, we count on that our autoencoder could have larger reconstruction errors on frauds then on regular transactions. Which means that we will use the reconstruction error as a amount that signifies if a transaction is fraudulent or not.

If you wish to be taught extra about autoencoders, place to begin is that this video from Larochelle on YouTube and Chapter 14 from the Deep Studying guide by Goodfellow et al.

## Visualization

For an autoencoder to work effectively we’ve got a powerful preliminary assumption: that the distribution of variables for regular transactions is completely different from the distribution for fraudulent ones. Let’s make some plots to confirm this. Variables have been remodeled to a `[0,1]`

interval for plotting.

We will see that distributions of variables for fraudulent transactions are very completely different then from regular ones, apart from the *Time* variable, which appears to have the very same distribution.

## Preprocessing

Earlier than the modeling steps we have to do some preprocessing. We’ll cut up the dataset into prepare and take a look at units after which we’ll Min-max normalize our knowledge (that is performed as a result of neural networks work significantly better with small enter values). We may also take away the *Time* variable because it has the very same distribution for regular and fraudulent transactions.

Based mostly on the *Time* variable we’ll use the primary 200,000 observations for coaching and the remaining for testing. That is good follow as a result of when utilizing the mannequin we wish to predict future frauds based mostly on transactions that occurred earlier than.

Now let’s work on normalization of inputs. We created 2 capabilities to assist us. The primary one will get descriptive statistics concerning the dataset which might be used for scaling. Then we’ve got a perform to carry out the min-max scaling. It’s essential to notice that we utilized the identical normalization constants for coaching and take a look at units.

```
library(purrr)
#' Will get descriptive statistics for each variable within the dataset.
get_desc <- perform(x) {
map(x, ~checklist(
min = min(.x),
max = max(.x),
imply = imply(.x),
sd = sd(.x)
))
}
#' Given a dataset and normalization constants it'll create a min-max normalized
#' model of the dataset.
normalization_minmax <- perform(x, desc) {
map2_dfc(x, desc, ~(.x - .y$min)/(.y$max - .y$min))
}
```

Now let’s create normalized variations of our datasets. We additionally remodeled our knowledge frames to matrices since that is the format anticipated by Keras.

We’ll now outline our mannequin in Keras, a symmetric autoencoder with 4 dense layers.

```
library(keras)
mannequin <- keras_model_sequential()
mannequin %>%
layer_dense(models = 15, activation = "tanh", input_shape = ncol(x_train)) %>%
layer_dense(models = 10, activation = "tanh") %>%
layer_dense(models = 15, activation = "tanh") %>%
layer_dense(models = ncol(x_train))
abstract(mannequin)
```

```
___________________________________________________________________________________
Layer (sort) Output Form Param #
===================================================================================
dense_1 (Dense) (None, 15) 450
___________________________________________________________________________________
dense_2 (Dense) (None, 10) 160
___________________________________________________________________________________
dense_3 (Dense) (None, 15) 165
___________________________________________________________________________________
dense_4 (Dense) (None, 29) 464
===================================================================================
Whole params: 1,239
Trainable params: 1,239
Non-trainable params: 0
___________________________________________________________________________________
```

We’ll then compile our mannequin, utilizing the imply squared error loss and the Adam optimizer for coaching.

```
mannequin %>% compile(
loss = "mean_squared_error",
optimizer = "adam"
)
```

## Coaching the mannequin

We will now prepare our mannequin utilizing the `match()`

perform. Coaching the mannequin within reason quick (~ 14s per epoch on my laptop computer). We’ll solely feed to our mannequin the observations of regular (non-fraudulent) transactions.

We’ll use `callback_model_checkpoint()`

with the intention to save our mannequin after every epoch. By passing the argument `save_best_only = TRUE`

we’ll carry on disk solely the epoch with smallest loss worth on the take a look at set. We may also use `callback_early_stopping()`

to cease coaching if the validation loss stops lowering for five epochs.

```
checkpoint <- callback_model_checkpoint(
filepath = "mannequin.hdf5",
save_best_only = TRUE,
interval = 1,
verbose = 1
)
early_stopping <- callback_early_stopping(persistence = 5)
mannequin %>% match(
x = x_train[y_train == 0,],
y = x_train[y_train == 0,],
epochs = 100,
batch_size = 32,
validation_data = checklist(x_test[y_test == 0,], x_test[y_test == 0,]),
callbacks = checklist(checkpoint, early_stopping)
)
```

```
Practice on 199615 samples, validate on 84700 samples
Epoch 1/100
199615/199615 [==============================] - 17s 83us/step - loss: 0.0036 - val_loss: 6.8522e-04d from inf to 0.00069, saving mannequin to mannequin.hdf5
Epoch 2/100
199615/199615 [==============================] - 17s 86us/step - loss: 4.7817e-04 - val_loss: 4.7266e-04d from 0.00069 to 0.00047, saving mannequin to mannequin.hdf5
Epoch 3/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.7753e-04 - val_loss: 4.2430e-04d from 0.00047 to 0.00042, saving mannequin to mannequin.hdf5
Epoch 4/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.3937e-04 - val_loss: 4.0299e-04d from 0.00042 to 0.00040, saving mannequin to mannequin.hdf5
Epoch 5/100
199615/199615 [==============================] - 19s 94us/step - loss: 3.2259e-04 - val_loss: 4.0852e-04 enhance
Epoch 6/100
199615/199615 [==============================] - 18s 91us/step - loss: 3.1668e-04 - val_loss: 4.0746e-04 enhance
...
```

After coaching we will get the ultimate loss for the take a look at set through the use of the `consider()`

fucntion.

```
loss <- consider(mannequin, x = x_test[y_test == 0,], y = x_test[y_test == 0,])
loss
```

```
loss
0.0003534254
```

## Tuning with CloudML

We might be able to get higher outcomes by tuning our mannequin hyperparameters. We will tune, for instance, the normalization perform, the training charge, the activation capabilities and the dimensions of hidden layers. CloudML makes use of Bayesian optimization to tune hyperparameters of fashions as described in this weblog put up.

We will use the cloudml package deal to tune our mannequin, however first we have to put together our challenge by making a coaching flag for every hyperparameter and a `tuning.yml`

file that can inform CloudML what parameters we wish to tune and the way.

The complete script used for coaching on CloudML may be discovered at https://github.com/dfalbel/fraud-autoencoder-example. Crucial modifications to the code have been including the coaching flags:

```
FLAGS <- flags(
flag_string("normalization", "minmax", "One in all minmax, zscore"),
flag_string("activation", "relu", "One in all relu, selu, tanh, sigmoid"),
flag_numeric("learning_rate", 0.001, "Optimizer Studying Price"),
flag_integer("hidden_size", 15, "The hidden layer measurement")
)
```

We then used the `FLAGS`

variable contained in the script to drive the hyperparameters of the mannequin, for instance:

```
mannequin %>% compile(
optimizer = optimizer_adam(lr = FLAGS$learning_rate),
loss = 'mean_squared_error',
)
```

We additionally created a `tuning.yml`

file describing how hyperparameters ought to be various throughout coaching, in addition to what metric we wished to optimize (on this case it was the validation loss: `val_loss`

).

**tuning.yml**

```
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
hyperparameters:
objective: MINIMIZE
hyperparameterMetricTag: val_loss
maxTrials: 10
maxParallelTrials: 5
params:
- parameterName: normalization
sort: CATEGORICAL
categoricalValues: [zscore, minmax]
- parameterName: activation
sort: CATEGORICAL
categoricalValues: [relu, selu, tanh, sigmoid]
- parameterName: learning_rate
sort: DOUBLE
minValue: 0.000001
maxValue: 0.1
scaleType: UNIT_LOG_SCALE
- parameterName: hidden_size
sort: INTEGER
minValue: 5
maxValue: 50
scaleType: UNIT_LINEAR_SCALE
```

We describe the kind of machine we wish to use (on this case a `standard_gpu`

occasion), the metric we wish to reduce whereas tuning, and the the utmost variety of trials (i.e. variety of combos of hyperparameters we wish to take a look at). We then specify how we wish to fluctuate every hyperparameter throughout tuning.

You may be taught extra concerning the tuning.yml file on the Tensorflow for R documentation and at Google’s official documentation on CloudML.

Now we’re able to ship the job to Google CloudML. We will do that by working:

```
library(cloudml)
cloudml_train("prepare.R", config = "tuning.yml")
```

The cloudml package deal takes care of importing the dataset and putting in any R package deal dependencies required to run the script on CloudML. In case you are utilizing RStudio v1.1 or larger, it’ll additionally can help you monitor your job in a background terminal. You may also monitor your job utilizing the Google Cloud Console.

After the job is completed we will gather the job outcomes with:

It will copy the recordsdata from the job with one of the best `val_loss`

efficiency on CloudML to your native system and open a report summarizing the coaching run.

Since we used a callback to avoid wasting mannequin checkpoints throughout coaching, the mannequin file was additionally copied from Google CloudML. Recordsdata created throughout coaching are copied to the “runs” subdirectory of the working listing from which `cloudml_train()`

is known as. You may decide this listing for the latest run with:

`[1] runs/cloudml_2018_01_23_221244595-03`

You may also checklist all earlier runs and their validation losses with:

`ls_runs(order = metric_val_loss, lowering = FALSE)`

```
run_dir metric_loss metric_val_loss
1 runs/2017-12-09T21-01-11Z 0.2577 0.1482
2 runs/2017-12-09T21-00-11Z 0.2655 0.1505
3 runs/2017-12-09T19-59-44Z 0.2597 0.1402
4 runs/2017-12-09T19-56-48Z 0.2610 0.1459
Use View(ls_runs()) to view all columns
```

In our case the job downloaded from CloudML was saved to `runs/cloudml_2018_01_23_221244595-03/`

, so the saved mannequin file is offered at `runs/cloudml_2018_01_23_221244595-03/mannequin.hdf5`

. We will now use our tuned mannequin to make predictions.

## Making predictions

Now that we educated and tuned our mannequin we’re able to generate predictions with our autoencoder. We have an interest within the MSE for every statement and we count on that observations of fraudulent transactions could have larger MSE’s.

First, let’s load our mannequin.

```
mannequin <- load_model_hdf5("runs/cloudml_2018_01_23_221244595-03/mannequin.hdf5",
compile = FALSE)
```

Now let’s calculate the MSE for the coaching and take a look at set observations.

measure of mannequin efficiency in extremely unbalanced datasets is the Space Beneath the ROC Curve (AUC). AUC has a pleasant interpretation for this downside, it’s the likelihood {that a} fraudulent transaction could have larger MSE then a standard one. We will calculate this utilizing the Metrics package deal, which implements all kinds of frequent machine studying mannequin efficiency metrics.

```
[1] 0.9546814
[1] 0.9403554
```

To make use of the mannequin in follow for making predictions we have to discover a threshold (okay) for the MSE, then if if (MSE > okay) we take into account that transaction a fraud (in any other case we take into account it regular). To outline this worth it’s helpful to take a look at precision and recall whereas various the edge (okay).

```
possible_k <- seq(0, 0.5, size.out = 100)
precision <- sapply(possible_k, perform(okay) {
predicted_class <- as.numeric(mse_test > okay)
sum(predicted_class == 1 & y_test == 1)/sum(predicted_class)
})
qplot(possible_k, precision, geom = "line")
+ labs(x = "Threshold", y = "Precision")
```

```
recall <- sapply(possible_k, perform(okay) {
predicted_class <- as.numeric(mse_test > okay)
sum(predicted_class == 1 & y_test == 1)/sum(y_test)
})
qplot(possible_k, recall, geom = "line")
+ labs(x = "Threshold", y = "Recall")
```

place to begin can be to decide on the edge with most precision however we might additionally base our resolution on how a lot cash we’d lose from fraudulent transactions.

Suppose every guide verification of fraud prices us $1 but when we don’t confirm a transaction and it’s a fraud we’ll lose this transaction quantity. Let’s discover for every threshold worth how a lot cash we’d lose.

```
cost_per_verification <- 1
lost_money <- sapply(possible_k, perform(okay) {
predicted_class <- as.numeric(mse_test > okay)
sum(cost_per_verification * predicted_class + (predicted_class == 0) * y_test * df_test$Quantity)
})
qplot(possible_k, lost_money, geom = "line") + labs(x = "Threshold", y = "Misplaced Cash")
```

We will discover one of the best threshold on this case with:

`[1] 0.005050505`

If we wanted to manually confirm all frauds, it might value us ~$13,000. Utilizing our mannequin we will scale back this to ~$2,500.