Optuna + Kaggle = Winning!

created with Lucidchart

Let the coding begin.

This is the first sprint; version 0.0.1-alpha; a dump cook’s recipe for automatically submitting Optuna optimized models to Kaggle contests.

Optuna Setup

I will use Kaggle’s IEEE CIS Fraud competition as my example to setup Optuna, and submit models using Kaggle’s Python API. In the code below, we create a Study object using Optuna’s factory function.

Instantiate Optuna Study

Next, we sample and prepare data for induction. Please note I am using a small sample size of 5k to intentionally reduce the performance of my initial model. This way my charts will show progress and my narrative will be compelling. Note my use of feather instead of CSV or HDF. This great article explains it best. XGBoost only likes continuous and boolean features. We filter the non-compliant columns and save their names for future reference.

read, sample, split training data

For Optuna or any other optimization system like Hyperopt, we need an objective function. Here is a very simple objective function meant only to be operational.

basic bro objective function

Time to let ‘er rip. Running the trials is straight forward; supply an objective function and how many trials you’d like to run. There are a lot more options but this is not the place to talk about them yet.

study.optimize(objective, n_trials=5)

If this code is run in a notebook, you should see something like the following. I like how the output keeps track of the best trial, similar to watching a neural network.

Optuna doing what it does

Time to Submit

At this point we have a CV-5 model that scored 0.735 AUC. If we know anything, we know our Kaggle score will be much worse than this but we are just making sure the plumbing work right now, not trying to win. To create our submission, we build a fresh, new XGBoost object using the (1) hyperparameters from Optuna trial #3. (2) We induce it on all the training data. (3) We prepare our test data for inference. (4) We score the test data, save as CSV, and (5) submit to Kaggle.

Here is what that looks like…

(1) Hyperparameters from best Optuna trial

best hyperparameters

Again, we apply these hyperparameters and score our test data. Good thing we kept track of the column names for non-continuous, non-boolean features.

(2) We induce it on all the training data.

induce (fit) on all training data (X, y)

(3) Prepare Kaggle test data, removing non-continuous, non-boolean features

prepare Kaggle test data

(4) Score Kaggle test data

test data scored

(5) Moment of truth, Kaggle submission

moment of truth, Kaggle submission

What is my leaderboard position?

Using the Kaggle Python API, we can get our Kaggle score and estimate our leaderboard results. Here is our best Kaggle score so far of 0.6825.

The Kaggle API only allows you to get the top 50 scores without forcing you to download a CSV.

Photo by Sarah Kilian on Unsplash

This is a rough start and some elbow grease will be added for our next “sprint.” Next time we are going to containerize this thing and make it scale (see number 6).

Oh yes! It will scale with Docker and K8s!

Photo by Mak on Unsplash




Google Cloud Engineer; where my passion for AutoML, Data, and Coding all happily coexist.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Playing chrome’s dino game by physically jumping and crouching

Classifying Tweets with Amazon ML

The Six NVIDIA Xavier Processors

Text Classification Implementation In SparkNLP By Using Universal Sentence Encoder

Reinforcement Learning, from Games to Geologic Interpretation

Paper Reading on Sequence to Sequence Learning with Neural Networks

A Complete Guide to Adam and RMSprop Optimizer

Text, Voice, & Recurrent Neural Networks

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
William Hill

William Hill

Google Cloud Engineer; where my passion for AutoML, Data, and Coding all happily coexist.

More from Medium

Supervised Machine Learning Cheat Sheet

Credit Card Fraud Detection With Machine Learning Model

Gradient Descent for Linear Regression with Multiple Variables and L2 Regularization

How does Machine Learning work? 3 Fields that cannot do without it.