AutoML
Optuna + Kaggle = Winning!
Building Competitive AutoML

Let the coding begin.
This is the first sprint; version 0.0.1-alpha; a dump cook’s recipe for automatically submitting Optuna optimized models to Kaggle contests.
Optuna Setup
I will use Kaggle’s IEEE CIS Fraud competition as my example to setup Optuna, and submit models using Kaggle’s Python API. In the code below, we create a Study object using Optuna’s factory function.
Instantiate Optuna Study
Next, we sample and prepare data for induction. Please note I am using a small sample size of 5k to intentionally reduce the performance of my initial model. This way my charts will show progress and my narrative will be compelling. Note my use of feather instead of CSV or HDF. This great article explains it best. XGBoost only likes continuous and boolean features. We filter the non-compliant columns and save their names for future reference.
read, sample, split training data
For Optuna or any other optimization system like Hyperopt, we need an objective function. Here is a very simple objective function meant only to be operational.
basic bro objective function
Time to let ‘er rip. Running the trials is straight forward; supply an objective function and how many trials you’d like to run. There are a lot more options but this is not the place to talk about them yet.
study.optimize(objective, n_trials=5)
If this code is run in a notebook, you should see something like the following. I like how the output keeps track of the best trial, similar to watching a neural network.

Time to Submit
At this point we have a CV-5 model that scored 0.735 AUC. If we know anything, we know our Kaggle score will be much worse than this but we are just making sure the plumbing work right now, not trying to win. To create our submission, we build a fresh, new XGBoost object using the (1) hyperparameters from Optuna trial #3. (2) We induce it on all the training data. (3) We prepare our test data for inference. (4) We score the test data, save as CSV, and (5) submit to Kaggle.
Here is what that looks like…
(1) Hyperparameters from best Optuna trial

Again, we apply these hyperparameters and score our test data. Good thing we kept track of the column names for non-continuous, non-boolean features.
(2) We induce it on all the training data.

(3) Prepare Kaggle test data, removing non-continuous, non-boolean features

(4) Score Kaggle test data

(5) Moment of truth, Kaggle submission

What is my leaderboard position?
Using the Kaggle Python API, we can get our Kaggle score and estimate our leaderboard results. Here is our best Kaggle score so far of 0.6825.

The Kaggle API only allows you to get the top 50 scores without forcing you to download a CSV.


This is a rough start and some elbow grease will be added for our next “sprint.” Next time we are going to containerize this thing and make it scale (see number 6).
Oh yes! It will scale with Docker and K8s!
