Increase Your Kaggle Score With a Random Forest

Ryan Swanstrom

12 years ago

Previously, I blogged about submitting your first solution to Kaggle for the Biological Response Competition. Well, that technique used Logistic Regression and the resulting score was not very good. Now, let’s try to improve upon that score. In this example, we will use what is called a Random Forest. Kaggle claims that random forests have performed well in many of the competitions.

Setup

There is no setup required beyond what was done when submitting your first solution. This technique will also use python as the software tool and the same data and directory structure.

The Random Forest Code

Scikit-learn, the machine learning library for python, has a nice implementation of a random forest. Here is some python code to run the random forest. A special thanks to Ben Hamner for supplying the basic code.
#!/usr/bin/env python

from sklearn.ensemble import RandomForestClassifier
import csv_io
import scipy

def main():
#read in the training file
train = csv_io.read_data("train.csv")
#set the training responses
target = [x[0] for x in train]
#set the training features
train = [x[1:] for x in train]
#read in the test file
realtest = csv_io.read_data("test.csv")

# random forest code
rf = RandomForestClassifier(n_estimators=150, min_samples_split=2, n_jobs=-1)
# fit the training data
print('fitting the model')
rf.fit(train, target)
# run model against test data
predicted_probs = rf.predict_proba(realtest)

predicted_probs = ["%f" % x[1] for x in predicted_probs]
csv_io.write_delimited_file("random_forest_solution.csv", predicted_probs)

print ('Random Forest Complete! You Rock! Submit random_forest_solution.csv to Kaggle')

if __name__=="__main__":
main()

Raw code can be obtained here. (Please use the raw code if you are going to copy/paste). Now save this file as random_forest.py in the directory (c:/kaggle/bioresponse) you previously created.

Running the code

Then open the Python GUI. You may need to run the following commands to navigate to the correct directory.
import os os.chdir('c:/kaggle/bioresponse')
Now you can run the actual random forest python code.
import random_forest random_forest.main()

Results

Now upload random_forest_solution.csv to Kaggle and enjoy moving up the Leaderboard. This score should place you at or near the random forest benchmark. As of today (5/30/2012), that score is about in the middle of the Leaderboard. Note: as the name implies, a random forest has a bit of randomness built into the algorithm, so your results may vary slightly.

Once again if you performed these steps, I would love to know about it. Thanks for following along, and good luck with Kaggle.