Yesterday, I wrote a post explaining the Kaggle Biological Response competition. If you don’t know, Kaggle is a website for data science competitions. Now it is time to submit a solution. After this post, you should have a spot on the Leaderboard. Granted, it will not be first place but it won’t be last place either. If you have not already done so, please create an account at Kaggle.
Setup Python
For this example, we can use the Python programming language. You will need to perform the following steps to get going. These steps are for Windows machines, but they could very easily be modified for a UNIX/Linux/MAC system.
- Install Python 2.7.3 – you need the programming language
- Install numpy – for linear algebra and other stuff
- Install scipy – for scientific calculations
- Install setuptools – easier python package installation
- Install scikit-learn – machine learning for python
Setup A File Structure And Get Data
Next create a directory on your C drive. Call it whatever you want. I recommend C:/kaggle/bioresponse. Then download and save the file csv_io.py for reading and writing CSV files. Thanks to Ben Hamner of Kaggle for that file. Next, go download the test and train files from Kaggle and save to your directory.
The Default Solution
If you opened the test.csv file, you would have noticed it has 2501 rows of actual data. Thus, a very simple default solution is to create a submission file with 2501 rows and the number 0.5 on each row. Then go to Kaggle and upload the submission file. I will not provide code for creating that file. There are many ways to do it manually or programatically. This solution will get you on the Leaderboard near the bottom, but not last.
A Logistic Regression Solution
Now, if you know a little statistics, you will recognize this problem as a classification problem, since the observed responses are either 0 or 1. Thus logistic regression is a decent algorithm to try. Here is the Python code to run logistic regression.
#!/usr/bin/env python
from sklearn.linear_model import LogisticRegression
import csv_io
import math
import scipy
def main():
#read in the training file
train = csv_io.read_data("train.csv")
#set the training responses
target = [x[0] for x in train]
#set the training features
train = [x[1:] for x in train]
#read in the test file
realtest = csv_io.read_data("test.csv")
# code for logistic regression
lr = LogisticRegression()
lr.fit(train, target)
predicted_probs = lr.predict_proba(realtest)
# write solutions to file
predicted_probs = ["%f" % x[1] for x in predicted_probs]
csv_io.write_delimited_file("log_solution.csv", predicted_probs)
print ('Logistic Regression Complete! Submit log_solution.csv to Kaggle')
if __name__=="__main__":
main()
Raw code can be obtained here (Please use the raw code if you are going to copy/paste).
Save this file as log_regression.py in the directory you created above. Then open the Python GUI. You may need to run the following commands to navigate to the correct directory.
import os
os.chdir('c:/kaggle/bioresponse')
Now you can run the actual logistic regression.
import log_regression
log_regression.main()
Now upload log_solution.csv to Kaggle, and you are playing the game.
Results
If you performed these steps, I would love to know about it. Thanks for following along, and good luck with Kaggle.
Leave a Reply