Your First Kaggle Submission

May 23, 2012

—

Yesterday, I wrote a post explaining the Kaggle Biological Response competition. If you don’t know, Kaggle is a website for data science competitions. Now it is time to submit a solution. After this post, you should have a spot on the Leaderboard. Granted, it will not be first place but it won’t be last place either. If you have not already done so, please create an account at Kaggle.

Setup Python

For this example, we can use the Python programming language. You will need to perform the following steps to get going. These steps are for Windows machines, but they could very easily be modified for a UNIX/Linux/MAC system.

Install Python 2.7.3 – you need the programming language
Install numpy – for linear algebra and other stuff
Install scipy – for scientific calculations
Install setuptools – easier python package installation
Install scikit-learn – machine learning for python

Setup A File Structure And Get Data

Next create a directory on your C drive. Call it whatever you want. I recommend C:/kaggle/bioresponse. Then download and save the file csv_io.py for reading and writing CSV files. Thanks to Ben Hamner of Kaggle for that file. Next, go download the test and train files from Kaggle and save to your directory.

The Default Solution

If you opened the test.csv file, you would have noticed it has 2501 rows of actual data. Thus, a very simple default solution is to create a submission file with 2501 rows and the number 0.5 on each row. Then go to Kaggle and upload the submission file. I will not provide code for creating that file. There are many ways to do it manually or programatically. This solution will get you on the Leaderboard near the bottom, but not last.

A Logistic Regression Solution

Now, if you know a little statistics, you will recognize this problem as a classification problem, since the observed responses are either 0 or 1. Thus logistic regression is a decent algorithm to try. Here is the Python code to run logistic regression.
#!/usr/bin/env python


from sklearn.linear_model import LogisticRegression

import csv_io

import math

import scipy
def main():

    #read in the training file

    train = csv_io.read_data("train.csv")

    #set the training responses

    target = [x[0] for x in train]

    #set the training features

    train = [x[1:] for x in train]

    #read in the test file

    realtest = csv_io.read_data("test.csv")
    # code for logistic regression

    lr = LogisticRegression()

    lr.fit(train, target)

    predicted_probs = lr.predict_proba(realtest)
    # write solutions to file

    predicted_probs = ["%f" % x[1] for x in predicted_probs]

    csv_io.write_delimited_file("log_solution.csv", predicted_probs)
    print ('Logistic Regression Complete! Submit log_solution.csv to Kaggle')

if __name__=="__main__": main()
Raw code can be obtained here (Please use the raw code if you are going to copy/paste).
Save this file as log_regression.py in the directory you created above. Then open the Python GUI. You may need to run the following commands to navigate to the correct directory.
import os os.chdir('c:/kaggle/bioresponse')
Now you can run the actual logistic regression.
import log_regression log_regression.main()
Now upload log_solution.csv to Kaggle, and you are playing the game.

Results

If you performed these steps, I would love to know about it. Thanks for following along, and good luck with Kaggle.

Comments

13 responses to “Your First Kaggle Submission”

Derek

May 23, 2012

I’m getting an error…

Traceback (most recent call last):
File “log_regression.py”, line 6, in
import logloss
ImportError: No module named logloss

I’m pretty new to programming so I’m not sure where to look for the logloss module. Just looking through the “log_regression.py” file I see the import statements. Just not sure where “logloss” would be. Any help would be appreciated.

Reply
1. Derek
  
  May 23, 2012
  
  Ok, I think I fixed it. Found a script for “logloss” on the Kaggle forums. I just saved that as “logloss.py” in my \kaggle\bioresponse directory. Ran “log_regression.py” on the command line and got the “Logistic Regression Complete!” message. Just wanted to respond for others, like me, who ran into this issue.
  
  Reply
2. Ryan Swanstrom
  
  May 25, 2012
  
  Derek,
  Thanks a lot for trying and for leaving a comment. Sorry it took so long to respond, I was away from a computer for a couple of days (I know, unheard of these days). Actually, the logloss import is not even needed for example. I have removed it in the post and the link to the code.
  Thanks,
  Ryan
  
  Reply
Increase Your Kaggle Score With a Random Forest | Data Science 101

May 31, 2012

[…] I blogged about submitting your first solution to Kaggle for the Biological Response Competition. Well, that technique used Logistic Regression and the […]

Reply
Azfar

May 31, 2012

Hi Guys,

csv_io.py doesn’t work for me :

Traceback (most recent call last):
File “fisher.py”, line 14, in
reader = csv_io.read_data(filename)
File “/home/azfarl/kaggle/csv_io.py”, line 12, in read_data
sample = [float(x) for x in line]
ValueError: invalid literal for float(): “1

Any ideas ? Doesn’t seem to like the float(x) on line 12 in csv_io.py

Cheers,
F

Reply
1. Ryan Swanstrom
  
  May 31, 2012
  
  I am guessing the data file you are reading in is invalid. It appears to be trying to convert “1 to a number. Is it possible your data file got a quote in it? This could possibly happen if you opened the data file in a spreadsheet and saved it. Without seeing anymore of your code, that would be my first guess. Let me know if this helps.
  
  Reply
Rakesh

June 4, 2012

Worked for me. Submitted one solution myself. Got no ranking though 🙁

Reply
1. Ryan Swanstrom
  
  June 6, 2012
  
  Thanks for trying. I am not sure why you did not get a ranking. Hmm?
  
  Reply
Steve G

February 16, 2013

Regarding step 4 in setting up Python “Install setuptools – easier python package installation ” – this will not work for 64 bit Python if you follow the links and directions. Instead you will want to install Distributions (a fork and update of Setuptools) from here

http://www.lfd.uci.edu/~gohlke/pythonlibs/#distribute

as well as other 64 bit scipy num and learnkit on that page.

Reply
rakeb

June 3, 2013

Reblogged this on Machine learning blog.

Reply
Abhik

August 16, 2013

When I run log_regression, I get the following “ImportError: No module named sklearn.linear_model”

Any pointers to what I might be doing wrong?

Reply
1. Ryan Swanstrom
  
  August 16, 2013
  
  I cannot be for sure, but it sounds like either scikit is not installed or not installed properly. Here is a link with a similar issue. http://comments.gmane.org/gmane.comp.python.scikit-learn/1828
  
  Reply
  1. Abhik
    
    August 16, 2013
    
    Yes. That’s what I suspect as well. I am confused only because I am using the Enthought Canopy Express, and it says that it comes with scikit preinstalled.