Ryan Swanstrom

Data Sources for Cool Data Science Projects: Part 2 – Guest Post

Oct 17, 2014

—

I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 1) about finding data for your next data science project.

At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are some more cool public data sources you can use for your next project:

Data With a Cause:

Environmental Data: Data on household energy usage is available as well as NASA Climate Data.
Medical and biological Data: You can get anything from anonymous medical records, to remote sensor reading for individuals, to data of the Genomes of 1000 individuals.

Miscellaneous:

Geo Data: Try looking at these Yelp Datasets for venues near major universities and one for major cities in the Southwest. The Foursquare API is another good source. Open Street Map has open data on venues as well.
Twitter Data: you can get access to Twitter Data used for sentiment analysis, network Twitter Data, social Twitter data, on top of their API.
Games Data: Datasets for games, including a large dataset of Poker hands, dataset of online Domion Games, and datasets of Chess Games are available.
Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include Anonymous usage data for MSNBC, Amazon purchase history (also anonymized), and Wikipedia traffic.

Metasources: these are great sources for other web pages.

Stanford Network Data: http://snap.stanford.edu/index.html
Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is available online.
UCI maintains archives of data for machine learning.
US Census Data
Amazon is hosting Public Datasets on s3
Kaggle hosts machine-learning challenges and many of their datasets are publicly available
The cities of Chicago, New York, Washington DC, and SF maintain public data warehouses.
Yahoo maintains a lot of data on its web properties which can be obtained by writing them.
BigML is a blog that maintains a list of public datasets for the machine learning community.
Finally, if there’s a website with data you are interested in, crawl for it!

While building your own project cannot replicate the experience of fellowship at The Data Incubator (our fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!

Got any more data sources? Let us know or leave a comment and we’ll add them to the list!

Additional Sources (added via comments since the post was published)

London Datastore – Public sector data from the city of London

Comments

9 responses to “Data Sources for Cool Data Science Projects: Part 2 – Guest Post”

Data Sources for Cool Data Science Projects: Part 2 | The Data Incubator

October 17, 2014

[…] This post was originally published by DataScience101. This, and all things Data Science, can be found here. […]

Reply
Philippe Van Impe

October 17, 2014

Reblogged this on The Brussels Data Science Community and commented:
Nice and useful information

Reply
Data Sources for Cool Data Science Projects: Part 1 – Guest Post | Data Science 101

October 17, 2014

[…] Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 2) about finding data for your next data science […]

Reply
Pasduil

October 17, 2014

London Datastore

http://data.london.gov.uk/

Reply
Data Sources for Cool Data Science Projects: Part 2 « Another Word For It

November 5, 2014

[…] Data Sources for Cool Data Science Projects: Part 2 by Ryan Swanstrom. […]

Reply
Scott

November 16, 2014

Another meta source: https://www.quandl.com/

Reply
1. Ryan Swanstrom
  
  November 18, 2014
  
  Thanks Scott, Quandl is actually included in the Part 1 list.
  
  Reply
Open Data Day 2015 | Data Science 101

February 21, 2015

[…] If you are looking for some good datasets to use: try Data Sources for Cool Data Science Projects: Part 1 and Part 2. […]

Reply
4 Steps to Finding Your Data – winenapa

February 23, 2018

[…] to augment your existing data with open data. Here are some lists of open data, Open Data, Part 1, Open Data, Part 2. There are also many others […]

Reply