Later today (2-3:45 pm ET), the White House will announce a $200M BigData research Initiative. Appropriately, it is being named the “Big Data Research and Development Initiative.”
The announcement will be broadcast live on Science360.
See this PDF for a listing of bigdata projects within the US Government.
I am excited to see how this will affect the education and training of data scientists.
What are your thoughts? Is this a good idea?
Did they skip the part about providing their existing data in a reasonable format? For example, county-level data is hard to come by in an regular data set. Instead of just download such a data set, I scraped >3000 web pages to piece together a coherent county-level data set from the US census website. If that data set had already existed elsewhere on government websites in a clean format, I was unable to find it (if someone spots it somewhere, please point me to it for future reference!). This basic census data is so hard to get or find on the government websites. I cannot even imagine all of the other non-sensitive data that the government has collected but failed to adequately distribute.
I’d really like to see the government lead the way in terms of reproducibility, which includes the release of data when providing reports. Roger Peng had an excellent paper on this topic in a recent issue of Science that I think is worth the 10 minutes that are needed to read it (see http://www.ncbi.nlm.nih.gov/pubmed/22144613).
I have never searched for the Census data, but I will believe you. I am not sure if any of the initiative was for cleaning the data and providing it in a nice format. Data needs to be both available and usable. Hopefully, some of the funding will lead to that.
Thanks for the comment.
Also, thanks for the link. I plan to read the paper when I get few minutes.
[…] Yesterday’s announcement included a few main highlights: […]