A few days ago I posted Data Science is more than just Statistics. I did not feel the post was complete so I am adding a part 2.
Data Collection
I received a couple comments about data scientists being involved with the collection of data. Yes, I would agree that is true. In order for data products to work, the correct data has to be collected. I probably did not explain this very well, but the approach to data collection is different. Typically, a statistics project will run some sort of rigorous experiment to collect data. The experiment will be very controlled and well-understood. In contrast, a data science project will collect data from existing systems, new system, sites on the web, sensors, and various other places. Most of the data does not come from a very controlled environment. By controlled, I mean specific number of users, specific type of users, specific time frame, and/or set constraints on the environment. This conglomeration of data is one of the reasons data scientists deal with such large datasets (there are other reasons too such as cheap hardware). I think it is more common for a data science project to deal with the whole population rather than a certain sampling of the population.
Importance of Statistics
In the previous post, I failed to emphasize the importance of statistics. I never said that statistics is not important to data science. Statistics is a critical element of data science. However, if you only know and study statistics, I feel you are missing other key elements of data science.
Thus, I stand by my previous statement, “if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.” Stay tuned for a post on choosing a data science graduate program.
Do you agree/disagree?
Leave a Reply