Data Science Methodology
- Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems.
- Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake.
- Analysis – This is the part of the process where insight is to be extracted
from the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data.
- Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation.
Can you think of anything the methodology is missing?
Note: This post is similar to the Data Scientific Method which I blogged about nearly 2 years ago.
can you suggest a data science course based on java. every where i find it in python
I think you might not be asking the right question. It depends upon what you want to do. Java is not the preferred language for data analysis and statistical models. Python and R are much easier for that. However, Java can be used with Hadoop.
A better approach is: “What do I want to learn?” Then go take a class about that. Learn to use the best tools for the job.
Hey, what about visualization? Don’t you think that is an important aspect of the overall thing too…
Yes, Visualization is important. I maybe did not specify it very well, but I think it falls into the final 2 stages. The analysis phase is were you would look at some exploratory visualizations, and the DAta Product itself may end up being a visualization.
Thanks for commenting. I hope my response helps clear things up.
One other consideration is that, in a real world setting, you have to incorporate continuous validation. Your models become stale over time and need to be updated.
That is a very good point. Thanks for commenting.
can you help me, I wanna apply data science to a business cost reduction, that’s my topic for masters project, so tell me what are the techniques under data science would you recommend for me to use them for cost reduction
I think I would need a bit more information. How are you trying to reduce costs?
solar cost reduction
You could add something like Data Action. Where the prediction or inference made affect another process. And this could be view more in machine learning and machine to machine applications.
DATA Action would be the phase that would usher in the artificial intelligent where machine go beyond the control of humans understand and we could move towards Technological singularity