Ryan Swanstrom

The Goal is Data Products: Now How Do We Get There?

Jan 12, 2015

—

in Data Science 101, Data-Driven Software Engineering

The primary output of data science is data products. Data products can be anything from a list of recommendations to a dashboard to a single chart or any other product that aides in making a more informed decision. In the end, data science should produce some usable results, and those results are the data product. The process used to created those data products needs a bit more formalization. Call it a: methodology, process, lifecycle, or workflow; but it needs to exist.

Dr. Kirk Bourne provided some thoughts in July 2014 with his article, Raising the Standard in the Big Data Analytics Profession. Data science needs some standards and possibly even a workflow, but the focus on data products cannot be lost.

Data Science is not Software Engineering

First, data science is often treated as software engineering because code is written. However, they are not the same thing. Agile methods, waterfall, and scrum are not pluggable methodologies that can be used with data science. Data science is more science and less engineering; therefore it should follow a more scientific method.

Existing Data Science Workflows

Luckily, some options already exist for data science. Much like software engineering, there is not a magic workflow that fits every project. The goal is to find a workflow that best fits the needs of the current project.

CRISP-DM

The most popular and oldest method is CRISP-DM. CRISP-DM was designed for data mining projects, which is closer to data science than software engineering, but still not exact. The 6 steps of CRISP-DM are:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Data Science Project Lifecycle

The The Data Science Project Lifecycle is a recent modification/improvement of CRISP-DM with a bit more of an engineering focus. The steps can be seen as:

Data acquisition
Data preparation
Hypothesis and modeling
Evaluation and Interpretation
Deployment
Operations
Optimization

Data Science Workflow

The Data Science Workflow: Overview and Challenges was presented on the ACM blog in 2013. It was part of a dissertation by Philip Guo. Here are the steps:

Preparation
Analysis
Reflection
Dissemination

Those are 3 options of workflows for data science. They are not the only options. Feel free to modify the workflows to best suit the project. It will be exciting to see the new workflows for data science that will be created in the near future. It will also be fun to see which ones turn out to be the most beneficial.

One thing a data product must do is help answer a question. Thus, a logical staring point for data science is a good question. Just don’t let the focus of the workflow come down to the process, which is often the case in software engineering. Let the focus be on data products.

Note:
I have previously written 2 posts on this topic, and I don’t think either post gets the methodology exactly correct.

Comments

5 responses to “The Goal is Data Products: Now How Do We Get There?”

Data Science tools and processes | Models are illuminating and wrong

January 12, 2015

[…] first link is Data Products how do we get there which discusses what methodologies people in the data science world use. I personally use one not […]

Reply
62 new external resources and articles about data science, big data – January 23 | Doclens

January 24, 2015

[…] Developing data products […]

Reply
Raymond Li

April 23, 2015

Thanks for writing about this, Ryan,

I found the portion about Data Science Workflows really interesting, since I come from a software engineering background. The steps for the 3 workflows described above seem to follow a traditional waterfall approach.

In a Data Science workflow, do you find that it’s necessary iterate or revisit certain steps?

Reply
1. Ryan Swanstrom
  
  April 30, 2015
  
  Raymond,
  Great question! Most software engineering workflows are just improvements or modifications of waterfall. I think the modeling and evaluation steps might need to be iterated. Depending upon those steps, it might be necessary to revisit the preparation step to obtain more data or further clean data.
  
  What do you think? Do you see steps that need iteration?
  
  Thanks,
  Ryan
  
  Reply
  1. Raymond Li
    
    May 1, 2015
    
    I believe so, but then my experience is mostly with software engineering and not very much with data science.
    
    I find that things never move along in a waterfall manner except for the simplest projects. Instead, it’s iterative — build something, get some feedback, add/change what you built based on the feedback, get some more feedback. Rinse and repeat.
    
    In the limited data analysis I’ve done to help solve engineering probelms, I frequently feel it follows a similar iterative approach. Prepare, analyze, reflect, get feedback (usually it’s not exactly what they want) and then repeat.