Fundamentals of Data Mining

Today we are generating data more than ever before. Over the last two years, 90 percent of the data in the world was generated. This data alone does not make any sense unless it’s identified to be related in some pattern. Data mining is the process of discovering these patterns among the data and is therefore also known as Knowledge Discovery from Data (KDD).

A definition from the book ‘Data Mining: Practical Machine Learning Tools and Techniques’, written by, Ian Witten and Eibe Frank describes Data mining as follows:

“Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. … Machine learning provides the technical basis for data mining. It is used to extract information from the raw data in databases…“

Overview

Take the example of a big supermarket that has a large number of records of customer purchases. Conventionally, in many supermarkets, this data is mostly used for inventory management or budgeting. However, by using sophisticated data mining tools and diligently scanning through data to find patterns that were never seen before, the supermarket management may know which combination of products is mostly purchased by their customers and how seasonality and other factors influence their purchasing decision. This is what we term as ‘recommender systems’ which is now being implemented to boost sales by recommending products to frequent customers based on their previous purchase activities. Consequently, this increases customer satisfaction.

As you may expect, this process can be extremely arduous and time-consuming and may require a high level of acumen to come up with valuable and actionable insights. Going through such a large amount of data, where trends might not be obvious, could get painful and demotivating. Therefore, learning some useful data mining procedures may prove beneficial in this regard.

You might be wondering what benefit you can get out of these techniques? As taught in Data Science Dojo’s data science bootcamp, you will have improved prediction and forecasting with respect to your product. An in-depth analysis of trends can offer managers a much more reliable way to conduct planning and forecasts. Furthermore, it also assists them in boosting their decision-making abilities as the decisions are evidence-based rather than mere conjectures and intuition. Additionally, this will enable an organization to utilize resources optimally and enhance the customer’s experience. How mining techniques could be leveraged to fulfill the goals of organizations is discussed ahead.

Data Mining Process

The complexity of the entire data mining mechanism can vary according to the size and kind of data an organization has and the aims that are required to be fulfilled. However, in most cases, there will be a generic process that underlies all such activities.

Domain Knowledge

The foremost step of this process is to possess relevant domain knowledge regarding the problem at hand. To anyone looking at a large pile of data, it may seem like a collection of junk unless the person has the background knowledge and information about the business. Only then one will be able to get a sense of what sort of data they require, what are the relevant properties of data to take into consideration and how that could be used to solve the problem at hand. Once these questions are answered, it’ll be easy to stay focused, allocate resources properly and attain a productive result eventually. This step will guide the discovery method and allow discovered patterns to be expressed in concise terms and at different levels of abstraction.

Data Collection

After defining the goals in the previous step, it is essential to collect data. This could involve using data that already exists in a company’s database, getting data from external resources or steps to collect new data through survey forms filled by customers. A group of experts holds this point of view that an organization must collect as much data as possible, even if it doesn’t make sense at an early stage.

Data Cleaning and Preprocessing

Following the collection step, comes the most onerous step of all: Data Cleaning and Preprocessing. In simple terms, this step involves dealing with missing values and outliers and removing noise or other misleading components that may cause false conclusions. This also includes transforming data into a form required by the mining procedure. This step may take a lot of resources, effort and patience to perform.

Analysis and Interpretation

The most crucial part begins by limiting data to the most important features and creating new relevant and useful features by leveraging the combination of the existing features in the data. Data is carefully analyzed and using mining algorithms, hidden patterns are extracted from the data. The models created using these algorithms could be evaluated against appropriate metrics to verify the model’s credibility. The choice of these metrics depends on the nature of the problem. A problem relating to the detection of fraudulent activity might choose false-negative error as a suitable evaluation metric. The patterns discovered after this step are interpreted using various visualization and reporting techniques and are made comprehensible for other team members to understand.

Deployment

Finally, the insights are used to take action and make important business decisions to solve the problem and leverage the entire laborious process undertaken. The success of this process can be assessed by how much value it brings to your business.

Data Mining Models

The models used for data mining can be primarily distinguished under two main types: supervised and unsupervised. The former is a term used for models where the data has been labeled, whereas, unsupervised learning, on the other hand, refers to unlabeled data. These models can be further classified as specified in the descriptions below.

Classification

Classification is a form of supervised learning technique where a known structure is generalized for distinguishing instances in new data. Based on the data, the model will create sets of discrete rules to split and group the highest proportion of similar target variables together. Banks use classification to predict if a client is going to default loan payment or not based on the client’s activities.

Regression

Regression Analysis is a statistical method for examining the relationship between two or more variables. It is a supervised learning technique used in predictive analytics to find a continuous value based on one or numerous variables. For example, regression algorithms are used by companies to forecast sales in future months based on sales data of previous months.

Anomaly Detection

Also known as outlier detection, anomaly detection is an unsupervised learning technique that is used to find rare occurrences or suspicious events in your data. The unusual data points may point to a problem or rare event that can be subject to further investigation. Anomaly detection could be used for network security to find external intrusions or suspicious activities from the users, for instance, a hacker opening connections on non-common ports or protocols.

Clustering

Another unsupervised learning method, clustering is the practice of assigning labels to unlabeled data using the patterns that exist in it. It assists in finding out structures in data that can group similar data points together. For example, clustering is used to group a large set of documents into categories based on the content.

Common Applications

Data mining can be effectively used in marketing to create customer segments based on their purchasing patterns that can be extracted from behavioral analysis. This could be used for creating targeted advertising that varies according to the customer profile and as a result, increases the conversion rate.

As mentioned above, financial organizations have been using data mining to detect fraudulent transactions. Previous transaction data can be analyzed to extract patterns and find any anomaly that can assist in distinguishing such transactions. Such methods are getting more effective in combating fraudulent activities and anticipating activities that might be atypical and might potentially go unnoticed.

In the area of Natural Language Processing, data mining, or referred to as text mining, can be of extreme use when analyzing a large volume of news, social media or other text data. Such data mining techniques have evolved to become contextually aware and can enable us to find out sentiments for a particular product. A firm could use it to understand how the general public feels about their newly launched device. Similarly, it also assists in discovering topics under discussion amongst the public related to a particular aspect. This technique is used for detecting fake news on social media as well.

About The Author

Rahim Rasool is an Associate Data Scientist at Data Science Dojo (DSD) where he helps create learning material for DSD’s data science bootcamp. He holds a bachelor’s in electrical engineering from National University of Sciences and Technology. He possesses great interest in machine learning, astronomy and history.

Note: Data Science 101 is proud to have this sponsored post from Data Science Dojo.

Comments

3 responses to “Fundamentals of Data Mining”

Rani

October 31, 2019

This is the first time I have visited this website. There’s a lot i can learn here!

Fundamentals of Data Mining – Data Science Outpost

November 3, 2019

[…] This post is syndicated. Read the original post here. […]

Fundamentals of Data Mining – Deep Marketing

November 10, 2019

[…] Overview Take the example of a big supermarket that has a large number of records of customer purchases. [Read More…] […]