What is Data Mining?
Data Mining is a process applied to find unknown patterns, correlations, and anomalies in data. Through mining, large datasets can be transformed into meaningful insights. This process can uncover valuable and actionable information contained within data that may otherwise be hidden from the user. Data Mining is especially effective at unveiling insights from messy, unstructured data, sifting through and sorting the information within by relevance.
Data Mining has a lot in common with Machine Learning — fundamentally sharing many of the same statistical methods and large scale data management techniques. But whilst Data Mining concerns itself with finding insights in and between datasets, Machine Learning extends the concept further to incorporate the idea of “updating over time” (automatically finding correlations and applying them to new algorithms).
How does Data Mining work?
There are two main approaches to Data Mining: Supervised and Unsupervised Learning.
Supervised Data Mining: example outputs are labelled beforehand, and the goal is to learn the function that transforms the inputs into outputs — so that this can be applied to future unseen information-sets.
Unsupervised Data Mining: outputs are not labelled in advance, and the goal is to deduce natural underlying structures within data (making unsupervised data mining a useful tool for exploratory analysis).
A typical Data Mining workflow often looks something like this:
- Business understanding is established: analysts and management with an understanding of an organization’s goals agree what type of information they hope to uncover through Data Mining, as well as how they will organize and use that data. Performance can later be analyzed against these predefined business objectives.
- Data identification and preparation: data from sources of interest are loaded into a data warehouse, or other datastore. Some of these may be “analysis-ready”. Others may require transformation or normalization in order to get them into a state from which they can be analyzed. Data quality/integrity tests may form part of the “pipeline” that this process constitutes. Frameworks like Great Expectations assist greatly in this, and users can make use of them within HASH Flows. At this stage, the number of features in a dataset may be reduced to those most closely aligned with business goals. Some amount of manual human-led data exploration may be performed at this point, in conjunction with the information acquired during step one, to guide later expectations.
- Data mining: test scenarios are generated to gauge the validity and quality of the data, and data mining begins. Various types of mining may take place, including:
- Automated classification: placing a data point into a predefined category, if an existing structure is known to the user (used by banks to check credit scores when issuing loans/mortgages, by companies looking to gender-stratify their audiences, and platforms looking to identify the genre or category of user-created content).
- Regression analysis: identifying the relationship between 1 dependent and ≥1 independent variables. Frequently utilized in the context of predictive analytics, to forecast future data, based on that available today.
- Association analysis: identifying which independent variables are frequently correlative (used by online retailers to recommend similar items).
- Outlier recognition: identifying the data points that fall outside a predefined category (and do not follow the rules of mutual interdependence between variables). Anomaly detection is often used in security and investigations contexts, as well as by arbitrageurs in financial markets.
- Cluster identification: extending the idea of ‘automated classification’, Data Mining can be used to identify recurring similarities between data, and suggest new groups by which entities might be accordingly clustered (frequently used in marketing to target specific consumer demographics, and in process mining).
- Collation and evaluation: initial results are viewed in light of the original project objectives mapped out before work began. Data visualization tools such as Power BI, Looker or hCore can be used to neatly present findings from research to end-users, so they can decide what to do with the information.
- Application: lessons learned can be applied in-business. Some insights can be shared across functions to help improve all levels of business performance — through a data and experiment catalog like hIndex — but it is up to the user to selectively act upon any insights drawn from data mining processes (in contrast to Machine Learning, in which results may be fed back into algorithms as part of an automated workflow, powered by a system such as hCloud).
3 Types of Data Modeling
- Descriptive: reveals similarities and differences between data points in a dataset, identifying any anomalies and highlighting relationships between variables.
- Predictive: forecasts what will happen in the future, based on the data available today.
- Prescriptive: using an initial predictive model, prescriptive modeling recommends potential steps to alter the outcome of this model.