Data mining is quickly becoming integral to creating value and business momentum. As a framework for collecting, searching, and filtering raw data in a systematic matter, data mining helps analysts parse data sets, and get at the most meaningful, useful information. This ability to detect patterns in the data, create models around groups, and test assumptions about relationships, has the collective power to change business paradigms and catalyze more successful outcomes.
But by what methods does data mining make sense of data? With increasing volume, variety, and velocity of data, how does data mining reduce noise, aggregate and simplify?
In this post we’ll explore several popular data mining techniques and briefly examine how they allow analysts to focus on variables, keep distractions at bay, and unearth patterns hidden within their data. Along the way, we’ll also define salient terms to give you a better taste for the vast and exciting field of data mining.
One way to reduce noise is through “dimensionality reduction”, where you reduce the number of random variables under consideration, by obtaining a set of principal variables. Essentially, you select features and extract them. Such dimensionality reduction is a very popular technique in machine learning. This is because if there are too many factors, or variables called features, the higher the number of features, the harder it is to visualize the training set and work on it. If some features are correlated or redundant, dimensionality reduction decreases the number of random variables under consideration, by obtaining a set of principal variables via feature selection and feature extraction. It is then possible to group cases into those similar to one another and those that are different, sorting and clustering datasets into dense collections of associated points.
With clusters established, you may then classify new cases as either those that don’t belong, or those that do, and put them into the right “bucket”. There are usually algorithms that help with classification. One such algorithm is k-means classification, where you have a chosen number of groups and then you are simply trying to find how cases can group to those in a high dimensional space. You could also think of classification as categorization of observations so as to understand taxonomies and sub-groups.
This technique finds relationships between variables so you can find which groups tend to occur together. First you define item sets, then generate rules around those item sets. For example, say you have an item set of A, B, and C. You might try to find when A occurs, B occurs as well. Or when B is present, C is generally present, too. Often, association analysis is useful for analyzing purchasing habits. You might, for instance, find that when a customer purchases Product X and Y together, there is a 50% chance that they’ll also purchase Z. As a store manager, it would behoove you to put these items together, perhaps discount one, or promote each as a unit. And so while association analysis can get complex with the k-Nearest Neighbors algorithm for finding densities in a multidimensional or multivariable space, in general association analysis helps to calculate the conditional probability, or the “confidence”, of an outcome.
Every dataset has outliers. But anomalies are different than outliers. They’re things that are unusual events that often signal a problem, such as a broken sensor or with fraud detection. Unlike grouping variables, with anomaly detection you're trying to find cases that are different from all the others, that don't belong in the clusters, so you can exclude them. Anomaly detection is important because if you don’t exclude the anomalies, they can distort the data and relationships between variables.
Score prediction is useful if you have a ton of variables, characteristics, or fields in your data and you want to find the variables among that big collection in order to predict a particular outcome. In a classification problem, you typically have historical data (labeled examples) and unlabeled examples. The goal of classification is to construct a model using the historical data that accurately predicts the label (class) of the unlabeled examples. A classification task begins with build data (also know as training data) for which the target values (or class assignments) are known. A classification model can also be used on build data with known target values, to compare the predictions to the known answers; such data is also known as test data or evaluation data. This technique is called testing a model, which measures the model's predictive accuracy. The application of a classification model to new data is called applying the model, and the data is called apply data or scoring data. Applying data is often called scoring the data. Classification is used in customer segmentation, business modeling, credit analysis, and many other applications. For example, a credit card company may wish to predict which customers will default on their payments. Each customer corresponds to a case; data for each case might consist of a number of attributes that describe the customer's spending habits, income, demographic attributes, etc. These are the predictor attributes. The target attribute indicates whether or not the customer has defaulted; that is, there are two possible classes, corresponding to having defaulted or not.
Score prediction offers a plenitude of techniques. One way to make precise, unbiased predictions is by using regression analysis, in which you predict the mean of the dependent variable given specific values of the dependent variable(s). The one caveat about a regression model is that it provides unbiased predictions of the observed values. It doesn’t address the precision of those predictions. Precision measures how close the predictions are to the observed values. In data mining, you want predictions to be both unbiased and close to the actual values, for the observed values cluster to be close to the predicted values. While there are hundreds of regressions (linear regression, logistic regression, Poisson regression), and as many methods for choosing which variable should be used to make the prediction, in general regression analysis helps find patterns in a large amount of data.
Sequence & Text Mining
For time-ordered data, sequence mining helps you to find events that adhere to an order, so that if one event occurs, you can say when another will likely occur next.
Unlike sequence mining, which works with structured data, text mining takes blocks of unstructured text data, whether that be from a customer forum, a book, a blog post, or a tweet history, in order to find words or phrases that carry meaning. Text mining involves algorithms of data mining, machine learning, statistics, and natural language processing.
When you have an enormous data set, it can take an exponentially increasing amount of time to do the analysis. Data reduction thus saves you not only time and RAM, but it also helps you simplify the dataset and focus on variables or constructs that are most likely to carry meaning and least likely to carry noise. To this end, there are a lot of great tools at your disposal. For example, you can do data reduction in R, Python, Orange, RapidMiner, Knime, or other tools.
As pragmatic groupings, clusters serve a particular purpose for the analyst. You’re grouping a set of objects in such a way that objects in the same group (cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Clustering serves a lot of purposes. You might want to hear certain songs together on a playlist. Maybe you want to serve up the same ads to a certain cohort. Or perhaps you want to ensure that certain kinds of patients get the same medical treatment. The algorithm you use will define what it means to be similar, how to measure distance, and what it means to be a cluster.
How you measure distance may vary (euclidean, distance from a centroid, data density and distribution) and depend on the kind of data at your disposal. There are several different ways to implement this partitioning, based on distinct models. Distinct algorithms are applied to each model, differentiating its properties and results. These models are distinguished by their organization and type of relationship between them. The most important ones are:
- Centralized – each cluster is represented by a single vector mean, and an object value is compared to these mean values
- Distributed – the cluster is built using statistical distributions
- Connectivity – he connectivity on these models is based on a distance function between elements – Group – algorithms have only group information
- Graph – cluster organization and relationship between members is defined by a graph linked structure
- Density – members of the cluster are grouped by regions where observations are dense and similar
One of the most important points to underscore about data mining is that, despite how useful each of these techniques may be, your data must be in good shape before you can do any real analysis. In fact, data preparation comprises 50-80% of every data mining project. ♦
If you want to cut that number down to zero and accelerate your time to analytics, we encourage you to connect your SaaS applications to Fusion. In minutes, you’ll have your very own cloud data warehouse which you can connect to any BI or data mining tool of choice. Getting started is free and easy.