Sunday, August 31, 2014

Association rule learning (slides)

Association rule learning is a popular approach to extract rules from large databases. Initially intended to transactional data, especially for the market basket analysis, the method can be applied to any binary or binarized data.

In these slides, we show the outline of the approach. We present a basic algorithm to generate association rules from data. We highlight the influence of the settings (minimum support and minimum confidence) for the reduction of the search space, and thus for the reduction  of the amount of calculations.

Keywords: association rule, association rules, itemset, frequent itemset, eclat algorithm, support, confidence, lift
Slides: Association rule learning
Wikipedia, "Association Rule Learning".
M. Zaki, S. Parthasaraty, M. Ogihara, W. Li, “New Algorithms for Fast Discovery of Association Rules”, in Proc. of KDD’97, p. 283-296, 1997.

Tuesday, August 12, 2014

ROC curve (slides)

The ROC curve is a graphical tool for the evaluation and comparison of binary classifiers. It provides more complete evaluation than the confusion matrix and the error rate.  It is valid even if we deal with a non-representative test set i.e. the observed class frequencies are not an estimate of the prior class probabilities. It is especially useful when we deal with class imbalance, and when the misclassification costs matrix is not well established.

In these slides, we show: the ideas underlying the ROC curve; the construction of the curve from a dataset; the calculation of the AUC (area under curve), a synthetic indicator derived from the ROC curve; and the use of the ROC curve for model comparison.

Keywords: receiver operating characteristic, roc curve, auc, area under curve, binary classifier, evaluation, model comparison, class probability estimate, score
Components (Tanagra): SCORING, ROC CURVE
Slides: ROC curve
Wikipedia, "Receiver Operating Characteristic".
T. Fawcett, "An introduction to ROC analysis", Pattern Recognition Letters, 27, 861-874, 2009.

Monday, August 4, 2014

Customer targeting (slides)

Customer targeting is one component of the direct marketing. The aim is to identify the customers which are the most interested in a new product. We are in the data mining context because we create a classifier from a learning sample. But we do not want to classify the instances. We want to measure the probability of the individuals to buy the product i.e. their score, their propensity to purchase. In this context, we use a specific tool - the gain chart (or the cumulative lift curve) - to assess the efficiency of the analysis.

In these slides, we detail the overall process. We emphasize the reading of the gain chart, especially the transposition of the reading of the chart from a labeled sample to the customer database (for which we do not know the values of the target attribute).

Keywords: customer targeting, direct marketing, scoring, score, propensity to purchase
Components (Tanagra): SCORING, LIFT CURVE
Slides: Customer targeting
Microsoft, “Lift chart (Analysis Services – Data Mining)”, SQL Server 2014.
H. Hamilton, “Cumulative Gains and Lift Charts”, in CS 831 – Knowledge Discovery in Databases, 2012.

Saturday, August 2, 2014

Descriptive discriminant analysis (slides)

The descriptive discriminant analysis (DDA) or canonical discriminant analysis is a statistical approach which performs a multivariate characterization of differences between groups. It is related to other factorial approaches such as principal component analysis or canonical correlation analysis.

In these slides, we show the main issues of the approach, and the reading of the results. We show also how the discriminant analysis is related to the predictive discriminant analysis (linear discriminant analysis) which, yet, relies on restrictive statistical assumptions.

Keywords: discriminant analysis, descriptive discriminant analysis, canonical discriminant analysis, predictive discriminant analysis, correlation ratio, R, lda package MASS, sas, proc candisc
Slides: DDA
Dataset: wine_quality.xls
SAS, "CANDISC procedure".