What is the lift in data mining

Lift chart

 

The lift chart enables a visual summary of the informational value provided by one or more statistical models for predicting a binomial (categorical) output variable (dependent variable). For multinomial output variables, lift charts can be calculated for each category. In particular, you can use the chart to compare the benefits that can be expected from using the corresponding forecast model with the baseline.

The lift chart is applicable to most statistical methods that compute predictions (classifications) for binomial or multinomial responses. In STATISTICA, lift charts can be calculated in various modules, including General Classification and Regression Trees (GC&RT) models, GCHAID, Generalized (non) linear models (logit and probit models for binomial answers), General models of discriminant analysis (for binomial answers) etc. The module Rapid deployment of predictive models calculates simple and superimposed gains charts (for several predictive models) based on models that have been trained and applied via PMML. This and similar summary charts (see lift chart) are often used in data mining projects when the dependent or interesting output variable is binomial or multinomial.

Example. This is an example to illustrate how the lift chart is constructed. Let's say you have a mailing list of previous customers and you want to offer these customers an additional service by sending out a brochure and other material describing the service. During previous campaigns, you have collected useful information about your customers (e.g. demographic information, purchase patterns) that you can relate to the response rate, i. H. See if the relevant customers responded to your offer and what kind of order they placed. Also, after similar previous campaigns, you were able to estimate the baseline response rate to be around 7%; H. 7% of all customers who had a similar offer in the mail replied (ordered the additional service).

With this baseline response rate (7%) and the costs of the mail campaign, sending the offer to all customers would result in a loss. So you want to use statistical analysis to help identify the customers who are most likely to respond. You can use models of general classification and regression trees in STATISTICA to build such a model based on the data collected in previous campaigns. You can now select only the 10% of customers from the mailing list that the C&R model predicts are most likely to respond. If the response rate among these customers (selected by the model) is 14% (as opposed to the 7% of the baseline), then the relative gain or lift value can be calculated as 14% / 7 % = 2 can be calculated. In other words, using STATISTICA C&RT to select 10% of customers from the mailing list will do twice as well as if you had used a simple random sample.

Analog values ​​can be calculated for each percentile of the population (the customers in the mailing list). You can calculate separate lift values ​​for selecting the top 20% of customers who are most likely to respond to the mail campaign, the top 30%, and so on. The lift values ​​for different percentiles can be connected by a line that typically drops slowly and coincides with the baseline when all customers (100%) are selected.

If more than one prediction model is used, several lift charts can be overlaid in order to obtain a graphical comparison of the benefits of the different models.

See also Rapid deployment of predictive models and gains chart.

See also: Lift Charts in Statistica Data Miner.