Predicting Customer Spend and Engagement

Across 11 datasets and 32 prediction targets, revealed preference features add near-zero marginal lift on average. Standard spend and engagement baselines already capture most of the signal.

Setup

We test whether choice-consistency features improve predictions of spend, churn, engagement, and loyalty at the individual user level. Each user’s purchase or click history is split in time. The early portion produces features. The later portion defines the prediction target. A regularized LightGBM is evaluated by 5-fold cross-validation. An L1-penalized logistic regression is used separately for feature selection and directional interpretation. All runs use SEED 42.

Features

The baseline consists of 13 features that capture how much a user spends, how often they buy, and how concentrated their purchases are. These include total spend, number of observations, mean basket size, spend trend, category concentration, and recency. The revealed preference features consist of 42 measures that capture how consistently a user makes decisions. These include rationality scores such as CCEI and MPI, graph-structural measures such as transitivity and preference density, choice diversity measures such as entropy and reversal rate, and utility-recovery measures such as Afriat utility dispersion and efficiency trajectories.

Targets

Each dataset produces one or more prediction targets from the held-out future window. Spend Drop flags users whose average spending fell by more than half. High Spender identifies users in the top third of total expenditure. Future LTV and Spend Change predict continuous spend levels. High Engagement identifies the most active users by session count. Low Loyalty flags users whose choices became more dispersed over time. High CTR measures effective click-through rate on news impressions. All classification targets use top-tercile thresholds computed on the training set to prevent leakage.

Results

The table shows the lift from adding 42 revealed preference features on top of a 13-feature baseline, measured by 5-fold cross-validation with a regularized LightGBM. The standard deviation across folds is shown in parentheses.

Dataset	Target	N	Base AUC-ROC	Δ ROC (pp)	Base AUC-PR	Δ PR (pp)
Dunnhumby	Spend Drop	2,222	.730 (.020)	-2.4	.180	-3.5
Dunnhumby	High Spender	2,222	.958 (.003)	-0.1	.929	-0.4
Dunnhumby	Future LTV (R²)	2,222	.562 (.050)	+0.7
Amazon	Spend Drop	4,694	.776 (.020)	+1.3	.226	+2.2
Amazon	High Spender	4,694	.944 (.007)	0.0	.903	+0.2
H&M	High Spender	46,757	.792 (.004)	-0.1	.683	-0.1
H&M	Spend Change (R²)	46,757	.285 (.013)	+0.1
Instacart	Low Loyalty	50,000	.969 (.002)	0.0	.935	+0.1
REES46	Low Loyalty	8,832	.887 (.008)	0.0	.709	+0.6
Taobao	Engagement	15,806	.930 (.006)	+0.1	.806	+0.1
Taobao BW	Low Loyalty	590	.989 (.006)	+0.2	.868	+0.5
Tenrec	Engagement	50,000	.993 (.000)	0.0	.983	0.0
MIND	High CTR	5,091	.657 (.022)	-0.4	.514	-0.4
FINN	Low Loyalty	46,858	.958 (.001)	0.0	.780	+0.1

All values are 5-fold cross-validated means from a regularized LightGBM. Standard deviations across folds are shown in parentheses next to each baseline. Δ columns show the percentage point change from adding revealed preference features. For R² targets, Δ is in R² units. Bold marks the one target where the lift clearly exceeds the fold-to-fold noise. Full results are in output/cv_results.json.

Findings

Under 5-fold cross-validation the results are stable and the earlier single-holdout lifts of 1 to 20 percent do not replicate. Amazon Spend Drop is the only target where the lift clearly exceeds the fold-to-fold noise. Three revealed preference features rank in the top ten by model importance across all datasets, but they do not improve accuracy over well-constructed baselines.

Suggestive Directions

The following directions are suggestive rather than causal and are based on Lasso coefficient signs under strong L1 regularization across all 11 datasets. Users who explore a wider variety of items tend to be more engaged but less loyal. Users with more fragmented preference graphs tend to spend more, suggesting that impulse buyers who do not optimize may be more valuable than deliberate ones. Users whose purchase efficiency is declining over time are more likely to churn, as if they are mentally disengaging before their spending actually falls. Users with more transitive preference graphs tend to click more but stay more loyal to specific products. These patterns are consistent across independent datasets and domains.

Feature Importance

Rank	Feature	Group	Mean importance	Top-10 in
1	`n_sessions`	Base	0.144	15 of 19
2	`std_menu_size`	Base	0.112	15 of 18
3	`menu_transitivity`	RP	0.101	18 of 19
4	`mean_basket_size`	Base	0.092	12 of 13
5	`spend_slope`	Base	0.080	13 of 13
6	`mean_menu_size`	Base	0.072	16 of 18
7	`choice_entropy_norm`	RP	0.072	17 of 19
8	`menu_pref_density`	RP	0.072	15 of 18
9	`total_spend`	Base	0.070	11 of 13
10	`mean_spend`	Base	0.064	11 of 13

Feature importances are computed from the full-scale single-holdout LGBM run across all datasets. The rankings are consistent with the cross-validated results above because gain-based importance aggregates across all trees and is less sensitive to a single split than prediction accuracy.

The three revealed preference features that appear in the top ten are:

Menu transitivity measures how often a user’s preferences form a consistent ordering. A user who prefers A over B and B over C but then picks C over A has low transitivity. High transitivity means the user knows what they want.
Choice entropy measures how spread out a user’s choices are across items. A user who always picks the same item has zero entropy. A user who picks many different items has high entropy. This captures exploration versus habit.
Menu preference density measures how many pairwise comparisons can be inferred from a user’s choices. A user who has been observed choosing between many different pairs of items has a dense preference graph. This captures how much evidence the data contains about that user’s preferences.

Replication

pip install prefgraph lightgbm scikit-learn

# 5-fold CV results (main table)
python case_studies/benchmarks/cv_benchmark.py --datasets validated

# Single holdout with both models
python case_studies/benchmarks/runner.py --datasets validated --model both

# Quick smoke test
python case_studies/benchmarks/runner.py --datasets validated --max-users 250 --model both

# Regenerate from cached results
python case_studies/benchmarks/runner.py --replot

All results are deterministic. Per-dataset JSON files are saved to case_studies/benchmarks/output/. For datasets on external drives, set PYREVEALED_DATA_DIR.

Dataset Summary

Dataset	Type	N	Total obs	T med	K	pct T 10 or more	Repeat	Uniq pct	Domain
Dunnhumby	Budget	2,222	55,388	30	10	93	28	24	Grocery
Amazon	Budget	4,694	54,356	31	15	95	73	13	E-commerce
H&M	Budget	46,757	29,132	15	9	97	68	21	Fashion
Instacart	Menu	50,000	59,649	16	18	64	57	50	Grocery
REES46	Menu	8,832	14,922	7	24	15	19	86	E-commerce
Taobao	Menu	15,806	11,061	5	30	2	2	100	E-commerce
Taobao BW	Menu	590	2,593	4	25	3	6	100	E-commerce
RetailRocket	Menu	47	356	5	57	19	0	100	E-commerce
Tenrec	Menu	50,000	12,847	5	24	16	0	100	Video
MIND	Menu	5,091	855	4	86	2	1	100	News
FINN	Menu	46,858	15,850	9	57	39	6	100	Classifieds

Budget datasets have longer histories with 15 to 31 observations per user and repeat rates between 28 and 73 percent. Menu datasets are thinner with 4 to 9 observations and near-zero repeat rates on most platforms. Revealed preference features work best where users make repeated choices from overlapping sets.

Feature Correlation

Revealed preference features are largely orthogonal to baseline features with a median cross-correlation of 0.12. Within the 59 revealed preference features on budget data, 45 pairs have absolute correlations above 0.80. These cluster into nine groups.

The largest cluster contains seven VEI and utility features that all measure the same underlying Afriat utility scale. The second largest contains six features linking violations, graph components, and VEI dispersion. These two clusters account for 13 of the 59 features and are heavily redundant. The third cluster is the one that matters for prediction. It contains preference density, strict preference density, and transitivity ratio, which correlate at 0.95 or above with each other. These are the graph-structural features that appear in the top ten importance list. CCEI and MPI form their own two-feature cluster at 0.92.

The three top-performing revealed preference features are only moderately correlated with the traditional consistency scores. Transitivity correlates with CCEI at about 0.5 and with MPI at about 0.3. Choice entropy is nearly uncorrelated with all traditional scores. This means the features that drive model importance are capturing different information from the classical rationality tests.

In the Lasso model, baseline features have a median standardized coefficient about five times larger than revealed preference features. However, the top revealed preference features reach magnitudes comparable to major baseline features. Choice entropy reaches 3.0 on REES46 Low Loyalty, which is in the same range as session count at 5.4. The number of strongly connected components reaches 3.8 on Instacart. When the Lasso does select a revealed preference feature, it gives it meaningful weight.

Limitations

The validity of revealed preference features depends on how well the observed data approximates true choice sets. Budget datasets in this study use category-level aggregation with median prices across all users, not individual receipt prices. Menu datasets use inferred sessions defined by arbitrary inactivity gaps. If these proxies distort the actual choice environment, the resulting consistency scores may measure data artifacts rather than genuine behavioral patterns.

Spend and frequency baselines capture most of the predictive signal in these tasks. Revealed preference features add measurable lift on only 1 of 32 targets. The three RP features that rank highly by importance are informative to the model but largely redundant with each other and with information the baseline already encodes through different channels.

The 42 revealed preference features are heavily correlated. On budget data, 45 feature pairs exceed 0.80 absolute correlation and cluster into nine groups. The effective dimensionality is far lower than 42, and much of the feature set is redundant by construction.

All results use a single model (regularized LightGBM) with one hyperparameter configuration under 5-fold cross-validation. A different model, tuning scheme, or evaluation metric could shift the picture. We have not tested whether RP features help in ensemble settings, in domains outside e-commerce, or on targets defined differently from ours.

Null Rates

50 of 59 revealed preference features are always populated on budget data. 25 of 27 are always populated on menu data. The exceptions are utility-recovery features that require a minimum number of intersecting choices. All missing values are imputed with training-set medians.

Dataset Descriptions

Dataset	Description
Dunnhumby	Household-level grocery scanner data from 2,222 households over two years. An observation is one active household-week across 10 staple commodity groups. Quantities are total units purchased per category. Prices are the global median price for each commodity that week, since individual receipt data is not available. Targets are Spend Drop, High Spender, and Future LTV.
Amazon	Amazon purchase records from 4,694 consumers across 15 product categories. Purchases are grouped by month at the category level. Prices are median monthly prices across all users, carried forward when a category has no sales in a given month. Targets are Spend Drop, High Spender, Spend Change, and Future LTV.
H&M	Fashion transactions from 46,757 customers over two years. Each customer’s purchases are grouped into monthly choice occasions across 9 product groups. Prices are the customer’s own average paid price that month, falling back to period-group medians for months with no purchase. Targets are High Spender, Spend Change, and Future Spend.
Instacart	Grocery reorder data from 50,000 users. Each observation is a single reordered SKU in a given order and aisle. The menu is the set of distinct products the user purchased from that aisle in the trailing three orders. The choice is the product they actually reordered. Targets are High Engagement, Low Loyalty, and High Novelty.
REES46	Multi-category e-commerce sessions from 8,832 users over two months. The platform provides native session boundaries. The menu is all items the user viewed in a single session. The choice is the purchased item. Targets are High Engagement, Low Loyalty, and High Novelty.
Taobao	Click and purchase events from 15,806 users on China’s largest e-commerce platform. Sessions are defined by a 30-minute inactivity gap. The menu is all items viewed in the session. The choice is the purchased item. Target is High Engagement.
Taobao BW	A variant of Taobao using buy-anchored windows. Whenever a user purchases an item, the menu is defined as all items viewed in the preceding 6-hour window. This avoids arbitrary session boundaries. Only 590 users qualify after filtering. Targets are High Engagement, Low Loyalty, High Entropy, Pref Drift, High Click Volume, High Active Time, and Fast Conversion.
RetailRocket	Click-stream data from a large e-commerce platform. Sessions are reconstructed using 30-minute inactivity windows. The menu is all items viewed in a session. The choice is the purchased item. Only 47 users have enough repeat sessions to qualify. Targets are High Engagement and Low Loyalty.
Tenrec	Video recommendation data from 50,000 users on Tencent QQ Browser. The menu is all items clicked since the previous positive feedback event. The choice is the item the user liked or shared. Targets are High Engagement and Low Loyalty.
MIND	News impression logs from 5,091 users on Microsoft News. Each impression is a directly logged slate of candidate articles. Only impressions with exactly one click are retained so the choice is unambiguous. The menu is the full set of articles shown. Targets are High Engagement, Low Loyalty, High Novelty, and High CTR.
FINN	Directly logged recommendation slates from 46,858 users on Norway’s largest classifieds marketplace. Each slate of up to 25 items is recorded by the platform along with which item was clicked. This is the cleanest menu dataset because the choice set is observed rather than reconstructed. Targets are High Engagement, Low Loyalty, and High Search Ratio.