Grocery Scanner Data#

Score household shopping behavior for economic rationality using preference-graph analysis on loyalty-card scanner data.

Supermarket grocery aisle

Introduction#

Every week, millions of households make grocery purchases across product categories at posted prices. A natural question: are these choices consistent with any utility function? If a household buys more beef when it’s expensive and less when it is inexpensive, the observation graph contains a cycle, and no well-behaved utility function can rationalize the observed behavior.

Dean & Martin (2016) applied GARP to 977 households’ grocery scanner data across 38 product categories, finding marked heterogeneity in rationality scores correlated with demographics. Echenique, Lee & Shum (2011) used the same type of data to compute the Money Pump Index, quantifying how much money an arbitrageur could extract from inconsistent shoppers.

What you’ll learn:

  • How to construct observation graphs from grocery transactions

  • The formal GARP test and three goodness-of-fit scores (CCEI, MPI, HM)

  • Exploratory analysis of real scanner data (Dunnhumby, 2,222 households)

  • How to segment customers by rationality and interpret the scores

Companion script: examples/applications/01_grocery_scanner.py

Background#

Revealed preference analysis on grocery data allows researchers to quantify consumer rationality without assuming specific functional forms for utility. For a formal treatment of the underlying axioms and efficiency metrics, see:

Data#

This application uses the Dunnhumby “The Complete Journey” dataset: 2,500 households tracked over 2 years (104 weeks) across 10 staple product categories.

Loading#

import sys
from pathlib import Path

# Load Dunnhumby pipeline
sys.path.insert(0, str(Path("dunnhumby")))
from data_loader import load_filtered_data
from price_oracle import get_master_price_grid
from session_builder import build_all_sessions

filtered = load_filtered_data()
price_grid = get_master_price_grid(filtered)
households = build_all_sessions(filtered, price_grid)

print(f"Households: {len(households)}")
print(f"Price grid: {price_grid.shape}")  # (104 weeks, 10 categories)
Households: 2222
Price grid: (104, 10)

Note

The Dunnhumby dataset requires a Kaggle download. Run case_studies/dunnhumby/download_data.sh first. See case_studies/dunnhumby/README.md for details.

EDA: Product Categories#

The 10 product categories and their average weekly prices:

Category

Mean Price ($)

Std ($)

Purchase Freq (%)

Soft Drinks

3.42

1.08

78%

Fluid Milk

2.89

0.94

72%

Bread/Rolls

2.51

0.83

68%

Cheese

4.67

1.52

61%

Bag Snacks

3.28

1.15

55%

Soup

1.89

0.72

43%

Yogurt

3.15

1.01

52%

Beef

6.24

2.18

38%

Frozen Pizza

4.53

1.44

35%

Lunchmeat

3.98

1.31

41%

EDA: Household Activity#

import numpy as np

obs_counts = [h.num_observations for h in households.values()]
print(f"Active weeks per household:")
print(f"  Min: {min(obs_counts)}  Median: {int(np.median(obs_counts))}"
      f"  Max: {max(obs_counts)}")
print(f"  Mean: {np.mean(obs_counts):.1f}  Std: {np.std(obs_counts):.1f}")
Active weeks per household:
  Min: 10  Median: 42  Max: 104
  Mean: 43.2  Std: 25.8

Households with more active weeks provide more data for GARP testing but are also more likely to show violations (more chances for inconsistency).

Pipeline Walkthrough#

Single household#

from prefgraph import BehaviorLog, validate_consistency
from prefgraph import compute_integrity_score, compute_confusion_metric
from prefgraph.algorithms.mpi import compute_houtman_maks_index

# Pick one household
hh = list(households.values())[0]
log = hh.behavior_log

print(f"Household: {log.user_id}")
print(f"Observations: {log.num_records}")
print(f"Goods: {log.num_features}")
Household: household_457
Observations: 63
Goods: 10

Step 1 — GARP test:

garp = validate_consistency(log)
print(f"GARP consistent: {garp.is_consistent}")
print(f"Violation cycles: {len(garp.violations)}")
GARP consistent: False
Violation cycles: 847

Step 2 — CCEI (efficiency score):

ccei = compute_integrity_score(log, tolerance=1e-4)
print(f"CCEI: {ccei.efficiency_index:.4f}")
print(f"Budget waste: {(1 - ccei.efficiency_index) * 100:.1f}%")
CCEI: 0.8325
Budget waste: 16.8%

Step 3 — MPI (exploitability):

mpi = compute_confusion_metric(log)
print(f"MPI: {mpi.mpi_value:.4f}")
MPI: 0.2140

Step 4 — Houtman-Maks (outlier fraction):

hm = compute_houtman_maks_index(log)
print(f"Observations to remove: {hm.removed_count}/{log.num_records}")
print(f"Fraction removed: {hm.fraction:.3f}")
Observations to remove: 15/63
Fraction removed: 0.238

Note

This household’s CCEI of 0.83 means that if we shrink each budget by 17%, all choices become rationalizable. The MPI of 0.21 means an arbitrageur could extract ~21% of total expenditure by cycling trades.

Batch Analysis#

Scoring all 2,222 households via the Rust Engine:

from prefgraph.engine import Engine

engine = Engine(metrics=["garp", "ccei", "mpi", "hm"])

# Batch: list of (prices T×K, quantities T×K) - one tuple per household
users = [hh.behavior_log.to_engine_tuple() for hh in households.values()]

results = engine.analyze_arrays(users)  # Rust/Rayon parallel scoring

# Each EngineResult has: .is_garp, .ccei, .mpi, .hm_consistent, .hm_total
for hh_key, er in zip(households.keys(), results):
    hm_frac = 1.0 - (er.hm_consistent / er.hm_total) if er.hm_total > 0 else 0.0
    print(f"  {hh_key}: GARP={er.is_garp}  CCEI={er.ccei:.3f}"
          f"  MPI={er.mpi:.3f}  HM={hm_frac:.3f}")

Score distributions across the panel:

Metric

Mean

Std

P10

P25

P50

P90

CCEI

0.839

0.105

0.698

0.766

0.852

0.960

MPI

0.225

0.112

0.078

0.142

0.225

0.371

HM removed

0.224

0.118

0.080

0.143

0.222

0.369

Key statistics:

  • 4.5% of households are perfectly GARP-consistent (CCEI = 1.0)

  • Mean CCEI = 0.839, meaning the average household wastes ~16% of budget

  • The distribution is left-skewed: most households are moderately rational

Grocery analysis - CCEI distribution, efficiency vs exploitability, rolling-window trajectories, recovered utility

Beyond Consistency Scores#

GARP and CCEI answer one question: is behavior consistent? PrefGraph goes much further. On the same data, without re-estimation, you can assess test power, diagnose individual observations, test preference structure, recover utility, and measure welfare.

Power analysis#

A GARP pass is only meaningful if the test had enough power to detect violations. Bronars (1987) simulates random behavior on the same budget sets; the fraction that violates GARP is the test’s power.

from prefgraph.contrib.bronars import compute_bronars_power
from prefgraph.contrib.power_analysis import compute_selten_measure

bp = compute_bronars_power(log, n_simulations=500, random_seed=42)
sm = compute_selten_measure(log, n_simulations=500, random_seed=42)

Sample households from the panel:

HH

T

CCEI

Bronars Power

Selten m

9

15

1.000

0.348

0.335

5

13

0.973

0.635

0.000

4

25

0.946

0.800

0.000

1

60

0.854

1.000

0.000

8

63

0.699

1.000

0.000

Household 9 passes GARP but with power = 0.35 — only 35% of random consumers would fail on those budgets. Household 1’s failure at power = 1.0 is definitive: every random consumer also fails, so the violation is not due to a lenient test.

Per-observation diagnostics#

VEI (Varian 1990) assigns each observation its own efficiency score via per-observation LP. Swaps (Apesteguia & Ballester 2015) counts the minimum adjacent swaps to restore consistency.

from prefgraph.algorithms.vei import compute_vei
from prefgraph.algorithms.mpi import compute_houtman_maks_index

vei = compute_vei(log)
hm  = compute_houtman_maks_index(log)

For household 1 (T=60, CCEI=0.854):

Metric

Value

VEI mean

1.000

VEI min

1.000

HM removed

19 / 60 (31.7%)

Swaps

0

VEI = 1.0 everywhere means violations are distributed across many observation pairs (no single outlier). HM = 31.7% means removing 19 of 60 weeks restores full consistency.

Structural tests#

Test what kind of utility function could generate the data:

from prefgraph.algorithms.harp import check_harp
from prefgraph.algorithms.quasilinear import check_quasilinearity
from prefgraph.contrib.additive import test_additive_separability

harp = check_harp(log)
ql   = check_quasilinearity(log)
sep  = test_additive_separability(log)

Test

Pass

Detail

Meaning

HARP (homothetic)

No

48 violations

Preferences don’t scale with income

Quasilinear

No

114K violations

Income effects matter

Additive separable

No

1 group

Cross-category interactions exist

All three fail for household 1, consistent with a general utility function with income effects and cross-price substitution between categories.

Utility recovery and welfare#

For GARP-consistent households, recover latent utility values via Afriat’s LP, then measure welfare impact of price changes:

from prefgraph import recover_utility
from prefgraph.contrib.welfare import analyze_welfare_change

# Household 9 (GARP-consistent, T=15)
u = recover_utility(log)  # Afriat LP
# Simulate 10% price increase in category 0
w = analyze_welfare_change(baseline_log, policy_log)

Metric

Value

Utility recovery

Success

HARP (homothetic)

Pass

Bronars power

0.348

CV (10% price increase)

–$0.61

EV (10% price increase)

+$0.60

Baseline expenditure

$13.82

Full pipeline summary#

Everything above runs on a single BehaviorLog, no re-estimation needed:

Category

Methods

Returns

Test

GARP, WARP, SARP, HARP, quasilinear, separability

bool + violations

Score

CCEI, MPI, VEI, Houtman-Maks, Swaps, Bronars power

float (0–1)

Recover

Utility (Afriat LP), demand, Slutsky matrix

vectors / matrices

Welfare

CV, EV, deadweight loss, expenditure function

dollar values

Structure

Additive groups, cross-price effects, Lancaster characteristics

partitions / matrices

Temporal Panel Analysis#

Beyond a single snapshot, tracking CCEI over time reveals household dynamics: who is consistently rational, who is deteriorating, and who crosses between segments.

Rolling-window CCEI#

For each household, compute CCEI over a sliding 20-week window:

def compute_rolling_ccei(log, window=20, step=5):
    results = []
    for start in range(0, log.num_records - window + 1, step):
        window_log = BehaviorLog(
            cost_vectors=log.cost_vectors[start:start+window],
            action_vectors=log.action_vectors[start:start+window],
        )
        ccei = compute_integrity_score(window_log).efficiency_index
        results.append((start, ccei))
    return results

Time fixed effects#

Raw CCEI trajectories confound household behavior with the macro price environment. If prices are volatile in Q4 (holiday promotions, supply shocks), every household’s CCEI drops — not because anyone changed their behavior, but because more price variation gives the GARP test more power to detect violations.

To separate household-level change from common time effects, the companion script uses a cohort-deviation approach: compute the panel mean CCEI at each window index, then classify each household by its deviation from the cohort mean.

# Cohort mean at each window position (time fixed effect)
cohort_mean = {}
for i in range(max_windows):
    vals = [traj[i] for traj in all_trajectories if len(traj) > i]
    cohort_mean[i] = np.mean(vals)

# Classify by deviation, not raw level
deviations = [ccei[i] - cohort_mean[i] for i in range(len(ccei))]

A household that drops while the cohort stays stable is genuinely deteriorating. A household that drops alongside everyone else is experiencing a challenging price environment.

Trajectory classification#

Classify each household by its deviation from cohort mean trajectory:

Trajectory

Criteria

Interpretation

Stable

std < 0.03

Preferences don’t change relative to cohort; reliable customer

Improving

slope > +0.005

Becoming more consistent relative to peers

Deteriorating

slope < -0.005

Choices becoming more erratic relative to peers; possible life change

Volatile

std > 0.03, |slope| < 0.005

Fluctuating consistency relative to cohort; context-dependent shopper

On Dunnhumby data (households with 30+ weeks):

Trajectory          N       %  Mean CCEI   Std CCEI  Avg Slope
────────────────  ───── ─────── ────────── ────────── ──────────
stable              38%          0.897      0.010    -0.006
improving           19%          0.838      0.070    +0.023
deteriorating       23%          0.871      0.074    -0.044
volatile            19%          0.931      0.049    +0.002

Crossover detection#

Users whose first-half and second-half CCEI differ by more than 0.05 represent crossovers — behavioral regime changes worth flagging:

HH-3       deteriorating    1st half: 0.969  →  2nd half: 0.729  (Δ = -0.24)
HH-14      deteriorating    1st half: 0.947  →  2nd half: 0.775  (Δ = -0.17)
HH-17      improving        1st half: 0.639  →  2nd half: 0.780  (Δ = +0.14)

A household dropping from CCEI 0.97 to 0.73 relative to the cohort warrants investigation: account sharing, life disruption, or idiosyncratic price sensitivity. (A raw CCEI drop without cohort comparison may simply reflect a period of high price volatility affecting all households.)

Interpretation#

Customer segmentation#

CCEI scores naturally segment customers into behavioral tiers:

Tier

CCEI Range

Interpretation

Consistent optimizers

0.95 – 1.00

Price-sensitive, respond predictably to promotions

Noisy maximizers

0.80 – 0.95

Preferences exist but imprecisely revealed

Erratic shoppers

< 0.80

Choices hard to rationalize; may benefit from curation

Business applications#

  1. Targeted pricing: High-CCEI customers respond predictably to price changes. Choi et al. (2014) found a 1 SD increase in CCEI associates with 15–19% more household wealth.

  2. Fraud/anomaly detection: A sudden CCEI drop for a previously consistent household signals account sharing or suspicious activity.

  3. Promotion evaluation: If a new promotion strategy increases average CCEI, customers are making more coherent choices — a sign of reduced cognitive load.

  4. Welfare measurement: MPI quantifies the dollar value of welfare losses from inconsistent choices, enabling cost-benefit analysis of interventions (loyalty programs, simplified displays).

Limitations#

  • GARP tests existence of any utility function, not a specific one. A high CCEI doesn’t mean the consumer is “smart” — just consistent.

  • CCEI is domain-specific: a consumer rational about groceries may be irrational about electronics (Chen et al. 2025, arXiv:2505.05275).

  • With 10 product categories and 50+ observations, GARP has high power to detect violations even from slight preference noise.