Preference Graphs

pip install prefgraph                  # core library
pip install "prefgraph[parquet]"       # + Parquet file support
pip install "prefgraph[datasets]"      # + real-world dataset loaders

See the Installation page for all extras and workflow options.

When users make choices, we can represent their decisions as a preference graph. If someone chooses A over B, B over C, and then C over A, they have formed a cycle. These cycles could represent an inconsistency in their decision making. PrefGraph uses graph algorithms (like Tarjan's SCC) to detect these cycles. By identifying and counting these violations, we can score a user's consistency.

Animated preference graph showing consistency detection

Budgets & Menu Choices

PrefGraph handles two types of choice data. Budgets are purchased quantities at given prices, like retail shopping. Menus are discrete selections from a set of available items, like search clicks or AI agent prompting. Menus come in three flavors: deterministic (MenuChoiceLog), stochastic (StochasticChoiceLog), and risk-based lotteries (RiskChoiceLog).

PrefGraph accepts Polars DataFrames, Pandas, Parquet files, or raw NumPy arrays. See the Loading Data guide for examples.

Budget-choice example

from prefgraph.datasets import load_demo
from prefgraph.engine import Engine, results_to_dataframe

# load_demo returns list[tuple[prices, quantities]] — synthetic shoppers
users = load_demo(n_users=100_000)

# Engine scores every user in parallel via Rust/Rayon
engine = Engine(metrics=["garp", "ccei"])  # GARP = acyclicity test, CCEI = efficiency score
results = engine.analyze_arrays(users)

# Flatten to a DataFrame for analysis
df = results_to_dataframe(results)
print(df[["is_garp", "n_violations", "ccei"]].head())

Scored 100,000 users in 3.8s (26,165 users/sec)

   is_garp  n_violations      ccei
   True             0  1.000000
   True             0  1.000000
   True             0  1.000000
  False             4  0.972536
  False             2  0.978055

Menu-choice example

from prefgraph import generate_random_menus
from prefgraph.engine import Engine, results_to_dataframe

# Generate discrete-choice data: each user picks one item from a menu
menus_data = generate_random_menus(
    n_users=100_000, n_obs=10, n_items=5,
    menu_size=(2, 5),        # menus contain 2–5 items each
    choice_model="logit",    # logit model with some noise
    rationality=0.7, seed=42
)

# HM = Houtman-Maks: counts how many choices to discard for consistency
engine = Engine(metrics=["hm"])
results = engine.analyze_menus(menus_data)
df = results_to_dataframe(results)
print(df[["is_sarp", "n_sarp_violations", "hm_consistent", "hm_total"]].head())

Generated + scored 100,000 users in 2.6s (38,895 users/sec)

   is_sarp  n_sarp_violations  hm_consistent  hm_total
  False                  6              3         5
  False                  3              4         5
  False                  6              3         5
  False                  3              4         5
  False                  6              3         5

Before You Trust the Scores

Consistency scores are only meaningful when the input data represents genuine feasible choices. Menus must reflect what the user actually saw, not a retroactive reconstruction from purchase logs. Keep only clean single-choice sessions where the user picked exactly one item. The chosen item must be present in the menu. Item IDs must be remapped to contiguous 0..N-1 indices before scoring. For budget data, prices must be positive and quantities non-negative. The Engine now rejects NaN, Inf, negative prices, out-of-range item IDs, and duplicate menu items with clear error messages, but the harder question is whether your menus and budgets approximate real choice sets at all. If they do not, the scores measure data artifacts, not behavior. See the Loading Data guide for worked examples of building clean inputs from raw event logs.

Examples

Inconsistency in AI Agents

Do LLMs have stable action rankings, or does the ranking change when different alternatives are shown? We queried GPT-4o-mini across 5 enterprise scenarios (support triage, alert routing, content moderation, hiring, procurement), each with 10 vignettes, 5 prompt frameworks, and 15 menus per vignette. The deterministic stage collected 3,750 calls at temperature 0; the stochastic stage sampled each menu 20 times at temperature 0.7, adding 75,000 calls — roughly 78,750 API calls in total over 15 hours. We built preference graphs from these responses and tested for logical cycles. All vignettes are synthetic and results come from a single model family, so these numbers are a diagnostic demo rather than a general benchmark. Full results: Detecting Inconsistency in AI Agents.

Operational Scenario	Perfect Consistency (%)	Probabilistic Consistency (%)
Support	88	54
Alert	92	74
Content Moderation Task	82	60
Jobs Task	74	62
Procurement	84	61

Predicting Customer Spend and Engagement

We tested whether revealed preference features improve user-level predictions across 11 datasets and 32 targets. Under 5-fold cross-validation, the median lift is zero. The one exception is Amazon churn prediction. Despite near-zero predictive lift, three revealed preference features rank in the top ten by model importance. Full results: Predicting Customer Spend and Engagement.

Dataset	N	Target	Base AUC-PR	+RP AUC-PR
Amazon	4,694	Spend Drop	.226	.248
REES46	8,832	Low Loyalty	.709	.715
H&M	46,757	High Spender	.683	.682
FINN	46,858	Low Loyalty	.780	.781

Performance

The Rust engine processes users in parallel via Rayon and streams them in fixed-size chunks, so memory stays flat regardless of population size. On a 10-core laptop, scoring 100,000 users across five metrics from a 110 MB Parquet file takes under two minutes. See the Performance Benchmarks page for details.

Throughput by Metric Configuration (T=20-100, K=5)
Metrics	Throughput (users/sec)	Latency (per user)
GARP Only (O(T²))	~49,000	20 μs
GARP + CCEI	~2,400	420 μs
Comprehensive (GARP, CCEI, MPI, HARP)	~2,000	500 μs

Budget — Large-Scale
Configuration	10K users	100K users	1M users
GARP (O(T²))	0.1s	2.0s	~20s
GARP + CCEI	4.2s	39.5s	~6.6 min
Comprehensive Suite	6.8s	67.1s	~11 min

Menu — Large-Scale
Configuration	10K users	100K users	1M users
SARP + WARP + HM	0.3s	5.2s	85.6s

Explore the API Reference and References for more.