Preference Graphs ================== .. code-block:: bash pip install prefgraph # core library pip install "prefgraph[parquet]" # + Parquet file support pip install "prefgraph[datasets]" # + real-world dataset loaders See the :doc:`install` page for all extras and workflow options. .. raw:: html

When users make choices, we can represent their decisions as a preference graph. If someone chooses A over B, B over C, and then C over A, they have formed a cycle. These cycles could represent an inconsistency in their decision making. PrefGraph uses graph algorithms (like Tarjan's SCC) to detect these cycles. By identifying and counting these violations, we can score a user's consistency.

Animated preference graph showing consistency detection
Budgets & Menu Choices ---------------------- PrefGraph handles two types of choice data. **Budgets** are purchased quantities at given prices, like retail shopping. **Menus** are discrete selections from a set of available items, like search clicks or AI agent prompting. Menus come in three flavors: deterministic (``MenuChoiceLog``), stochastic (``StochasticChoiceLog``), and risk-based lotteries (``RiskChoiceLog``). PrefGraph accepts Polars DataFrames, Pandas, Parquet files, or raw NumPy arrays. See the :doc:`Loading Data ` guide for examples. **Budget-choice example** .. code-block:: python from prefgraph.datasets import load_demo from prefgraph.engine import Engine, results_to_dataframe # load_demo returns list[tuple[prices, quantities]] — synthetic shoppers users = load_demo(n_users=100_000) # Engine scores every user in parallel via Rust/Rayon engine = Engine(metrics=["garp", "ccei"]) # GARP = acyclicity test, CCEI = efficiency score results = engine.analyze_arrays(users) # Flatten to a DataFrame for analysis df = results_to_dataframe(results) print(df[["is_garp", "n_violations", "ccei"]].head()) .. code-block:: text Scored 100,000 users in 3.8s (26,165 users/sec) is_garp n_violations ccei 0 True 0 1.000000 1 True 0 1.000000 2 True 0 1.000000 3 False 4 0.972536 4 False 2 0.978055 **Menu-choice example** .. code-block:: python from prefgraph import generate_random_menus from prefgraph.engine import Engine, results_to_dataframe # Generate discrete-choice data: each user picks one item from a menu menus_data = generate_random_menus( n_users=100_000, n_obs=10, n_items=5, menu_size=(2, 5), # menus contain 2–5 items each choice_model="logit", # logit model with some noise rationality=0.7, seed=42 ) # HM = Houtman-Maks: counts how many choices to discard for consistency engine = Engine(metrics=["hm"]) results = engine.analyze_menus(menus_data) df = results_to_dataframe(results) print(df[["is_sarp", "n_sarp_violations", "hm_consistent", "hm_total"]].head()) .. code-block:: text Generated + scored 100,000 users in 2.6s (38,895 users/sec) is_sarp n_sarp_violations hm_consistent hm_total 0 False 6 3 5 1 False 3 4 5 2 False 6 3 5 3 False 3 4 5 4 False 6 3 5 Before You Trust the Scores --------------------------- Consistency scores are only meaningful when the input data represents genuine feasible choices. Menus must reflect what the user actually saw, not a retroactive reconstruction from purchase logs. Keep only clean single-choice sessions where the user picked exactly one item. The chosen item must be present in the menu. Item IDs must be remapped to contiguous ``0..N-1`` indices before scoring. For budget data, prices must be positive and quantities non-negative. The Engine now rejects NaN, Inf, negative prices, out-of-range item IDs, and duplicate menu items with clear error messages, but the harder question is whether your menus and budgets approximate real choice sets at all. If they do not, the scores measure data artifacts, not behavior. See the :doc:`Loading Data ` guide for worked examples of building clean inputs from raw event logs. Examples -------- Inconsistency in AI Agents ~~~~~~~~~~~~~~~~~~~~~~~~~~ Do LLMs have stable action rankings, or does the ranking change when different alternatives are shown? We queried GPT-4o-mini across 5 enterprise scenarios (support triage, alert routing, content moderation, hiring, procurement), each with 10 vignettes, 5 prompt frameworks, and 15 menus per vignette. The deterministic stage collected 3,750 calls at temperature 0; the stochastic stage sampled each menu 20 times at temperature 0.7, adding 75,000 calls — roughly 78,750 API calls in total over 15 hours. We built preference graphs from these responses and tested for logical cycles. All vignettes are synthetic and results come from a single model family, so these numbers are a diagnostic demo rather than a general benchmark. Full results: :doc:`budget/app_llm_benchmark`. .. list-table:: :header-rows: 1 :widths: 30 30 30 * - Operational Scenario - Perfect Consistency (%) - Probabilistic Consistency (%) * - **Support** - 88 - 54 * - **Alert** - 92 - 74 * - **Content Moderation Task** - 82 - 60 * - **Jobs Task** - 74 - 62 * - **Procurement** - 84 - 61 Predicting Customer Spend and Engagement ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We tested whether revealed preference features improve user-level predictions across 11 datasets and 32 targets. Under 5-fold cross-validation, the median lift is zero. The one exception is Amazon churn prediction. Despite near-zero predictive lift, three revealed preference features rank in the top ten by model importance. Full results: :doc:`benchmarks_ecommerce`. .. list-table:: :header-rows: 1 :widths: 18 8 15 12 12 * - Dataset - N - Target - Base AUC-PR - +RP AUC-PR * - Amazon - 4,694 - Spend Drop - .226 - **.248** * - REES46 - 8,832 - Low Loyalty - .709 - .715 * - H&M - 46,757 - High Spender - .683 - .682 * - FINN - 46,858 - Low Loyalty - .780 - .781 Performance ~~~~~~~~~~~ The Rust engine processes users in parallel via Rayon and streams them in fixed-size chunks, so memory stays flat regardless of population size. On a 10-core laptop, scoring 100,000 users across five metrics from a 110 MB Parquet file takes under two minutes. See the :doc:`Performance Benchmarks ` page for details. .. raw:: html
.. list-table:: Throughput by Metric Configuration (T=20-100, K=5) :header-rows: 1 :widths: 40 20 20 * - Metrics - Throughput (users/sec) - Latency (per user) * - **GARP Only** (O(T²)) - ~49,000 - 20 μs * - **GARP + CCEI** - ~2,400 - 420 μs * - **Comprehensive** (GARP, CCEI, MPI, HARP) - ~2,000 - 500 μs .. list-table:: Budget — Large-Scale :header-rows: 1 :widths: 25 15 15 15 * - Configuration - 10K users - 100K users - 1M users * - GARP (O(T²)) - 0.1s - 2.0s - ~20s * - GARP + CCEI - 4.2s - 39.5s - ~6.6 min * - Comprehensive Suite - 6.8s - 67.1s - ~11 min .. list-table:: Menu — Large-Scale :header-rows: 1 :widths: 25 15 15 15 * - Configuration - 10K users - 100K users - 1M users * - SARP + WARP + HM - 0.3s - 5.2s - **85.6s** .. raw:: html
Explore the :doc:`API Reference ` and :doc:`References ` for more. .. Archived: homepage book blurb moved to docs/archive/homepage_extras.rst .. toctree:: :maxdepth: 2 :hidden: install quickstart theory/index identification budget/index menu/index benchmarks algorithms api papers