Performance Benchmarks#

PrefGraph uses a Rust engine (rpt-core) for batch choice analysis. It combines Rayon parallelism with graph algorithms and HiGHS linear programming, streaming users in chunks to keep memory bounded.

Rust vs Python engine throughput comparison

Scalability and Throughput#

Throughput scales roughly linearly with core count since each user is independent. On a 10-12 core CPU, GARP-only processing reaches about 49k users/sec. Adding CCEI yields about 2.4k/sec. The full suite (GARP, CCEI, MPI, HARP) sustains about 2.0k/sec.

Throughput characteristics across user cohorts

Per-user latency is about 20 microseconds for GARP-only, 420 microseconds for GARP+CCEI, and 500 microseconds for the full four-metric suite. These figures vary with core count and clock speed.

Throughput by Metric Configuration (T=20-100, K=5)#

Metrics

Throughput (users/sec)

Latency (per user)

GARP Only (O(T²))

~49,000

20 μs

GARP + CCEI

~2,400

420 μs

Comprehensive (GARP, CCEI, MPI, HARP)

~2,000

500 μs

Computational Complexity by Metric#

Throughput varies by algorithm complexity. See Algorithms for details.

Per-user computational cost by metric and observation count (T)

Memory Management and Streaming#

The engine maintains a flat memory profile by streaming users in fixed-size chunks. By default, batches of 50k users are processed, and intermediate buffers are released between batches, so peak memory depends on chunk size rather than total population.

Memory consumption under streaming conditions

At the default chunk size, peak memory is 100-200 MB. Million-user runs work on a laptop. Adjust chunk size to trade memory for throughput.

Large-Scale Benchmarks#

For budgets with T=20–100 and K=5, GARP alone completes 10k, 100k, and 1M users in roughly 0.1s, 2.0s, and ~20s. Adding CCEI yields about 4.2s, 39.5s, and ~6.6 minutes. Running the full suite (GARP, CCEI, MPI, HARP) takes around 6.8s, 67.1s, and ~11 minutes for the same scales.

On discrete menus (50 items, 20–100 sessions), the SARP+WARP+HM bundle completes ~0.3s at 10k users, ~5.2s at 100k, and ~85.6s at 1M. Measured on Apple M-series hardware. Timings scale with core count.

Budget — Large-Scale#

Configuration

10K users

100K users

1M users

GARP (O(T²))

0.1s

2.0s

~20s

GARP + CCEI

4.2s

39.5s

~6.6 min

Comprehensive Suite

6.8s

67.1s

~11 min

Menu — Large-Scale#

Configuration

10K users

100K users

1M users

SARP + WARP + HM

0.3s

5.2s

85.6s

End-to-End from Disk#

The benchmarks above measure in-memory scoring. Below is the full disk-to-scores pipeline on 100,000 synthetic consumers (T=15, K=5, full 5-metric suite).

Budget: 100K Users, 5 Metrics (GARP, CCEI, MPI, HARP, HM)#

Pipeline

Read

Transform

Score

Total

File Size

CSV + Polars

63 ms

18.0 s

1m 32s

1m 50s

281 MB

Parquet + Polars

68 ms

14.1 s

1m 43s

1m 57s

110 MB

Parquet (streaming)

1m 45s

110 MB

Menu: 1,000 Users (SARP + WARP + HM, 50 items)#

Pipeline

Total

In-memory

131 ms

CSV + reconstruct

200 ms

File I/O adds under 70 ms for 280 MB. The Rust scoring step dominates wall time for anything beyond GARP-only. Parquet streaming via engine.analyze_parquet() skips the Python transformation step and reaches about 950 users/sec for the full suite. Parquet with zstd is 2.6x smaller than CSV.

Complexity Summary#

GARP runs in O(T²), CCEI in O(T² log T), MPI and HARP in O(T³). Houtman-Maks uses greedy FVS with exact ILP for small T. Utility recovery and VEI each solve LPs at O(T²) scale.

Complexity Summary#

Algorithm

Complexity

Implementation Notes

GARP

O(T²)

SCC-based arc-scan; avoids O(T³) transitive closure.

CCEI (AEI)

O(T² log T)

Iterative binary search over T² potential efficiency ratios.

MPI

O(T³)

Karp’s maximum-mean-weight cycle algorithm.

HARP

O(T³)

Max-product path calculation via modified Floyd-Warshall.

Houtman-Maks

O(T²) / ILP

Greedy FVS (approximate); ILP via HiGHS for exact solutions (T ≤ 200).

Utility Recovery

O(T²)

Linear programming with 2T variables and T(T-1) constraints.

VEI

O(T²)

Observation-specific efficiency via constrained optimization.

Hardware Configuration#

All results are from an Apple M-series machine (11 cores). On a 64-core server, throughput is roughly 5x higher.