Performance
This page records benchmark results from the v0.12.x performance cycle. Numbers are updated each release cycle. All measurements are from the Go benchmark suite in benchmarks/ — see the v0.12.x performance post for the full story behind each number.
Machine: Apple M4, macOS, Go 1.26.
Package installation
go test -bench=BenchmarkInstall -benchmem -benchtime=3s -count=5 ./benchmarks/| Benchmark | v0.12.1 (sequential) | v0.12.2 (parallel pool) | Improvement |
|---|---|---|---|
| Install 47 wheels (warm cache) | ~65 ms | ~51 ms | −22% |
The parallel pool uses min(len(pins), GOMAXPROCS*2) workers. The gain is larger on a slow disk or network where the sequential path serialised all I/O latency.
Package resolution
go test -bench=BenchmarkPMLock -benchmem -benchtime=3s -count=5 ./benchmarks/| Benchmark | v0.12.1 | v0.12.3 (prefetch) | v0.12.4 (concurrency ×4) |
|---|---|---|---|
| pm lock, 47 pkgs, warm (fixture) | ~14 ms | ~11 ms | ~11 ms |
| pm lock, 47 pkgs, cold (real PyPI) | ~4.8 s | ~4.2 s | ~3.1 s |
The fixture benchmark uses an in-process index (no network). It measures pure resolver overhead. The cold-cache number uses a real PyPI request and reflects network latency; your numbers will differ by connection speed.
Test runner
go test -bench=BenchmarkTestRunner -benchmem -benchtime=3s -count=5 ./benchmarks/| Benchmark | v0.12.1 (unbounded) | v0.12.5 (bounded pool) |
|---|---|---|
| RunParallel, 100 test files | ~14 ms | ~14 ms |
Throughput is the same on a machine with sufficient RAM. The improvement from B-4 is peak goroutine count and GC pressure: the unbounded implementation launched one goroutine per file; the bounded pool holds at GOMAXPROCS*2. On a 200-file suite on a 4-core machine the difference is 200 live interpreter allocations versus 8.
Build cache
go test -bench=BenchmarkBuild -benchmem -benchtime=3s -count=3 ./benchmarks/| Benchmark | Time |
|---|---|
| BenchmarkBuild_CacheMiss (full build, tiny fixture) | ~14 ms |
| BenchmarkBuild_CacheHit (cache hit, tiny fixture) | ~8 ms |
| BenchmarkCheckCache_Hit (Go-level hash check, 10 files) | ~55 µs |
On a real project where file collection, minification, and zip writing dominate, the cache hit path reduces second-build time to ~55 µs of hash checks regardless of project size. The remaining ~8 ms in the CLI benchmark is process startup.
Startup
go test -bench=BenchmarkStartup -benchmem -benchtime=3s -count=3 ./benchmarks/| Benchmark | v0.12.1 | v0.12.8 |
|---|---|---|
| BenchmarkStartup (run test fixture file) | ~8 ms | ~8 ms |
BenchmarkStartup_InlinePass (bunpy -c "pass") | — | ~7.2 ms |
bunpy -c "pass" at ~7.2 ms is inside the 10 ms target and below CPython 3.14’s 14 ms cold start on M-series hardware. The lazy module loading in v0.12.8 skips all 40+ bunpy.* factory calls for scripts that never import bunpy. The remaining startup cost is Go runtime init and goipy.New().
Running the benchmarks yourself
# Generate fixtures once
go run ./benchmarks/fixtures/build_fixtures.go
# Run all benchmarks
go test -bench=. -benchmem -benchtime=3s -count=3 ./benchmarks/
# Run a specific benchmark
go test -bench=BenchmarkStartup -benchmem -benchtime=5s -count=5 ./benchmarks/
# Cross-tool comparison (bunpy vs uv vs CPython)
go test -bench=. -benchmem -benchtime=3s -count=3 ./benchmarks/compare/The scripts/bench.sh script runs all benchmarks and writes a snapshot to benchmarks/baseline.txt.
Environment variables that affect performance
| Variable | Effect |
|---|---|
BUNPY_TEST_PARALLELISM=N | Override test runner worker count (default: GOMAXPROCS*2) |
BUNPY_PYPI_CONCURRENCY=N | Override PyPI page fetch concurrency (default: 16) |
BUNPY_PYPI_INDEX_URL=url | Use an alternate PyPI index (e.g. a local mirror) |
BUNPY_DEBUG=http2 | Log HTTP/2 negotiation for each PyPI connection |
BUNPY_PROFILE_STARTUP=1 | Write a pprof CPU profile to /tmp/bunpy-startup.pprof |
BUNPY_STARTUP_PPROF=path | Override the pprof output path |