Profiling Overview
Summary
Copper profiling is intentionally separate from normal runtime logging and from startup or address-book preparation behavior. Profiling can remain enabled even when normal runtime logs are kept quiet.
Profiling Controls
Copper profiling is controlled by:
-profile_metrics-profile_top_n <N>-profile_paths_full-profile_snapshot_interval_s <seconds>
Wrapper equivalents:
-P-N <N>-A-I <seconds>
Validation Scope
The maintained profiling guidance is grounded in the version4 validation runs
that exercised a real python3 -c "import torch" workload on a small
multi-node allocation. Those runs were used to confirm four practical points:
profiling files are emitted reliably when profiling is enabled
aggregate, top-path, and full-path profiling can be enabled together
pre-destroysnapshots preserve useful profiling information before final teardownthe resulting outputs are useful both for performance interpretation and for path-level diagnosis
Representative launch settings from that evaluation included:
metadata
ENOENTTTL enabled at1000 msprofiling enabled
top-path limit set to
20full-path outputs enabled
Profiling Use Cases
The profiling stack is intended to answer practical workload questions such as:
which operations dominate request count
which operations dominate elapsed time
which paths dominate metadata traffic
which paths dominate data traffic
how much reuse is coming from Copper caches
how much reuse is coming from the ENOENT TTL path
Validated Profiling Outputs
The profiling evaluation work validated three output modes on a real
python3 -c "import torch" workload:
aggregate summaries
bounded top-path summaries
full-path forensic outputs
The same evaluation also confirmed that the pre-destroy snapshot path is
working and that its outputs matched the final outputs closely in a clean run.
Representative findings from the 2-node validation were:
all profiling output families were written successfully
getattrdominated total call countreaddominated total measured latencythe profiler surfaced the repeated
ENOENTprobe paths that motivated the metadata TTL workthe metadata TTL was active and useful, with
1,131stores and9,715serves in the validation run
Representative Iter3 Cluster Totals
The maintained version4 dataset under docs/source/iter3 adds a more
complete cluster summary for the same general import torch workload shape.
Across the two-rank cluster summary:
Metric |
Value |
|---|---|
Total counted FUSE operations |
|
Total cache hits |
|
Total cache misses |
|
Total negative results |
|
Metadata |
|
Metadata |
|
Metadata |
|
Total measured latency |
|
In that same run, the operation totals retained the same overall shape:
Operation |
Total calls |
Total latency |
|---|---|---|
|
|
|
|
|
|
|
|
|
This continues to support the same interpretation:
metadata discovery dominates total request count
shared-library and file-content reads dominate total measured latency
the ENOENT TTL continues to avoid repeated negative metadata work at a non-trivial scale even in a small two-rank run
Produced Files
When profiling is enabled, each Copper rank can produce several layers of output. The version4 evaluation confirmed the presence of the following file families:
*-profiling_summary.md*-profiling_operations.csv*-profiling_aggregate.csv*-profiling_top_paths.csv*-data_table_cache_event.output*-tree_table_cache_event.output*-md_table_cache_event.output*-md_ttl_cache_event.output
The same run also produced the corresponding pre-destroy variants:
*-pre-destroy-profiling_summary.md*-pre-destroy-profiling_operations.csv*-pre-destroy-profiling_aggregate.csv*-pre-destroy-profiling_top_paths.csv*-pre-destroy-*.output
These outputs serve distinct purposes:
aggregate CSVs summarize cluster-wide workload shape
top-path CSVs preserve the hottest paths without unbounded output growth
full-path and table-event outputs support deeper forensic analysis
pre-destroyoutputs provide an earlier preserved checkpoint when final shutdown is noisy
Final Full-Path Profiling Findings
The final full-path profiling experiment provides the maintained production shape for a profiling-enabled import workload.
Across the four-rank cluster summary:
Metric |
Value |
|---|---|
Total counted FUSE operations |
|
Total cache hits |
|
Total cache misses |
|
Total negative results |
|
Metadata |
|
Metadata |
|
Total measured latency |
|
At the operation level, the same cluster summary showed:
getattrdominated total call count and metadata workreaddominated cumulative measured latencyreaddirwas present but much smaller than the metadata and data hot paths
Cross-Rank Stability
The four per-rank final summaries were consistent with each other. Across the four ranks, the observed ranges were:
Metric |
Range across ranks |
|---|---|
Total counted FUSE operations |
|
Total cache hits |
|
Total cache misses |
|
Total negative results |
|
Metadata |
|
Metadata |
|
Total measured latency |
|
The pre-destroy summaries matched those same ranges, which is important
operationally: if shutdown becomes noisy, the pre-destroy snapshot still
preserves the profiling signal.
What the Profiles Say About the Workload
The full-path analysis also makes the path structure visible:
the environment root and its parent directories dominate total path events
the Python stdlib and
site-packagesremain very hotthe Torch Python package and native Torch libraries are active immediately
repeated missing shared-library probes are common but are usually expected
The iter3 cluster path heuristics make that same structure more explicit. The largest observed path classes were:
Path class |
Total events |
Example |
|---|---|---|
|
|
environment-root parent traversal such as |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
That breakdown is useful because it shows that the dominant profile is not a random scatter of unrelated paths. It is concentrated in:
the environment root and its parents
Python stdlib and package-discovery directories
the Torch package tree
a relatively small set of native Torch and ROCm libraries
a still smaller set of repeated probe-miss paths
Concrete Hot-Path Examples
The version4 top-path outputs are most useful when read as a focused hotspot report rather than as an exhaustive dump.
Representative metadata hot paths included:
/lustrewith175,759metadata events/lustre/orionwith167,959metadata eventsthe active
conda_envroot with113,359metadata events.../lib/python3.12/site-packages/torchwith62,032metadata events
Representative data hot paths included:
.../torch/lib/libmagma.sowith9,091events.../torch/lib/libMIOpen.sowith3,098events.../torch/lib/libtorch_hip.sowith2,519events.../torch/lib/librocsparse.sowith1,869events.../torch/lib/libtorch_cpu.sowith1,860events
Representative TTL hot paths included:
.../torch/lib/libhsa-amd-aqlprofile64.sowith256to257TTL events in the version4 runs.../lib/python312.zipwith roughly74to77TTL events.../lib/glibc-hwcapswith roughly64to65TTL events.../pyvenv.cfgwith16to17TTL events
These are high-signal outputs because they answer three different operational questions directly:
which directories dominate metadata traffic
which shared libraries dominate read traffic
which repeated misses are worth treating as environment-path cleanup targets
Operationally, those findings support two different profiling use cases:
workload measurement, where the main question is which operations or path classes dominate total work
environment diagnosis, where the main question is which duplicated, stale, or optional probes can be reduced without breaking the active runtime
Metric Reading Guide
The version4 evaluation also clarified a set of terms that are worth keeping close to the maintained profiling pages.
Counted FUSE operationsThe total number of requests observed by the profiler. This is the broadest measure of workload volume.
Measured latencyTime recorded by the profiler while handling operations. This appears both as per-operation average latency and as cumulative total latency.
Cache hitA request answered from Copper’s cache path without repeating the full underlying lookup or read path.
Cache missA request that required additional work because Copper did not already hold a reusable cached answer.
Negative resultA request that completed with a not-found or similar unsuccessful result. Many such results are expected during Python and dynamic-loader startup.
Top-path profilingA bounded hotspot view that preserves the busiest paths while avoiding the output growth of a complete per-path dump.
Full-path profilingA more forensic mode that retains wider path-level evidence and is best reserved for targeted runs.
Pre-destroy snapshotA profiling checkpoint written before final teardown so that useful measurements survive even when the shutdown path becomes noisy.
That is why the documentation now treats profiling as both:
a performance-measurement tool
a configuration-cleanup tool for environment paths
Startup Timing Versus Profiling
The following timing lines are useful alongside profiling, but they are not profiling outputs:
provider registration completed after <us>first successful parent rpc_... completed after <us> since provider startup
Those lines describe startup/readiness behavior. Profiling files remain the authoritative source for workload measurements.