Profiling Overview

Summary

Copper profiling is intentionally separate from normal runtime logging and from startup or address-book preparation behavior. Profiling can remain enabled even when normal runtime logs are kept quiet.

Profiling Controls

Copper profiling is controlled by:

-profile_metrics
-profile_top_n <N>
-profile_paths_full
-profile_snapshot_interval_s <seconds>

Wrapper equivalents:

-P
-N <N>
-A
-I <seconds>

Validation Scope

The maintained profiling guidance is grounded in the version4 validation runs that exercised a real python3 -c "import torch" workload on a small multi-node allocation. Those runs were used to confirm four practical points:

profiling files are emitted reliably when profiling is enabled
aggregate, top-path, and full-path profiling can be enabled together
pre-destroy snapshots preserve useful profiling information before final teardown
the resulting outputs are useful both for performance interpretation and for path-level diagnosis

Representative launch settings from that evaluation included:

metadata ENOENT TTL enabled at 1000 ms
profiling enabled
top-path limit set to 20
full-path outputs enabled

Profiling Use Cases

The profiling stack is intended to answer practical workload questions such as:

which operations dominate request count
which operations dominate elapsed time
which paths dominate metadata traffic
which paths dominate data traffic
how much reuse is coming from Copper caches
how much reuse is coming from the ENOENT TTL path

Validated Profiling Outputs

The profiling evaluation work validated three output modes on a real python3 -c "import torch" workload:

aggregate summaries
bounded top-path summaries
full-path forensic outputs

The same evaluation also confirmed that the pre-destroy snapshot path is working and that its outputs matched the final outputs closely in a clean run.

Representative findings from the 2-node validation were:

all profiling output families were written successfully
getattr dominated total call count
read dominated total measured latency
the profiler surfaced the repeated ENOENT probe paths that motivated the metadata TTL work
the metadata TTL was active and useful, with 1,131 stores and 9,715 serves in the validation run

Representative Iter3 Cluster Totals

The maintained version4 dataset under docs/source/iter3 adds a more complete cluster summary for the same general import torch workload shape. Across the two-rank cluster summary:

Metric	Value
Total counted FUSE operations	`3,968,662`
Total cache hits	`3,795,687`
Total cache misses	`2,284`
Total negative results	`68,095`
Metadata `ENOENT` TTL stores	`1,132`
Metadata `ENOENT` TTL serves	`9,715`
Metadata `ENOENT` TTL expires	`6`
Total measured latency	`50.351050 s`

In that same run, the operation totals retained the same overall shape:

Operation	Total calls	Total latency
`getattr`	`3,770,238`	`13.383411 s`
`read`	`93,524`	`35.790869 s`
`readdir`	`2,304`	`0.249776 s`

This continues to support the same interpretation:

metadata discovery dominates total request count
shared-library and file-content reads dominate total measured latency
the ENOENT TTL continues to avoid repeated negative metadata work at a non-trivial scale even in a small two-rank run

Produced Files

When profiling is enabled, each Copper rank can produce several layers of output. The version4 evaluation confirmed the presence of the following file families:

*-profiling_summary.md
*-profiling_operations.csv
*-profiling_aggregate.csv
*-profiling_top_paths.csv
*-data_table_cache_event.output
*-tree_table_cache_event.output
*-md_table_cache_event.output
*-md_ttl_cache_event.output

The same run also produced the corresponding pre-destroy variants:

*-pre-destroy-profiling_summary.md
*-pre-destroy-profiling_operations.csv
*-pre-destroy-profiling_aggregate.csv
*-pre-destroy-profiling_top_paths.csv
*-pre-destroy-*.output

These outputs serve distinct purposes:

aggregate CSVs summarize cluster-wide workload shape
top-path CSVs preserve the hottest paths without unbounded output growth
full-path and table-event outputs support deeper forensic analysis
pre-destroy outputs provide an earlier preserved checkpoint when final shutdown is noisy

Final Full-Path Profiling Findings

The final full-path profiling experiment provides the maintained production shape for a profiling-enabled import workload.

Across the four-rank cluster summary:

Metric	Value
Total counted FUSE operations	`8,003,438`
Total cache hits	`7,592,110`
Total cache misses	`4,401`
Total negative results	`201,735`
Metadata `ENOENT` TTL stores	`2,513`
Metadata `ENOENT` TTL serves	`19,732`
Total measured latency	`37.399760 s`

At the operation level, the same cluster summary showed:

getattr dominated total call count and metadata work
read dominated cumulative measured latency
readdir was present but much smaller than the metadata and data hot paths

Cross-Rank Stability

The four per-rank final summaries were consistent with each other. Across the four ranks, the observed ranges were:

Metric	Range across ranks
Total counted FUSE operations	`2,000,144` to `2,002,372`
Total cache hits	`1,897,747` to `1,898,160`
Total cache misses	`948` to `1,165`
Total negative results	`49,623` to `52,204`
Metadata `ENOENT` TTL stores	`560` to `824`
Metadata `ENOENT` TTL serves	`4,576` to `5,708`
Total measured latency	`9.111560 s` to `9.475459 s`

The pre-destroy summaries matched those same ranges, which is important operationally: if shutdown becomes noisy, the pre-destroy snapshot still preserves the profiling signal.

What the Profiles Say About the Workload

The full-path analysis also makes the path structure visible:

the environment root and its parent directories dominate total path events
the Python stdlib and site-packages remain very hot
the Torch Python package and native Torch libraries are active immediately
repeated missing shared-library probes are common but are usually expected

The iter3 cluster path heuristics make that same structure more explicit. The largest observed path classes were:

Path class	Total events	Example
`environment_prefix`	`2,828,400`	environment-root parent traversal such as `/lustre`
`python_stdlib`	`197,232`	`.../conda_env/lib/python3.12`
`torch_python_package`	`188,608`	`.../site-packages/torch`
`python_site_packages`	`163,484`	`.../lib/python3.12/site-packages`
`torch_native_library`	`57,503`	`.../torch/lib/libmagma.so`
`missing_shared_library_probe`	`1,084`	`.../torch/lib/libhsa-amd-aqlprofile64.so`

That breakdown is useful because it shows that the dominant profile is not a random scatter of unrelated paths. It is concentrated in:

the environment root and its parents
Python stdlib and package-discovery directories
the Torch package tree
a relatively small set of native Torch and ROCm libraries
a still smaller set of repeated probe-miss paths

Concrete Hot-Path Examples

The version4 top-path outputs are most useful when read as a focused hotspot report rather than as an exhaustive dump.

Representative metadata hot paths included:

/lustre with 175,759 metadata events
/lustre/orion with 167,959 metadata events
the active conda_env root with 113,359 metadata events
.../lib/python3.12/site-packages/torch with 62,032 metadata events

Representative data hot paths included:

.../torch/lib/libmagma.so with 9,091 events
.../torch/lib/libMIOpen.so with 3,098 events
.../torch/lib/libtorch_hip.so with 2,519 events
.../torch/lib/librocsparse.so with 1,869 events
.../torch/lib/libtorch_cpu.so with 1,860 events

Representative TTL hot paths included:

.../torch/lib/libhsa-amd-aqlprofile64.so with 256 to 257 TTL events in the version4 runs
.../lib/python312.zip with roughly 74 to 77 TTL events
.../lib/glibc-hwcaps with roughly 64 to 65 TTL events
.../pyvenv.cfg with 16 to 17 TTL events

These are high-signal outputs because they answer three different operational questions directly:

which directories dominate metadata traffic
which shared libraries dominate read traffic
which repeated misses are worth treating as environment-path cleanup targets

Operationally, those findings support two different profiling use cases:

workload measurement, where the main question is which operations or path classes dominate total work
environment diagnosis, where the main question is which duplicated, stale, or optional probes can be reduced without breaking the active runtime

Metric Reading Guide

The version4 evaluation also clarified a set of terms that are worth keeping close to the maintained profiling pages.

Counted FUSE operations: The total number of requests observed by the profiler. This is the broadest measure of workload volume.
Measured latency: Time recorded by the profiler while handling operations. This appears both as per-operation average latency and as cumulative total latency.
Cache hit: A request answered from Copper’s cache path without repeating the full underlying lookup or read path.
Cache miss: A request that required additional work because Copper did not already hold a reusable cached answer.
Negative result: A request that completed with a not-found or similar unsuccessful result. Many such results are expected during Python and dynamic-loader startup.
Top-path profiling: A bounded hotspot view that preserves the busiest paths while avoiding the output growth of a complete per-path dump.
Full-path profiling: A more forensic mode that retains wider path-level evidence and is best reserved for targeted runs.
Pre-destroy snapshot: A profiling checkpoint written before final teardown so that useful measurements survive even when the shutdown path becomes noisy.

That is why the documentation now treats profiling as both:

a performance-measurement tool
a configuration-cleanup tool for environment paths

Startup Timing Versus Profiling

The following timing lines are useful alongside profiling, but they are not profiling outputs:

provider registration completed after <us>
first successful parent rpc_... completed after <us> since provider startup

Those lines describe startup/readiness behavior. Profiling files remain the authoritative source for workload measurements.