Profiling Overview

Summary

Copper profiling is intentionally separate from normal runtime logging and from startup or address-book preparation behavior. Profiling can remain enabled even when normal runtime logs are kept quiet.

Profiling Controls

Copper profiling is controlled by:

  • -profile_metrics

  • -profile_top_n <N>

  • -profile_paths_full

  • -profile_snapshot_interval_s <seconds>

Wrapper equivalents:

  • -P

  • -N <N>

  • -A

  • -I <seconds>

Validation Scope

The maintained profiling guidance is grounded in the version4 validation runs that exercised a real python3 -c "import torch" workload on a small multi-node allocation. Those runs were used to confirm four practical points:

  • profiling files are emitted reliably when profiling is enabled

  • aggregate, top-path, and full-path profiling can be enabled together

  • pre-destroy snapshots preserve useful profiling information before final teardown

  • the resulting outputs are useful both for performance interpretation and for path-level diagnosis

Representative launch settings from that evaluation included:

  • metadata ENOENT TTL enabled at 1000 ms

  • profiling enabled

  • top-path limit set to 20

  • full-path outputs enabled

Profiling Use Cases

The profiling stack is intended to answer practical workload questions such as:

  • which operations dominate request count

  • which operations dominate elapsed time

  • which paths dominate metadata traffic

  • which paths dominate data traffic

  • how much reuse is coming from Copper caches

  • how much reuse is coming from the ENOENT TTL path

Validated Profiling Outputs

The profiling evaluation work validated three output modes on a real python3 -c "import torch" workload:

  • aggregate summaries

  • bounded top-path summaries

  • full-path forensic outputs

The same evaluation also confirmed that the pre-destroy snapshot path is working and that its outputs matched the final outputs closely in a clean run.

Representative findings from the 2-node validation were:

  • all profiling output families were written successfully

  • getattr dominated total call count

  • read dominated total measured latency

  • the profiler surfaced the repeated ENOENT probe paths that motivated the metadata TTL work

  • the metadata TTL was active and useful, with 1,131 stores and 9,715 serves in the validation run

Representative Iter3 Cluster Totals

The maintained version4 dataset under docs/source/iter3 adds a more complete cluster summary for the same general import torch workload shape. Across the two-rank cluster summary:

Metric

Value

Total counted FUSE operations

3,968,662

Total cache hits

3,795,687

Total cache misses

2,284

Total negative results

68,095

Metadata ENOENT TTL stores

1,132

Metadata ENOENT TTL serves

9,715

Metadata ENOENT TTL expires

6

Total measured latency

50.351050 s

In that same run, the operation totals retained the same overall shape:

Operation

Total calls

Total latency

getattr

3,770,238

13.383411 s

read

93,524

35.790869 s

readdir

2,304

0.249776 s

This continues to support the same interpretation:

  • metadata discovery dominates total request count

  • shared-library and file-content reads dominate total measured latency

  • the ENOENT TTL continues to avoid repeated negative metadata work at a non-trivial scale even in a small two-rank run

Produced Files

When profiling is enabled, each Copper rank can produce several layers of output. The version4 evaluation confirmed the presence of the following file families:

  • *-profiling_summary.md

  • *-profiling_operations.csv

  • *-profiling_aggregate.csv

  • *-profiling_top_paths.csv

  • *-data_table_cache_event.output

  • *-tree_table_cache_event.output

  • *-md_table_cache_event.output

  • *-md_ttl_cache_event.output

The same run also produced the corresponding pre-destroy variants:

  • *-pre-destroy-profiling_summary.md

  • *-pre-destroy-profiling_operations.csv

  • *-pre-destroy-profiling_aggregate.csv

  • *-pre-destroy-profiling_top_paths.csv

  • *-pre-destroy-*.output

These outputs serve distinct purposes:

  • aggregate CSVs summarize cluster-wide workload shape

  • top-path CSVs preserve the hottest paths without unbounded output growth

  • full-path and table-event outputs support deeper forensic analysis

  • pre-destroy outputs provide an earlier preserved checkpoint when final shutdown is noisy

Final Full-Path Profiling Findings

The final full-path profiling experiment provides the maintained production shape for a profiling-enabled import workload.

Across the four-rank cluster summary:

Metric

Value

Total counted FUSE operations

8,003,438

Total cache hits

7,592,110

Total cache misses

4,401

Total negative results

201,735

Metadata ENOENT TTL stores

2,513

Metadata ENOENT TTL serves

19,732

Total measured latency

37.399760 s

At the operation level, the same cluster summary showed:

  • getattr dominated total call count and metadata work

  • read dominated cumulative measured latency

  • readdir was present but much smaller than the metadata and data hot paths

Cross-Rank Stability

The four per-rank final summaries were consistent with each other. Across the four ranks, the observed ranges were:

Metric

Range across ranks

Total counted FUSE operations

2,000,144 to 2,002,372

Total cache hits

1,897,747 to 1,898,160

Total cache misses

948 to 1,165

Total negative results

49,623 to 52,204

Metadata ENOENT TTL stores

560 to 824

Metadata ENOENT TTL serves

4,576 to 5,708

Total measured latency

9.111560 s to 9.475459 s

The pre-destroy summaries matched those same ranges, which is important operationally: if shutdown becomes noisy, the pre-destroy snapshot still preserves the profiling signal.

What the Profiles Say About the Workload

The full-path analysis also makes the path structure visible:

  • the environment root and its parent directories dominate total path events

  • the Python stdlib and site-packages remain very hot

  • the Torch Python package and native Torch libraries are active immediately

  • repeated missing shared-library probes are common but are usually expected

The iter3 cluster path heuristics make that same structure more explicit. The largest observed path classes were:

Path class

Total events

Example

environment_prefix

2,828,400

environment-root parent traversal such as /lustre

python_stdlib

197,232

.../conda_env/lib/python3.12

torch_python_package

188,608

.../site-packages/torch

python_site_packages

163,484

.../lib/python3.12/site-packages

torch_native_library

57,503

.../torch/lib/libmagma.so

missing_shared_library_probe

1,084

.../torch/lib/libhsa-amd-aqlprofile64.so

That breakdown is useful because it shows that the dominant profile is not a random scatter of unrelated paths. It is concentrated in:

  • the environment root and its parents

  • Python stdlib and package-discovery directories

  • the Torch package tree

  • a relatively small set of native Torch and ROCm libraries

  • a still smaller set of repeated probe-miss paths

Concrete Hot-Path Examples

The version4 top-path outputs are most useful when read as a focused hotspot report rather than as an exhaustive dump.

Representative metadata hot paths included:

  • /lustre with 175,759 metadata events

  • /lustre/orion with 167,959 metadata events

  • the active conda_env root with 113,359 metadata events

  • .../lib/python3.12/site-packages/torch with 62,032 metadata events

Representative data hot paths included:

  • .../torch/lib/libmagma.so with 9,091 events

  • .../torch/lib/libMIOpen.so with 3,098 events

  • .../torch/lib/libtorch_hip.so with 2,519 events

  • .../torch/lib/librocsparse.so with 1,869 events

  • .../torch/lib/libtorch_cpu.so with 1,860 events

Representative TTL hot paths included:

  • .../torch/lib/libhsa-amd-aqlprofile64.so with 256 to 257 TTL events in the version4 runs

  • .../lib/python312.zip with roughly 74 to 77 TTL events

  • .../lib/glibc-hwcaps with roughly 64 to 65 TTL events

  • .../pyvenv.cfg with 16 to 17 TTL events

These are high-signal outputs because they answer three different operational questions directly:

  • which directories dominate metadata traffic

  • which shared libraries dominate read traffic

  • which repeated misses are worth treating as environment-path cleanup targets

Operationally, those findings support two different profiling use cases:

  • workload measurement, where the main question is which operations or path classes dominate total work

  • environment diagnosis, where the main question is which duplicated, stale, or optional probes can be reduced without breaking the active runtime

Metric Reading Guide

The version4 evaluation also clarified a set of terms that are worth keeping close to the maintained profiling pages.

Counted FUSE operations

The total number of requests observed by the profiler. This is the broadest measure of workload volume.

Measured latency

Time recorded by the profiler while handling operations. This appears both as per-operation average latency and as cumulative total latency.

Cache hit

A request answered from Copper’s cache path without repeating the full underlying lookup or read path.

Cache miss

A request that required additional work because Copper did not already hold a reusable cached answer.

Negative result

A request that completed with a not-found or similar unsuccessful result. Many such results are expected during Python and dynamic-loader startup.

Top-path profiling

A bounded hotspot view that preserves the busiest paths while avoiding the output growth of a complete per-path dump.

Full-path profiling

A more forensic mode that retains wider path-level evidence and is best reserved for targeted runs.

Pre-destroy snapshot

A profiling checkpoint written before final teardown so that useful measurements survive even when the shutdown path becomes noisy.

That is why the documentation now treats profiling as both:

  • a performance-measurement tool

  • a configuration-cleanup tool for environment paths

Startup Timing Versus Profiling

The following timing lines are useful alongside profiling, but they are not profiling outputs:

  • provider registration completed after <us>

  • first successful parent rpc_... completed after <us> since provider startup

Those lines describe startup/readiness behavior. Profiling files remain the authoritative source for workload measurements.