Profiling Overview ================== Summary ------- Copper profiling is intentionally separate from normal runtime logging and from startup or address-book preparation behavior. Profiling can remain enabled even when normal runtime logs are kept quiet. Profiling Controls ------------------ Copper profiling is controlled by: - ``-profile_metrics`` - ``-profile_top_n `` - ``-profile_paths_full`` - ``-profile_snapshot_interval_s `` Wrapper equivalents: - ``-P`` - ``-N `` - ``-A`` - ``-I `` Validation Scope ---------------- The maintained profiling guidance is grounded in the version4 validation runs that exercised a real ``python3 -c "import torch"`` workload on a small multi-node allocation. Those runs were used to confirm four practical points: - profiling files are emitted reliably when profiling is enabled - aggregate, top-path, and full-path profiling can be enabled together - ``pre-destroy`` snapshots preserve useful profiling information before final teardown - the resulting outputs are useful both for performance interpretation and for path-level diagnosis Representative launch settings from that evaluation included: - metadata ``ENOENT`` TTL enabled at ``1000 ms`` - profiling enabled - top-path limit set to ``20`` - full-path outputs enabled Profiling Use Cases ------------------- The profiling stack is intended to answer practical workload questions such as: - which operations dominate request count - which operations dominate elapsed time - which paths dominate metadata traffic - which paths dominate data traffic - how much reuse is coming from Copper caches - how much reuse is coming from the ENOENT TTL path Validated Profiling Outputs --------------------------- The profiling evaluation work validated three output modes on a real ``python3 -c "import torch"`` workload: - aggregate summaries - bounded top-path summaries - full-path forensic outputs The same evaluation also confirmed that the ``pre-destroy`` snapshot path is working and that its outputs matched the final outputs closely in a clean run. Representative findings from the 2-node validation were: - all profiling output families were written successfully - ``getattr`` dominated total call count - ``read`` dominated total measured latency - the profiler surfaced the repeated ``ENOENT`` probe paths that motivated the metadata TTL work - the metadata TTL was active and useful, with ``1,131`` stores and ``9,715`` serves in the validation run Representative Iter3 Cluster Totals ----------------------------------- The maintained version4 dataset under ``docs/source/iter3`` adds a more complete cluster summary for the same general ``import torch`` workload shape. Across the two-rank cluster summary: .. list-table:: :header-rows: 1 * - Metric - Value * - Total counted FUSE operations - ``3,968,662`` * - Total cache hits - ``3,795,687`` * - Total cache misses - ``2,284`` * - Total negative results - ``68,095`` * - Metadata ``ENOENT`` TTL stores - ``1,132`` * - Metadata ``ENOENT`` TTL serves - ``9,715`` * - Metadata ``ENOENT`` TTL expires - ``6`` * - Total measured latency - ``50.351050 s`` In that same run, the operation totals retained the same overall shape: .. list-table:: :header-rows: 1 * - Operation - Total calls - Total latency * - ``getattr`` - ``3,770,238`` - ``13.383411 s`` * - ``read`` - ``93,524`` - ``35.790869 s`` * - ``readdir`` - ``2,304`` - ``0.249776 s`` This continues to support the same interpretation: - metadata discovery dominates total request count - shared-library and file-content reads dominate total measured latency - the ENOENT TTL continues to avoid repeated negative metadata work at a non-trivial scale even in a small two-rank run Produced Files -------------- When profiling is enabled, each Copper rank can produce several layers of output. The version4 evaluation confirmed the presence of the following file families: - ``*-profiling_summary.md`` - ``*-profiling_operations.csv`` - ``*-profiling_aggregate.csv`` - ``*-profiling_top_paths.csv`` - ``*-data_table_cache_event.output`` - ``*-tree_table_cache_event.output`` - ``*-md_table_cache_event.output`` - ``*-md_ttl_cache_event.output`` The same run also produced the corresponding ``pre-destroy`` variants: - ``*-pre-destroy-profiling_summary.md`` - ``*-pre-destroy-profiling_operations.csv`` - ``*-pre-destroy-profiling_aggregate.csv`` - ``*-pre-destroy-profiling_top_paths.csv`` - ``*-pre-destroy-*.output`` These outputs serve distinct purposes: - aggregate CSVs summarize cluster-wide workload shape - top-path CSVs preserve the hottest paths without unbounded output growth - full-path and table-event outputs support deeper forensic analysis - ``pre-destroy`` outputs provide an earlier preserved checkpoint when final shutdown is noisy Final Full-Path Profiling Findings ---------------------------------- The final full-path profiling experiment provides the maintained production shape for a profiling-enabled import workload. Across the four-rank cluster summary: .. list-table:: :header-rows: 1 * - Metric - Value * - Total counted FUSE operations - ``8,003,438`` * - Total cache hits - ``7,592,110`` * - Total cache misses - ``4,401`` * - Total negative results - ``201,735`` * - Metadata ``ENOENT`` TTL stores - ``2,513`` * - Metadata ``ENOENT`` TTL serves - ``19,732`` * - Total measured latency - ``37.399760 s`` At the operation level, the same cluster summary showed: - ``getattr`` dominated total call count and metadata work - ``read`` dominated cumulative measured latency - ``readdir`` was present but much smaller than the metadata and data hot paths Cross-Rank Stability -------------------- The four per-rank final summaries were consistent with each other. Across the four ranks, the observed ranges were: .. list-table:: :header-rows: 1 * - Metric - Range across ranks * - Total counted FUSE operations - ``2,000,144`` to ``2,002,372`` * - Total cache hits - ``1,897,747`` to ``1,898,160`` * - Total cache misses - ``948`` to ``1,165`` * - Total negative results - ``49,623`` to ``52,204`` * - Metadata ``ENOENT`` TTL stores - ``560`` to ``824`` * - Metadata ``ENOENT`` TTL serves - ``4,576`` to ``5,708`` * - Total measured latency - ``9.111560 s`` to ``9.475459 s`` The ``pre-destroy`` summaries matched those same ranges, which is important operationally: if shutdown becomes noisy, the pre-destroy snapshot still preserves the profiling signal. What the Profiles Say About the Workload ---------------------------------------- The full-path analysis also makes the path structure visible: - the environment root and its parent directories dominate total path events - the Python stdlib and ``site-packages`` remain very hot - the Torch Python package and native Torch libraries are active immediately - repeated missing shared-library probes are common but are usually expected The iter3 cluster path heuristics make that same structure more explicit. The largest observed path classes were: .. list-table:: :header-rows: 1 * - Path class - Total events - Example * - ``environment_prefix`` - ``2,828,400`` - environment-root parent traversal such as ``/lustre`` * - ``python_stdlib`` - ``197,232`` - ``.../conda_env/lib/python3.12`` * - ``torch_python_package`` - ``188,608`` - ``.../site-packages/torch`` * - ``python_site_packages`` - ``163,484`` - ``.../lib/python3.12/site-packages`` * - ``torch_native_library`` - ``57,503`` - ``.../torch/lib/libmagma.so`` * - ``missing_shared_library_probe`` - ``1,084`` - ``.../torch/lib/libhsa-amd-aqlprofile64.so`` That breakdown is useful because it shows that the dominant profile is not a random scatter of unrelated paths. It is concentrated in: - the environment root and its parents - Python stdlib and package-discovery directories - the Torch package tree - a relatively small set of native Torch and ROCm libraries - a still smaller set of repeated probe-miss paths Concrete Hot-Path Examples -------------------------- The version4 top-path outputs are most useful when read as a focused hotspot report rather than as an exhaustive dump. Representative metadata hot paths included: - ``/lustre`` with ``175,759`` metadata events - ``/lustre/orion`` with ``167,959`` metadata events - the active ``conda_env`` root with ``113,359`` metadata events - ``.../lib/python3.12/site-packages/torch`` with ``62,032`` metadata events Representative data hot paths included: - ``.../torch/lib/libmagma.so`` with ``9,091`` events - ``.../torch/lib/libMIOpen.so`` with ``3,098`` events - ``.../torch/lib/libtorch_hip.so`` with ``2,519`` events - ``.../torch/lib/librocsparse.so`` with ``1,869`` events - ``.../torch/lib/libtorch_cpu.so`` with ``1,860`` events Representative TTL hot paths included: - ``.../torch/lib/libhsa-amd-aqlprofile64.so`` with ``256`` to ``257`` TTL events in the version4 runs - ``.../lib/python312.zip`` with roughly ``74`` to ``77`` TTL events - ``.../lib/glibc-hwcaps`` with roughly ``64`` to ``65`` TTL events - ``.../pyvenv.cfg`` with ``16`` to ``17`` TTL events These are high-signal outputs because they answer three different operational questions directly: - which directories dominate metadata traffic - which shared libraries dominate read traffic - which repeated misses are worth treating as environment-path cleanup targets Operationally, those findings support two different profiling use cases: - workload measurement, where the main question is which operations or path classes dominate total work - environment diagnosis, where the main question is which duplicated, stale, or optional probes can be reduced without breaking the active runtime Metric Reading Guide -------------------- The version4 evaluation also clarified a set of terms that are worth keeping close to the maintained profiling pages. ``Counted FUSE operations`` The total number of requests observed by the profiler. This is the broadest measure of workload volume. ``Measured latency`` Time recorded by the profiler while handling operations. This appears both as per-operation average latency and as cumulative total latency. ``Cache hit`` A request answered from Copper's cache path without repeating the full underlying lookup or read path. ``Cache miss`` A request that required additional work because Copper did not already hold a reusable cached answer. ``Negative result`` A request that completed with a not-found or similar unsuccessful result. Many such results are expected during Python and dynamic-loader startup. ``Top-path profiling`` A bounded hotspot view that preserves the busiest paths while avoiding the output growth of a complete per-path dump. ``Full-path profiling`` A more forensic mode that retains wider path-level evidence and is best reserved for targeted runs. ``Pre-destroy snapshot`` A profiling checkpoint written before final teardown so that useful measurements survive even when the shutdown path becomes noisy. That is why the documentation now treats profiling as both: - a performance-measurement tool - a configuration-cleanup tool for environment paths Startup Timing Versus Profiling ------------------------------- The following timing lines are useful alongside profiling, but they are not profiling outputs: - ``provider registration completed after `` - ``first successful parent rpc_... completed after since provider startup`` Those lines describe startup/readiness behavior. Profiling files remain the authoritative source for workload measurements. Related Pages ------------- - :doc:`operations_environment_path_analysis` - :doc:`operations_metadata_enoent_ttl_evaluation` - :doc:`operations_registration_and_scaling`