Launch and Analysis Runbook =========================== Purpose ------- This page is a step-by-step runbook for common Copper launch and post-run analysis tasks. It is intentionally procedural, even where some of the same material appears elsewhere in the documentation set. Use this page when the goal is to: - launch Copper with basic runtime settings - enable startup timing visibility - enable profiling outputs - choose between facility and discovery address-book preparation - run the post-processing scripts after a job completes Step 1: Choose a Log Level -------------------------- Copper logging is controlled with ``-l`` in the platform launch wrappers and with ``-log_level`` in direct ``cu_fuse`` runs. The current user-facing levels are: .. list-table:: :header-rows: 1 * - Level - Meaning - Typical use * - ``-l 0`` - no logging - quiet production runs * - ``-l 1`` - fatal only - crash-only visibility * - ``-l 2`` - error and fatal - error-focused troubleshooting * - ``-l 3`` - warning, error, and fatal - warning-focused troubleshooting * - ``-l 4`` - info, warning, error, and fatal - startup timing and readiness studies * - ``-l 5`` - debug-heavy / most logging - deep debugging If the goal is to retain the compact startup timing lines: - ``provider registration completed after `` - ``first successful parent rpc_... completed after since provider startup`` use at least: - ``-l 4`` Step 2: Choose an Address-Book Source ------------------------------------- Copper supports two address-book source modes in the launch wrappers. ``facility`` Filters a provided facility address book down to the current allocation. ``discover`` Runs the discovery helper across the allocation, preserves the raw output, and derives the final ``copper_address_book.txt`` from the selected network column. Practical guidance: - use ``facility`` when the site address book is trusted and current - use ``discover`` when a fresh allocation-derived mapping is needed - when using ``discover``, keep both: - ``logs/copper_address_book.txt`` - ``logs/copper_address_book_full_output.txt`` Step 3: Choose Profiling Options -------------------------------- Copper profiling can be enabled with any combination of the following options: .. list-table:: :header-rows: 1 * - Wrapper option - Effect * - ``-P`` - enable profiling collection * - ``-N `` - enable profiling and keep the hottest ``N`` paths * - ``-A`` - enable fuller path-level outputs * - ``-I `` - enable periodic profiling snapshots while the job is still running The corresponding direct ``cu_fuse`` options are: - ``-profile_metrics`` - ``-profile_top_n `` - ``-profile_paths_full`` - ``-profile_snapshot_interval_s `` Common combinations: - no profiling: - no ``-P``, ``-N``, ``-A``, or ``-I`` - aggregate profiling: - ``-P`` - bounded hotspot profiling: - ``-P -N 20`` - full-path forensic profiling: - ``-P -N 20 -A`` Step 4: Launch Copper --------------------- Aurora ^^^^^^ Basic run with startup timing visibility: .. code-block:: bash launch_copper_aurora.sh -l 4 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}" Facility-mode run with profiling enabled: .. code-block:: bash launch_copper_aurora.sh -l 4 -P -N 20 -A -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}" Discovery-mode run: .. code-block:: bash launch_copper_aurora.sh -l 4 -a discover -n cxi -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}" Aurora notes: - ``cxi`` is the common endpoint family in the Aurora wrappers - ``48,49,50,51`` is the common Copper service-core example - ``-l 4`` is the recommended setting when startup timing matters Frontier ^^^^^^^^ Basic run with startup timing visibility: .. code-block:: bash launch_copper_frontier.sh -l 4 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}" Facility-mode run with profiling enabled: .. code-block:: bash launch_copper_frontier.sh -l 4 -P -N 20 -A -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}" Discovery-mode run: .. code-block:: bash launch_copper_frontier.sh -l 4 -a discover -n cxi://cxi1 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}" Frontier notes: - ``cxi://cxi1`` is a common endpoint example - ``1,2`` or ``1,2,65,66`` are common Copper service-core examples Step 5: Run the Workload ------------------------ After Copper starts, run the workload against the Copper-mounted path. Aurora example: .. code-block:: bash time mpirun --np "${NRANKS}" --ppn "${RANKS_PER_NODE}" \ --cpu-bind=list:4:56:9:61:14:66:19:71:20:74:25:79 \ --genvall \ --genv=PYTHONPATH="${MY_COPPER_MOUNT}${PACKAGE_DIR}/:${PYTHONPATH}" \ python3 -c "import torch; print(torch.__file__);" Frontier example: .. code-block:: bash /usr/bin/time srun --overlap -N "${SLURM_NNODES}" -n $((SLURM_NNODES * 8)) \ --ntasks-per-node=8 --cpu-bind="${CPU_BINDING_MAP}" \ python3 -c "import torch; print(torch.__file__)" Step 6: Stop Copper ------------------- Aurora: .. code-block:: bash stop_copper_aurora.sh -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}" Frontier: .. code-block:: bash stop_copper_frontier.sh -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}" Important shutdown note: - run the matching ``stop_copper_aurora.sh`` or ``stop_copper_frontier.sh`` before attempting ``aggregate_profiling.py`` or ``path_usage_analysis.py`` - wait for Copper shutdown to complete on all nodes before running the analysis scripts - the final profiling CSV files under ``profiling/final/`` and the final raw table outputs under ``tables/final/`` may not exist until shutdown has completed - if analysis is attempted too early, the common failure mode is: ``failed to find profiling csv files under: .../profiling/final`` Step 7: Locate the Job Output Directory --------------------------------------- The post-processing scripts expect the Copper job root directory, which is the directory that contains: - ``logs/`` - ``tables/`` - ``profiling/`` Typical example: .. code-block:: text /path/to/copper-logs-dir/ Do not point the scripts at: - ``logs/`` directly - ``profiling/final/`` directly - ``profiling/cluster/`` directly Step 8: Run Aggregate Profiling ------------------------------- The aggregate profiling script combines per-rank profiling outputs into cluster-level summaries. Basic usage: .. code-block:: bash cd /lus/flare/projects/datascience/kaushik/copper-tests/copper/scripts python3 aggregate_profiling.py /path/to/copper-logs-dir/ With a custom output prefix: .. code-block:: bash python3 aggregate_profiling.py /path/to/copper-logs-dir/ cluster_usage_test With explicit usage-root analysis: .. code-block:: bash python3 aggregate_profiling.py \ /path/to/copper-logs-dir/ \ cluster_usage_test \ --usage-root /path/to/package-root Outputs are written under: - ``profiling/cluster/-profiling_aggregate.csv`` - ``profiling/cluster/-profiling_operations.csv`` - ``profiling/cluster/-profiling_summary.md`` If the script reports: ``failed to find profiling csv files under: ...`` the most common causes are: - the run was not launched with profiling enabled - the wrong job root directory was passed - Copper had not been stopped yet, so the final outputs had not been flushed - the run failed before profiling files were written Step 9: Inspect Cache Table Usage --------------------------------- The ioctl helper scripts can also capture current table occupancy for Copper's three main cache tables and summarize them into one combined report. Workflow: .. code-block:: bash cd /path/to/copper/scripts/get_copper_stats_ioctl # Step 1: add this to env.sh, or export it in the current shell export VIEW_DIR=/mnt/bb/${USER}/copper_mount # Step 2: while Copper is still running, collect the raw cache-size outputs bash ./get_cache_usage_summary.sh /lustre/orion/proj-shared/ums046/some_output_dir # Step 3: for offline re-summarization of an existing directory python3 ./summarize_cache_usage.py /path/to/dir --csv /path/to/dir/cache_usage_summary.csv Outputs include: - ``data_cache_size.output`` - ``tree_cache_size.output`` - ``md_cache_size.output`` - ``cache_usage_summary.txt`` - ``cache_usage_summary.csv`` Interpretation: - the reports show current used bytes and entry counts - the combined summary shows the total currently occupied bytes across all three tables - the tables are dynamically growing ``unordered_map`` containers rather than fixed-capacity pools - memory usage is bounded by normal process and node limits rather than by a Copper-specific global table budget - ``max_cacheable_byte_size`` is a per-file admission threshold for data caching, not a total cache budget - Copper does not yet report a fixed total budget or remaining available bytes for those tables Sample output: .. code-block:: md | Table | Files Found | Used Bytes | Used Human | Entries | | -------- | ----------: | ---------: | ---------: | ------: | | data | 1 | 70296558 | 67.04 MiB | 1025 | | tree | 1 | 27772 | 27.12 KiB | 140 | | metadata | 1 | 340848 | 332.86 KiB | 2367 | | combined | - | 70665178 | 67.39 MiB | 3532 | Step 10: Run Path Usage Analysis -------------------------------- The path-usage script compares observed paths against the selected filesystem roots and produces used-path, missing-probe, and candidate-not-observed lists. Basic usage: .. code-block:: bash cd /lus/flare/projects/datascience/kaushik/copper-tests/copper/scripts python3 path_usage_analysis.py /path/to/copper-logs-dir/ With an explicit package root: .. code-block:: bash python3 path_usage_analysis.py \ /path/to/copper-logs-dir/ \ --usage-root /path/to/package-root With a custom output directory: .. code-block:: bash python3 path_usage_analysis.py \ /path/to/copper-logs-dir/ \ --usage-root /path/to/package-root \ --output-dir /path/to/copper-logs-dir//paths_dir Outputs include: - ``all_possible_existing_paths.txt`` - ``used_paths_existing.txt`` - ``same_run_candidate_not_observed_existing_paths.txt`` - ``missing_probe_paths.txt`` - ``roots_and_counts.txt`` - ``paths_summary.md`` Sample Outputs from Aurora Run 8405873 -------------------------------------- The examples below come from: - ``/lus/flare/projects/datascience/kaushik/copper-tests/aurora_runs/scale_test/2_nodes/copper-logs-dir/8405873`` Startup log sample ^^^^^^^^^^^^^^^^^^ From ``logs/x4613c3s7b0n0-82226-output.log``: .. code-block:: text 2026-03-27 23-33-57.994 [Info] (cu_hello_main) user-facing log_level: 5 2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_metrics: enabled 2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_top_n: 20 2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_paths_full: enabled 2026-03-27 23-33-57.997 [Info] (start_thallium_engine) starting thallium engine 2026-03-27 23-33-58.078 [Info] (start_thallium_engine) engine started 2026-03-27 23-33-58.082 [Info] (start_thallium_engine) server running at address: ofi+cxi://0x09470400 2026-03-27 23-33-58.082 [Info] (start_thallium_engine) provider registration completed after 85096 us This startup view confirms: - the chosen log level - whether profiling was enabled - whether the server came up cleanly - the local provider registration time Cluster aggregate summary sample ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ From ``profiling/cluster/cluster_usage_test-profiling_summary.md``: .. code-block:: text Ranks with aggregate files: 4 total_cache_hits: 5984151 total_cache_misses: 3010 total_counted_fuse_operations: 6377844 total_measured_latency_seconds: 39.513989 total_negative_results: 168147 metadata_enoent_ttl_serves: 45424 metadata_enoent_ttl_stores: 2372 Selected operation totals from the same summary: .. code-block:: text getattr: 5952916 calls, 31.390461 s total latency read: 195912 calls, 7.008104 s total latency readdir: 6480 calls, 0.535788 s total latency This summary is useful for: - confirming that cluster aggregation succeeded - identifying the hottest operations by call count - identifying the operations that dominated measured latency - checking how heavily Copper served data from cache Path usage summary sample ^^^^^^^^^^^^^^^^^^^^^^^^^ From ``paths_dir/paths_summary.md``: .. code-block:: text All possible existing paths under the selected roots: 22383 Actually used in this run: 1729 Probed but absent: 555 Observed files: 1404 of 20625 (6.81%) Observed directories: 325 of 1758 (18.49%) This path summary supports a staged interpretation: - the selected environment tree was much larger than the set touched by the run - file coverage for this run was 6.81 percent, not full-tree usage - missing probe paths are normal loader or Python search misses and should not be confused with active deletion Path class sample ^^^^^^^^^^^^^^^^^ Also from ``profiling/cluster/cluster_usage_test-profiling_summary.md``: .. code-block:: text environment_prefix: 5075764 events, example /lus shared_library: 126484 events, example .../nvidia/cu13/lib/libcublasLt.so.13 negative_probe_path: 19452 events, example .../aurora_runs/nvidia These classes are useful when planning cleanup experiments: - keep the active environment core paths first - review repeated negative probe paths for stale search-path entries - use the observed set as an allowlist candidate for a cloned environment, rather than as proof that unseen files are permanently unnecessary Practical Reading Order ----------------------- If the goal is fast post-run interpretation, the simplest reading order is: #. check the Copper node logs in ``logs/`` for startup and readiness timing #. check ``profiling/cluster/*.md`` for aggregate workload shape #. check ``profiling/cluster/*.csv`` for totals and per-operation counts #. check ``paths_summary.md`` when pruning or environment minimization is the next question Related Pages ------------- - :doc:`guide_aurora_and_frontier` - :doc:`guide_build_and_run` - :doc:`operations_profiling_overview` - :doc:`operations_profiling_reference` - :doc:`operations_environment_path_analysis`