Launch and Analysis Runbook

Purpose

This page is a step-by-step runbook for common Copper launch and post-run analysis tasks. It is intentionally procedural, even where some of the same material appears elsewhere in the documentation set.

Use this page when the goal is to:

  • launch Copper with basic runtime settings

  • enable startup timing visibility

  • enable profiling outputs

  • choose between facility and discovery address-book preparation

  • run the post-processing scripts after a job completes

Step 1: Choose a Log Level

Copper logging is controlled with -l in the platform launch wrappers and with -log_level in direct cu_fuse runs.

The current user-facing levels are:

Level

Meaning

Typical use

-l 0

no logging

quiet production runs

-l 1

fatal only

crash-only visibility

-l 2

error and fatal

error-focused troubleshooting

-l 3

warning, error, and fatal

warning-focused troubleshooting

-l 4

info, warning, error, and fatal

startup timing and readiness studies

-l 5

debug-heavy / most logging

deep debugging

If the goal is to retain the compact startup timing lines:

  • provider registration completed after <us>

  • first successful parent rpc_... completed after <us> since provider startup

use at least:

  • -l 4

Step 2: Choose an Address-Book Source

Copper supports two address-book source modes in the launch wrappers.

facility

Filters a provided facility address book down to the current allocation.

discover

Runs the discovery helper across the allocation, preserves the raw output, and derives the final copper_address_book.txt from the selected network column.

Practical guidance:

  • use facility when the site address book is trusted and current

  • use discover when a fresh allocation-derived mapping is needed

  • when using discover, keep both: - logs/copper_address_book.txt - logs/copper_address_book_full_output.txt

Step 3: Choose Profiling Options

Copper profiling can be enabled with any combination of the following options:

Wrapper option

Effect

-P

enable profiling collection

-N <N>

enable profiling and keep the hottest N paths

-A

enable fuller path-level outputs

-I <seconds>

enable periodic profiling snapshots while the job is still running

The corresponding direct cu_fuse options are:

  • -profile_metrics

  • -profile_top_n <N>

  • -profile_paths_full

  • -profile_snapshot_interval_s <seconds>

Common combinations:

  • no profiling: - no -P, -N, -A, or -I

  • aggregate profiling: - -P

  • bounded hotspot profiling: - -P -N 20

  • full-path forensic profiling: - -P -N 20 -A

Step 4: Launch Copper

Aurora

Basic run with startup timing visibility:

launch_copper_aurora.sh -l 4 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Facility-mode run with profiling enabled:

launch_copper_aurora.sh -l 4 -P -N 20 -A -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Discovery-mode run:

launch_copper_aurora.sh -l 4 -a discover -n cxi -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Aurora notes:

  • cxi is the common endpoint family in the Aurora wrappers

  • 48,49,50,51 is the common Copper service-core example

  • -l 4 is the recommended setting when startup timing matters

Frontier

Basic run with startup timing visibility:

launch_copper_frontier.sh -l 4 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Facility-mode run with profiling enabled:

launch_copper_frontier.sh -l 4 -P -N 20 -A -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Discovery-mode run:

launch_copper_frontier.sh -l 4 -a discover -n cxi://cxi1 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Frontier notes:

  • cxi://cxi1 is a common endpoint example

  • 1,2 or 1,2,65,66 are common Copper service-core examples

Step 5: Run the Workload

After Copper starts, run the workload against the Copper-mounted path.

Aurora example:

time mpirun --np "${NRANKS}" --ppn "${RANKS_PER_NODE}" \
  --cpu-bind=list:4:56:9:61:14:66:19:71:20:74:25:79 \
  --genvall \
  --genv=PYTHONPATH="${MY_COPPER_MOUNT}${PACKAGE_DIR}/:${PYTHONPATH}" \
  python3 -c "import torch; print(torch.__file__);"

Frontier example:

/usr/bin/time srun --overlap -N "${SLURM_NNODES}" -n $((SLURM_NNODES * 8)) \
  --ntasks-per-node=8 --cpu-bind="${CPU_BINDING_MAP}" \
  python3 -c "import torch; print(torch.__file__)"

Step 6: Stop Copper

Aurora:

stop_copper_aurora.sh -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Frontier:

stop_copper_frontier.sh -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Important shutdown note:

  • run the matching stop_copper_aurora.sh or stop_copper_frontier.sh before attempting aggregate_profiling.py or path_usage_analysis.py

  • wait for Copper shutdown to complete on all nodes before running the analysis scripts

  • the final profiling CSV files under profiling/final/ and the final raw table outputs under tables/final/ may not exist until shutdown has completed

  • if analysis is attempted too early, the common failure mode is: failed to find profiling csv files under: .../profiling/final

Step 7: Locate the Job Output Directory

The post-processing scripts expect the Copper job root directory, which is the directory that contains:

  • logs/

  • tables/

  • profiling/

Typical example:

/path/to/copper-logs-dir/<jobid>

Do not point the scripts at:

  • logs/ directly

  • profiling/final/ directly

  • profiling/cluster/ directly

Step 8: Run Aggregate Profiling

The aggregate profiling script combines per-rank profiling outputs into cluster-level summaries.

Basic usage:

cd /lus/flare/projects/datascience/kaushik/copper-tests/copper/scripts
python3 aggregate_profiling.py /path/to/copper-logs-dir/<jobid>

With a custom output prefix:

python3 aggregate_profiling.py /path/to/copper-logs-dir/<jobid> cluster_usage_test

With explicit usage-root analysis:

python3 aggregate_profiling.py \
  /path/to/copper-logs-dir/<jobid> \
  cluster_usage_test \
  --usage-root /path/to/package-root

Outputs are written under:

  • profiling/cluster/<prefix>-profiling_aggregate.csv

  • profiling/cluster/<prefix>-profiling_operations.csv

  • profiling/cluster/<prefix>-profiling_summary.md

If the script reports:

failed to find profiling csv files under: ...

the most common causes are:

  • the run was not launched with profiling enabled

  • the wrong job root directory was passed

  • Copper had not been stopped yet, so the final outputs had not been flushed

  • the run failed before profiling files were written

Step 9: Inspect Cache Table Usage

The ioctl helper scripts can also capture current table occupancy for Copper’s three main cache tables and summarize them into one combined report.

Workflow:

cd /path/to/copper/scripts/get_copper_stats_ioctl

# Step 1: add this to env.sh, or export it in the current shell
export VIEW_DIR=/mnt/bb/${USER}/copper_mount

# Step 2: while Copper is still running, collect the raw cache-size outputs
bash ./get_cache_usage_summary.sh /lustre/orion/proj-shared/ums046/some_output_dir

# Step 3: for offline re-summarization of an existing directory
python3 ./summarize_cache_usage.py /path/to/dir --csv /path/to/dir/cache_usage_summary.csv

Outputs include:

  • data_cache_size.output

  • tree_cache_size.output

  • md_cache_size.output

  • cache_usage_summary.txt

  • cache_usage_summary.csv

Interpretation:

  • the reports show current used bytes and entry counts

  • the combined summary shows the total currently occupied bytes across all three tables

  • the tables are dynamically growing unordered_map containers rather than fixed-capacity pools

  • memory usage is bounded by normal process and node limits rather than by a Copper-specific global table budget

  • max_cacheable_byte_size is a per-file admission threshold for data caching, not a total cache budget

  • Copper does not yet report a fixed total budget or remaining available bytes for those tables

Sample output:

| Table    | Files Found | Used Bytes | Used Human | Entries |
| -------- | ----------: | ---------: | ---------: | ------: |
| data     | 1           | 70296558   | 67.04 MiB  | 1025    |
| tree     | 1           | 27772      | 27.12 KiB  | 140     |
| metadata | 1           | 340848     | 332.86 KiB | 2367    |
| combined | -           | 70665178   | 67.39 MiB  | 3532    |

Step 10: Run Path Usage Analysis

The path-usage script compares observed paths against the selected filesystem roots and produces used-path, missing-probe, and candidate-not-observed lists.

Basic usage:

cd /lus/flare/projects/datascience/kaushik/copper-tests/copper/scripts
python3 path_usage_analysis.py /path/to/copper-logs-dir/<jobid>

With an explicit package root:

python3 path_usage_analysis.py \
  /path/to/copper-logs-dir/<jobid> \
  --usage-root /path/to/package-root

With a custom output directory:

python3 path_usage_analysis.py \
  /path/to/copper-logs-dir/<jobid> \
  --usage-root /path/to/package-root \
  --output-dir /path/to/copper-logs-dir/<jobid>/paths_dir

Outputs include:

  • all_possible_existing_paths.txt

  • used_paths_existing.txt

  • same_run_candidate_not_observed_existing_paths.txt

  • missing_probe_paths.txt

  • roots_and_counts.txt

  • paths_summary.md

Sample Outputs from Aurora Run 8405873

The examples below come from:

  • /lus/flare/projects/datascience/kaushik/copper-tests/aurora_runs/scale_test/2_nodes/copper-logs-dir/8405873

Startup log sample

From logs/x4613c3s7b0n0-82226-output.log:

2026-03-27 23-33-57.994 [Info] (cu_hello_main) user-facing log_level: 5
2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_metrics: enabled
2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_top_n: 20
2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_paths_full: enabled
2026-03-27 23-33-57.997 [Info] (start_thallium_engine) starting thallium engine
2026-03-27 23-33-58.078 [Info] (start_thallium_engine) engine started
2026-03-27 23-33-58.082 [Info] (start_thallium_engine) server running at address: ofi+cxi://0x09470400
2026-03-27 23-33-58.082 [Info] (start_thallium_engine) provider registration completed after 85096 us

This startup view confirms:

  • the chosen log level

  • whether profiling was enabled

  • whether the server came up cleanly

  • the local provider registration time

Cluster aggregate summary sample

From profiling/cluster/cluster_usage_test-profiling_summary.md:

Ranks with aggregate files: 4
total_cache_hits: 5984151
total_cache_misses: 3010
total_counted_fuse_operations: 6377844
total_measured_latency_seconds: 39.513989
total_negative_results: 168147
metadata_enoent_ttl_serves: 45424
metadata_enoent_ttl_stores: 2372

Selected operation totals from the same summary:

getattr: 5952916 calls, 31.390461 s total latency
read: 195912 calls, 7.008104 s total latency
readdir: 6480 calls, 0.535788 s total latency

This summary is useful for:

  • confirming that cluster aggregation succeeded

  • identifying the hottest operations by call count

  • identifying the operations that dominated measured latency

  • checking how heavily Copper served data from cache

Path usage summary sample

From paths_dir/paths_summary.md:

All possible existing paths under the selected roots: 22383
Actually used in this run: 1729
Probed but absent: 555
Observed files: 1404 of 20625 (6.81%)
Observed directories: 325 of 1758 (18.49%)

This path summary supports a staged interpretation:

  • the selected environment tree was much larger than the set touched by the run

  • file coverage for this run was 6.81 percent, not full-tree usage

  • missing probe paths are normal loader or Python search misses and should not be confused with active deletion

Path class sample

Also from profiling/cluster/cluster_usage_test-profiling_summary.md:

environment_prefix: 5075764 events, example /lus
shared_library: 126484 events, example .../nvidia/cu13/lib/libcublasLt.so.13
negative_probe_path: 19452 events, example .../aurora_runs/nvidia

These classes are useful when planning cleanup experiments:

  • keep the active environment core paths first

  • review repeated negative probe paths for stale search-path entries

  • use the observed set as an allowlist candidate for a cloned environment, rather than as proof that unseen files are permanently unnecessary

Practical Reading Order

If the goal is fast post-run interpretation, the simplest reading order is:

  1. check the Copper node logs in logs/ for startup and readiness timing

  2. check profiling/cluster/*.md for aggregate workload shape

  3. check profiling/cluster/*.csv for totals and per-operation counts

  4. check paths_summary.md when pruning or environment minimization is the next question