Launch and Analysis Runbook

Purpose

This page is a step-by-step runbook for common Copper launch and post-run analysis tasks. It is intentionally procedural, even where some of the same material appears elsewhere in the documentation set.

Use this page when the goal is to:

launch Copper with basic runtime settings
enable startup timing visibility
enable profiling outputs
choose between facility and discovery address-book preparation
run the post-processing scripts after a job completes

Step 1: Choose a Log Level

Copper logging is controlled with -l in the platform launch wrappers and with -log_level in direct cu_fuse runs.

The current user-facing levels are:

Level	Meaning	Typical use
`-l 0`	no logging	quiet production runs
`-l 1`	fatal only	crash-only visibility
`-l 2`	error and fatal	error-focused troubleshooting
`-l 3`	warning, error, and fatal	warning-focused troubleshooting
`-l 4`	info, warning, error, and fatal	startup timing and readiness studies
`-l 5`	debug-heavy / most logging	deep debugging

If the goal is to retain the compact startup timing lines:

provider registration completed after <us>
first successful parent rpc_... completed after <us> since provider startup

use at least:

-l 4

Step 2: Choose an Address-Book Source

Copper supports two address-book source modes in the launch wrappers.

facility: Filters a provided facility address book down to the current allocation.
discover: Runs the discovery helper across the allocation, preserves the raw output, and derives the final copper_address_book.txt from the selected network column.

Practical guidance:

use facility when the site address book is trusted and current
use discover when a fresh allocation-derived mapping is needed
when using discover, keep both: - logs/copper_address_book.txt - logs/copper_address_book_full_output.txt

Step 3: Choose Profiling Options

Copper profiling can be enabled with any combination of the following options:

Wrapper option	Effect
`-P`	enable profiling collection
`-N <N>`	enable profiling and keep the hottest `N` paths
`-A`	enable fuller path-level outputs
`-I <seconds>`	enable periodic profiling snapshots while the job is still running

The corresponding direct cu_fuse options are:

-profile_metrics
-profile_top_n <N>
-profile_paths_full
-profile_snapshot_interval_s <seconds>

Common combinations:

no profiling: - no -P, -N, -A, or -I
aggregate profiling: - -P
bounded hotspot profiling: - -P -N 20
full-path forensic profiling: - -P -N 20 -A

Step 4: Launch Copper

Aurora

Basic run with startup timing visibility:

launch_copper_aurora.sh -l 4 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Facility-mode run with profiling enabled:

launch_copper_aurora.sh -l 4 -P -N 20 -A -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Discovery-mode run:

launch_copper_aurora.sh -l 4 -a discover -n cxi -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Aurora notes:

cxi is the common endpoint family in the Aurora wrappers
48,49,50,51 is the common Copper service-core example
-l 4 is the recommended setting when startup timing matters

Frontier

Basic run with startup timing visibility:

launch_copper_frontier.sh -l 4 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Facility-mode run with profiling enabled:

launch_copper_frontier.sh -l 4 -P -N 20 -A -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Discovery-mode run:

launch_copper_frontier.sh -l 4 -a discover -n cxi://cxi1 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Frontier notes:

cxi://cxi1 is a common endpoint example
1,2 or 1,2,65,66 are common Copper service-core examples

Step 5: Run the Workload

After Copper starts, run the workload against the Copper-mounted path.

Aurora example:

time mpirun --np "${NRANKS}" --ppn "${RANKS_PER_NODE}" \
  --cpu-bind=list:4:56:9:61:14:66:19:71:20:74:25:79 \
  --genvall \
  --genv=PYTHONPATH="${MY_COPPER_MOUNT}${PACKAGE_DIR}/:${PYTHONPATH}" \
  python3 -c "import torch; print(torch.__file__);"

Frontier example:

/usr/bin/time srun --overlap -N "${SLURM_NNODES}" -n $((SLURM_NNODES * 8)) \
  --ntasks-per-node=8 --cpu-bind="${CPU_BINDING_MAP}" \
  python3 -c "import torch; print(torch.__file__)"

Step 6: Stop Copper

Aurora:

stop_copper_aurora.sh -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Frontier:

stop_copper_frontier.sh -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"

Important shutdown note:

run the matching stop_copper_aurora.sh or stop_copper_frontier.sh before attempting aggregate_profiling.py or path_usage_analysis.py
wait for Copper shutdown to complete on all nodes before running the analysis scripts
the final profiling CSV files under profiling/final/ and the final raw table outputs under tables/final/ may not exist until shutdown has completed
if analysis is attempted too early, the common failure mode is: failed to find profiling csv files under: .../profiling/final

Step 7: Locate the Job Output Directory

The post-processing scripts expect the Copper job root directory, which is the directory that contains:

logs/
tables/
profiling/

Typical example:

/path/to/copper-logs-dir/<jobid>

Do not point the scripts at:

logs/ directly
profiling/final/ directly
profiling/cluster/ directly

Step 8: Run Aggregate Profiling

The aggregate profiling script combines per-rank profiling outputs into cluster-level summaries.

Basic usage:

cd /lus/flare/projects/datascience/kaushik/copper-tests/copper/scripts
python3 aggregate_profiling.py /path/to/copper-logs-dir/<jobid>

With a custom output prefix:

python3 aggregate_profiling.py /path/to/copper-logs-dir/<jobid> cluster_usage_test

With explicit usage-root analysis:

python3 aggregate_profiling.py \
  /path/to/copper-logs-dir/<jobid> \
  cluster_usage_test \
  --usage-root /path/to/package-root

Outputs are written under:

profiling/cluster/<prefix>-profiling_aggregate.csv
profiling/cluster/<prefix>-profiling_operations.csv
profiling/cluster/<prefix>-profiling_summary.md

If the script reports:

failed to find profiling csv files under: ...

the most common causes are:

the run was not launched with profiling enabled
the wrong job root directory was passed
Copper had not been stopped yet, so the final outputs had not been flushed
the run failed before profiling files were written

Step 9: Inspect Cache Table Usage

The ioctl helper scripts can also capture current table occupancy for Copper’s three main cache tables and summarize them into one combined report.

Workflow:

cd /path/to/copper/scripts/get_copper_stats_ioctl

# Step 1: add this to env.sh, or export it in the current shell
export VIEW_DIR=/mnt/bb/${USER}/copper_mount

# Step 2: while Copper is still running, collect the raw cache-size outputs
bash ./get_cache_usage_summary.sh /lustre/orion/proj-shared/ums046/some_output_dir

# Step 3: for offline re-summarization of an existing directory
python3 ./summarize_cache_usage.py /path/to/dir --csv /path/to/dir/cache_usage_summary.csv

Outputs include:

data_cache_size.output
tree_cache_size.output
md_cache_size.output
cache_usage_summary.txt
cache_usage_summary.csv

Interpretation:

the reports show current used bytes and entry counts
the combined summary shows the total currently occupied bytes across all three tables
the tables are dynamically growing unordered_map containers rather than fixed-capacity pools
memory usage is bounded by normal process and node limits rather than by a Copper-specific global table budget
max_cacheable_byte_size is a per-file admission threshold for data caching, not a total cache budget
Copper does not yet report a fixed total budget or remaining available bytes for those tables

Sample output:

| Table    | Files Found | Used Bytes | Used Human | Entries |
| -------- | ----------: | ---------: | ---------: | ------: |
| data     | 1           | 70296558   | 67.04 MiB  | 1025    |
| tree     | 1           | 27772      | 27.12 KiB  | 140     |
| metadata | 1           | 340848     | 332.86 KiB | 2367    |
| combined | -           | 70665178   | 67.39 MiB  | 3532    |

Step 10: Run Path Usage Analysis

The path-usage script compares observed paths against the selected filesystem roots and produces used-path, missing-probe, and candidate-not-observed lists.

Basic usage:

cd /lus/flare/projects/datascience/kaushik/copper-tests/copper/scripts
python3 path_usage_analysis.py /path/to/copper-logs-dir/<jobid>

With an explicit package root:

python3 path_usage_analysis.py \
  /path/to/copper-logs-dir/<jobid> \
  --usage-root /path/to/package-root

With a custom output directory:

python3 path_usage_analysis.py \
  /path/to/copper-logs-dir/<jobid> \
  --usage-root /path/to/package-root \
  --output-dir /path/to/copper-logs-dir/<jobid>/paths_dir

Outputs include:

all_possible_existing_paths.txt
used_paths_existing.txt
same_run_candidate_not_observed_existing_paths.txt
missing_probe_paths.txt
roots_and_counts.txt
paths_summary.md

Sample Outputs from Aurora Run 8405873

The examples below come from:

/lus/flare/projects/datascience/kaushik/copper-tests/aurora_runs/scale_test/2_nodes/copper-logs-dir/8405873

Startup log sample

From logs/x4613c3s7b0n0-82226-output.log:

2026-03-27 23-33-57.994 [Info] (cu_hello_main) user-facing log_level: 5
2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_metrics: enabled
2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_top_n: 20
2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_paths_full: enabled
2026-03-27 23-33-57.997 [Info] (start_thallium_engine) starting thallium engine
2026-03-27 23-33-58.078 [Info] (start_thallium_engine) engine started
2026-03-27 23-33-58.082 [Info] (start_thallium_engine) server running at address: ofi+cxi://0x09470400
2026-03-27 23-33-58.082 [Info] (start_thallium_engine) provider registration completed after 85096 us

This startup view confirms:

the chosen log level
whether profiling was enabled
whether the server came up cleanly
the local provider registration time

Cluster aggregate summary sample

From profiling/cluster/cluster_usage_test-profiling_summary.md:

Ranks with aggregate files: 4
total_cache_hits: 5984151
total_cache_misses: 3010
total_counted_fuse_operations: 6377844
total_measured_latency_seconds: 39.513989
total_negative_results: 168147
metadata_enoent_ttl_serves: 45424
metadata_enoent_ttl_stores: 2372

Selected operation totals from the same summary:

getattr: 5952916 calls, 31.390461 s total latency
read: 195912 calls, 7.008104 s total latency
readdir: 6480 calls, 0.535788 s total latency

This summary is useful for:

confirming that cluster aggregation succeeded
identifying the hottest operations by call count
identifying the operations that dominated measured latency
checking how heavily Copper served data from cache

Path usage summary sample

From paths_dir/paths_summary.md:

All possible existing paths under the selected roots: 22383
Actually used in this run: 1729
Probed but absent: 555
Observed files: 1404 of 20625 (6.81%)
Observed directories: 325 of 1758 (18.49%)

This path summary supports a staged interpretation:

the selected environment tree was much larger than the set touched by the run
file coverage for this run was 6.81 percent, not full-tree usage
missing probe paths are normal loader or Python search misses and should not be confused with active deletion

Path class sample

Also from profiling/cluster/cluster_usage_test-profiling_summary.md:

environment_prefix: 5075764 events, example /lus
shared_library: 126484 events, example .../nvidia/cu13/lib/libcublasLt.so.13
negative_probe_path: 19452 events, example .../aurora_runs/nvidia

These classes are useful when planning cleanup experiments:

keep the active environment core paths first
review repeated negative probe paths for stale search-path entries
use the observed set as an allowlist candidate for a cloned environment, rather than as proof that unseen files are permanently unnecessary

Practical Reading Order

If the goal is fast post-run interpretation, the simplest reading order is:

check the Copper node logs in logs/ for startup and readiness timing
check profiling/cluster/*.md for aggregate workload shape
check profiling/cluster/*.csv for totals and per-operation counts
check paths_summary.md when pruning or environment minimization is the next question