Launch and Analysis Runbook
Purpose
This page is a step-by-step runbook for common Copper launch and post-run analysis tasks. It is intentionally procedural, even where some of the same material appears elsewhere in the documentation set.
Use this page when the goal is to:
launch Copper with basic runtime settings
enable startup timing visibility
enable profiling outputs
choose between facility and discovery address-book preparation
run the post-processing scripts after a job completes
Step 1: Choose a Log Level
Copper logging is controlled with -l in the platform launch wrappers and
with -log_level in direct cu_fuse runs.
The current user-facing levels are:
Level |
Meaning |
Typical use |
|---|---|---|
|
no logging |
quiet production runs |
|
fatal only |
crash-only visibility |
|
error and fatal |
error-focused troubleshooting |
|
warning, error, and fatal |
warning-focused troubleshooting |
|
info, warning, error, and fatal |
startup timing and readiness studies |
|
debug-heavy / most logging |
deep debugging |
If the goal is to retain the compact startup timing lines:
provider registration completed after <us>first successful parent rpc_... completed after <us> since provider startup
use at least:
-l 4
Step 2: Choose an Address-Book Source
Copper supports two address-book source modes in the launch wrappers.
facilityFilters a provided facility address book down to the current allocation.
discoverRuns the discovery helper across the allocation, preserves the raw output, and derives the final
copper_address_book.txtfrom the selected network column.
Practical guidance:
use
facilitywhen the site address book is trusted and currentuse
discoverwhen a fresh allocation-derived mapping is neededwhen using
discover, keep both: -logs/copper_address_book.txt-logs/copper_address_book_full_output.txt
Step 3: Choose Profiling Options
Copper profiling can be enabled with any combination of the following options:
Wrapper option |
Effect |
|---|---|
|
enable profiling collection |
|
enable profiling and keep the hottest |
|
enable fuller path-level outputs |
|
enable periodic profiling snapshots while the job is still running |
The corresponding direct cu_fuse options are:
-profile_metrics-profile_top_n <N>-profile_paths_full-profile_snapshot_interval_s <seconds>
Common combinations:
no profiling: - no
-P,-N,-A, or-Iaggregate profiling: -
-Pbounded hotspot profiling: -
-P -N 20full-path forensic profiling: -
-P -N 20 -A
Step 4: Launch Copper
Aurora
Basic run with startup timing visibility:
launch_copper_aurora.sh -l 4 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"
Facility-mode run with profiling enabled:
launch_copper_aurora.sh -l 4 -P -N 20 -A -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"
Discovery-mode run:
launch_copper_aurora.sh -l 4 -a discover -n cxi -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"
Aurora notes:
cxiis the common endpoint family in the Aurora wrappers48,49,50,51is the common Copper service-core example-l 4is the recommended setting when startup timing matters
Frontier
Basic run with startup timing visibility:
launch_copper_frontier.sh -l 4 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"
Facility-mode run with profiling enabled:
launch_copper_frontier.sh -l 4 -P -N 20 -A -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"
Discovery-mode run:
launch_copper_frontier.sh -l 4 -a discover -n cxi://cxi1 -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"
Frontier notes:
cxi://cxi1is a common endpoint example1,2or1,2,65,66are common Copper service-core examples
Step 5: Run the Workload
After Copper starts, run the workload against the Copper-mounted path.
Aurora example:
time mpirun --np "${NRANKS}" --ppn "${RANKS_PER_NODE}" \
--cpu-bind=list:4:56:9:61:14:66:19:71:20:74:25:79 \
--genvall \
--genv=PYTHONPATH="${MY_COPPER_MOUNT}${PACKAGE_DIR}/:${PYTHONPATH}" \
python3 -c "import torch; print(torch.__file__);"
Frontier example:
/usr/bin/time srun --overlap -N "${SLURM_NNODES}" -n $((SLURM_NNODES * 8)) \
--ntasks-per-node=8 --cpu-bind="${CPU_BINDING_MAP}" \
python3 -c "import torch; print(torch.__file__)"
Step 6: Stop Copper
Aurora:
stop_copper_aurora.sh -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"
Frontier:
stop_copper_frontier.sh -d "${LOGDIR}" -v "${MY_COPPER_MOUNT}"
Important shutdown note:
run the matching
stop_copper_aurora.shorstop_copper_frontier.shbefore attemptingaggregate_profiling.pyorpath_usage_analysis.pywait for Copper shutdown to complete on all nodes before running the analysis scripts
the final profiling CSV files under
profiling/final/and the final raw table outputs undertables/final/may not exist until shutdown has completedif analysis is attempted too early, the common failure mode is:
failed to find profiling csv files under: .../profiling/final
Step 7: Locate the Job Output Directory
The post-processing scripts expect the Copper job root directory, which is the directory that contains:
logs/tables/profiling/
Typical example:
/path/to/copper-logs-dir/<jobid>
Do not point the scripts at:
logs/directlyprofiling/final/directlyprofiling/cluster/directly
Step 8: Run Aggregate Profiling
The aggregate profiling script combines per-rank profiling outputs into cluster-level summaries.
Basic usage:
cd /lus/flare/projects/datascience/kaushik/copper-tests/copper/scripts
python3 aggregate_profiling.py /path/to/copper-logs-dir/<jobid>
With a custom output prefix:
python3 aggregate_profiling.py /path/to/copper-logs-dir/<jobid> cluster_usage_test
With explicit usage-root analysis:
python3 aggregate_profiling.py \
/path/to/copper-logs-dir/<jobid> \
cluster_usage_test \
--usage-root /path/to/package-root
Outputs are written under:
profiling/cluster/<prefix>-profiling_aggregate.csvprofiling/cluster/<prefix>-profiling_operations.csvprofiling/cluster/<prefix>-profiling_summary.md
If the script reports:
failed to find profiling csv files under: ...
the most common causes are:
the run was not launched with profiling enabled
the wrong job root directory was passed
Copper had not been stopped yet, so the final outputs had not been flushed
the run failed before profiling files were written
Step 9: Inspect Cache Table Usage
The ioctl helper scripts can also capture current table occupancy for Copper’s three main cache tables and summarize them into one combined report.
Workflow:
cd /path/to/copper/scripts/get_copper_stats_ioctl
# Step 1: add this to env.sh, or export it in the current shell
export VIEW_DIR=/mnt/bb/${USER}/copper_mount
# Step 2: while Copper is still running, collect the raw cache-size outputs
bash ./get_cache_usage_summary.sh /lustre/orion/proj-shared/ums046/some_output_dir
# Step 3: for offline re-summarization of an existing directory
python3 ./summarize_cache_usage.py /path/to/dir --csv /path/to/dir/cache_usage_summary.csv
Outputs include:
data_cache_size.outputtree_cache_size.outputmd_cache_size.outputcache_usage_summary.txtcache_usage_summary.csv
Interpretation:
the reports show current used bytes and entry counts
the combined summary shows the total currently occupied bytes across all three tables
the tables are dynamically growing
unordered_mapcontainers rather than fixed-capacity poolsmemory usage is bounded by normal process and node limits rather than by a Copper-specific global table budget
max_cacheable_byte_sizeis a per-file admission threshold for data caching, not a total cache budgetCopper does not yet report a fixed total budget or remaining available bytes for those tables
Sample output:
| Table | Files Found | Used Bytes | Used Human | Entries |
| -------- | ----------: | ---------: | ---------: | ------: |
| data | 1 | 70296558 | 67.04 MiB | 1025 |
| tree | 1 | 27772 | 27.12 KiB | 140 |
| metadata | 1 | 340848 | 332.86 KiB | 2367 |
| combined | - | 70665178 | 67.39 MiB | 3532 |
Step 10: Run Path Usage Analysis
The path-usage script compares observed paths against the selected filesystem roots and produces used-path, missing-probe, and candidate-not-observed lists.
Basic usage:
cd /lus/flare/projects/datascience/kaushik/copper-tests/copper/scripts
python3 path_usage_analysis.py /path/to/copper-logs-dir/<jobid>
With an explicit package root:
python3 path_usage_analysis.py \
/path/to/copper-logs-dir/<jobid> \
--usage-root /path/to/package-root
With a custom output directory:
python3 path_usage_analysis.py \
/path/to/copper-logs-dir/<jobid> \
--usage-root /path/to/package-root \
--output-dir /path/to/copper-logs-dir/<jobid>/paths_dir
Outputs include:
all_possible_existing_paths.txtused_paths_existing.txtsame_run_candidate_not_observed_existing_paths.txtmissing_probe_paths.txtroots_and_counts.txtpaths_summary.md
Sample Outputs from Aurora Run 8405873
The examples below come from:
/lus/flare/projects/datascience/kaushik/copper-tests/aurora_runs/scale_test/2_nodes/copper-logs-dir/8405873
Startup log sample
From logs/x4613c3s7b0n0-82226-output.log:
2026-03-27 23-33-57.994 [Info] (cu_hello_main) user-facing log_level: 5
2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_metrics: enabled
2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_top_n: 20
2026-03-27 23-33-57.994 [Info] (cu_hello_main) profile_paths_full: enabled
2026-03-27 23-33-57.997 [Info] (start_thallium_engine) starting thallium engine
2026-03-27 23-33-58.078 [Info] (start_thallium_engine) engine started
2026-03-27 23-33-58.082 [Info] (start_thallium_engine) server running at address: ofi+cxi://0x09470400
2026-03-27 23-33-58.082 [Info] (start_thallium_engine) provider registration completed after 85096 us
This startup view confirms:
the chosen log level
whether profiling was enabled
whether the server came up cleanly
the local provider registration time
Cluster aggregate summary sample
From profiling/cluster/cluster_usage_test-profiling_summary.md:
Ranks with aggregate files: 4
total_cache_hits: 5984151
total_cache_misses: 3010
total_counted_fuse_operations: 6377844
total_measured_latency_seconds: 39.513989
total_negative_results: 168147
metadata_enoent_ttl_serves: 45424
metadata_enoent_ttl_stores: 2372
Selected operation totals from the same summary:
getattr: 5952916 calls, 31.390461 s total latency
read: 195912 calls, 7.008104 s total latency
readdir: 6480 calls, 0.535788 s total latency
This summary is useful for:
confirming that cluster aggregation succeeded
identifying the hottest operations by call count
identifying the operations that dominated measured latency
checking how heavily Copper served data from cache
Path usage summary sample
From paths_dir/paths_summary.md:
All possible existing paths under the selected roots: 22383
Actually used in this run: 1729
Probed but absent: 555
Observed files: 1404 of 20625 (6.81%)
Observed directories: 325 of 1758 (18.49%)
This path summary supports a staged interpretation:
the selected environment tree was much larger than the set touched by the run
file coverage for this run was 6.81 percent, not full-tree usage
missing probe paths are normal loader or Python search misses and should not be confused with active deletion
Path class sample
Also from profiling/cluster/cluster_usage_test-profiling_summary.md:
environment_prefix: 5075764 events, example /lus
shared_library: 126484 events, example .../nvidia/cu13/lib/libcublasLt.so.13
negative_probe_path: 19452 events, example .../aurora_runs/nvidia
These classes are useful when planning cleanup experiments:
keep the active environment core paths first
review repeated negative probe paths for stale search-path entries
use the observed set as an allowlist candidate for a cloned environment, rather than as proof that unseen files are permanently unnecessary
Practical Reading Order
If the goal is fast post-run interpretation, the simplest reading order is:
check the Copper node logs in
logs/for startup and readiness timingcheck
profiling/cluster/*.mdfor aggregate workload shapecheck
profiling/cluster/*.csvfor totals and per-operation countscheck
paths_summary.mdwhen pruning or environment minimization is the next question