Profiling Reference

Profiling Modes

Copper supports three profiling levels:

Aggregate-only profiling: Records cluster-wide totals and per-operation summaries.
Top-path profiling: Records the hottest paths per category, bounded by profile_top_n.
Full-path profiling: Preserves more detailed path-level artifacts for deeper investigation.

Runtime Options

Option	Meaning
`-profile_metrics`	enable profiling collection
`-profile_top_n <N>`	keep the hottest `N` paths per category
`-profile_paths_full`	enable fuller path-level forensic output
`-profile_snapshot_interval_s <seconds>`	write periodic snapshots while Copper is still running

Output Layout

With profiling enabled, the job output directory typically contains:

profiling/final/ - final per-rank profiling summaries and CSV files
profiling/cluster/ - aggregated cluster-level summaries produced by

scripts/aggregate_profiling.py
tables/final/ - raw cache or table outputs associated with the run

Common Files

File	Meaning
`profiling_summary.md`	human-readable per-rank profiling summary
`profiling_operations.csv`	per-operation counts and timing
`profiling_aggregate.csv`	aggregate counters and cache metrics
`profiling_top_paths.csv`	hottest path list when top-path profiling is enabled

Key Terms

cache hit: Copper served the request from a local or forwarded cache path without repeating the underlying storage operation.
Metadata ENOENT TTL store: Copper inserted an exact-path negative metadata result into the temporary ENOENT TTL cache.
Metadata ENOENT TTL serve: Copper reused that temporary missing-path result instead of repeating the backend metadata lookup.
Metadata ENOENT TTL expire: An ENOENT TTL entry aged out and was removed after its configured lifetime.
FUSE operation: A filesystem request issued through the Copper mount, such as getattr, read, readdir, or open.
Metadata: File information rather than file content, including existence, type, size, and permissions.
Data: File content bytes read from a file, such as the bytes of a shared library or Python module.
Negative result: A request that ends in a missing-path or similar unsuccessful result rather than a successful metadata or data response.
ENOENT: The standard Unix “No such file or directory” result. In profiling output, repeated ENOENT paths are often informative rather than erroneous.
Top-path profiling: A bounded hotspot view that retains the busiest paths instead of dumping the full path population.
Full-path profiling: A deeper forensic mode that preserves more complete per-path evidence and can become large at scale.
Pre-destroy snapshot: A profiling snapshot written before the final Copper teardown sequence.

Reading the Metrics

The maintained profiling evaluations support the following interpretations:

high getattr counts usually indicate metadata-heavy startup or import behavior
high cumulative read latency usually indicates that content reads, not metadata probes, dominate elapsed service time
high Metadata ENOENT TTL serve counts are generally a positive sign that repeated negative metadata probes are being collapsed successfully
repeated missing shared-library probes, python*.zip checks, and pyvenv.cfg lookups are often expected runtime behavior rather than correctness bugs

Validated Output Families

The version4 profiling validation confirmed that a profiling-enabled run can produce:

per-rank Markdown summaries
per-rank operation CSVs
aggregate CSVs
bounded top-path CSVs
raw cache-event outputs
matching pre-destroy variants of those artifacts

That output structure is what makes profiling useful both for quick workload inspection and for deeper environment-path debugging.

Operational Guidance

Use aggregate profiling first for scalable runs.
Add top-path profiling when path hot spots matter.
Reserve full-path profiling for smaller or targeted forensic runs.
Do not rely on normal log verbosity to infer whether profiling ran; rely on the profiling output files themselves.

Cache Usage Metrics

Copper also supports ioctl-based table-size dumps for the three main cache tables:

data cache
tree cache
metadata cache

The raw table-size outputs are produced by the existing scripts under scripts/get_copper_stats_ioctl/. A post-analysis helper can then summarize them into one combined cache-usage report:

Recommended workflow:

cd scripts/get_copper_stats_ioctl

# Step 1: add this to env.sh, or export it in the current shell
export VIEW_DIR=/mnt/bb/${USER}/copper_mount

# Step 2: while Copper is still running, collect the raw cache-size outputs
bash ./get_cache_usage_summary.sh /lustre/orion/proj-shared/ums046/some_output_dir

# Step 3: for offline re-summarization of an existing directory
python3 ./summarize_cache_usage.py /path/to/dir --csv /path/to/dir/cache_usage_summary.csv

The summary reports:

used bytes for each table
entry counts for each table
combined used bytes across the three tables

The raw files are:

data_cache_size.output
tree_cache_size.output
md_cache_size.output

Memory model:

the main Copper tables are dynamically growing containers, not fixed-capacity pools
cache.h uses std::unordered_map<Key, Value> with no global size limit
tl_cache.h does the same for the Thallium-side caches
the tables grow as entries are inserted
memory usage is bounded by normal process and node limits, including DRAM, allocator behavior, and any job or cgroup memory limits
max_cacheable_byte_size is only a per-file admission threshold for data caching, not a total cache budget

It does not currently report a true remaining cache available value, because Copper does not yet maintain a fixed global cache budget for those tables.

Sample output:

| Table    | Files Found | Used Bytes | Used Human | Entries |
| -------- | ----------: | ---------: | ---------: | ------: |
| data     | 1           | 70296558   | 67.04 MiB  | 1025    |
| tree     | 1           | 27772      | 27.12 KiB  | 140     |
| metadata | 1           | 340848     | 332.86 KiB | 2367    |
| combined | -           | 70665178   | 67.39 MiB  | 3532    |