Metadata ENOENT TTL Evaluation
Scope
This page integrates the maintained findings from the -E evaluation work
for exact-path metadata ENOENT caching. The core experiment measured how
the Copper launch wrapper -E <value> option affected:
python3 -c "import torch"completion timerepeated metadata rechecks
backend
ENOENTtraffictotal negative metadata churn
This dataset should be read as an older debugging-phase reference, not as the final stable production baseline. The runs were collected while the code still had heavier debug logging and before the later stable startup path was fully in place. The main value of the dataset is as a rough estimate of TTL behavior on the older code path.
Summary of the 128-Node Evaluation
The 128-node experiment compared E100, E1000, E2000, E5000,
and E10000.
This 128-node summary is best treated as a rough tuning reference for the
older code line. It remains useful for understanding the relative effect of
small versus moderate -E values, but it should not be read as the
authoritative performance ranking for the current stable code.
|
Import result |
Elapsed |
Stale completed entries |
TTL stores |
TTL serves |
Backend |
|---|---|---|---|---|---|---|
|
success |
|
|
|
|
|
|
success |
|
|
|
|
|
|
success |
|
|
|
|
|
|
success |
|
|
|
|
|
|
success |
|
|
|
|
|
Key Findings
E100is too small at 128 nodesIt produced the slowest import time and the highest stale-recheck count.
- Most of the gain arrives by
E1000toE2000 The largest step improvement happened between
E100andE1000.- Larger values continue to help, but with diminishing returns
E5000andE10000still reduce stale rechecks and backendENOENTtraffic, but only modestly compared with theE1000toE2000transition.E10000gave the best raw import time in this datasetThe difference relative to
E2000was narrow, so the best choice depends on whether absolute speed or operational conservatism matters more.
Recommended Defaults by Scale
The archived evaluation and follow-on test planning support the following practical starting points:
Scale range |
Recommended starting point |
Notes |
|---|---|---|
|
|
already large enough to collapse common repeated misses |
|
|
best overall balance in the 128-node dataset |
|
compare |
carry both until the larger-scale workload is characterized |
Operational Guidance
keep
E100out of larger-scale production comparisonsuse
E2000as the default balanced starting point unless the workload shows a clear benefit from larger settingskeep observing
stale completed entry, TTL serves, and backendENOENTcounts togetherremember that the TTL reduces repeated work; it does not eliminate the workload’s tendency to ask for missing paths
treat the 128-node numbers on this page as directional guidance from an older debug-era evaluation rather than as the final word on current stable behavior
What the TTL Fix Does Not Solve
The metadata ENOENT TTL is useful, but narrow. It does not by itself solve:
a workload that keeps asking for many distinct missing paths
heavy directory enumeration churn
read-path amplification for file content
transport or RPC failures unrelated to
ENOENTmount instability after upstream Copper failures
The right interpretation is that the TTL is a targeted reduction in repeated negative metadata work, not a general cure for all startup pathologies.