Metadata ENOENT TTL Evaluation
==============================

Scope
-----

This page integrates the maintained findings from the ``-E`` evaluation work
for exact-path metadata ``ENOENT`` caching. The core experiment measured how
the Copper launch wrapper ``-E <value>`` option affected:

- ``python3 -c "import torch"`` completion time
- repeated metadata rechecks
- backend ``ENOENT`` traffic
- total negative metadata churn

This dataset should be read as an older debugging-phase reference, not as the
final stable production baseline. The runs were collected while the code still
had heavier debug logging and before the later stable startup path was fully in
place. The main value of the dataset is as a rough estimate of TTL behavior on
the older code path.

Summary of the 128-Node Evaluation
----------------------------------

The 128-node experiment compared ``E100``, ``E1000``, ``E2000``, ``E5000``,
and ``E10000``.

This 128-node summary is best treated as a rough tuning reference for the
older code line. It remains useful for understanding the relative effect of
small versus moderate ``-E`` values, but it should not be read as the
authoritative performance ranking for the current stable code.

.. list-table::
   :header-rows: 1

   * - ``-E`` value
     - Import result
     - Elapsed
     - Stale completed entries
     - TTL stores
     - TTL serves
     - Backend ``ENOENT`` lookups
   * - ``100``
     - success
     - ``57.94 s``
     - ``15,023``
     - ``110,608``
     - ``632,028``
     - ``3,449``
   * - ``1000``
     - success
     - ``48.79 s``
     - ``1,511``
     - ``86,575``
     - ``642,517``
     - ``818``
   * - ``2000``
     - success
     - ``47.80 s``
     - ``1,017``
     - ``85,421``
     - ``643,206``
     - ``624``
   * - ``5000``
     - success
     - ``48.03 s``
     - ``1,020``
     - ``85,117``
     - ``643,482``
     - ``606``
   * - ``10000``
     - success
     - ``47.02 s``
     - ``872``
     - ``84,852``
     - ``643,556``
     - ``571``

Key Findings
------------

``E100`` is too small at 128 nodes
   It produced the slowest import time and the highest stale-recheck count.

Most of the gain arrives by ``E1000`` to ``E2000``
   The largest step improvement happened between ``E100`` and ``E1000``.

Larger values continue to help, but with diminishing returns
   ``E5000`` and ``E10000`` still reduce stale rechecks and backend
   ``ENOENT`` traffic, but only modestly compared with the ``E1000`` to
   ``E2000`` transition.

``E10000`` gave the best raw import time in this dataset
   The difference relative to ``E2000`` was narrow, so the best choice depends
   on whether absolute speed or operational conservatism matters more.

Recommended Defaults by Scale
-----------------------------

The archived evaluation and follow-on test planning support the following
practical starting points:

.. list-table::
   :header-rows: 1

   * - Scale range
     - Recommended starting point
     - Notes
   * - ``2-16`` nodes
     - ``E1000``
     - already large enough to collapse common repeated misses
   * - ``128-512`` nodes
     - ``E2000``
     - best overall balance in the 128-node dataset
   * - ``1K-10K`` nodes
     - compare ``E2000`` and ``E5000``
     - carry both until the larger-scale workload is characterized

Operational Guidance
--------------------

- keep ``E100`` out of larger-scale production comparisons
- use ``E2000`` as the default balanced starting point unless the workload
  shows a clear benefit from larger settings
- keep observing ``stale completed entry``, TTL serves, and backend
  ``ENOENT`` counts together
- remember that the TTL reduces repeated work; it does not eliminate the
  workload's tendency to ask for missing paths
- treat the 128-node numbers on this page as directional guidance from an
  older debug-era evaluation rather than as the final word on current stable
  behavior

What the TTL Fix Does Not Solve
-------------------------------

The metadata ``ENOENT`` TTL is useful, but narrow. It does not by itself solve:

- a workload that keeps asking for many distinct missing paths
- heavy directory enumeration churn
- read-path amplification for file content
- transport or RPC failures unrelated to ``ENOENT``
- mount instability after upstream Copper failures

The right interpretation is that the TTL is a targeted reduction in repeated
negative metadata work, not a general cure for all startup pathologies.