False ENOENT Report

Scope

This note summarizes the false-ENOENT investigation that informed later metadata TTL handling in Copper.

Primary Finding

The observed mount failures were not caused by Slurm CPU binding. The important behavior was repeated metadata ENOENT handling during Python and Torch startup through the Copper mount. Some of those ENOENT results were legitimate optional-loader probes, but Copper was also doing too much repeated work on exact-path negative lookups.

Changes Introduced by That Investigation

The investigation produced four important runtime changes:

  • path-status coordination cleanup for completed entries

  • root-only metadata ENOENT TTL

  • broader exact-path metadata ENOENT TTL reuse

  • configurable metadata ENOENT TTL through -md_enoent_ttl_ms and launch_copper.sh -E <value>

Operational Takeaway

The metadata ENOENT TTL improves suppression of repeated rechecks for the same exact missing path, but it does not eliminate legitimate missing-path metadata events generated by Python, Conda, or dynamic loader behavior.

Current Relevance

Copper retains the configurable metadata ENOENT TTL path. That logic is complementary to the startup, readiness, and address-book scaling work; it addresses a different class of startup overhead.