std.Build.Cache: fix several bugs

Aside from adding comments to document the logic in `Cache.Manifest.hit` better, this commit fixes two serious bugs. The first, spotted by Andrew, is that when upgrading from a shared to an exclusive lock on the manifest file, we do not seek it back to the start. This is a simple fix. The second is more subtle, and has to do with the computation of file digests. Broadly speaking, the goal of the main loop in `hit` is to iterate the files listed in the manifest file, and check if they've changed, based on stat and a file hash. While doing this, the `bin_digest` field of `std.Build.Cache.File`, which is initially `undefined`, is populated for all files, either straight from the manifest (if the stat matches) or recomputed from the file on-disk. This file digest is then used to update `man.hash.hasher`, which is building the final hash used as, for instance, the output directory name when the compiler emits into the cache directory. When `hit` returns a cache miss, it is expected that `man.hash.hasher` includes the digests of all "initial files"; that is, those which have been already added with e.g. `addFilePath`, but not those which will later be added with `addFilePost` (even though the manifest file has told us about some such files). Previously, `hit` was using the `unhit` function to do this in a few cases. However, this is incorrect, because `hit` assumes that all files already have their `bin_digest` field populated; this function is only valid to call *after* `hit` returns. Instead, we need to actually compute the hashes which haven't yet been populated. Even if this logic has been working, there was still a bug here, because we called `unhit` when upgrading from a shared to an exclusive lock, writing the (potentially `undefined`) file digests, but the loop itself writes the file digests *again*! All in all, the hashing logic here was actually incredibly broken. I've taken the opportunity to restructure this section of the code into what I think is a more readable format. A new function, `hitWithCurrentLock`, uses the open manifest file to try and find a cache hit. It returns a tagged union which, in the miss case, tells the caller (`hit`) how many files already have their hash populated. This avoids redundant work recomputing the same hash multiple times in situations where the lock needs upgrading. This also eliminates the outer loop from `hit`, which was a little confusing because it iterated no more than twice! The bugs fixed here could manifest in several different ways depending on how contended file locks were satisfied. Most notably, on a cache miss, the Zig compiler might have written the compilation output to the incorrect directory (because it incorrectly constructed a hash using `undefined` or repeated file digests), resulting in all future hits on this manifest causing `error.FileNotFound`. This is #23110. I have been able to reproduce #23110 on `master`, and have not been able to after this commit, so I am relatively sure this commit resolves that issue. Resolves: #23110
2025-12-06 13:54:21 +00:00 · 2025-04-24 01:17:32 +01:00 · 2025-04-24 01:17:32 +01:00 · 160f2dabed
commit 160f2dabed
parent 53f298cffa
3 changed files with 234 additions and 166 deletions
--- a/lib/std/Build/Cache.zig
+++ b/lib/std/Build/Cache.zig
@ -337,6 +337,7 @@ pub const Manifest = struct {
        manifest_create: fs.File.OpenError,
        manifest_read: fs.File.ReadError,
        manifest_lock: fs.File.LockError,
        manifest_seek: fs.File.SeekError,
        file_open: FileOp,
        file_stat: FileOp,
        file_read: FileOp,
@ -488,7 +489,6 @@ pub const Manifest = struct {
    /// option, one may call `toOwnedLock` to obtain a smaller object which can represent
    /// the lock. `deinit` is safe to call whether or not `toOwnedLock` has been called.
    pub fn hit(self: *Manifest) HitError!bool {
        const gpa = self.cache.gpa;
        assert(self.manifest_file == null);
        self.diagnostic = .none;
@ -501,12 +501,12 @@ pub const Manifest = struct {
        self.hex_digest = binToHex(bin_digest);
        self.hash.hasher = hasher_init;
        self.hash.hasher.update(&bin_digest);
        @memcpy(manifest_file_path[0..self.hex_digest.len], &self.hex_digest);
        manifest_file_path[hex_digest_len..][0..ext.len].* = ext.*;
        // We'll try to open the cache with an exclusive lock, but if that would block
        // and `want_shared_lock` is set, a shared lock might be sufficient, so we'll
        // open with a shared lock instead.
        while (true) {
            if (self.cache.manifest_dir.createFile(&manifest_file_path, .{
                .read = true,
@ -575,7 +575,104 @@ pub const Manifest = struct {
        self.want_refresh_timestamp = true;
        const input_file_count = self.files.entries.len;
-        while (true) : (self.unhit(bin_digest, input_file_count)) {
+
        // We're going to construct a second hash. Its input will begin with the digest we've
        // already computed (`bin_digest`), and then it'll have the digests of each input file,
        // including "post" files (see `addFilePost`). If this is a hit, we learn the set of "post"
        // files from the manifest on disk. If this is a miss, we'll learn those from future calls
        // to `addFilePost` etc. As such, the state of `self.hash.hasher` after this function
        // depends on whether this is a hit or a miss.
        //
        // If we return `true` indicating a cache hit, then `self.hash.hasher` must already include
        // the digests of the "post" files, so the caller can call `final`. Otherwise, on a cache
        // miss, `self.hash.hasher` will include the digests of all non-"post" files -- that is,
        // the ones we've already been told about. The rest will be discovered through calls to
        // `addFilePost` etc, which will update the hasher. After all files are added, the user can
        // use `final`, and will at some point `writeManifest` the file list to disk.
        self.hash.hasher = hasher_init;
        self.hash.hasher.update(&bin_digest);
        hit: {
            const file_digests_populated: usize = digests: {
                switch (try self.hitWithCurrentLock()) {
                    .hit => break :hit,
                    .miss => |m| if (!try self.upgradeToExclusiveLock()) {
                        break :digests m.file_digests_populated;
                    },
                }
                // We've just had a miss with the shared lock, and upgraded to an exclusive lock. Someone
                // else might have modified the digest, so we need to check again before deciding to miss.
                // Before trying again, we must reset `self.hash.hasher` and `self.files`.
                // This is basically just the first half of `unhit`.
                self.hash.hasher = hasher_init;
                self.hash.hasher.update(&bin_digest);
                while (self.files.count() != input_file_count) {
                    var file = self.files.pop().?;
                    file.key.deinit(self.cache.gpa);
                }
                // Also, seek the file back to the start.
                self.manifest_file.?.seekTo(0) catch |err| {
                    self.diagnostic = .{ .manifest_seek = err };
                    return error.CacheCheckFailed;
                };
                switch (try self.hitWithCurrentLock()) {
                    .hit => break :hit,
                    .miss => |m| break :digests m.file_digests_populated,
                }
            };
            // This is a guaranteed cache miss. We're almost ready to return `false`, but there's a
            // little bookkeeping to do first. The first `file_digests_populated` entries in `files`
            // have their `bin_digest` populated; there may be some left in `input_file_count` which
            // we'll need to populate ourselves. Other than that, this is basically `unhit`.
            self.manifest_dirty = true;
            self.hash.hasher = hasher_init;
            self.hash.hasher.update(&bin_digest);
            while (self.files.count() != input_file_count) {
                var file = self.files.pop().?;
                file.key.deinit(self.cache.gpa);
            }
            for (self.files.keys(), 0..) |*file, idx| {
                if (idx < file_digests_populated) {
                    // `bin_digest` is already populated by `hitWithCurrentLock`, so we can use it directly.
                    self.hash.hasher.update(&file.bin_digest);
                } else {
                    self.populateFileHash(file) catch |err| {
                        self.diagnostic = .{ .file_hash = .{
                            .file_index = idx,
                            .err = err,
                        } };
                        return error.CacheCheckFailed;
                    };
                }
            }
            return false;
        }
        if (self.want_shared_lock) {
            self.downgradeToSharedLock() catch |err| {
                self.diagnostic = .{ .manifest_lock = err };
                return error.CacheCheckFailed;
            };
        }
        return true;
    }
    /// Assumes that `self.hash.hasher` has been updated only with the original digest, that
    /// `self.files` contains only the original input files, and that `self.manifest_file.?` is
    /// seeked to the start of the file.
    fn hitWithCurrentLock(self: *Manifest) HitError!union(enum) {
        hit,
        miss: struct {
            file_digests_populated: usize,
        },
    } {
        const gpa = self.cache.gpa;
        const input_file_count = self.files.entries.len;
        const file_contents = self.manifest_file.?.reader().readAllAlloc(gpa, manifest_file_size_max) catch |err| switch (err) {
            error.OutOfMemory => return error.OutOfMemory,
            error.StreamTooLong => return error.OutOfMemory,
@ -589,20 +686,12 @@ pub const Manifest = struct {
        var any_file_changed = false;
        var line_iter = mem.tokenizeScalar(u8, file_contents, '\n');
        var idx: usize = 0;
-            if (if (line_iter.next()) |line| !std.mem.eql(u8, line, manifest_header) else true) {
+        const header_valid = valid: {
-                if (try self.upgradeToExclusiveLock()) continue;
+            const line = line_iter.next() orelse break :valid false;
-                self.manifest_dirty = true;
+            break :valid std.mem.eql(u8, line, manifest_header);
                while (idx < input_file_count) : (idx += 1) {
                    const ch_file = &self.files.keys()[idx];
                    self.populateFileHash(ch_file) catch |err| {
                        self.diagnostic = .{ .file_hash = .{
                            .file_index = idx,
                            .err = err,
                        } };
                        return error.CacheCheckFailed;
        };
-                }
+        if (!header_valid) {
-                return false;
+            return .{ .miss = .{ .file_digests_populated = 0 } };
        }
        while (line_iter.next()) |line| {
            defer idx += 1;
@ -674,8 +763,8 @@ pub const Manifest = struct {
            const dir = self.cache.prefixes()[pp.prefix].handle;
            const this_file = dir.openFile(pp.sub_path, .{ .mode = .read_only }) catch |err| switch (err) {
                error.FileNotFound => {
-                        if (try self.upgradeToExclusiveLock()) continue;
+                    // Every digest before this one has been populated successfully.
-                        return false;
+                    return .{ .miss = .{ .file_digests_populated = idx } };
                },
                else => |e| {
                    self.diagnostic = .{ .file_open = .{
@ -699,8 +788,6 @@ pub const Manifest = struct {
            const inode_match = actual_stat.inode == cache_hash_file.stat.inode;
            if (!size_match or !mtime_match or !inode_match) {
                    self.manifest_dirty = true;
                cache_hash_file.stat = .{
                    .size = actual_stat.size,
                    .mtime = actual_stat.mtime,
@ -734,40 +821,21 @@ pub const Manifest = struct {
            }
        }
-            if (any_file_changed) {
+        // If the manifest was somehow missing one of our input files, or if any file hash has changed,
-                if (try self.upgradeToExclusiveLock()) continue;
+        // then this is a cache miss. However, we have successfully populated some or all of the file
-                // cache miss
+        // digests.
-                // keep the manifest file open
+        if (any_file_changed or idx < input_file_count) {
-                self.unhit(bin_digest, input_file_count);
+            return .{ .miss = .{ .file_digests_populated = idx } };
                return false;
        }
-            if (idx < input_file_count) {
+        return .hit;
                if (try self.upgradeToExclusiveLock()) continue;
                self.manifest_dirty = true;
                while (idx < input_file_count) : (idx += 1) {
                    self.populateFileHash(&self.files.keys()[idx]) catch |err| {
                        self.diagnostic = .{ .file_hash = .{
                            .file_index = idx,
                            .err = err,
                        } };
                        return error.CacheCheckFailed;
                    };
                }
                return false;
            }
            if (self.want_shared_lock) {
                self.downgradeToSharedLock() catch |err| {
                    self.diagnostic = .{ .manifest_lock = err };
                    return error.CacheCheckFailed;
                };
            }
            return true;
        }
    }
    /// Reset `self.hash.hasher` to the state it should be in after `hit` returns `false`.
    /// The hasher contains the original input digest, and all original input file digests (i.e.
    /// not including post files).
    /// Assumes that `bin_digest` is populated for all files up to `input_file_count`. As such,
    /// this is not necessarily safe to call within `hit`.
    pub fn unhit(self: *Manifest, bin_digest: BinDigest, input_file_count: usize) void {
        // Reset the hash.
        self.hash.hasher = hasher_init;
--- a/lib/std/Build/Step.zig
+++ b/lib/std/Build/Step.zig
@ -759,7 +759,7 @@ fn failWithCacheError(s: *Step, man: *const Build.Cache.Manifest, err: Build.Cac
    switch (err) {
        error.CacheCheckFailed => switch (man.diagnostic) {
            .none => unreachable,
-            .manifest_create, .manifest_read, .manifest_lock => |e| return s.fail("failed to check cache: {s} {s}", .{
+            .manifest_create, .manifest_read, .manifest_lock, .manifest_seek => |e| return s.fail("failed to check cache: {s} {s}", .{
                @tagName(man.diagnostic), @errorName(e),
            }),
            .file_open, .file_stat, .file_read, .file_hash => |op| {
--- a/src/Compilation.zig
+++ b/src/Compilation.zig
@ -2129,7 +2129,7 @@ pub fn update(comp: *Compilation, main_progress_node: std.Progress.Node) !void {
            const is_hit = man.hit() catch |err| switch (err) {
                error.CacheCheckFailed => switch (man.diagnostic) {
                    .none => unreachable,
-                    .manifest_create, .manifest_read, .manifest_lock => |e| return comp.setMiscFailure(
+                    .manifest_create, .manifest_read, .manifest_lock, .manifest_seek => |e| return comp.setMiscFailure(
                        .check_whole_cache,
                        "failed to check cache: {s} {s}",
                        .{ @tagName(man.diagnostic), @errorName(e) },