I am writing a console program that has to process a bunch of configuration files every time it is run. Since in some scenarios it is measured to be a significant portion of the entire execution time (more than half for big configs and small tasks), I decided to cache the preprocessed text configs in a binary format that is much faster to load.
To guard against stale cache after someone updates the configs, I am comparing the std::filesystem::last_write_time
of the binary cache and original file. This logic can be summarized in the following minimal program:
#include <algorithm>
#include <cstdio>
#include <filesystem>
#include <fstream>
#include <iterator>
#include <string>
#include <unordered_map>
int main(int argc, const char** argv) {
if (argc < 2) {
std::printf("Usage: %s <NAME>\n", argc ? argv[0] : "");
return 1;
}
const auto path = std::filesystem::path{ argv[1] };
auto file = std::ifstream{ path };
if (!file.good()) {
std::printf("File does not exist: '%s'\n", path.c_str());
return 1;
}
// Don't worry about collisions with paths containing literal `#`,
// real program handles it with proper escapes
const auto cache_path = [](auto s) {
std::replace(s.begin(), s.end(), '/', '#');
return std::filesystem::temp_directory_path() / s;
}(path.string());
auto cache_file = std::ifstream{ cache_path };
if (cache_file.good()
&& std::filesystem::last_write_time(cache_path)
> std::filesystem::last_write_time(path)) {
const auto cache_content
= std::string{ std::istream_iterator<char>{ cache_file },
std::istream_iterator<char>{} };
std::printf("Cache hit: %s\n", cache_content.c_str());
} else {
std::printf("Cache miss\n");
}
// Real program does not reprocess on cache hit,
// here we do it to show the bug
auto chars = std::unordered_map<char, size_t>{};
std::for_each(
std::istream_iterator<char>{ file },
std::istream_iterator<char>{},
[&chars](char c) { chars[c]++; }
);
const auto result = chars.size();
std::printf("Processed: %zu\n", result);
std::ofstream{ cache_path } << result;
}
The above test program works in the base case of writing directly to the file, but fails if another file is moved over it instead:
$ echo aaa > test.txt
$ ./alphabet test.txt
Cache miss
Processed: 1
$ ./alphabet test.txt
Cache hit: 1
Processed: 1
$ echo abc > test.txt
$ echo abcdef > test2.txt
$ ./alphabet test.txt
Cache miss
Processed: 3
$ ./alphabet test.txt
Cache hit: 3
Processed: 3
$ mv test2.txt test.txt
$ ./alphabet test.txt
Cache hit: 3
Processed: 6
How can I properly protect against it? I can think of couple options, neither of which I particularly like:
- Calculating file hash and comparing it to one stored in cache. 100% reliable (baring hash collisions I don't worry about: ), but I would rather avoid it requires reading the entirety of the original file anyways, minimizing cache effectiveness.
- Checking last write time of the parent directory, which (at least on Linux) updates any time any file is added, removed or renamed. This works for moves, but has heavy cost of false positives if we touch anything else. Additionally, still leaves a bug if entire directories get moved, protecting against which would require checking directories all the way to root, making it pointless:
$ echo abc > a/test.txt $ echo abcdef > b/test.txt $ ./alphabet a/test.txt $ rm -rf a/ $ mv b/ a/ $ ./alphabet a/test.txt
- Comparing file
inode
to the one stored in cache. While quick and robust, there is no portable way to get it, and may not be supported by all filesystems. - Watching for file changes. Not portable without
#ifdef
'ing OS or external libraries, plus requires some daemon or other server running between program invocations, also unsure it won't fail in some scenarios. - Documenting that any "funny bussiness" from the user requires running program with
--invalidate-cache
. Half-joking.
I am writing a console program that has to process a bunch of configuration files every time it is run. Since in some scenarios it is measured to be a significant portion of the entire execution time (more than half for big configs and small tasks), I decided to cache the preprocessed text configs in a binary format that is much faster to load.
To guard against stale cache after someone updates the configs, I am comparing the std::filesystem::last_write_time
of the binary cache and original file. This logic can be summarized in the following minimal program:
#include <algorithm>
#include <cstdio>
#include <filesystem>
#include <fstream>
#include <iterator>
#include <string>
#include <unordered_map>
int main(int argc, const char** argv) {
if (argc < 2) {
std::printf("Usage: %s <NAME>\n", argc ? argv[0] : "");
return 1;
}
const auto path = std::filesystem::path{ argv[1] };
auto file = std::ifstream{ path };
if (!file.good()) {
std::printf("File does not exist: '%s'\n", path.c_str());
return 1;
}
// Don't worry about collisions with paths containing literal `#`,
// real program handles it with proper escapes
const auto cache_path = [](auto s) {
std::replace(s.begin(), s.end(), '/', '#');
return std::filesystem::temp_directory_path() / s;
}(path.string());
auto cache_file = std::ifstream{ cache_path };
if (cache_file.good()
&& std::filesystem::last_write_time(cache_path)
> std::filesystem::last_write_time(path)) {
const auto cache_content
= std::string{ std::istream_iterator<char>{ cache_file },
std::istream_iterator<char>{} };
std::printf("Cache hit: %s\n", cache_content.c_str());
} else {
std::printf("Cache miss\n");
}
// Real program does not reprocess on cache hit,
// here we do it to show the bug
auto chars = std::unordered_map<char, size_t>{};
std::for_each(
std::istream_iterator<char>{ file },
std::istream_iterator<char>{},
[&chars](char c) { chars[c]++; }
);
const auto result = chars.size();
std::printf("Processed: %zu\n", result);
std::ofstream{ cache_path } << result;
}
The above test program works in the base case of writing directly to the file, but fails if another file is moved over it instead:
$ echo aaa > test.txt
$ ./alphabet test.txt
Cache miss
Processed: 1
$ ./alphabet test.txt
Cache hit: 1
Processed: 1
$ echo abc > test.txt
$ echo abcdef > test2.txt
$ ./alphabet test.txt
Cache miss
Processed: 3
$ ./alphabet test.txt
Cache hit: 3
Processed: 3
$ mv test2.txt test.txt
$ ./alphabet test.txt
Cache hit: 3
Processed: 6
How can I properly protect against it? I can think of couple options, neither of which I particularly like:
- Calculating file hash and comparing it to one stored in cache. 100% reliable (baring hash collisions I don't worry about: https://stackoverflow/a/4014407/2894535 ), but I would rather avoid it requires reading the entirety of the original file anyways, minimizing cache effectiveness.
- Checking last write time of the parent directory, which (at least on Linux) updates any time any file is added, removed or renamed. This works for moves, but has heavy cost of false positives if we touch anything else. Additionally, still leaves a bug if entire directories get moved, protecting against which would require checking directories all the way to root, making it pointless:
$ echo abc > a/test.txt $ echo abcdef > b/test.txt $ ./alphabet a/test.txt $ rm -rf a/ $ mv b/ a/ $ ./alphabet a/test.txt
- Comparing file
inode
to the one stored in cache. While quick and robust, there is no portable way to get it, and may not be supported by all filesystems. - Watching for file changes. Not portable without
#ifdef
'ing OS or external libraries, plus requires some daemon or other server running between program invocations, also unsure it won't fail in some scenarios. - Documenting that any "funny bussiness" from the user requires running program with
--invalidate-cache
. Half-joking.
- 1 I would go for (1) + check mtime and file size. Use fast hasher like xxHash. – Osyotr Commented Mar 22 at 13:36
- @Osyotr So file size makes sense as a quick check (if different, hash will be different), but why bother with time at all? If file hash is the same, then contents are the same, most likely someone edited file then rolled it back. – Dominik Kaszewski Commented Mar 22 at 13:41
- 1 I know you're looking for portable solutions, but another Linux-specific solution is to compare the file's ctime rather than mtime. Renaming a file updates its ctime. – Barmar Commented Mar 22 at 13:55
- if its really the contents you're worried about then hash is the only guarantee content is identical, the rest are distractions. Why not incorporate the hash generation whenever the file(s) are edited ... – ticktalk Commented Mar 23 at 16:51
- I would use a hash (at least CRC32) on the first 1% of each configuration file (maybe at least 1024 bytes?) as a signature, when the file's timestamp shows a possible cache hit. This hash is obviously stored as part of binary data, as header. File should be read as binary, not as text or high-level structure (ex: JSON, XML, ...). In addition to file size and timestamp, it should be enough to ensure a proper cache hit, and it should be portable too, but you still have the problem of "where to store cache files". On Windows, I would have used an ADS on each config file to ensure consistency. – Wisblade Commented Mar 24 at 10:56
1 Answer
Reset to default 0I decided to go with combination of 1. and 5., plus file size check suggested by @Osyotr. User can select between "quick" and "strict" mode, plus command to invalidate cache.
- Store cache files with filename
original.filesize + '-' + encode(original.fullname)
, whereencode
escapes any non-alphanumeric characters to ensure valid filename. Storing this info in filename allows quick mode to determine cache validity without reading either file. - In either mode, check if cache file with corresponding name exists, implicitly checking file size matches expected.
- If it does, check cache file has modified date newer than original file.
- In strict mode only, calculate hash of original and compare it to hash stored in first couple bytes of cache file.
- If all checks pass, use cache file contents.
- Delete all cache files matching filename but not size.
With the example program, here are some test cases and how they are handled in the two modes:
echo a > test.txt
- new file, no cache, both workecho b > test.txt
- modifies last write time, both workecho abc > test2.txt && mv test2.txt test.txt
- last write time unmodified but changes size, both workecho xyz > test2.txt && mv test2.txt test.txt
- last write time and size unmodified, quick mode fails but strict mode works