Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs-cache: Add Cache Struct #95

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

fs-cache: Add Cache Struct #95

wants to merge 10 commits into from

Conversation

Pushkarm029
Copy link
Collaborator

  • Cache is not implemented yet. It will be using MemoryLimitedStorage under the hood.
  • MemoryLimitedStorage is a simplified version of FolderStorage that keeps a small number of (K, V) pairs in memory.
  • Currently, it uses Least Recently Used (LRU) algorithm with a LinkedHashMap to decide which (K, V) pairs to keep in memory.

Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Copy link

Benchmark for 603c1db

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.1±0.55µs 247.3±3.21µs -0.72%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.32µs 15.5±0.12µs -0.64%
blake3_resource_id_creation/compute_from_bytes:small 1363.7±5.28ns 1366.1±3.36ns +0.18%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 196.7±0.46µs 196.9±0.54µs +0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1704.6±6.46µs 1710.0±15.22µs +0.32%
crc32_resource_id_creation/compute_from_bytes:large 86.7±0.52µs 87.0±1.98µs +0.35%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.01µs 5.4±0.02µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.27ns 92.5±1.74ns +0.22%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.6±1.54µs 64.6±0.69µs 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 912.7±4.68µs 916.7±5.12µs +0.44%
resource_index/index_build//tmp/ark-fs-index-benchmarksPQj4Mh 110.1±1.24ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksg00L2f 113.0±2.09ms N/A N/A
resource_index/index_get_resource_by_id 98.6±1.25ns 98.1±3.16ns -0.51%
resource_index/index_get_resource_by_path 55.1±3.28ns 55.9±3.43ns +1.45%
resource_index/index_update_all 1092.6±26.67ms 1121.2±40.66ms +2.62%
resource_index/index_update_one 671.6±17.92ms 659.4±16.29ms -1.82%

Signed-off-by: Pushkar Mishra <[email protected]>
Copy link

Benchmark for 20fe5a6

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 248.5±0.43µs 250.9±0.79µs +0.97%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.14µs 15.6±0.06µs 0.00%
blake3_resource_id_creation/compute_from_bytes:small 1366.3±6.88ns 1351.3±8.54ns -1.10%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.1±0.41µs 196.7±0.68µs -0.20%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1701.8±4.62µs 1706.9±19.45µs +0.30%
crc32_resource_id_creation/compute_from_bytes:large 86.7±0.19µs 86.7±0.35µs 0.00%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.01µs 5.4±0.02µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.4±0.52ns 92.3±0.29ns -0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.4±0.26µs 64.8±0.26µs +0.62%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 912.0±4.12µs 913.3±1.83µs +0.14%
resource_index/index_build//tmp/ark-fs-index-benchmarksnxxYTn 107.7±1.31ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksvuyxiZ 106.4±1.91ms N/A N/A
resource_index/index_get_resource_by_id 100.7±1.10ns 101.1±1.20ns +0.40%
resource_index/index_get_resource_by_path 53.6±0.85ns 59.6±3.74ns +11.19%
resource_index/index_update_all 1121.3±43.90ms 1147.9±45.96ms +2.37%
resource_index/index_update_one 690.6±30.43ms 696.9±24.44ms +0.91%

@Pushkarm029 Pushkarm029 requested a review from kirillt November 19, 2024 18:49
Comment on lines 50 to 53
/// Load most recent cached items into memory based on timestamps
pub fn load_recent(&mut self) -> Result<()> {
self.storage.load_fs()
}
Copy link
Member

@kirillt kirillt Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we don't need to expose this function.

Only set/get API is needed

The rest should happen under the hood:

  • any set should write both to memory and disk
  • one-way sync from disk to memory is needed when users get values
  • if we hit our own limit for bytes stored in the in-memory mapping, we erase oldest entries from it
  • but entries are always stored on disk, no need to sync from memory to disk explicitly

Primary usage scenario: keys are of type ResourceId

  1. App indexes a folder.
  2. App may populate the cache before using it, but it's not required.
  3. App will query caches by key:
    • if the entry is in memory already, that's great, we just return the value
    • otherwise, we check disk for entry with the requested key
    • if it is on disk, we add it to in-memory storage and return the value
    • otherwise, we return None
  4. Index can notify the app about recently discovered resources. Corresponding values can be in the cache already, but this is not required. App can initialize values for new resources.

Secondary usage scenario: keys are of arbitrary type

Can be any deterministic computation.

Comment on lines 55 to 58
/// Get number of items currently in memory
// pub fn memory_items(&self) -> usize {
// self.storage.memory_items()
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good idea to log such information with debug level. Also, we should count bytes not number of entries. Some cache items can be huge, e.g. bitmaps.

fs-cache/src/cache.rs Outdated Show resolved Hide resolved
Copy link

Benchmark for 840a337

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.7±0.90µs 249.9±3.96µs +0.08%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.07µs 15.6±0.39µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1355.3±1.66ns 1345.4±1.59ns -0.73%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.6±2.23µs 196.5±1.78µs -0.56%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1689.4±10.06µs 1689.3±19.40µs -0.01%
crc32_resource_id_creation/compute_from_bytes:large 92.0±1.71µs 91.8±0.43µs -0.22%
crc32_resource_id_creation/compute_from_bytes:medium 5.7±0.02µs 5.7±0.03µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 96.4±1.08ns 96.3±0.36ns -0.10%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 65.2±0.36µs 65.2±0.24µs 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 911.5±6.57µs 909.8±12.85µs -0.19%
resource_index/index_build//tmp/ark-fs-index-benchmarkshSX1P0 112.8±1.64ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksrfjRBT 111.6±0.59ms N/A N/A
resource_index/index_get_resource_by_id 128.4±1.86ns 126.5±2.54ns -1.48%
resource_index/index_get_resource_by_path 53.1±0.87ns 53.4±1.43ns +0.56%
resource_index/index_update_all 1121.0±32.23ms 1125.9±45.68ms +0.44%
resource_index/index_update_one 692.9±29.07ms 680.3±28.80ms -1.82%

Signed-off-by: Pushkar Mishra <[email protected]>
Copy link

Benchmark for 50fe163

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 248.3±0.92µs 249.2±1.65µs +0.36%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.28µs 15.6±0.05µs 0.00%
blake3_resource_id_creation/compute_from_bytes:small 1369.6±3.04ns 1377.0±1.78ns +0.54%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.0±0.93µs 197.1±2.58µs +0.05%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1698.9±9.96µs 1702.7±18.55µs +0.22%
crc32_resource_id_creation/compute_from_bytes:large 86.7±0.75µs 86.8±0.27µs +0.12%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.04µs 5.4±0.01µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.4±0.48ns 92.4±0.82ns 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.3±0.21µs 64.5±0.74µs +0.31%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 913.1±8.90µs 910.9±7.09µs -0.24%
resource_index/index_build//tmp/ark-fs-index-benchmarksVXChWr 110.9±0.81ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksb93cF3 112.2±1.35ms N/A N/A
resource_index/index_get_resource_by_id 99.1±0.98ns 99.3±3.04ns +0.20%
resource_index/index_get_resource_by_path 54.6±1.22ns 55.6±1.78ns +1.83%
resource_index/index_update_all 1088.2±32.03ms 1123.7±41.04ms +3.26%
resource_index/index_update_one 678.0±17.77ms 667.3±16.55ms -1.58%

fs-cache/src/cache.rs Show resolved Hide resolved
fs-cache/src/cache.rs Outdated Show resolved Hide resolved
fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved
fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved
fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved
fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved
fs-cache/Cargo.toml Outdated Show resolved Hide resolved
fs-cache/Cargo.toml Outdated Show resolved Hide resolved
fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved
fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved
@Pushkarm029
Copy link
Collaborator Author

Thank you for the review.

if path.exists() && path.is_dir() {
return Some((path, None));
}
let Ok(path) = PathBuf::from_str(storage);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it panic if from_str returns Err?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Even though PathBuf::from_str() has a return type of Result<Self, Self::Err>, it never fails because a PathBuf is just a cross-platform wrapper around a string path.

  • The actual validation of whether the path exists or is valid in the filesystem happens later when you call methods like exists() or is_dir().

#[stable(feature = "path_from_str", since = "1.32.0")]
impl FromStr for PathBuf {
    type Err = core::convert::Infallible;

    #[inline]
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        Ok(PathBuf::from(s))
    }
}


pub fn get(&mut self, key: &K) -> Option<V> {
// Check memory cache first - will update LRU order automatically
if let Some(value) = self.memory_cache.get_refresh(key) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty cool, I didn't know about get_refresh function.

But I guess it works with O(N) complexity, right? Consider using dedicated Rust crate for LRU to get O(1) complexity. This isn't pressing issue, we can create an issue for it.

// Try to load from disk
let file_path = self.path.join(format!("{}.json", key));
if file_path.exists() {
// Doubt: Update file's modiied time (in disk) on read to preserve LRU across app restarts?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's track this feature and work on it later. Better to keep implementation simple for this moment and avoid redundant state. Btw we can also simply write cached keys into a file + apply atomic versioning on it, so all peers would have same view of LRU.


let new_timestamp = SystemTime::now();
file.set_modified(new_timestamp)?;
file.sync_all()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not needed anymore, since we are no longer doing any kind of synchronization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need it, in fn load_fs, we compare timestamps to load recent files into memory. This could help us efficiently decide which files to load or skip since it provides precise timing. However, it's a very minor improvement, so we can skip it.

// Write a single value to disk
fn write_value_to_disk(&mut self, key: &K, value: &V) -> Result<()> {
let file_path = self.path.join(format!("{}.json", key));
let mut file = File::create(&file_path)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add debug_assert that the file doesn't exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should use lightweight atomic writing to avoid dirty writing. Keep in mind scenario when several ARK apps on same device use same folder and write to the cache in parallel.

I believe, that atomic versions would be excessive here, but I'm not 100% sure yet.

Comment on lines 198 to 200
let file_path = self.path.join(format!("{}.json", key));
let file = File::open(&file_path)?;
let value: V = serde_json::from_reader(file).map_err(|err| {
Copy link
Member

@kirillt kirillt Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values are not necessarily JSONs, they can be arbitrary binaries (e.g. JPG images).

App developer will decide the structure of values, we should not have any assumptions around it.

Signed-off-by: Pushkar Mishra <[email protected]>
Copy link

Benchmark for 42bb74c

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 253.2±1.52µs 249.8±1.76µs -1.34%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.08µs 15.5±0.05µs -0.64%
blake3_resource_id_creation/compute_from_bytes:small 1364.8±2.77ns 1361.8±12.07ns -0.22%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 196.8±0.38µs 197.4±0.97µs +0.30%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1708.0±5.92µs 1710.5±15.38µs +0.15%
crc32_resource_id_creation/compute_from_bytes:large 86.8±0.29µs 86.5±0.16µs -0.35%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.06µs 5.4±0.03µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.38ns 92.3±0.38ns 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.7±1.69µs 64.5±0.24µs -0.31%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 929.5±3.84µs 909.0±7.26µs -2.21%
resource_index/index_build//tmp/ark-fs-index-benchmarksA5AkUo 102.6±0.86ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksiZe8TK 106.8±2.63ms N/A N/A
resource_index/index_get_resource_by_id 114.2±5.84ns 101.5±3.15ns -11.12%
resource_index/index_get_resource_by_path 56.4±0.55ns 56.6±1.79ns +0.35%
resource_index/index_update_all 1088.6±30.71ms 1082.8±46.16ms -0.53%
resource_index/index_update_one 660.5±12.96ms 639.0±12.28ms -3.26%

Signed-off-by: Pushkar Mishra <[email protected]>
Copy link

github-actions bot commented Dec 1, 2024

Benchmark for c7341e8

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 248.7±0.71µs 248.1±1.25µs -0.24%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.04µs 15.6±0.19µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1365.8±2.31ns 1365.3±2.84ns -0.04%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.5±0.62µs 197.4±0.51µs -0.05%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1700.2±4.80µs 1707.8±15.05µs +0.45%
crc32_resource_id_creation/compute_from_bytes:large 86.6±0.31µs 86.6±0.34µs 0.00%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.02µs 5.4±0.02µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.35ns 92.3±0.36ns 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 65.1±0.66µs 65.5±2.18µs +0.61%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 989.3±3.53µs 915.9±2.34µs -7.42%
resource_index/index_build//tmp/ark-fs-index-benchmarks8EW2uf 104.7±2.28ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksz7qZYD 110.6±2.78ms N/A N/A
resource_index/index_get_resource_by_id 98.5±0.50ns 98.8±1.41ns +0.30%
resource_index/index_get_resource_by_path 52.9±0.78ns 53.4±1.23ns +0.95%
resource_index/index_update_all 1096.8±38.35ms 1118.9±49.79ms +2.01%
resource_index/index_update_one 668.8±25.60ms 667.1±20.70ms -0.25%

Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Copy link

github-actions bot commented Dec 1, 2024

Benchmark for d698fcf

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 251.2±6.65µs 251.8±5.32µs +0.24%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.04µs 15.6±0.18µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1365.2±38.12ns 1356.3±10.87ns -0.65%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 198.4±5.47µs 196.9±0.99µs -0.76%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1713.0±16.08µs 1724.9±45.91µs +0.69%
crc32_resource_id_creation/compute_from_bytes:large 87.1±1.95µs 87.0±1.61µs -0.11%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.14µs 5.4±0.05µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.36ns 92.4±0.36ns +0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.7±0.26µs 64.5±0.31µs -0.31%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 916.4±15.36µs 914.6±8.60µs -0.20%
resource_index/index_build//tmp/ark-fs-index-benchmarkscqgOio 111.8±1.02ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksosCgHK 112.9±2.86ms N/A N/A
resource_index/index_get_resource_by_id 108.4±3.05ns 117.9±3.55ns +8.76%
resource_index/index_get_resource_by_path 54.3±2.07ns 53.4±0.84ns -1.66%
resource_index/index_update_all 1111.4±33.86ms 1131.9±44.19ms +1.84%
resource_index/index_update_one 670.8±23.59ms 677.0±18.97ms +0.92%

Comment on lines +64 to +68
// TODO: NEED FIX
memory_cache: LruCache::new(
NonZeroUsize::new(max_memory_bytes)
.expect("Capacity can't be zero"),
),
Copy link
Collaborator Author

@Pushkarm029 Pushkarm029 Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LruCache requires the capacity (number of items) to be specified during initialization. However, our Cache is designed to be limited by max_memory_bytes. So, my question is: what would be the most way to initialize the LruCache with?

Note: In all other functions, we are already comparing based on the number of bytes, not the number of items.

I think we can create another parameter(max_items) which will be Option<usize> with default as 100.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the number of items should be left up to the developer calling the function. Instead of taking max_memory_bytes as an argument, we could take max_memory_items. This would require redesigning the implementation to focus on the number of items rather than memory size, but it would give developers the flexibility to decide based on the average size of the items they store.

If prioritizing memory size over the number of items is a hard requirement, then I can think of two options:

  • We could implement our own version of LruCache 
  • Or, LruCache has a resize() method, and we could use this to resize the cache based on other metadata we track

Also, I looked into uluru, and it uses the number of items to initialize the cache as well. Just mentioning this in case you were considering it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, what about this? https://docs.rs/lru-mem/latest/lru_mem/

But it has only 3 stars on GitHub..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually an interesting option and would have been a perfect fit 😃

I wouldn’t recommend it though because if we find any issues later in the crate, we’d have to fork it and fix the problem ourselves, and we’re not familiar with the code. Plus, since it's not actively maintained/ used, there wouldn’t be anyone around to help us either.

Copy link

github-actions bot commented Dec 1, 2024

Benchmark for a683a1e

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 248.1±2.89µs 251.9±1.44µs +1.53%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.03µs 15.5±0.05µs 0.00%
blake3_resource_id_creation/compute_from_bytes:small 1365.5±25.41ns 1359.4±10.95ns -0.45%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.2±0.51µs 199.2±2.51µs +1.01%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1717.8±7.57µs 1715.3±26.78µs -0.15%
crc32_resource_id_creation/compute_from_bytes:large 86.8±0.40µs 86.8±0.63µs 0.00%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.08µs 5.4±0.02µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.31ns 92.8±2.31ns +0.54%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.3±0.33µs 64.4±0.27µs +0.16%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 914.7±18.73µs 920.0±16.67µs +0.58%
resource_index/index_build//tmp/ark-fs-index-benchmarksOmbnwj 103.7±1.25ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksrimMQB 103.9±2.68ms N/A N/A
resource_index/index_get_resource_by_id 106.9±3.64ns 99.4±3.23ns -7.02%
resource_index/index_get_resource_by_path 52.7±0.54ns 53.6±0.72ns +1.71%
resource_index/index_update_all 1080.8±30.30ms 1086.0±37.89ms +0.48%
resource_index/index_update_one 650.5±15.88ms 659.1±24.05ms +1.32%

Copy link

github-actions bot commented Dec 1, 2024

Benchmark for 92357c6

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 248.6±2.33µs 251.2±1.34µs +1.05%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.06µs 15.6±0.05µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1363.7±5.39ns 1366.4±6.19ns +0.20%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.0±1.64µs 197.4±1.60µs +0.20%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1707.8±7.37µs 1710.1±11.30µs +0.13%
crc32_resource_id_creation/compute_from_bytes:large 86.8±0.23µs 86.6±0.35µs -0.23%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.01µs 5.4±0.02µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.4±0.37ns 92.3±0.18ns -0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 66.1±0.16µs 64.6±1.60µs -2.27%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 913.4±2.26µs 914.3±4.10µs +0.10%
resource_index/index_build//tmp/ark-fs-index-benchmarksFMPQVr 106.1±0.98ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksQNMEsB 108.6±1.48ms N/A N/A
resource_index/index_get_resource_by_id 106.8±1.16ns 102.4±7.45ns -4.12%
resource_index/index_get_resource_by_path 53.8±1.29ns 53.1±1.25ns -1.30%
resource_index/index_update_all 1081.6±25.74ms 1085.1±36.51ms +0.32%
resource_index/index_update_one 651.6±14.25ms 660.5±13.79ms +1.37%

Comment on lines +64 to +68
// TODO: NEED FIX
memory_cache: LruCache::new(
NonZeroUsize::new(max_memory_bytes)
.expect("Capacity can't be zero"),
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the number of items should be left up to the developer calling the function. Instead of taking max_memory_bytes as an argument, we could take max_memory_items. This would require redesigning the implementation to focus on the number of items rather than memory size, but it would give developers the flexibility to decide based on the average size of the items they store.

If prioritizing memory size over the number of items is a hard requirement, then I can think of two options:

  • We could implement our own version of LruCache 
  • Or, LruCache has a resize() method, and we could use this to resize the cache based on other metadata we track

Also, I looked into uluru, and it uses the number of items to initialize the cache as well. Just mentioning this in case you were considering it.

Comment on lines +340 to +353
// Remove oldest entries until we have space for new value
while self.current_memory_bytes + size > self.max_memory_bytes {
let (_, old_entry) = self
.memory_cache
.pop_lru()
.expect("Cache should have entries to evict");
debug_assert!(
self.current_memory_bytes >= old_entry.size,
"Memory tracking inconsistency detected"
);
self.current_memory_bytes = self
.current_memory_bytes
.saturating_sub(old_entry.size);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But Yeah I think we should remove this code at all costs.

It’s currently undermining the purpose of using the external LRU cache crate. If there’s absolutely no other way around this, then we may need to implement our own LRU cache solution.

This operation should be O(1)

Comment on lines +14 to +17
struct CacheEntry<V> {
value: V,
size: usize,
}
Copy link
Collaborator

@tareknaser tareknaser Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to store the size of the value? Can’t we just read it from fs when needed? I don’t see it being read often.

If it’s for convenience to avoid I/O calls…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to track memory consumption in bytes to be precise about when to offload values, and to support large values, too. We probably can't avoid saving value sizes into memory, otherwise when we hit the limit we cannot know how much values we need to offload.

However, that's where we could split the crate into 2 flavours:

  1. dynamically-sized values e.g. byte vectors and text strings
  2. statically-sized values e.g. integers

For the 2nd flavour we could utilize some standard Rust trait.

Is there a way to have these 2 flavours combined nicely in a single crate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to track memory consumption in bytes to be precise about when to offload values...

Thtat's fine but we're actually reading the file size from the disk again in the get_file_size method, even though we already have it stored in the metadata. That seems a bit wasteful. Check out the next comment for more on this

  1. dynamically-sized values e.g. byte vectors and text strings

That's a good point
I completely missed the dynamic types aspect when I looked at this. Now it makes a lot more sense why we need to track the data size instead of just the number of items.

Is there a way to have these 2 flavours combined nicely in a single crate?

If we're dealing with types that have a fixed size, like usize, it doesn't really matter if we count how many items there are or just the total size they take up. But this completely breaks with the second flavour you mentioned.

The only solution I can think of right now is to treat all types of data as if they were the second type – basically, keep track of how much memory they use instead of how many there are. But that brings up the question of how to do this in a clean way

Main thread: #95 (comment)

Comment on lines +326 to +327
log::debug!("cache/{}: caching in memory for key {}", self.label, key);
let size = self.get_file_size(key)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... then we would essentially be defeating the purpose here, as we're reading the file size from disk again

Comment on lines +319 to +321
fn get_file_size(&self, key: &K) -> Result<usize> {
Ok(fs::metadata(self.path.join(key.to_string()))?.len() as usize)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment is misleading. We don’t actually check the memory cache for size information first, but I think we should.

Comment on lines 55 to +57

/// Writes a serializable value to a file and returns the timestamp of the write
pub fn write_json_file<T: Serialize>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't really return the timestamp of the write.

Comment on lines +72 to +76
pub fn extract_key_from_file_path<K>(
label: &str,
path: &Path,
include_extension: bool,
) -> Result<K>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving this function to the utils module. It definitely feels like the right place for it.

Could you also add a doc comment for it?

@@ -0,0 +1 @@
pub mod cache;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the following comments are just nitpicks, so feel free to ignore them.

I think it’s better to only export the Cache struct. That way, when we import the crate, we can use Cache directly without needing to do fs_cache::cache::Cache. It would be fs_cache::Cache instead.

Suggested change
pub mod cache;
mod cache;
pub use cache::Cache;

Comment on lines +404 to +418
#[fixture]
fn temp_dir() -> TempDir {
TempDir::new("tmp").expect("Failed to create temporary directory")
}

#[fixture]
fn cache(temp_dir: TempDir) -> Cache<String, TestValue> {
Cache::new(
"test".to_string(),
temp_dir.path(),
1024 * 1024, // 1MB
false,
)
.expect("Failed to create cache")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like using fixtures here might be a bit overkill, but maybe you have plans to expand on this in the future. Personally, I think we could just create the temp_dir and cache instances directly in each test function, or use a helper function, which would make the code more readable.

Comment on lines +355 to +362
// Add new value and update size
self.memory_cache.put(
key.clone(),
CacheEntry {
value: value.clone(),
size,
},
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small suggestion: we could use push here instead of put. According to the docs, push does the following: ‘If an entry with key k already exists in the cache or another cache entry is removed (due to the LRU’s capacity), then it returns the old entry’s key-value pair.’

It should be a nice touch to log the evicted key, if there is one.

| `data-json` | JSON serialization and deserialization |
| Package | Description |
| --------------- | ---------------------------------------- |
| `ark-cli` | The CLI tool to interact with ark crates |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `ark-cli` | The CLI tool to interact with ark crates |
| `ark-cli` | The CLI tool to interact with ARK crates |

| `data-resource` | Resource hashing and ID construction |
| `fs-cache` | Memory and disk caching with LRU eviction |
| `fs-index` | Resource Index construction and updating |
| `fs-storage` | Filesystem storage for resources |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `fs-storage` | Filesystem storage for resources |
| `fs-storage` | Key-value storage persisted on filesystem |

Comment on lines 124 to 131
// Sort by size before loading
file_metadata.sort_by(|a, b| b.1.cmp(&a.1));

// Clear existing cache
self.memory_cache.clear();
self.current_memory_bytes = 0;

// Load files that fit in memory
Copy link
Member

@kirillt kirillt Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea with pre-loading the cache 👍

We probably also need a switch to enable the app developer to turn it on/off.

}
}

// Sort by size before loading
file_metadata.sort_by(|a, b| b.1.cmp(&a.1));
// Sort by modified time (most recent first)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure that pre-loading the most recently modified values would really be beneficial.

We could implement more sophisticated approach with gathering query statistics and recording it somewhere on disk for pre-loading in future. But I would do it in a separate PR and not right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants