`fs-cache`: Add Cache Struct #95

Pushkarm029 · 2024-11-17T16:21:47Z

Cache is not implemented yet. It will be using MemoryLimitedStorage under the hood.
MemoryLimitedStorage is a simplified version of FolderStorage that keeps a small number of (K, V) pairs in memory.
Currently, it uses Least Recently Used (LRU) algorithm with a LinkedHashMap to decide which (K, V) pairs to keep in memory.

Signed-off-by: Pushkar Mishra <[email protected]>

github-actions · 2024-11-17T16:42:00Z

Benchmark for `603c1db`

Click to view benchmark

Test	Base	PR	%
blake3_resource_id_creation/compute_from_bytes:large	249.1±0.55µs	247.3±3.21µs	-0.72%
blake3_resource_id_creation/compute_from_bytes:medium	15.6±0.32µs	15.5±0.12µs	-0.64%
blake3_resource_id_creation/compute_from_bytes:small	1363.7±5.28ns	1366.1±3.36ns	+0.18%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg	196.7±0.46µs	196.9±0.54µs	+0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf	1704.6±6.46µs	1710.0±15.22µs	+0.32%
crc32_resource_id_creation/compute_from_bytes:large	86.7±0.52µs	87.0±1.98µs	+0.35%
crc32_resource_id_creation/compute_from_bytes:medium	5.4±0.01µs	5.4±0.02µs	0.00%
crc32_resource_id_creation/compute_from_bytes:small	92.3±0.27ns	92.5±1.74ns	+0.22%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg	64.6±1.54µs	64.6±0.69µs	0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf	912.7±4.68µs	916.7±5.12µs	+0.44%
resource_index/index_build//tmp/ark-fs-index-benchmarksPQj4Mh	110.1±1.24ms	N/A	N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksg00L2f	113.0±2.09ms	N/A	N/A
resource_index/index_get_resource_by_id	98.6±1.25ns	98.1±3.16ns	-0.51%
resource_index/index_get_resource_by_path	55.1±3.28ns	55.9±3.43ns	+1.45%
resource_index/index_update_all	1092.6±26.67ms	1121.2±40.66ms	+2.62%
resource_index/index_update_one	671.6±17.92ms	659.4±16.29ms	-1.82%

Signed-off-by: Pushkar Mishra <[email protected]>

github-actions · 2024-11-18T16:54:50Z

Benchmark for `20fe5a6`

Click to view benchmark

Test	Base	PR	%
blake3_resource_id_creation/compute_from_bytes:large	248.5±0.43µs	250.9±0.79µs	+0.97%
blake3_resource_id_creation/compute_from_bytes:medium	15.6±0.14µs	15.6±0.06µs	0.00%
blake3_resource_id_creation/compute_from_bytes:small	1366.3±6.88ns	1351.3±8.54ns	-1.10%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg	197.1±0.41µs	196.7±0.68µs	-0.20%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf	1701.8±4.62µs	1706.9±19.45µs	+0.30%
crc32_resource_id_creation/compute_from_bytes:large	86.7±0.19µs	86.7±0.35µs	0.00%
crc32_resource_id_creation/compute_from_bytes:medium	5.4±0.01µs	5.4±0.02µs	0.00%
crc32_resource_id_creation/compute_from_bytes:small	92.4±0.52ns	92.3±0.29ns	-0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg	64.4±0.26µs	64.8±0.26µs	+0.62%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf	912.0±4.12µs	913.3±1.83µs	+0.14%
resource_index/index_build//tmp/ark-fs-index-benchmarksnxxYTn	107.7±1.31ms	N/A	N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksvuyxiZ	106.4±1.91ms	N/A	N/A
resource_index/index_get_resource_by_id	100.7±1.10ns	101.1±1.20ns	+0.40%
resource_index/index_get_resource_by_path	53.6±0.85ns	59.6±3.74ns	+11.19%
resource_index/index_update_all	1121.3±43.90ms	1147.9±45.96ms	+2.37%
resource_index/index_update_one	690.6±30.43ms	696.9±24.44ms	+0.91%

kirillt · 2024-11-20T06:52:20Z

fs-cache/src/cache.rs

+    /// Load most recent cached items into memory based on timestamps
+    pub fn load_recent(&mut self) -> Result<()> {
+        self.storage.load_fs()
+    }


Actually, we don't need to expose this function.

Only set/get API is needed

The rest should happen under the hood:

any set should write both to memory and disk

one-way sync from disk to memory is needed when users get values

if we hit our own limit for bytes stored in the in-memory mapping, we erase oldest entries from it

but entries are always stored on disk, no need to sync from memory to disk explicitly

Primary usage scenario: keys are of type ResourceId

App indexes a folder.

App may populate the cache before using it, but it's not required.

App will query caches by key:

if the entry is in memory already, that's great, we just return the value

otherwise, we check disk for entry with the requested key

if it is on disk, we add it to in-memory storage and return the value

otherwise, we return None

Index can notify the app about recently discovered resources. Corresponding values can be in the cache already, but this is not required. App can initialize values for new resources.

Secondary usage scenario: keys are of arbitrary type

Can be any deterministic computation.

kirillt · 2024-11-20T06:59:12Z

fs-cache/src/cache.rs

+    /// Get number of items currently in memory
+    // pub fn memory_items(&self) -> usize {
+    //     self.storage.memory_items()
+    // }


Should be good idea to log such information with debug level. Also, we should count bytes not number of entries. Some cache items can be huge, e.g. bitmaps.

fs-cache/src/cache.rs

Signed-off-by: Pushkar Mishra <[email protected]>

github-actions · 2024-11-20T17:00:38Z

Benchmark for `840a337`

Click to view benchmark

Test	Base	PR	%
blake3_resource_id_creation/compute_from_bytes:large	249.7±0.90µs	249.9±3.96µs	+0.08%
blake3_resource_id_creation/compute_from_bytes:medium	15.5±0.07µs	15.6±0.39µs	+0.65%
blake3_resource_id_creation/compute_from_bytes:small	1355.3±1.66ns	1345.4±1.59ns	-0.73%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg	197.6±2.23µs	196.5±1.78µs	-0.56%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf	1689.4±10.06µs	1689.3±19.40µs	-0.01%
crc32_resource_id_creation/compute_from_bytes:large	92.0±1.71µs	91.8±0.43µs	-0.22%
crc32_resource_id_creation/compute_from_bytes:medium	5.7±0.02µs	5.7±0.03µs	0.00%
crc32_resource_id_creation/compute_from_bytes:small	96.4±1.08ns	96.3±0.36ns	-0.10%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg	65.2±0.36µs	65.2±0.24µs	0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf	911.5±6.57µs	909.8±12.85µs	-0.19%
resource_index/index_build//tmp/ark-fs-index-benchmarkshSX1P0	112.8±1.64ms	N/A	N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksrfjRBT	111.6±0.59ms	N/A	N/A
resource_index/index_get_resource_by_id	128.4±1.86ns	126.5±2.54ns	-1.48%
resource_index/index_get_resource_by_path	53.1±0.87ns	53.4±1.43ns	+0.56%
resource_index/index_update_all	1121.0±32.23ms	1125.9±45.68ms	+0.44%
resource_index/index_update_one	692.9±29.07ms	680.3±28.80ms	-1.82%

Signed-off-by: Pushkar Mishra <[email protected]>

github-actions · 2024-11-21T07:20:11Z

Benchmark for `50fe163`

Click to view benchmark

Test	Base	PR	%
blake3_resource_id_creation/compute_from_bytes:large	248.3±0.92µs	249.2±1.65µs	+0.36%
blake3_resource_id_creation/compute_from_bytes:medium	15.6±0.28µs	15.6±0.05µs	0.00%
blake3_resource_id_creation/compute_from_bytes:small	1369.6±3.04ns	1377.0±1.78ns	+0.54%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg	197.0±0.93µs	197.1±2.58µs	+0.05%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf	1698.9±9.96µs	1702.7±18.55µs	+0.22%
crc32_resource_id_creation/compute_from_bytes:large	86.7±0.75µs	86.8±0.27µs	+0.12%
crc32_resource_id_creation/compute_from_bytes:medium	5.4±0.04µs	5.4±0.01µs	0.00%
crc32_resource_id_creation/compute_from_bytes:small	92.4±0.48ns	92.4±0.82ns	0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg	64.3±0.21µs	64.5±0.74µs	+0.31%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf	913.1±8.90µs	910.9±7.09µs	-0.24%
resource_index/index_build//tmp/ark-fs-index-benchmarksVXChWr	110.9±0.81ms	N/A	N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksb93cF3	112.2±1.35ms	N/A	N/A
resource_index/index_get_resource_by_id	99.1±0.98ns	99.3±3.04ns	+0.20%
resource_index/index_get_resource_by_path	54.6±1.22ns	55.6±1.78ns	+1.83%
resource_index/index_update_all	1088.2±32.03ms	1123.7±41.04ms	+3.26%
resource_index/index_update_one	678.0±17.77ms	667.3±16.55ms	-1.58%

fs-cache/src/cache.rs

fs-cache/src/memory_limited_storage.rs

fs-cache/Cargo.toml

fs-cache/src/memory_limited_storage.rs

Pushkarm029 · 2024-11-24T06:00:13Z

Thank you for the review.

kirillt · 2024-11-26T11:24:52Z

ark-cli/src/util.rs

-        if path.exists() && path.is_dir() {
-            return Some((path, None));
-        }
+    let Ok(path) = PathBuf::from_str(storage);


Does it panic if from_str returns Err?

Even though PathBuf::from_str() has a return type of Result<Self, Self::Err>, it never fails because a PathBuf is just a cross-platform wrapper around a string path.

The actual validation of whether the path exists or is valid in the filesystem happens later when you call methods like exists() or is_dir().

#[stable(feature = "path_from_str", since = "1.32.0")] impl FromStr for PathBuf { type Err = core::convert::Infallible; #[inline] fn from_str(s: &str) -> Result<Self, Self::Err> { Ok(PathBuf::from(s)) } }

fs-cache/src/memory_limited_storage.rs

kirillt · 2024-11-26T12:03:35Z

fs-cache/src/memory_limited_storage.rs

+
+    pub fn get(&mut self, key: &K) -> Option<V> {
+        // Check memory cache first - will update LRU order automatically
+        if let Some(value) = self.memory_cache.get_refresh(key) {


That's pretty cool, I didn't know about get_refresh function.

But I guess it works with O(N) complexity, right? Consider using dedicated Rust crate for LRU to get O(1) complexity. This isn't pressing issue, we can create an issue for it.

kirillt · 2024-11-26T12:14:12Z

fs-cache/src/memory_limited_storage.rs

+        // Try to load from disk
+        let file_path = self.path.join(format!("{}.json", key));
+        if file_path.exists() {
+            // Doubt: Update file's modiied time (in disk) on read to preserve LRU across app restarts?


Let's track this feature and work on it later. Better to keep implementation simple for this moment and avoid redundant state. Btw we can also simply write cached keys into a file + apply atomic versioning on it, so all peers would have same view of LRU.

kirillt · 2024-11-26T12:15:53Z

fs-cache/src/memory_limited_storage.rs

+
+        let new_timestamp = SystemTime::now();
+        file.set_modified(new_timestamp)?;
+        file.sync_all()?;


Do we really need it?

No, not needed anymore, since we are no longer doing any kind of synchronization.

I think we may need it, in fn load_fs, we compare timestamps to load recent files into memory. This could help us efficiently decide which files to load or skip since it provides precise timing. However, it's a very minor improvement, so we can skip it.

kirillt · 2024-11-26T12:20:39Z

fs-cache/src/memory_limited_storage.rs

+    // Write a single value to disk
+    fn write_value_to_disk(&mut self, key: &K, value: &V) -> Result<()> {
+        let file_path = self.path.join(format!("{}.json", key));
+        let mut file = File::create(&file_path)?;


Let's add debug_assert that the file doesn't exist.

Also, we should use lightweight atomic writing to avoid dirty writing. Keep in mind scenario when several ARK apps on same device use same folder and write to the cache in parallel.

I believe, that atomic versions would be excessive here, but I'm not 100% sure yet.

kirillt · 2024-11-26T12:28:14Z

fs-cache/src/memory_limited_storage.rs

+        let file_path = self.path.join(format!("{}.json", key));
+        let file = File::open(&file_path)?;
+        let value: V = serde_json::from_reader(file).map_err(|err| {


Values are not necessarily JSONs, they can be arbitrary binaries (e.g. JPG images).

App developer will decide the structure of values, we should not have any assumptions around it.

fs-cache/src/memory_limited_storage.rs

Signed-off-by: Pushkar Mishra <[email protected]>

github-actions · 2024-11-30T20:15:30Z

Benchmark for `42bb74c`

Click to view benchmark

Test	Base	PR	%
blake3_resource_id_creation/compute_from_bytes:large	253.2±1.52µs	249.8±1.76µs	-1.34%
blake3_resource_id_creation/compute_from_bytes:medium	15.6±0.08µs	15.5±0.05µs	-0.64%
blake3_resource_id_creation/compute_from_bytes:small	1364.8±2.77ns	1361.8±12.07ns	-0.22%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg	196.8±0.38µs	197.4±0.97µs	+0.30%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf	1708.0±5.92µs	1710.5±15.38µs	+0.15%
crc32_resource_id_creation/compute_from_bytes:large	86.8±0.29µs	86.5±0.16µs	-0.35%
crc32_resource_id_creation/compute_from_bytes:medium	5.4±0.06µs	5.4±0.03µs	0.00%
crc32_resource_id_creation/compute_from_bytes:small	92.3±0.38ns	92.3±0.38ns	0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg	64.7±1.69µs	64.5±0.24µs	-0.31%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf	929.5±3.84µs	909.0±7.26µs	-2.21%
resource_index/index_build//tmp/ark-fs-index-benchmarksA5AkUo	102.6±0.86ms	N/A	N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksiZe8TK	106.8±2.63ms	N/A	N/A
resource_index/index_get_resource_by_id	114.2±5.84ns	101.5±3.15ns	-11.12%
resource_index/index_get_resource_by_path	56.4±0.55ns	56.6±1.79ns	+0.35%
resource_index/index_update_all	1088.6±30.71ms	1082.8±46.16ms	-0.53%
resource_index/index_update_one	660.5±12.96ms	639.0±12.28ms	-3.26%

Signed-off-by: Pushkar Mishra <[email protected]>

github-actions · 2024-12-01T12:08:48Z

Benchmark for `c7341e8`

Click to view benchmark

Test	Base	PR	%
blake3_resource_id_creation/compute_from_bytes:large	248.7±0.71µs	248.1±1.25µs	-0.24%
blake3_resource_id_creation/compute_from_bytes:medium	15.5±0.04µs	15.6±0.19µs	+0.65%
blake3_resource_id_creation/compute_from_bytes:small	1365.8±2.31ns	1365.3±2.84ns	-0.04%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg	197.5±0.62µs	197.4±0.51µs	-0.05%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf	1700.2±4.80µs	1707.8±15.05µs	+0.45%
crc32_resource_id_creation/compute_from_bytes:large	86.6±0.31µs	86.6±0.34µs	0.00%
crc32_resource_id_creation/compute_from_bytes:medium	5.4±0.02µs	5.4±0.02µs	0.00%
crc32_resource_id_creation/compute_from_bytes:small	92.3±0.35ns	92.3±0.36ns	0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg	65.1±0.66µs	65.5±2.18µs	+0.61%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf	989.3±3.53µs	915.9±2.34µs	-7.42%
resource_index/index_build//tmp/ark-fs-index-benchmarks8EW2uf	104.7±2.28ms	N/A	N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksz7qZYD	110.6±2.78ms	N/A	N/A
resource_index/index_get_resource_by_id	98.5±0.50ns	98.8±1.41ns	+0.30%
resource_index/index_get_resource_by_path	52.9±0.78ns	53.4±1.23ns	+0.95%
resource_index/index_update_all	1096.8±38.35ms	1118.9±49.79ms	+2.01%
resource_index/index_update_one	668.8±25.60ms	667.1±20.70ms	-0.25%

Signed-off-by: Pushkar Mishra <[email protected]>

github-actions · 2024-12-01T12:43:46Z

Benchmark for d698fcf

Click to view benchmark

Test	Base	PR	%
blake3_resource_id_creation/compute_from_bytes:large	251.2±6.65µs	251.8±5.32µs	+0.24%
blake3_resource_id_creation/compute_from_bytes:medium	15.5±0.04µs	15.6±0.18µs	+0.65%
blake3_resource_id_creation/compute_from_bytes:small	1365.2±38.12ns	1356.3±10.87ns	-0.65%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg	198.4±5.47µs	196.9±0.99µs	-0.76%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf	1713.0±16.08µs	1724.9±45.91µs	+0.69%
crc32_resource_id_creation/compute_from_bytes:large	87.1±1.95µs	87.0±1.61µs	-0.11%
crc32_resource_id_creation/compute_from_bytes:medium	5.4±0.14µs	5.4±0.05µs	0.00%
crc32_resource_id_creation/compute_from_bytes:small	92.3±0.36ns	92.4±0.36ns	+0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg	64.7±0.26µs	64.5±0.31µs	-0.31%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf	916.4±15.36µs	914.6±8.60µs	-0.20%
resource_index/index_build//tmp/ark-fs-index-benchmarkscqgOio	111.8±1.02ms	N/A	N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksosCgHK	112.9±2.86ms	N/A	N/A
resource_index/index_get_resource_by_id	108.4±3.05ns	117.9±3.55ns	+8.76%
resource_index/index_get_resource_by_path	54.3±2.07ns	53.4±0.84ns	-1.66%
resource_index/index_update_all	1111.4±33.86ms	1131.9±44.19ms	+1.84%
resource_index/index_update_one	670.8±23.59ms	677.0±18.97ms	+0.92%

Pushkarm029 · 2024-12-01T12:44:38Z

fs-cache/src/cache.rs

+            // TODO: NEED FIX
+            memory_cache: LruCache::new(
+                NonZeroUsize::new(max_memory_bytes)
+                    .expect("Capacity can't be zero"),
+            ),


LruCache requires the capacity (number of items) to be specified during initialization. However, our Cache is designed to be limited by max_memory_bytes. So, my question is: what would be the most way to initialize the LruCache with?

Note: In all other functions, we are already comparing based on the number of bytes, not the number of items.

I think we can create another parameter(max_items) which will be Option<usize> with default as 100.

I think the number of items should be left up to the developer calling the function. Instead of taking max_memory_bytes as an argument, we could take max_memory_items. This would require redesigning the implementation to focus on the number of items rather than memory size, but it would give developers the flexibility to decide based on the average size of the items they store.

If prioritizing memory size over the number of items is a hard requirement, then I can think of two options:

We could implement our own version of LruCache

Or, LruCache has a resize() method, and we could use this to resize the cache based on other metadata we track

Also, I looked into uluru, and it uses the number of items to initialize the cache as well. Just mentioning this in case you were considering it.

Guys, what about this? https://docs.rs/lru-mem/latest/lru_mem/

But it has only 3 stars on GitHub..

It's actually an interesting option and would have been a perfect fit 😃

I wouldn’t recommend it though because if we find any issues later in the crate, we’d have to fork it and fix the problem ourselves, and we’re not familiar with the code. Plus, since it's not actively maintained/ used, there wouldn’t be anyone around to help us either.

github-actions · 2024-12-01T12:46:06Z

Benchmark for a683a1e

Click to view benchmark

Test	Base	PR	%
blake3_resource_id_creation/compute_from_bytes:large	248.1±2.89µs	251.9±1.44µs	+1.53%
blake3_resource_id_creation/compute_from_bytes:medium	15.5±0.03µs	15.5±0.05µs	0.00%
blake3_resource_id_creation/compute_from_bytes:small	1365.5±25.41ns	1359.4±10.95ns	-0.45%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg	197.2±0.51µs	199.2±2.51µs	+1.01%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf	1717.8±7.57µs	1715.3±26.78µs	-0.15%
crc32_resource_id_creation/compute_from_bytes:large	86.8±0.40µs	86.8±0.63µs	0.00%
crc32_resource_id_creation/compute_from_bytes:medium	5.4±0.08µs	5.4±0.02µs	0.00%
crc32_resource_id_creation/compute_from_bytes:small	92.3±0.31ns	92.8±2.31ns	+0.54%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg	64.3±0.33µs	64.4±0.27µs	+0.16%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf	914.7±18.73µs	920.0±16.67µs	+0.58%
resource_index/index_build//tmp/ark-fs-index-benchmarksOmbnwj	103.7±1.25ms	N/A	N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksrimMQB	103.9±2.68ms	N/A	N/A
resource_index/index_get_resource_by_id	106.9±3.64ns	99.4±3.23ns	-7.02%
resource_index/index_get_resource_by_path	52.7±0.54ns	53.6±0.72ns	+1.71%
resource_index/index_update_all	1080.8±30.30ms	1086.0±37.89ms	+0.48%
resource_index/index_update_one	650.5±15.88ms	659.1±24.05ms	+1.32%

github-actions · 2024-12-01T12:55:56Z

Benchmark for `92357c6`

Click to view benchmark

Test	Base	PR	%
blake3_resource_id_creation/compute_from_bytes:large	248.6±2.33µs	251.2±1.34µs	+1.05%
blake3_resource_id_creation/compute_from_bytes:medium	15.5±0.06µs	15.6±0.05µs	+0.65%
blake3_resource_id_creation/compute_from_bytes:small	1363.7±5.39ns	1366.4±6.19ns	+0.20%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg	197.0±1.64µs	197.4±1.60µs	+0.20%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf	1707.8±7.37µs	1710.1±11.30µs	+0.13%
crc32_resource_id_creation/compute_from_bytes:large	86.8±0.23µs	86.6±0.35µs	-0.23%
crc32_resource_id_creation/compute_from_bytes:medium	5.4±0.01µs	5.4±0.02µs	0.00%
crc32_resource_id_creation/compute_from_bytes:small	92.4±0.37ns	92.3±0.18ns	-0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg	66.1±0.16µs	64.6±1.60µs	-2.27%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf	913.4±2.26µs	914.3±4.10µs	+0.10%
resource_index/index_build//tmp/ark-fs-index-benchmarksFMPQVr	106.1±0.98ms	N/A	N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksQNMEsB	108.6±1.48ms	N/A	N/A
resource_index/index_get_resource_by_id	106.8±1.16ns	102.4±7.45ns	-4.12%
resource_index/index_get_resource_by_path	53.8±1.29ns	53.1±1.25ns	-1.30%
resource_index/index_update_all	1081.6±25.74ms	1085.1±36.51ms	+0.32%
resource_index/index_update_one	651.6±14.25ms	660.5±13.79ms	+1.37%

tareknaser · 2024-12-06T21:51:19Z

fs-cache/src/cache.rs

+            // TODO: NEED FIX
+            memory_cache: LruCache::new(
+                NonZeroUsize::new(max_memory_bytes)
+                    .expect("Capacity can't be zero"),
+            ),


I think the number of items should be left up to the developer calling the function. Instead of taking max_memory_bytes as an argument, we could take max_memory_items. This would require redesigning the implementation to focus on the number of items rather than memory size, but it would give developers the flexibility to decide based on the average size of the items they store.

If prioritizing memory size over the number of items is a hard requirement, then I can think of two options:

We could implement our own version of LruCache

Or, LruCache has a resize() method, and we could use this to resize the cache based on other metadata we track

Also, I looked into uluru, and it uses the number of items to initialize the cache as well. Just mentioning this in case you were considering it.

tareknaser · 2024-12-06T21:54:23Z

fs-cache/src/cache.rs

+        // Remove oldest entries until we have space for new value
+        while self.current_memory_bytes + size > self.max_memory_bytes {
+            let (_, old_entry) = self
+                .memory_cache
+                .pop_lru()
+                .expect("Cache should have entries to evict");
+            debug_assert!(
+                self.current_memory_bytes >= old_entry.size,
+                "Memory tracking inconsistency detected"
+            );
+            self.current_memory_bytes = self
+                .current_memory_bytes
+                .saturating_sub(old_entry.size);
+        }


But Yeah I think we should remove this code at all costs.

It’s currently undermining the purpose of using the external LRU cache crate. If there’s absolutely no other way around this, then we may need to implement our own LRU cache solution.

This operation should be O(1)

tareknaser · 2024-12-06T21:56:36Z

fs-cache/src/cache.rs

+struct CacheEntry<V> {
+    value: V,
+    size: usize,
+}


Why do we need to store the size of the value? Can’t we just read it from fs when needed? I don’t see it being read often.

If it’s for convenience to avoid I/O calls…

We need to track memory consumption in bytes to be precise about when to offload values, and to support large values, too. We probably can't avoid saving value sizes into memory, otherwise when we hit the limit we cannot know how much values we need to offload.

However, that's where we could split the crate into 2 flavours:

dynamically-sized values e.g. byte vectors and text strings

statically-sized values e.g. integers

For the 2nd flavour we could utilize some standard Rust trait.

Is there a way to have these 2 flavours combined nicely in a single crate?

We need to track memory consumption in bytes to be precise about when to offload values...

Thtat's fine but we're actually reading the file size from the disk again in the get_file_size method, even though we already have it stored in the metadata. That seems a bit wasteful. Check out the next comment for more on this

dynamically-sized values e.g. byte vectors and text strings

That's a good point
I completely missed the dynamic types aspect when I looked at this. Now it makes a lot more sense why we need to track the data size instead of just the number of items.

Is there a way to have these 2 flavours combined nicely in a single crate?

If we're dealing with types that have a fixed size, like usize, it doesn't really matter if we count how many items there are or just the total size they take up. But this completely breaks with the second flavour you mentioned.

The only solution I can think of right now is to treat all types of data as if they were the second type – basically, keep track of how much memory they use instead of how many there are. But that brings up the question of how to do this in a clean way

Main thread: #95 (comment)

tareknaser · 2024-12-06T21:58:14Z

fs-cache/src/cache.rs

+        log::debug!("cache/{}: caching in memory for key {}", self.label, key);
+        let size = self.get_file_size(key)?;


... then we would essentially be defeating the purpose here, as we're reading the file size from disk again

tareknaser · 2024-12-06T21:59:25Z

fs-cache/src/cache.rs

+    fn get_file_size(&self, key: &K) -> Result<usize> {
+        Ok(fs::metadata(self.path.join(key.to_string()))?.len() as usize)
+    }


This doc comment is misleading. We don’t actually check the memory cache for size information first, but I think we should.

tareknaser · 2024-12-06T22:14:58Z

fs-storage/src/utils.rs


+/// Writes a serializable value to a file and returns the timestamp of the write
+pub fn write_json_file<T: Serialize>(


This function doesn't really return the timestamp of the write.

tareknaser · 2024-12-06T22:16:10Z

fs-storage/src/utils.rs

+pub fn extract_key_from_file_path<K>(
+    label: &str,
+    path: &Path,
+    include_extension: bool,
+) -> Result<K>


Thanks for moving this function to the utils module. It definitely feels like the right place for it.

Could you also add a doc comment for it?

tareknaser · 2024-12-06T22:19:04Z

fs-cache/src/lib.rs

@@ -0,0 +1 @@
+pub mod cache;


All of the following comments are just nitpicks, so feel free to ignore them.

I think it’s better to only export the Cache struct. That way, when we import the crate, we can use Cache directly without needing to do fs_cache::cache::Cache. It would be fs_cache::Cache instead.

Suggested change

pub mod cache;

mod cache;

pub use cache::Cache;

tareknaser · 2024-12-06T22:20:14Z

fs-cache/src/cache.rs

+    #[fixture]
+    fn temp_dir() -> TempDir {
+        TempDir::new("tmp").expect("Failed to create temporary directory")
+    }
+
+    #[fixture]
+    fn cache(temp_dir: TempDir) -> Cache<String, TestValue> {
+        Cache::new(
+            "test".to_string(),
+            temp_dir.path(),
+            1024 * 1024, // 1MB
+            false,
+        )
+        .expect("Failed to create cache")
+    }


I feel like using fixtures here might be a bit overkill, but maybe you have plans to expand on this in the future. Personally, I think we could just create the temp_dir and cache instances directly in each test function, or use a helper function, which would make the code more readable.

tareknaser · 2024-12-06T22:21:51Z

fs-cache/src/cache.rs

+        // Add new value and update size
+        self.memory_cache.put(
+            key.clone(),
+            CacheEntry {
+                value: value.clone(),
+                size,
+            },
+        );


A small suggestion: we could use push here instead of put. According to the docs, push does the following: ‘If an entry with key k already exists in the cache or another cache entry is removed (due to the LRU’s capacity), then it returns the old entry’s key-value pair.’

It should be a nice touch to log the evicted key, if there is one.

kirillt · 2024-12-11T06:30:51Z

README.md

-| `data-json`     | JSON serialization and deserialization   |
+| Package         | Description                               |
+| --------------- | ----------------------------------------  |
+| `ark-cli`       | The CLI tool to interact with ark crates  |


kirillt · 2024-12-11T06:32:25Z

README.md

+| `data-resource` | Resource hashing and ID construction      |
+| `fs-cache`      | Memory and disk caching with LRU eviction |
+| `fs-index`      | Resource Index construction and updating  |
+| `fs-storage`    | Filesystem storage for resources          |


kirillt · 2024-12-11T06:47:25Z

fs-cache/src/cache.rs

+        // Sort by size before loading
+        file_metadata.sort_by(|a, b| b.1.cmp(&a.1));
+
+        // Clear existing cache
+        self.memory_cache.clear();
+        self.current_memory_bytes = 0;
+
+        // Load files that fit in memory


Interesting idea with pre-loading the cache 👍

We probably also need a switch to enable the app developer to turn it on/off.

kirillt · 2024-12-11T06:59:04Z

fs-cache/src/cache.rs

            }
        }

-        // Sort by size before loading
-        file_metadata.sort_by(|a, b| b.1.cmp(&a.1));
+        // Sort by modified time (most recent first)


Actually, I'm not sure that pre-loading the most recently modified values would really be beneficial.

We could implement more sophisticated approach with gathering query statistics and recording it somewhere on disk for pre-loading in future. But I would do it in a separate PR and not right now.

Pushkarm029 added 2 commits November 17, 2024 20:13

init

b056880

Signed-off-by: Pushkar Mishra <[email protected]>

fix

8797e8f

Signed-off-by: Pushkar Mishra <[email protected]>

few changes

56597ac

Signed-off-by: Pushkar Mishra <[email protected]>

Pushkarm029 requested a review from kirillt November 19, 2024 18:49

kirillt reviewed Nov 20, 2024

View reviewed changes

fs-cache/src/cache.rs Outdated Show resolved Hide resolved

simplified cache & removed remove

e5b41bc

Signed-off-by: Pushkar Mishra <[email protected]>

fix

33091bf

Signed-off-by: Pushkar Mishra <[email protected]>

Pushkarm029 requested a review from tareknaser November 21, 2024 07:00

tareknaser reviewed Nov 23, 2024

View reviewed changes

kirillt reviewed Nov 26, 2024

View reviewed changes

fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved

kirillt reviewed Nov 26, 2024

View reviewed changes

fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved

kirillt reviewed Nov 26, 2024

View reviewed changes

fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved

kirillt reviewed Nov 26, 2024

View reviewed changes

fs-cache/src/memory_limited_storage.rs Outdated Show resolved Hide resolved

major changes

60225f4

Signed-off-by: Pushkar Mishra <[email protected]>

fix

39d9bb0

Signed-off-by: Pushkar Mishra <[email protected]>

Pushkarm029 added 3 commits December 1, 2024 17:53

fix

a4b201a

Signed-off-by: Pushkar Mishra <[email protected]>

fix

de954cd

Signed-off-by: Pushkar Mishra <[email protected]>

improve comments

9136354

Signed-off-by: Pushkar Mishra <[email protected]>

Pushkarm029 commented Dec 1, 2024

View reviewed changes

Pushkarm029 requested review from tareknaser and kirillt December 1, 2024 12:48

tareknaser reviewed Dec 6, 2024

View reviewed changes

kirillt reviewed Dec 11, 2024

View reviewed changes

		log::debug!("cache/{}: caching in memory for key {}", self.label, key);
		let size = self.get_file_size(key)?;


		/// Writes a serializable value to a file and returns the timestamp of the write
		pub fn write_json_file<T: Serialize>(

	\| `ark-cli` \| The CLI tool to interact with ark crates \|
	\| `ark-cli` \| The CLI tool to interact with ARK crates \|

	\| `fs-storage` \| Filesystem storage for resources \|
	\| `fs-storage` \| Key-value storage persisted on filesystem \|

fs-cache: Add Cache Struct #95

Are you sure you want to change the base?

fs-cache: Add Cache Struct #95

Conversation

Pushkarm029 commented Nov 17, 2024

github-actions bot commented Nov 17, 2024

Benchmark for 603c1db

github-actions bot commented Nov 18, 2024

Benchmark for 20fe5a6

kirillt Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Only set/get API is needed

Primary usage scenario: keys are of type ResourceId

Secondary usage scenario: keys are of arbitrary type

Choose a reason for hiding this comment

github-actions bot commented Nov 20, 2024

Benchmark for 840a337

github-actions bot commented Nov 21, 2024

Benchmark for 50fe163

Pushkarm029 commented Nov 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kirillt Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Nov 30, 2024

Benchmark for 42bb74c

github-actions bot commented Dec 1, 2024

Benchmark for c7341e8

github-actions bot commented Dec 1, 2024

Benchmark for d698fcf

Pushkarm029 Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 1, 2024

Benchmark for a683a1e

github-actions bot commented Dec 1, 2024

Benchmark for 92357c6

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tareknaser Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kirillt Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`fs-cache`: Add Cache Struct #95

`fs-cache`: Add Cache Struct #95

Benchmark for `603c1db`

Benchmark for `20fe5a6`

kirillt Nov 20, 2024 •

edited

Loading

Only `set`/`get` API is needed

Primary usage scenario: keys are of type `ResourceId`

Benchmark for `840a337`

Benchmark for `50fe163`

kirillt Nov 26, 2024 •

edited

Loading

Benchmark for `42bb74c`

Benchmark for `c7341e8`

Pushkarm029 Dec 1, 2024 •

edited

Loading

Benchmark for `92357c6`

tareknaser Dec 6, 2024 •

edited

Loading

kirillt Dec 11, 2024 •

edited

Loading