-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs-cache
: Add Cache Struct
#95
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 603c1dbClick to view benchmark
|
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 20fe5a6Click to view benchmark
|
fs-cache/src/cache.rs
Outdated
/// Load most recent cached items into memory based on timestamps | ||
pub fn load_recent(&mut self) -> Result<()> { | ||
self.storage.load_fs() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we don't need to expose this function.
Only set
/get
API is needed
The rest should happen under the hood:
- any
set
should write both to memory and disk - one-way sync from disk to memory is needed when users
get
values - if we hit our own limit for bytes stored in the in-memory mapping, we erase oldest entries from it
- but entries are always stored on disk, no need to sync from memory to disk explicitly
Primary usage scenario: keys are of type ResourceId
- App indexes a folder.
- App may populate the cache before using it, but it's not required.
- App will query caches by key:
- if the entry is in memory already, that's great, we just return the value
- otherwise, we check disk for entry with the requested key
- if it is on disk, we add it to in-memory storage and return the value
- otherwise, we return
None
- Index can notify the app about recently discovered resources. Corresponding values can be in the cache already, but this is not required. App can initialize values for new resources.
Secondary usage scenario: keys are of arbitrary type
Can be any deterministic computation.
fs-cache/src/cache.rs
Outdated
/// Get number of items currently in memory | ||
// pub fn memory_items(&self) -> usize { | ||
// self.storage.memory_items() | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be good idea to log such information with debug
level. Also, we should count bytes not number of entries. Some cache items can be huge, e.g. bitmaps.
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 840a337Click to view benchmark
|
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 50fe163Click to view benchmark
|
Thank you for the review. |
if path.exists() && path.is_dir() { | ||
return Some((path, None)); | ||
} | ||
let Ok(path) = PathBuf::from_str(storage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it panic if from_str
returns Err
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Even though PathBuf::from_str() has a return type of Result<Self, Self::Err>, it never fails because a PathBuf is just a cross-platform wrapper around a string path.
-
The actual validation of whether the path exists or is valid in the filesystem happens later when you call methods like exists() or is_dir().
#[stable(feature = "path_from_str", since = "1.32.0")]
impl FromStr for PathBuf {
type Err = core::convert::Infallible;
#[inline]
fn from_str(s: &str) -> Result<Self, Self::Err> {
Ok(PathBuf::from(s))
}
}
|
||
pub fn get(&mut self, key: &K) -> Option<V> { | ||
// Check memory cache first - will update LRU order automatically | ||
if let Some(value) = self.memory_cache.get_refresh(key) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty cool, I didn't know about get_refresh
function.
But I guess it works with O(N) complexity, right? Consider using dedicated Rust crate for LRU to get O(1) complexity. This isn't pressing issue, we can create an issue for it.
// Try to load from disk | ||
let file_path = self.path.join(format!("{}.json", key)); | ||
if file_path.exists() { | ||
// Doubt: Update file's modiied time (in disk) on read to preserve LRU across app restarts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's track this feature and work on it later. Better to keep implementation simple for this moment and avoid redundant state. Btw we can also simply write cached keys into a file + apply atomic versioning on it, so all peers would have same view of LRU.
|
||
let new_timestamp = SystemTime::now(); | ||
file.set_modified(new_timestamp)?; | ||
file.sync_all()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not needed anymore, since we are no longer doing any kind of synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need it, in fn load_fs
, we compare timestamps to load recent files into memory. This could help us efficiently decide which files to load or skip since it provides precise timing. However, it's a very minor improvement, so we can skip it.
// Write a single value to disk | ||
fn write_value_to_disk(&mut self, key: &K, value: &V) -> Result<()> { | ||
let file_path = self.path.join(format!("{}.json", key)); | ||
let mut file = File::create(&file_path)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add debug_assert
that the file doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should use lightweight atomic writing to avoid dirty writing. Keep in mind scenario when several ARK apps on same device use same folder and write to the cache in parallel.
I believe, that atomic versions would be excessive here, but I'm not 100% sure yet.
let file_path = self.path.join(format!("{}.json", key)); | ||
let file = File::open(&file_path)?; | ||
let value: V = serde_json::from_reader(file).map_err(|err| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Values are not necessarily JSONs, they can be arbitrary binaries (e.g. JPG images).
App developer will decide the structure of values, we should not have any assumptions around it.
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for 42bb74cClick to view benchmark
|
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for c7341e8Click to view benchmark
|
Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Signed-off-by: Pushkar Mishra <[email protected]>
Benchmark for d698fcfClick to view benchmark
|
// TODO: NEED FIX | ||
memory_cache: LruCache::new( | ||
NonZeroUsize::new(max_memory_bytes) | ||
.expect("Capacity can't be zero"), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LruCache
requires the capacity (number of items) to be specified during initialization. However, our Cache is designed to be limited by max_memory_bytes. So, my question is: what would be the most way to initialize the LruCache with?
Note: In all other functions, we are already comparing based on the number of bytes, not the number of items.
I think we can create another parameter(max_items
) which will be Option<usize>
with default as 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the number of items should be left up to the developer calling the function. Instead of taking max_memory_bytes
as an argument, we could take max_memory_items
. This would require redesigning the implementation to focus on the number of items rather than memory size, but it would give developers the flexibility to decide based on the average size of the items they store.
If prioritizing memory size over the number of items is a hard requirement, then I can think of two options:
- We could implement our own version of
LruCache
- Or,
LruCache
has aresize()
method, and we could use this to resize the cache based on other metadata we track
Also, I looked into uluru
, and it uses the number of items to initialize the cache as well. Just mentioning this in case you were considering it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, what about this? https://docs.rs/lru-mem/latest/lru_mem/
But it has only 3 stars on GitHub..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually an interesting option and would have been a perfect fit 😃
I wouldn’t recommend it though because if we find any issues later in the crate, we’d have to fork it and fix the problem ourselves, and we’re not familiar with the code. Plus, since it's not actively maintained/ used, there wouldn’t be anyone around to help us either.
Benchmark for a683a1eClick to view benchmark
|
Benchmark for 92357c6Click to view benchmark
|
// TODO: NEED FIX | ||
memory_cache: LruCache::new( | ||
NonZeroUsize::new(max_memory_bytes) | ||
.expect("Capacity can't be zero"), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the number of items should be left up to the developer calling the function. Instead of taking max_memory_bytes
as an argument, we could take max_memory_items
. This would require redesigning the implementation to focus on the number of items rather than memory size, but it would give developers the flexibility to decide based on the average size of the items they store.
If prioritizing memory size over the number of items is a hard requirement, then I can think of two options:
- We could implement our own version of
LruCache
- Or,
LruCache
has aresize()
method, and we could use this to resize the cache based on other metadata we track
Also, I looked into uluru
, and it uses the number of items to initialize the cache as well. Just mentioning this in case you were considering it.
// Remove oldest entries until we have space for new value | ||
while self.current_memory_bytes + size > self.max_memory_bytes { | ||
let (_, old_entry) = self | ||
.memory_cache | ||
.pop_lru() | ||
.expect("Cache should have entries to evict"); | ||
debug_assert!( | ||
self.current_memory_bytes >= old_entry.size, | ||
"Memory tracking inconsistency detected" | ||
); | ||
self.current_memory_bytes = self | ||
.current_memory_bytes | ||
.saturating_sub(old_entry.size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Yeah I think we should remove this code at all costs.
It’s currently undermining the purpose of using the external LRU cache crate. If there’s absolutely no other way around this, then we may need to implement our own LRU cache solution.
This operation should be O(1)
struct CacheEntry<V> { | ||
value: V, | ||
size: usize, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to store the size of the value? Can’t we just read it from fs when needed? I don’t see it being read often.
If it’s for convenience to avoid I/O calls…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to track memory consumption in bytes to be precise about when to offload values, and to support large values, too. We probably can't avoid saving value sizes into memory, otherwise when we hit the limit we cannot know how much values we need to offload.
However, that's where we could split the crate into 2 flavours:
- dynamically-sized values e.g. byte vectors and text strings
- statically-sized values e.g. integers
For the 2nd flavour we could utilize some standard Rust trait.
Is there a way to have these 2 flavours combined nicely in a single crate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to track memory consumption in bytes to be precise about when to offload values...
Thtat's fine but we're actually reading the file size from the disk again in the get_file_size
method, even though we already have it stored in the metadata. That seems a bit wasteful. Check out the next comment for more on this
- dynamically-sized values e.g. byte vectors and text strings
That's a good point
I completely missed the dynamic types aspect when I looked at this. Now it makes a lot more sense why we need to track the data size instead of just the number of items.
Is there a way to have these 2 flavours combined nicely in a single crate?
If we're dealing with types that have a fixed size, like usize
, it doesn't really matter if we count how many items there are or just the total size they take up. But this completely breaks with the second flavour you mentioned.
The only solution I can think of right now is to treat all types of data as if they were the second type – basically, keep track of how much memory they use instead of how many there are. But that brings up the question of how to do this in a clean way
Main thread: #95 (comment)
log::debug!("cache/{}: caching in memory for key {}", self.label, key); | ||
let size = self.get_file_size(key)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... then we would essentially be defeating the purpose here, as we're reading the file size from disk again
fn get_file_size(&self, key: &K) -> Result<usize> { | ||
Ok(fs::metadata(self.path.join(key.to_string()))?.len() as usize) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment is misleading. We don’t actually check the memory cache for size information first, but I think we should.
|
||
/// Writes a serializable value to a file and returns the timestamp of the write | ||
pub fn write_json_file<T: Serialize>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't really return the timestamp of the write.
pub fn extract_key_from_file_path<K>( | ||
label: &str, | ||
path: &Path, | ||
include_extension: bool, | ||
) -> Result<K> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving this function to the utils module. It definitely feels like the right place for it.
Could you also add a doc comment for it?
@@ -0,0 +1 @@ | |||
pub mod cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the following comments are just nitpicks, so feel free to ignore them.
I think it’s better to only export the Cache
struct. That way, when we import the crate, we can use Cache
directly without needing to do fs_cache::cache::Cache
. It would be fs_cache::Cache
instead.
pub mod cache; | |
mod cache; | |
pub use cache::Cache; |
#[fixture] | ||
fn temp_dir() -> TempDir { | ||
TempDir::new("tmp").expect("Failed to create temporary directory") | ||
} | ||
|
||
#[fixture] | ||
fn cache(temp_dir: TempDir) -> Cache<String, TestValue> { | ||
Cache::new( | ||
"test".to_string(), | ||
temp_dir.path(), | ||
1024 * 1024, // 1MB | ||
false, | ||
) | ||
.expect("Failed to create cache") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like using fixtures
here might be a bit overkill, but maybe you have plans to expand on this in the future. Personally, I think we could just create the temp_dir
and cache
instances directly in each test function, or use a helper function, which would make the code more readable.
// Add new value and update size | ||
self.memory_cache.put( | ||
key.clone(), | ||
CacheEntry { | ||
value: value.clone(), | ||
size, | ||
}, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small suggestion: we could use push
here instead of put
. According to the docs, push
does the following: ‘If an entry with key k already exists in the cache or another cache entry is removed (due to the LRU’s capacity), then it returns the old entry’s key-value pair.’
It should be a nice touch to log the evicted key, if there is one.
| `data-json` | JSON serialization and deserialization | | ||
| Package | Description | | ||
| --------------- | ---------------------------------------- | | ||
| `ark-cli` | The CLI tool to interact with ark crates | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `ark-cli` | The CLI tool to interact with ark crates | | |
| `ark-cli` | The CLI tool to interact with ARK crates | |
| `data-resource` | Resource hashing and ID construction | | ||
| `fs-cache` | Memory and disk caching with LRU eviction | | ||
| `fs-index` | Resource Index construction and updating | | ||
| `fs-storage` | Filesystem storage for resources | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `fs-storage` | Filesystem storage for resources | | |
| `fs-storage` | Key-value storage persisted on filesystem | |
fs-cache/src/cache.rs
Outdated
// Sort by size before loading | ||
file_metadata.sort_by(|a, b| b.1.cmp(&a.1)); | ||
|
||
// Clear existing cache | ||
self.memory_cache.clear(); | ||
self.current_memory_bytes = 0; | ||
|
||
// Load files that fit in memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea with pre-loading the cache 👍
We probably also need a switch to enable the app developer to turn it on/off.
} | ||
} | ||
|
||
// Sort by size before loading | ||
file_metadata.sort_by(|a, b| b.1.cmp(&a.1)); | ||
// Sort by modified time (most recent first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not sure that pre-loading the most recently modified values would really be beneficial.
We could implement more sophisticated approach with gathering query statistics and recording it somewhere on disk for pre-loading in future. But I would do it in a separate PR and not right now.
MemoryLimitedStorage
under the hood.