-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separating the log from the state #141
Comments
I think that's a good idea for read only streaming use-case :) The log stream abstraction can expose both pull and push based interfaces. Then the existing DeltaTable implementation can be refactored into a pull based consumer. Read-only streaming consumers can use the push interface to react to new log entries in near realtime. |
A pull-based interface would probably be good enough as a start. With a push-based interface, someone has to poll the underlying file system anyway (assuming something like I'll play with this idea a bit to see if it leads anywhere. |
Just to make sure we are talking about pull/push at the same abstraction level. In my first comment, I was referring to pull and push interfaces at the application level. For example, a pull interface, or API if you prefer, means the application code needs to explicitly call When it comes to actually implementing the push interface, we also have the choice of leveraging push or pull semantics at the storage level. For example, for local file system backend on Linux platform, we could leverage inotify to achieve real time push implementation. For backends like S3, we would need to actually pull new values from |
@dispanser Azure Databricks has "autoloader" feature, that resembles inotify behavior. Did you already try that? |
@houqp there's autoloader on aws as well :) |
related #661 |
I think the LogStore work that's been done recently I am going to close this out |
In the current architecture, the concepts of the delta table and the delta log are available in one single abstraction,
DeltaTable
. To update to the newest state, one can callupdate()
and the table state follows accordingly.However, I can see several use cases, mostly around stream processing, that are not actually interested in the current table state, but instead in the stream of changes. For such a use case, subscribing to a stream of actions would probably provide the better abstraction.
Ultimately,
DeltaTable
is just a representation of the aggregated state of the log up to a specific commit, so it would just be another subscriber to the delta log changes.I wonder if it would make sense to have a first-class citizen,
DeltaLog
, exposing commits as a stream of batches of actions.The text was updated successfully, but these errors were encountered: