-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Node.js Streams vs pull-streams vs ReactiveJS #362
Comments
The only viable other alternative to node-streams I found so far was https://pull-stream.github.io. They are quite nice and there is an example of secio implemented here: libp2p/js-libp2p-secio#8 The main barrier I hit with them was integration with node-streams, so if we were to move the whole code base to use pull-streams this might solve those issues. |
First, thank you @dignifiedquire for chasing this streams issue and for researching possible solutions by doing experiments with actual implementations, really 5 🌟 👌. To give a bit more emphasis to the issues we've realised (through the hard way), the Node.js Streams, also known as I believe, after having gone through and also talked with several people, that we shouldn't treat these 3 problems as one, and falling into the trap of patching the 3 pain points with a large bandaid, which will bring a lot more than what we need, increasing the complexity and also hitting performance and the js blob size. Just to put in context, a normal communication between two IPFS peers is something like: Candidate solutionspumppump is a small module that offers correct error propagation and stream destruction. It can be seen as pump is part of a collection of utility modules for Node.js Streams, called mississippi. rxjsrxjs is a very large project supported by lots of groups and companies, its documentation is in a very advanced state and exists for more languages than Node.js. It offers a lot of nice APIs to operate streams of data, or 'arrays over time', however, (from what I can understand), doesn't offer any backpressure mechanism, essential for networking protocols or operations over large quantities of data. In some sense, it is similar to Node.js Streams 1 (Streams Classic), but with fancier APIs pull-streamspull-streams are super interesting, I really like how the ecosystem is growing and they result from similar frustrations that Node.js developers had in the past. I was initially confused of how to ensure that pull-streams offer backpressure mechanism, specially compatible with the underlying transports (push streams), but @yoshuawuyts appeared in the IPFS workshop for the rescue and gave me a really good explanation. Essentially pull-streams leverage the underlying push-streams highWatermark to do the network backpressure for them, as seen in the following image: Problem 1 - propagation of errorsThis is a requirement for js-ipfs, as each part of the code works isolated from the remaining services, but they all mount on libp2p to open the connections. As seen in Figure 1, an error in a network socket will cause several parts of the code to have to handle that error and react to it. It seems to me that any of the proposed solutions would solve this problem really well. Then it will be up to the developers to understand how to handle the network errors if they open a connection themselves. Problem 2 - ensure data gets flushedIt doesn't seem to me that any of the proposed solutions offers a fix for this problem, because in the end, we are always using Node.js stream to write to disk. What we can (and should do) is solve this issue at the ipfs-repo level, making sure that data is flushed before we let the end event be emitted, and while we are at it, add a LRU cache to make reads faster, mitigating also the problem of the false negatives. Problem 3 - streams ecosystemThis feels a bit like lying to ourselves, it is true that not all stream modules implement Node.js streams as we could expect, but that doesn't mean that whatever other modules we get for pull-streams or rxjs will do everything just right. This has to be solved by having very comprehensive tests, unit, integration and chaos tests, to make sure we don't miss anything. Important to note that, modules like Other concerns
Other thoughtsWe have to be cautious when proposing/moving to other ecosystems. Every single one of them has its qualms and moving back and forth will always add noise in the migration. There are plenty of libraries and frameworks that optimize the developers experience, however, since they might require a huge paradigm shift on the code structure, they lack to get adoption. In the end, it is the people interest that will make a good lib/framework, a good framework. One good example is the This takes me to my next point which is, in order to maximize adoption and contribution, language API are typically the best bet. Summing upIn the end, independently of the solution we decide to go for, we should understand and make sure we don't add more complexity and problems for the long term. Bringing in a framework to solve control flow will certainly add a considerable amount of overhead (the large bandaid). We probably can get away with pull-streams look pretty exciting with their low footprint and convenience, but they only solve issues partially (error propagation) and we would need to expose regular streams anyhow, which they do pretty conveniently, same doesn't happen for rxjs. Let's make sure to chat Friday and make a decision and move on, please do comment in this thread before hand, so that we make the meeting very efficient. |
Thank you very much @diasdavid for this comprehensive post, below are a couple of notes just for clarification.
This is very nicely solved in
Yes, but I don't think we should sacrifice our productivity just for adoption as a library. We don't want to clash with everything the community does, but this shouldn't stop us from trying to write as efficient as possible the most excellent code that we can.
I'm afraid that we can't get away with this, as the data flushing issues are more serious than just the fs part. The most pressing issue, the one which lead me to investigating other stream solutions was the one I encountered when trying to implement secio. It encounters the issue that wrapping node-streams and trying to properly manage things like
I don't know how your schedule looks like but I would appreciate if we could actually have this chat today, given that it is currently blocking 90% of the work I want to get a move on. |
Ok, let's do it today then :) 4pm UTC sounds good? http://www.worldtimebuddy.com/?qm=1&lid=2267057,100,308,12&h=100&date=2016-8-4&sln=16-17 |
Pinging @haadcode @nginnever @xicombd @RichardLitt @victorbjelkholm @shamb0t and anyone else following this, you are welcome to join the chat, don't feel obligated to though :) |
https://github.com/dominictarr/pull-fs uses
Yeah, chances are slim that will happen soon. But that's fine because in Node
Probably the only native abstraction capable of doing streaming stuff well I'd be super wary of exposing multiple interfaces as JS, the language™ is a An example OO API: const ipfs = require('ipfs')
ipfs('/my-cool/repo-name', (err, node) => {
if (err) throw err
const cat = node.files.cat(function (err, data) {
if (err) return console.log('error: ' + err)
console.log('chunk: ' + String(data))
})
cat.start()
cat.pause()
cat.stop()
node.stop()
}) For native abstractions you could totes create a plugin system to expose // pull-stream
const ipfsPull = require('ipfs-pull-stream')
const pull = require('pull-stream')
const ipfs = require('ipfs')
ipfs('/my-cool/repo-name', (err, node) => {
if (err) throw err
node.use(ipfsPull())
const cat$ = node.files.cat('<hash>')
pull(cat$, pull.log())
}) // node streams
const ipfsNodeStreams = require('ipfs-node-streams')
const ipfs = require('ipfs')
const pump = require('pump')
ipfs('/my-cool/repo-name', (err, node) => {
if (err) throw err
node.use(ipfsNodeStreams())
const cat$ = node.files.cat('<hash>')
pump(cat$, process.stdout)
}) Does this make sense? |
If you guys don't make a solid decision during this meeting i'm going to make the decision for you. That is a threat I intend to follow through on. |
Really awesome work @dignifiedquire in concretely identifying the pain points in dealing with node streams. I can't make the hangout, but here are my thoughts: For me, pull streams' # 1 superpower is their simplicity (in the Rich Hickey sense of the word): anyone can write a full pull stream implementation from scratch in a few minutes, from first principles. This is not true of node streams in the slightest. Simplicity brings transparency with it, meaning debugging and reasoning about implementation gets easier. I think this is the biggest argument I can make for pull streams. In this sense, I bet js-ipfs would benefit tremendously. That said, pull streams' lack mainstream awareness. Despite being simple, grokking pull streams is certainly not easy. It took me a while to wrap my head around the big ideas, and I still don't wield them as intuitively as node streams. Pushing IPFS contributors to grok pull streams could hurt the project's approachability. Summary: deviating from node streams in 2016 seems like a high-risk move re raising the barrier to contribution, despite potential gains. As @diasdavid said, piece-wise solutions to each problem (like |
Thank you everyone for all the input, it was really valuable for all the discussion. Me and @dignifiedquire talked for a while and given our options, we decided to give pull-streams a spin, however this is not a full buy in, because we still do not know if it will solve the entire problem for libp2p (specially secio) and ipfs-repo. The planTest pull-streams in libp2p
We hope to get this done no later by next Wednesday IPFS Repo
Note: We could avoid migrating these to pull-streams and just do the LRU cache + strong verification at the Repo, but since we are moving libp2p to pull-streams, it makes sense to use the same type of streams everywhere, plus we get all the other conviniences. Acknowledgements
|
Can you give insights to what lead you to this decision? What were the pros and cons for pull-streams vs. RxJS? |
@haadcode The most important reasons are described in my first post, here: #362 (comment) In a tl;dr:
|
As an alternative to RX, with backpressure possibilities, there is also the Reactive Streams specification (http://www.reactive-streams.org/). It is still not something that feels very natural compared to streams in nodejs (compare https://github.com/reactive-streams/reactive-streams-jvm/blob/v1.0.0/README.md). |
@ekroon I saw them mentioned in the discussions around backpressure but it seems there is no implementation in JavaScript currently. Implementing something like this would add a lot of work for us without being sure that it's worth it (https://github.com/reactive-streams/reactive-streams-js is empty :() |
reactive streams also looks like its its own network protocol. |
@dignifiedquire @whyrusleeping in the simplest form reactive streams is only a set of interfaces that make it possible to subscribe to a stream and be notified of errors and the end of the stream (which is also in RX). In addition you can effectively switch between pull and push as there is an extra call on the subscription (normally only a cancel or unsubscribe), namely request(n); to request a maximum of new items. If the receiving end is limiting the flow by doing request(1) after a message, it would be effectively pull and when the sending end is limiting via request(bignumber); than it is effectively (unbounded) push. Looking at node.js streams and pull-streams specifically, it looks to me that you can almost do the same with it. I am not completely sure if you can specify a lot of data and that way be a bit more pushy (I get a bit stuck when reading javascript ;)); and that reactive streams is (still) more a JVM party and not very wide spread, but that the interface itself would be mappable onto a pull stream implementation. (RxJs is pretty popular I thought nowadays for UIs, so there is hope for a reactive streams implementation) As for a network protocol: it is not, but it is meant to be used within an application and crossing network boundaries. E.g. if I have an photo resize service split up in a resizer + stream chunker (creating JPEG or something) and a sender over the network (over some protocol providing flow control, TCP / SPDY ...), than I would be able to rate limit the sender by requesting only 1 new photo from the stream chunker which in turn could rate limit the TCP connection a little bit if needed. |
for converting between node streams and pull streams: stream-to-pull-stream has had a lot of battle testing and has been very reliable. |
Closing as we have merged and released our first version of js-ipfs using pull-streams internally :) |
Reasons for consideration
In the recent months of developing js-ipfs we had some issues with various parts of the different versions of node-streams. Some of the major ones were
error
handling, especially as wepipe
and wrap streams many times from module to moduleIn light of these issues we are exploring different alternatives to replace the stream based code inside js-ipfs
Currently I have identified two possible alternatives
1. Pull Streams
Example code
Resources
2. RxJS
Example code
Resources
The text was updated successfully, but these errors were encountered: