Node.js streams and highWaterMark
Hi,
I have been working with streams for a long time. When I started working on programming few years ago, I first start with the basics in web development such as HTML, CSS. Afterward PHP comes and for long remain and is still one of my favorite language. I liked it because PHP is easy, flexible, POO is not a mandatory and easily you can build a really well working architecture with it even if you are not an expert. All of this to say that I did work with stream in PHP and now the I am doing the same with Node.js.
Stream for what?
I worked on https://simply-debrid.com for quite a long time now. The simple idea behind this is nothing more than taking a remote stream and send it back to the client acting like a proxy. Interesting thing is to get pause/resume support.
To do this I worked with streams and sockets.
Steps are the following one:
- Connect to the remote stream
- Read chunks
- Write chunks to the client
I did this with the following PHP's functions:
This worked good but in a specific version of this especially when I wanted to bind to a specific network interface I also used socket based function to define my connection.
I used:
- socket_create
- socket_set_option
- socket_bind
- socket_connect
- https://secure.php.net/manual/fr/function.socket-write.php
- socket_read
- socket_close
You can do the same with Node.js and it's even easier.
So remember the global idea. We want to proxify a ressource over a network see the following schema:
The client is getting a resource from a predefined url schema and then resource is fetched according to the specified inputs.
Node.js stream
Few years ago when I started Node.js I wanted to do exactly the same thing I was doing for a few time in PHP. So I look into Node.js and I dig into it.
Node.js stream are now composed by 3 main entities:
- Writable Streams
- Readable Streams
- Transform Streams
For my needs I used:
- request
- pipe (native) then it turn to pump
- http Node.js core module
So it's easy with Node.js, it work like this:
- Sream A
const remote = request('http://example.com/ressource.sql')
From now you are almost done, next create your listening server in order to accept clients requests. Obviously you will have to parse your data, url requests and everything. I did it with http Node.js core module, but you can also do it with express or any other web framework able to setup a web server for you.
const http = require('http')
http.createServer((req, res) => {
/*do whatever you want and parse everything..
...then pipe data to client*/
}).listen(8080)
This is exactly how It worked for me:
- I created the web server
- I defined a remote resource
- I streamed response to the client
Now server is ready to handle requests, remote resource is set. Note that when requesting a resource with request it return a Readable Stream.
From now pump will do the rest as usual .
pump(req, remote, res, (err) => {
if(err){
console.log(err)
}else{
console.log('Downloaded ended, stream closed')
}
})
This is working like a charm . At first when I was not working with pump, I had a lot of issues since I needed to deal with all emitted event. I also wanted to track down when a stream was finished or closed, paused, resumed and everything and this could be very tricky, pump really helps for this.
This has been almost good for me, but I did had another issue where this article is all about. Lately when I worked with PHP, I used fread see function signature:
string fread ( resource $handle , int $length )
So fread is taking our resource and a length, length stand for the size of each chunk you will read and keep into your internal buffer (memory). I wanted to deal with this value in the same way as PHP with Node.js since I realize I was buffering too much data. The problem is when you have a lot of traffic you will also have multiple fulfilled buffers, meaning you will temporary fetch more data than really sent it out to the client. Then your bandwidth is not symmetric upload and download bandwith are not synchronized to the client (extra bandwidth consuming). Size of buffer is managed by highWaterMark. This is an internal value related to Node.js streams.
It says:
highWaterMark Buffer level when stream.write() starts returning false. Defaults to 16384 (16kb), or 16 for objectMode streams.
Meaning buffer is 16kb by default and you will fill it until 16kb then stop gathering data until you did not drain your buffer. I wanted to modify and lower this value in the aim of reducing the amount of extra bandwidth usage, getting a more client-synchronized stream. Actually that was the case with PHP since fread was reading chunk of 512 bytes. However it's not that easy when you are using request (http built-in core module wrapper with extra value) since the response is directly using the core built-in stream.Readable class of Node.js. In documentation they show how to extend this class and customized it, it worked for me but I still don't know how to make it as default or even override the core class, my bad . Anyway I did ask how to do this few months ago on Github how to set highWaterMark #2333 no one answer from now and out of someone else is also facing the same issue.
I did not implemented it yet, but a solution should be to do not use request or any other request module. Indeed I thought about got module which does sound good and promising but has no options to specify the way stream.Readable answer is returned. Then use directly the http.get option with a customized (inherited version) of stream.Readable, see [Implementing a Readable Stream](Implementing a Readable Stream).
If you guys and girls have any idea about how I can deal with this, feel free to contact me , i'll appreciate, thanks .