Based on what I understood from Jawahar's talk during the January 2020 GeekNight (with me as part of audience) held at ThoughtWorks, Chennai.
Errors, if any, are my own. Corrections appreciated.
—
Consider two files medium.txt
and
large.txt
.
medium.txt
is a reasonably large file and
large.txt
is a huge file with respect to the computer
hardware.
On a 4GB RAM computer, my medium.txt
and
large.txt
span around 200MB and 2GB respectively.
Let's write a simple nodejs application in a file
nostream.js
to compress a given file.
const fs = require('fs');
const zlib = require('zlib');
const filePath = process.argv[2];
const data = fs.readFileSync(filePath);
.gzip(data, compressedData => {
zlib.writeFileSync('out.gz', compressedData);
fsconsole.log("Compression success!");
; })
fs
module
is used for reading and writing files along with zlib
's gzip()
for the compression.
As a quick way of doing it, it is assumed that the input file name
would be given as the first argument for the application invocation.
This would make it the third parameter when the application is invoked
using node
. Hence the argv[2]
.
When we try to compress the 200MB medium.txt
file,
everything works fine.
$ node nostream.js medium.txt
Compression success!
$ du -h out.gz medium.txt
4.0K out.gz
219M medium.txt
(I used a trivial text file with repeating content as input. Hence the small size of the compressed file.)
However, on trying the same with the 2GB large.txt
, it
was a different story and an error that looked something like this
showed up:
$ node nostream.js large.txt
fs.js:317
throw new ERR_FS_FILE_TOO_LARGE(size);
^
RangeError [ERR_FS_FILE_TOO_LARGE]: File size (2162601906) is greater than possible Buffer: 2147483647 bytes
at tryCreateBuffer (fs.js:317:13)
at Object.readFileSync (fs.js:353:14)
at Object.<anonymous> (/home/famubu/nostream.js:6:17)
at Module._compile (internal/modules/cjs/loader.js:959:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:995:10)
at Module.load (internal/modules/cjs/loader.js:815:32)
at Function.Module._load (internal/modules/cjs/loader.js:727:14)
at Function.Module.runMain (internal/modules/cjs/loader.js:1047:10)
at internal/main/run_main_module.js:17:11 {
code: 'ERR_FS_FILE_TOO_LARGE'
}
The error shows up because it can't handle so big a file.
The reason is that in this case, the application attempts to load the entire file data into the memory before initiating the compression process.
It need not be so.
This is where streams come in handy.
Stream in Node.js, is an interface for working with data that keep on coming. This means that all of the data needn't be there before we can start processing it.
That's the very nature of streaming data in general: it may even be an infinite stream (for example, temperature readings of CPU cores of a supercomputer). So we can't necessarily wait for all the data to be available.
There are four types of streams in Node.js.
Readable
streams: data can be read.Writable
streams: data can be written.Duplex
streams: data can both be read and written.Transform
streams: duplex streams which can apply some
kind of transformation to the read data before it is written.The main objective of streams is to "limit the buffering of data to acceptable levels such that sources and destinations of differing speeds will not overwhelm the available memory". ⁵
An internal buffer is associated with every Readable
and
Writable
stream.
The size of this internal buffer depends on the highWaterMark
option of the stream's constructor.
For Readable
streams, the data is read into the internal
buffer till the limit specified by highWaterMark
after
which the reading process is paused. Reading can resume after enough
data in the buffer has been consumed to make way for more data.
Likewise for Writable
streams, the data to be written
out is first stored inside the internal buffer from where it is consumed
later.
Since Duplex
and Transform
streams can
perform both reading and writing, each of them has two separate internal
buffers. One each for reading and writing.
Let us try compressing the large.txt
file again. This
time using streams.
const fs = require('fs');
const zlib = require('zlib');
const filePath = process.argv[2];
const inputStream = fs.createReadStream(filePath);
const outputStream = fs.createWriteStream("out_stream.gz");
const compressTransformStream = zlib.createGzip();
inputStream.pipe(compressTransformStream)
.pipe(outputStream)
This time, it works.
Let us modify this a bit and add a message using an event handler for
the close
event of the writable stream. This event is
emitted when the file being written to is closed.
inputStream.pipe(compressTransformStream)
.pipe(outputStream)
.on('close', () => {
console.log("Compression with streams success!");
; })
Running the new app,
$ node stream.js large.txt
Compression with streams success!
$ du -h out_stream.gz large.txt
2.1G large.txt
908K out_stream.gz
The compression works when streams are used because the size of the data buffer need not be as big as the file itself as the 'produced' data is consumed before the entire file is loaded, making a smaller buffer size possible. The app can start writing to the output file before the entire input file has been compressed. This way, the app is way easier on the computer's memory.
(Almost all of this article was originally written in 2020 before having access to the talk's video (not long before corona forced us all inside 😅). Got around to dusting it off only now.)