javascript - Node echo server degrades 10x when stream pipes are used over buffering -

on node v8.1.4 , v6.11.1

i started out following echo server implementation, refer pipe.js or pipe.

const http = require('http');  const handler = (req, res) => req.pipe(res); http.createserver(handler).listen(3001);

and benchmarked wrk , following lua script (shortened brevity) send small body payload.

wrk.method = "post" wrk.body   = string.rep("a", 10)

at 2k requests per second , 44ms of average latency, performance not great.

so wrote implementation uses intermediate buffers until request finished , writes buffers out. refer buffer.js or buffer.

const http = require('http');  const handler = (req, res) => {   let buffs = [];   req.on('data', (chunk) => {     buffs.push(chunk);   });   req.on('end', () => {     res.write(buffer.concat(buffs));     res.end();   }); }; http.createserver(handler).listen(3001);

performance drastically changed buffer.js servicing 20k requests per second @ 4ms of average latency.

visually, graph below depicts average number of requests serviced on 5 runs , various latency percentiles (p50 median).

so, buffer order of magnitude better in categories. question why?

what follows next investigation notes, @ least educational.

response behavior

both implementations have been crafted give same exact response returned curl -d - --raw. if given body of 10 d's, both return exact same response (with modified time, of course):

http/1.1 200 ok date: thu, 20 jul 2017 18:33:47 gmt connection: keep-alive transfer-encoding: chunked  dddddddddd 0

both output 128 bytes (remember this).

the mere fact of buffering

semantically, difference between 2 implementations pipe.js writes data while request hasn't ended. might make 1 suspect there multiple data events in buffer.js. not true.

req.on('data', (chunk) => {   console.log(`chunk length: ${chunk.length}`);   buffs.push(chunk); }); req.on('end', () => {   console.log(`buffs length: ${buffs.length}`);   res.write(buffer.concat(buffs));   res.end(); });

empirically:

chunk length 10
buffers length 1

since there ever 1 chunk, happens if remove buffering , implement poor man's pipe:

const http = require('http');  const handler = (req, res) => {   req.on('data', (chunk) => res.write(chunk));   req.on('end', () => res.end()); }; http.createserver(handler).listen(3001);

turns out, has abysmal performance pipe.js. find interesting because same number of res.write , res.end calls made same parameters. best guess far performance differences due sending response data after request data has ended.

profiling

i profiled both application using simple profiling guide (--prof).

i've included relevant lines:

pipe.js

 [summary]:    ticks  total  nonlib   name    2043   11.3%   14.1%  javascript   11656   64.7%   80.7%  c++      77    0.4%    0.5%  gc    3568   19.8%          shared libraries     740    4.1%          unaccounted   [c++]:    ticks  total  nonlib   name    6374   35.4%   44.1%  syscall    2589   14.4%   17.9%  writev

buffer.js

 [summary]:    ticks  total  nonlib   name    2512    9.0%   16.0%  javascript   11989   42.7%   76.2%  c++     419    1.5%    2.7%  gc   12319   43.9%          shared libraries    1228    4.4%          unaccounted   [c++]:    ticks  total  nonlib   name    8293   29.6%   52.7%  writev     253    0.9%    1.6%  syscall

we see in both implementations, c++ dominates time; however, functions dominate swapped. syscalls account half time pipe, yet 1% buffer (forgive rounding). next step, syscalls culprit?

strace here come

invoking strace strace -c node pipe.js give summary of syscalls. here top syscalls:

pipe.js

% time     seconds  usecs/call     calls    errors syscall ------ ----------- ----------- --------- --------- ----------------  43.91    0.014974           2      9492           epoll_wait  25.57    0.008720           0    405693           clock_gettime  20.09    0.006851           0     61748           writev   6.11    0.002082           0     61803       106 write

buffer.js

% time     seconds  usecs/call     calls    errors syscall ------ ----------- ----------- --------- --------- ----------------  42.56    0.007379           0    121374           writev  32.73    0.005674           0    617056           clock_gettime  12.26    0.002125           0    121579           epoll_ctl  11.72    0.002032           0    121492           read   0.62    0.000108           0      1217           epoll_wait

the top syscall pipe (epoll_wait) 44% of time 0.6% of time buffer (a 140x increase). while there large time discrepancy, number of times epoll_wait invoked less lopsided pipe calling epoll_wait ~8x more often. can derive couple bits of useful information statement, such pipe calls epoll_wait , average, these calls heavier epoll_wait buffer.

for buffer, top syscall writev, expected considering of time should spent writing data socket.

logically next step take @ these epoll_wait statements regular strace, showed buffer contained epoll_wait 100 events (representing hundred connections used wrk) , pipe had less 100 of time. so:

pipe.js

epoll_wait(5, [.16 snip.], 1024, 0) = 16

buffer.js

epoll_wait(5, [.100 snip.], 1024, 0) = 100

graphically:

this explains why there more epoll_wait in pipe, epoll_wait doesn't service connections in 1 event loop. epoll_wait 0 events makes event loop idle! doesn't explain why epoll_wait takes more time pipe, man page states epoll_wait should return immediately:

specifying timeout equal 0 cause epoll_wait() return immediately, if no events available.

while man page says function returns immediately, can confirm this? strace -t rescue:

besides supporting buffer has fewer calls, can see calls took less 100ns. pipe has more interesting distribution showing while calls take under 100ns, non-negligible amount take longer , land microsecond land.

strace did find oddity, , that's writev. return value number of bytes written.

pipe.js

writev(11, [{"http/1.1 200 ok\r\ndate: thu, 20 j"..., 109},   {"\r\n", 2}, {"dddddddddd", 10}, {"\r\n", 2}], 4) = 123

buffer.js

writev(11, [{"http/1.1 200 ok\r\ndate: thu, 20 j"..., 109},   {"\r\n", 2}, {"dddddddddd", 10}, {"\r\n", 2}, {"0\r\n\r\n", 5}], 5) = 128

remember when said both output 128 bytes? well, writev returned 123 bytes pipe , 128 buffer. 5 bytes difference pipe reconciled in subsequent write call each writev.

write(44, "0\r\n\r\n", 5)

and if i'm not mistaken, write syscalls blocking.

conclusion

if have make educated guess, piping when request not finished causes write calls. these blocking calls reduce throughput partially through more frequent epoll_wait statements. why write called instead of single writev seen in buffer beyond me. can explain why saw happening?

the kicker? in official node.js guide can see how guide starts buffer implementation , moves pipe! if pipe implementation in official guide there shouldn't such performance hit, right?

aside: real world performance implications of question should minimal, question quite contrived in regards functionality , body side, though doesn't mean less of useful question. hypothetically, answer "node.js uses write allow better performance under x situations (where x more real world use case)"

^{disclosure: question copied , modified my blog post in hopes better avenue getting question answered}

july 31st 2017 edit

my initial hypothesis writing echoed body after request stream has finished increases performance has been disproved @robertklep readable.js (or readable) implementation:

const http   = require('http'); const bufsiz = 2048;  const handler = (req, res) => {   req.on('readable', _ => {     let chunk;     while (null !== (chunk = req.read(bufsiz))) {       res.write(chunk);     }   });   req.on('end', () => {     res.end();   }); }; http.createserver(handler).listen(3001);

readable performed @ same level buffer while writing data before end event. if makes me more confused because difference between readable , initial poor man's pipe implementation difference between data , readable event , yet caused 10x performance increase. know data event isn't inherently slow because used in our buffer code.

for curious, strace on readable reported writev outputs 128 bytes output buffer

this perplexing!

that's funny question have!

in fact, buffered vs piped not question here. have small chunk; processed in 1 event. show issue @ hand, can write handler this:

let chunk; req.on('data', (dt) => {     chunk=dt }); req.on('end', () => {     res.write(chunk);     res.end(); });

let chunk; req.on('data', (dt) => {     chunk=dt;     res.write(chunk);     res.end(); }); req.on('end', () => { });

let chunk; req.on('data', (dt) => {     chunk=dt     res.write(chunk); }); req.on('end', () => {     res.end(); });

if write , end on same handler, latency 10 times less.

if check write function code, there around line

msg.connection.cork(); process.nexttick(connectioncorknt, msg.connection);

cork , uncork connection on next event. means use cache data, force data sent on next event before other events processed.

to sum up, if have write , end on different handlers, have:

cork connection (+ create tick uncork)
create buffer data
uncork connection event (send data)
call end process (that send packet final chunk , close)

if on same handler, end function called before uncork event processed, final chunk in cache.

cork connection
create buffer data
add "end" chunk on buffer
uncork connection send everything

also, end function runs cork / uncork synchronously, little bit faster.

now why matter? because on tcp side, if send packet data, , wish send more, process wait acknowledge client before sending more:

write + end on different handlers:

0.044961s: post / => request
0.045322s: http/1.1 => 1st chunk : header + "aaaaaaaaa"
0.088522s: acknowledge of packet
0.088567s: continuation => 2nd chunk (ending part, 0\r\n\r\n)

there ~40 ms before ack after 1st buffer sent.

write + end in same handler:

the data complete in single packet, no ack needed.

why 40ms on ack? built-in feature in os improve performance overall. described in section 4.2.3.2 of ietf rfc 1122: when send ack segment'. red hat (fedora/centos/rhel) uses 40ms: parameter , can modified. on debian (ubuntu included), seems hardcoded 40ms, it's not modifiable (except if create connection tcp_no_delay option).

i hope enough detail understand little bit more process. answer big, stop here, guess.

readable

i checked note readable. wild guess: if readable detects empty input closes stream on same tick.

edit: read code readable. suspected:

https://github.com/nodejs/node/blob/master/lib/_stream_readable.js#l371

https://github.com/nodejs/node/blob/master/lib/_stream_readable.js#l1036

if read finishes event, end emitted processed next.

so event processing is:

readable event: reads data
readable detects has finished => creates end event
you write data creates event uncork
end event processed (uncork done)
uncork processed (but nothing done)

if reduce buffer:

req.on('readable',()=> {     let chunk2;     while (null !== (chunk2 = req.read(5))) {         res.write(chunk2);     } });

this forces 2 writes. process be:

readable event: reads data. 5 as.
you write data creates uncork event
you read data. readable detects has finished => create end event
you write data , added buffered data
uncork processed (because launched before end); send data
end event processed (uncork done) => wait ack send final chunk
process slow (it is; checked)

Search This Blog

TY