javascript - Node echo server degrades 10x when stream pipes are used over buffering -
on node v8.1.4 , v6.11.1
i started out following echo server implementation, refer pipe.js or pipe.
const http = require('http'); const handler = (req, res) => req.pipe(res); http.createserver(handler).listen(3001);
and benchmarked wrk , following lua script (shortened brevity) send small body payload.
wrk.method = "post" wrk.body = string.rep("a", 10)
at 2k requests per second , 44ms of average latency, performance not great.
so wrote implementation uses intermediate buffers until request finished , writes buffers out. refer buffer.js or buffer.
const http = require('http'); const handler = (req, res) => { let buffs = []; req.on('data', (chunk) => { buffs.push(chunk); }); req.on('end', () => { res.write(buffer.concat(buffs)); res.end(); }); }; http.createserver(handler).listen(3001);
performance drastically changed buffer.js servicing 20k requests per second @ 4ms of average latency.
visually, graph below depicts average number of requests serviced on 5 runs , various latency percentiles (p50 median).
so, buffer order of magnitude better in categories. question why?
what follows next investigation notes, @ least educational.
response behavior
both implementations have been crafted give same exact response returned curl -d - --raw
. if given body of 10 d's, both return exact same response (with modified time, of course):
http/1.1 200 ok date: thu, 20 jul 2017 18:33:47 gmt connection: keep-alive transfer-encoding: chunked dddddddddd 0
both output 128 bytes (remember this).
the mere fact of buffering
semantically, difference between 2 implementations pipe.js writes data while request hasn't ended. might make 1 suspect there multiple data
events in buffer.js. not true.
req.on('data', (chunk) => { console.log(`chunk length: ${chunk.length}`); buffs.push(chunk); }); req.on('end', () => { console.log(`buffs length: ${buffs.length}`); res.write(buffer.concat(buffs)); res.end(); });
empirically:
- chunk length 10
- buffers length 1
since there ever 1 chunk, happens if remove buffering , implement poor man's pipe:
const http = require('http'); const handler = (req, res) => { req.on('data', (chunk) => res.write(chunk)); req.on('end', () => res.end()); }; http.createserver(handler).listen(3001);
turns out, has abysmal performance pipe.js. find interesting because same number of res.write
, res.end
calls made same parameters. best guess far performance differences due sending response data after request data has ended.
profiling
i profiled both application using simple profiling guide (--prof).
i've included relevant lines:
pipe.js
[summary]: ticks total nonlib name 2043 11.3% 14.1% javascript 11656 64.7% 80.7% c++ 77 0.4% 0.5% gc 3568 19.8% shared libraries 740 4.1% unaccounted [c++]: ticks total nonlib name 6374 35.4% 44.1% syscall 2589 14.4% 17.9% writev
buffer.js
[summary]: ticks total nonlib name 2512 9.0% 16.0% javascript 11989 42.7% 76.2% c++ 419 1.5% 2.7% gc 12319 43.9% shared libraries 1228 4.4% unaccounted [c++]: ticks total nonlib name 8293 29.6% 52.7% writev 253 0.9% 1.6% syscall
we see in both implementations, c++ dominates time; however, functions dominate swapped. syscalls account half time pipe, yet 1% buffer (forgive rounding). next step, syscalls culprit?
strace here come
invoking strace strace -c node pipe.js
give summary of syscalls. here top syscalls:
pipe.js
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 43.91 0.014974 2 9492 epoll_wait 25.57 0.008720 0 405693 clock_gettime 20.09 0.006851 0 61748 writev 6.11 0.002082 0 61803 106 write
buffer.js
% time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 42.56 0.007379 0 121374 writev 32.73 0.005674 0 617056 clock_gettime 12.26 0.002125 0 121579 epoll_ctl 11.72 0.002032 0 121492 read 0.62 0.000108 0 1217 epoll_wait
the top syscall pipe (epoll_wait
) 44% of time 0.6% of time buffer (a 140x increase). while there large time discrepancy, number of times epoll_wait
invoked less lopsided pipe calling epoll_wait
~8x more often. can derive couple bits of useful information statement, such pipe calls epoll_wait
, average, these calls heavier epoll_wait
buffer.
for buffer, top syscall writev
, expected considering of time should spent writing data socket.
logically next step take @ these epoll_wait
statements regular strace, showed buffer contained epoll_wait
100 events (representing hundred connections used wrk
) , pipe had less 100 of time. so:
pipe.js
epoll_wait(5, [.16 snip.], 1024, 0) = 16
buffer.js
epoll_wait(5, [.100 snip.], 1024, 0) = 100
graphically:
this explains why there more epoll_wait
in pipe, epoll_wait
doesn't service connections in 1 event loop. epoll_wait
0 events makes event loop idle! doesn't explain why epoll_wait
takes more time pipe, man page states epoll_wait
should return immediately:
specifying timeout equal 0 cause epoll_wait() return immediately, if no events available.
while man page says function returns immediately, can confirm this? strace -t
rescue:
besides supporting buffer has fewer calls, can see calls took less 100ns. pipe has more interesting distribution showing while calls take under 100ns, non-negligible amount take longer , land microsecond land.
strace did find oddity, , that's writev
. return value number of bytes written.
pipe.js
writev(11, [{"http/1.1 200 ok\r\ndate: thu, 20 j"..., 109}, {"\r\n", 2}, {"dddddddddd", 10}, {"\r\n", 2}], 4) = 123
buffer.js
writev(11, [{"http/1.1 200 ok\r\ndate: thu, 20 j"..., 109}, {"\r\n", 2}, {"dddddddddd", 10}, {"\r\n", 2}, {"0\r\n\r\n", 5}], 5) = 128
remember when said both output 128 bytes? well, writev
returned 123 bytes pipe , 128 buffer. 5 bytes difference pipe reconciled in subsequent write
call each writev
.
write(44, "0\r\n\r\n", 5)
and if i'm not mistaken, write
syscalls blocking.
conclusion
if have make educated guess, piping when request not finished causes write
calls. these blocking calls reduce throughput partially through more frequent epoll_wait
statements. why write
called instead of single writev
seen in buffer beyond me. can explain why saw happening?
the kicker? in official node.js guide can see how guide starts buffer implementation , moves pipe! if pipe implementation in official guide there shouldn't such performance hit, right?
aside: real world performance implications of question should minimal, question quite contrived in regards functionality , body side, though doesn't mean less of useful question. hypothetically, answer "node.js uses write
allow better performance under x situations (where x more real world use case)"
disclosure: question copied , modified my blog post in hopes better avenue getting question answered
july 31st 2017 edit
my initial hypothesis writing echoed body after request stream has finished increases performance has been disproved @robertklep readable.js (or readable) implementation:
const http = require('http'); const bufsiz = 2048; const handler = (req, res) => { req.on('readable', _ => { let chunk; while (null !== (chunk = req.read(bufsiz))) { res.write(chunk); } }); req.on('end', () => { res.end(); }); }; http.createserver(handler).listen(3001);
readable performed @ same level buffer while writing data before end
event. if makes me more confused because difference between readable , initial poor man's pipe implementation difference between data
, readable
event , yet caused 10x performance increase. know data
event isn't inherently slow because used in our buffer code.
for curious, strace on readable reported writev
outputs 128 bytes output buffer
this perplexing!
that's funny question have!
in fact, buffered vs piped not question here. have small chunk; processed in 1 event. show issue @ hand, can write handler this:
let chunk; req.on('data', (dt) => { chunk=dt }); req.on('end', () => { res.write(chunk); res.end(); });
or
let chunk; req.on('data', (dt) => { chunk=dt; res.write(chunk); res.end(); }); req.on('end', () => { });
or
let chunk; req.on('data', (dt) => { chunk=dt res.write(chunk); }); req.on('end', () => { res.end(); });
if write
, end
on same handler, latency 10 times less.
if check write
function code, there around line
msg.connection.cork(); process.nexttick(connectioncorknt, msg.connection);
cork
, uncork
connection on next event. means use cache data, force data sent on next event before other events processed.
to sum up, if have write
, end
on different handlers, have:
- cork connection (+ create tick uncork)
- create buffer data
- uncork connection event (send data)
- call end process (that send packet final chunk , close)
if on same handler, end
function called before uncork
event processed, final chunk in cache.
- cork connection
- create buffer data
- add "end" chunk on buffer
- uncork connection send everything
also, end
function runs cork
/ uncork
synchronously, little bit faster.
now why matter? because on tcp side, if send packet data, , wish send more, process wait acknowledge client before sending more:
write
+ end
on different handlers:
- 0.044961s:
post
/ => request - 0.045322s:
http/1.1
=> 1st chunk : header + "aaaaaaaaa" - 0.088522s: acknowledge of packet
- 0.088567s: continuation => 2nd chunk (ending part,
0\r\n\r\n
)
there ~40 ms before ack
after 1st buffer sent.
write
+ end
in same handler:
the data complete in single packet, no ack
needed.
why 40ms on ack
? built-in feature in os improve performance overall. described in section 4.2.3.2 of ietf rfc 1122: when send ack segment'. red hat (fedora/centos/rhel) uses 40ms: parameter , can modified. on debian (ubuntu included), seems hardcoded 40ms, it's not modifiable (except if create connection tcp_no_delay
option).
i hope enough detail understand little bit more process. answer big, stop here, guess.
readable
i checked note readable
. wild guess: if readable
detects empty input closes stream on same tick.
edit: read code readable. suspected:
https://github.com/nodejs/node/blob/master/lib/_stream_readable.js#l371
https://github.com/nodejs/node/blob/master/lib/_stream_readable.js#l1036
if read finishes event, end
emitted processed next.
so event processing is:
readable
event: reads datareadable
detects has finished => createsend
event- you write data creates event uncork
end
event processed (uncork done)- uncork processed (but nothing done)
if reduce buffer:
req.on('readable',()=> { let chunk2; while (null !== (chunk2 = req.read(5))) { res.write(chunk2); } });
this forces 2 writes. process be:
readable
event: reads data. 5a
s.- you write data creates uncork event
- you read data.
readable
detects has finished => createend
event - you write data , added buffered data
- uncork processed (because launched before
end
); send data end
event processed (uncork done) => waitack
send final chunk- process slow (it is; checked)
Comments
Post a Comment