Wednesday, February 1, 2012

Nasty StreamFS Bug

Several months ago I ran into a nasty bug in StreamFS. When the incoming data rate was too high StreamFS would crash with a flood of “Too many open files” exceptions. I took an initial approach to the problem and then dug deeper to find out it was fundamentally a timing bug -- although i’m still unsure about why the bug caused the problem.

Initial approach:
  • Call close() on all database connections.
  • Both in the MySQL driver and the MongoDB driver.
  • Make sure all HttpExchange objects are properly cleaned up after a response is sent.
  • Added a finally clause to the sendResponse() method in the Resource object code that assures the closing of the input/output streams and a final call to the exchange.close() method.
The code was spread all over the place, so it took a while to make sure I had “plugged every hole”, and there were many! So it wasn’t a completely wasted effort. It made the code base more stable, no question. However, the problem still occurred when I unleased the deluge of data coming onto the system.

The next couple of week I tried everything: tracers, logging, etc. Nothing was clearing this bug. So I finally decided to just dig through the datapath. Data is submitted, forwarded to the subscription engine and finally placed in MongoDB. I added in arbitrary delays and it turns out that’s what caused the problem. The process described above was entirely serialized:
  • Forward to subscribers
  • Save in database
With a delay of more than 100 ms between these two step and the incoming post rate at about 700 Kbps (with data size about 300 bytes per entry, so about 300 json documents per second), the process was not fast enough to send the reply and handle the next incoming document. I created a separate task using a thread pool, to paralelize the subscription forwarding process and the database insertion process and the problem disappeared. There must be something in the underlying HttpExchange code in the com.sun.net.httpserver library that is causing some kind of problem with close never being called if the incoming data rate is too fast. In either case it fixed the bug and StreamFS is better for it.