View Full Version : Why programming is sometimes hard

07-18-2014, 12:42 PM
I fixed some networking issues in our engine (specifically Drox Operative) a while back and the path I had to take to find the solution was interesting, so I thought I would write up a little of what happened.

The problem was that once a game progressed long enough, people were having a hard time joining the server hosting that particular sector. Easy enough, I thought, but no matter what I did I couldn't reproduce the problem. I even got some of our gamers to send me their save games that they were having trouble with, but still no luck.

Even though I couldn't reproduce the problem, I looked at the save games to see if there was anything interesting going on. What I found was one of the initial networking messages was so large that the networking system needed to create over 10 fragments for it (the system would send 10+ smaller messages). At the time, the system would throw out the entire message if it received an out of order fragment or any fragment got lost. I figured if the packet loss was very high, a message of that size would almost never get through. So I set my handy packet loss tool to drop 50% of the packets and sure enough, I could almost never get that message through.

So I go about making the system smarter about fragments. I made it so it would store all of the fragments and if it missed a fragment or got an out of order fragment, everything was fine and it would just wait for the other side to resend the message again. As long as the next message was the same as the first, it would ignore repeated fragments and just use the fragments it needed. This way sooner or later, even with bad packet loss, it will get all of the fragments.

So now everything should work great. I test it and the new stuff does exactly what it should, but it still doesn't work. Even weirder is I turn the packet loss tool off and it still doesn't work even though it used to when the tool was off. At least now I have a reproducible situation.

I debug the problem and see that it gets each fragment fine until a little after 8192 bytes. It wasn't exactly 8192, but it was near enough to that power of 2 that I was suspicious. I turned the packet loss stuff back on and now I started getting data after 8192 bytes but I noticed that I was getting the same number of fragments through each time. So the networking was only delivering a certain number of bytes before eating everything else. I did a little googling and found out that Windows defaults to a 8192 buffer size for incoming UDP packets.

Ok, so I've found the problem. I found the correct commands and now the networking is told to use a much larger buffer so it can at least hold one large message. I test again and it still doesn't work! :( I start debugging again. Now I see fragments coming through way past 8192, so that is fixed, but I get to around 32K and then one of the numbers goes negative. Again that sounds like another power of 2. In this case it sounds like a signed 16 bit value. Sure enough I find that fragments are using a signed 16 bit for an offset number. Again easy enough, I change it to a 32 bit value so that should never be a problem again, assuming we never generate messages that are over 2GB in size. :)

I test again and things finally work as they should! So in the end, my initial packet loss changes had nothing to do with the real problem that we were running into. The fixes to the actual problem were like 2 lines of code compared to probably 100s dealing with the packet loss. However, it still is a nice change because it handles packet loss much better on large messages. I'm still not quite sure why I couldn't reproduce the problem in the first place though.