receive hangs; was TCP send() performance and eventual hangs

zigg
Posts: 28
Joined: Wed Jul 01, 2009 2:42 pm

Re: TCP send() performance and eventual hangs

Post by zigg » Sun Jul 05, 2009 4:31 pm

Yeah, I intended to post it, but got dragged away from the computer yesterday. ;) But here you go:

http://zigg.com/tmp/tftps-ds-test.zip

I'm using the Makefile on Linux, so if your build environment is different you may have to link in dswifi9d yourself. It is a little ugly and I just noticed there's potential for a pretty enormous memory leak in the main loop, but I don't think it should be troubling us here.

tftp.c contains pretty much all the networking stuff, and if I haven't changed it too much, tftp.c.orig can be swapped in to try the select()-based version.

The .nds is of the version with my own timeout loops written. Which, by the way, has proved to be no more reliable than the select() version after all, and slightly slower to boot. (The latter is to be expected.)

main.c allocates and fills a buffer of 1 MiB and will send it to any connecting TFTP client. In my tests I used the client in the tftp-hpa package on Ubuntu/Debian (primarily because it let me use a port number other than 69 and I developed the server on Linux first). Tested it with this command:

Code: Select all

while time tftp 172.22.216.131 -c get test;do date;done
which will repeatedly request the file and exit when the tftp client fails to end the transfer properly.

EDIT: I just got a Guru Meditation Error running it again, at least a handful of packets away from the end of file.

Code: Select all

Guru Meditation Error!
data abort!

pc: 00154a04 addr: 00154a04

r0: 0b004000 r8: 00001800
r1: 0b008daa r9: 0000000c
r2: 0000544e r10: 0b0084f4
r3: 0000045c r11: 00000e10
r4: 0241a2c4 r12: ffff6eaa
r5: 00001800 sp: 0b003dc0
r6: 0000916c lr: ffff0290
r7: 00000708 pc: 00154a0c

0b003dc0: 00001800 0b008daa
0b003dc8: 0000544e 0000045c
0b003dd0: ffff6eaa 02001a9e
0b003dd8: 72070400 04408018
0b003de0: 000abd90 35820040
0b003de8: 00154a08 0be91a00
0b003df0: 15008b00 604713e9
0b003df8: 91432200 4900b6f6
0b003e00: 0003aaaa 00080000
0b003e08: 20000045 00400000

melw
Posts: 13
Joined: Thu Mar 20, 2008 11:43 am

Re: TCP send() performance and eventual hangs

Post by melw » Mon Jul 06, 2009 1:36 pm

Thanks for the source! JSYK, compiled perfectly under Windows as well...

Here's further information what's going on:
melw wrote:There are probably coming from Wifi_Update() in wifi_arm9.c, after reading a packet from the RX buffer. The RX buffer is checked in a loop, and obviously there's at least two packets to be processed when your application hangs. Pretty much only thing that happens after this is processing the received packet(s).
After the previous send here we receive two packets in a row. While (almost?) all the received packets have protocol set to B0024C8, here the first packet has protocol set to B00241C, and the second one again B0024C8. When the second packet gets to sgIP_UDP_ReceivePacket(), the following line hangs in an infinite loop:

Code: Select all

while(tmb->next) tmb=tmb->next;
Sgstair or someone with more knowledge on inner workings of DS Wifi library could probably share some insight into what this actually means.

elhobbs
Posts: 358
Joined: Thu Jul 02, 2009 1:19 pm

Re: TCP send() performance and eventual hangs

Post by elhobbs » Mon Jul 06, 2009 3:15 pm

Code: Select all

while(tmb->next) tmb=tmb->next;
dswifi is maintaining a linked list of packets that have been received. this code is trying to find the end of the list - which is where new packets are added. it would appear that this list is becoming corrupt when packets are being received very close together. dswifi disables interrupts by saving REG_IME and restoring it when these lists are being modified.

I suspect there may be an issue with the libnds fifo system. in particular fifoInternalRecvInterrupt. this function does not backup and restore REG_IME. instead it enables and disables it by writing 1 and 0 repectively - regardless of the current state. I think that if a packet is received while fifoInternalRecvInterrupt is still processing the first packet that it could lead to corruption of the list.

zigg
Posts: 28
Joined: Wed Jul 01, 2009 2:42 pm

Re: TCP send() performance and eventual hangs

Post by zigg » Mon Jul 06, 2009 6:16 pm

elhobbs wrote:I suspect there may be an issue with the libnds fifo system. in particular fifoInternalRecvInterrupt. this function does not backup and restore REG_IME. instead it enables and disables it by writing 1 and 0 repectively - regardless of the current state. I think that if a packet is received while fifoInternalRecvInterrupt is still processing the first packet that it could lead to corruption of the list.
Hmm. I'm really green on this stuff, though I did some lunchtime reading, so feel free to tell me I'm way off-base.

fifoIRI is an interrupt handler. So if REG_IME has been set to 0, it won't be called in the first place? And the interrupt dispatcher is doing its own saving/restoring of IME around the call to fifoIRI, it seems.

fifoIRI does seem to kick off by masking its own IRQ then re-enabling IME (though it'll turn it off for a few critical sections). It exits with IME cleared as near as I can tell; the interrupt dispatcher then restores IME before returning control.

dswifi seems to be calling enableCriticalSection and friend at the right places (not having read all of it, honestly), so fifoIRI shouldn't get called when the aforementioned list is being manipulated.

We shouldn't be in fifoIRI in the first place if dswifi has locked out interrupts, which it will do when manipulating these lists, so fifoIRI setting IME then getting an IRQ_WIFI interrupt should not create a situation where this problem occurs.

I think.

All that said, for brute force's sake I could try to clear the IRQ_WIFI line in REG_IE inside of fifoIRI before setting IME, then restoring it afterward. At the very least it should prove/disprove this theory.

WinterMute
Site Admin
Posts: 1845
Joined: Tue Aug 09, 2005 3:21 am
Location: UK
Contact:

Re: TCP send() performance and eventual hangs

Post by WinterMute » Tue Jul 07, 2009 1:39 am

elhobbs wrote: I suspect there may be an issue with the libnds fifo system. in particular fifoInternalRecvInterrupt. this function does not backup and restore REG_IME. instead it enables and disables it by writing 1 and 0 repectively - regardless of the current state. I think that if a packet is received while fifoInternalRecvInterrupt is still processing the first packet that it could lead to corruption of the list.
Not entirely sure how you come to this conclusion - REG_IME will *always* be zero at the start of any user irq handler, it needs to be explicitly set to 1 to allow nested interrupts. The dispatcher was designed not to nest interrupts by default, mainly since the opposite case would make it seriously difficult to prevent nesting.

interrupts are blocked during list manipulation unless I've missed something which is, of course, perfectly possible.
melw wrote: After the previous send here we receive two packets in a row. While (almost?) all the received packets have protocol set to B0024C8, here the first packet has protocol set to B00241C, and the second one again B0024C8. When the second packet gets to sgIP_UDP_ReceivePacket(), the following line hangs in an infinite loop:

Code: Select all

while(tmb->next) tmb=tmb->next;
Where do you get the idea these are protocol settings? They look awfully like stack addresses to me.

Can you reliably reproduce the hang by any chance?

zigg wrote: pc: 00154a04 addr: 00154a04
This looks like some sort of stack corruption - it's definitely not somewhere the pc should ever be. How reproducible is that?
Help keep devkitPro toolchains free, Donate today

Personal Blog

zigg
Posts: 28
Joined: Wed Jul 01, 2009 2:42 pm

Re: TCP send() performance and eventual hangs

Post by zigg » Tue Jul 07, 2009 2:28 am

Masking IRQ_WIFI in fifoIRI didn't help. I've gone back to the stock libnds*.a.
WinterMute wrote:Can you reliably reproduce the hang by any chance?
Absolutely. I can start my basic TFTP server (see above) and run a TFTP client in a loop, and it will hang after sending some number (and part of the last) of 1 MiB files to my PC.

However, I had not pinpointed it to any particular line of code. It merely behaves the same way with two R: packets and then the hang.
WinterMute wrote:
zigg wrote: pc: 00154a04 addr: 00154a04
This looks like some sort of stack corruption - it's definitely not somewhere the pc should ever be. How reproducible is that?
I've only seen it once. The hang I've seen dozens of times, maybe about ten with the code I posted above.

zigg
Posts: 28
Joined: Wed Jul 01, 2009 2:42 pm

Re: TCP send() performance and eventual hangs

Post by zigg » Tue Jul 07, 2009 3:02 am

zigg wrote:I've only seen it once. The hang I've seen dozens of times, maybe about ten with the code I posted above.
Spoke too soon. Just got another. I hope you don't mind but I just snapped a picture this time:

Image

This is running the same binary that is in the zip file I posted.

elhobbs
Posts: 358
Joined: Thu Jul 02, 2009 1:19 pm

Re: TCP send() performance and eventual hangs

Post by elhobbs » Tue Jul 07, 2009 3:30 am

WinterMute wrote:
elhobbs wrote: I suspect there may be an issue with the libnds fifo system. in particular fifoInternalRecvInterrupt. this function does not backup and restore REG_IME. instead it enables and disables it by writing 1 and 0 repectively - regardless of the current state. I think that if a packet is received while fifoInternalRecvInterrupt is still processing the first packet that it could lead to corruption of the list.
Not entirely sure how you come to this conclusion - REG_IME will *always* be zero at the start of any user irq handler, it needs to be explicitly set to 1 to allow nested interrupts. The dispatcher was designed not to nest interrupts by default, mainly since the opposite case would make it seriously difficult to prevent nesting.

interrupts are blocked during list manipulation unless I've missed something which is, of course, perfectly possible.
since fifoInternalRecvInterrupt enables interrupts wouldn't it allow Wifi_Update to re-enter through Timer_50ms on the timer 3 interrupt?

melw
Posts: 13
Joined: Thu Mar 20, 2008 11:43 am

Re: TCP send() performance and eventual hangs

Post by melw » Tue Jul 07, 2009 1:37 pm

WinterMute wrote:
melw wrote: After the previous send here we receive two packets in a row. While (almost?) all the received packets have protocol set to B0024C8, here the first packet has protocol set to B00241C, and the second one again B0024C8. When the second packet gets to sgIP_UDP_ReceivePacket(), the following line hangs in an infinite loop:

Code: Select all

while(tmb->next) tmb=tmb->next;
Where do you get the idea these are protocol settings? They look awfully like stack addresses to me.
In sgIP_Hub_ReceiveHardwarePacket() there's a following debug message printing these values:

Code: Select all

SGIP_DEBUG_MESSAGE(("\nhub: rx packet %04X %X\n",protocol,packet->totallength));
Didn't notice the protocol is unsigned short* until now - a pointer to protocol settings then?
WinterMute wrote:Can you reliably reproduce the hang by any chance?
Yes. Like Zigg said, the hang occurs sooner or later - sometimes it may take more than 10Mb of transmitted traffic, sometimes as soon as during the first megabyte. The hang is always similar - two packets in a row, as described above.
WinterMute wrote:
zigg wrote: pc: 00154a04 addr: 00154a04
This looks like some sort of stack corruption - it's definitely not somewhere the pc should ever be. How reproducible is that?
I got a guru meditation once during the debugging session yesterday with similar looking addresses and registers. Meanwhile the sgIP_UDP_ReceivePacket() hang occurred about 10 times.

elhobbs
Posts: 358
Joined: Thu Jul 02, 2009 1:19 pm

Re: TCP send() performance and eventual hangs

Post by elhobbs » Tue Jul 07, 2009 2:13 pm

Didn't notice the protocol is unsigned short* until now - a pointer to protocol settings then?
the cast for the protocol is for the buffer pointer not the value at index 6. the protocol is encoded as 2 bytes at offset 12 and 13 in the packet - so casting the buffer as an unsigned short allows reading the value directly.

you have to be careful doing this on the ds as it generally does not like unaligned reads on shorts or ints - but in this case it should always be fine since the offset is fixed and will always be aligned.

Post Reply

Who is online

Users browsing this forum: No registered users and 9 guests