Cogent 637 board

Thu Nov 17 15:31:55 UTC 2005

Jiri Gaisler wrote:
> 
> I had a new look at this issue, and enabled the
> ARM patches also for the SPARC port. I could
> run without ipalign in the network driver (after
> also fixing ip_checksum_hdr).
> 
> However, performance dropped from about 30 Mbit/s
> to ~ 20 Mbit/s for my specific driver. Running
> hardware profiling (possible on the LEON3 cpu),
> I could see where the cpu spend the time:
> 
> With ARM patches
> ================
> 
> function                                              samples     ratio(%)
> _CPU_Thread_Idle_body                                     703      40.77
> memcpy                                                    443      25.69
> in_cksum                                                  119       6.90
> tcp_input                                                  36       2.08
> _Workspace_Handler_initialization                          29       1.68
> syscall                                                    29       1.68
> c_rtems_main                                               25       1.45
> _Thread_Dispatch                                           24       1.39
> memset                                                     20       1.16
> ip_input                                                   19       1.10
> _ISR_Handler                                               16       0.92
> 
> 
> With ipalign in the driver
> ==========================
> 
> function                                              samples     ratio(%)
> _CPU_Thread_Idle_body                                     656      44.68
> ipalign                                                   198      13.48
> memcpy                                                    158      10.76
> in_cksum                                                  121       8.24
> tcp_input                                                  34       2.31
> _Workspace_Handler_initialization                          29       1.97
> soreceive                                                  22       1.49
> memset                                                     20       1.36
> c_rtems_main                                               19       1.29
> ip_input                                                   18       1.22
> syscall                                                    18       1.22
> _ISR_Handler                                               14       0.95
> 
> 
> 
> The cpu is loaded to the same degree in both cases (~ 60%),
> but the time is memcpy doubled when using the ARM patches.
> The reason for this is that doing ipalign in the driver
> aligns the packet on a word address, and memcpy can use
> 32-bit accesses to move the packets in the stack. Without
> ipalign, memcpy resorts to byte-wise copying (!). The
> memcpy implementation in newlib is thus not very efficient,
> at least not for the SPARC port. Implementating a
> modified memcpy that would use 16-bit transfers for
> 16-bit aligned data resulted in 30 Mbit/s performance again.
> 
> The question is what to do next. The ARM patches could be
> modified so that they could be enabled by all targets that
> need it, but that would require a better memcpy implementation
> in newlib. Or to stick with the ipalign in the driver and
> live with the cpu overhead. Comments anyone ...?

For CPUs with RTEMS ports, newlib has optimized assembly language 
implementations of memcpy for only the h8300, sh, and i386.  RTEMS 
itself has an optimized version for the m68k which knows some CPU 
specific model characteristics beyond the multlibs.

I can't find an appropriately licensed SPARC optimized memcpy 
implementation to compare.  NetBSD doesn't even have one and the
ARM version they have has an unacceptable advertising clause.
We want the optimized code merged into newlib so it has to have
the correct licensing terms.

Optimizing memcpy and the cksum code are the only CPU specific 
optimizations that can be made to the network stack.

> Jiri.
> 
> 
> 
> Jay Monkman wrote:
> 
>> Jiri Gaisler wrote:
>>
>>> This issue is also a problem for the SPARC port, which does
>>> not allow miss-aligned access. In my opinion, the network stack
>>> has a bug here as it uses a pointer to structure without checking
>>> the pointer alignment. The whole problem would be solved by
>>> splitting that access to the IP address in the IP header into
>>> two 16-bit reads, rather than a 32-bit read. It would require
>>> to modify the IP stack, but all these issues would be solved
>>> once and for all. Maybe the access could be done through a
>>> #define which would be empty on targets supporting unaligned
>>> access.
>>>
>>> An other solution (which is used in eCos) is to implement an
>>> unaligned access trap handler. The trap handler emulates the
>>> access using two 16-bit reads. The overhead should not be
>>> that large as it is only the IP addresses in the IP header
>>> which are miss-aligned.
>>
>>
>>
>> This is also a problem on MIPS.
>>
>> The network stack is from an older version of FreeBSD, from a time 
>> when it must
>> have only supported x86 and m68k, which can deal with misaligned data. 
>> The stack
>> actually goes out of its way to make the problem worse. In at least 
>> one place it
>> uses two 16 bit fields to store a 32 bit value, where the first 16 bit 
>> field is
>> not 4-byte aligned. Something like this:
>>
>>     | Byte 0 | Byte 1 | Byte 2 | Byte 3 |
>>     -------------------------------------
>>         |            long 1                 |
>>     -------------------------------------
>>     |    ?   |   ?    |     short 1     |
>>     -------------------------------------
>>     |     short 2     |    ?   |   ?    |
>>     -------------------------------------
>>
>> So, the 32 bit value is stored in short 1:short 2. So no matter what 
>> you do,
>> something is misaligned.
>>
>>
>> If you search the networking code for __arm__, you'll find the places 
>> we had to
>> change to fix the misaligned accesses for ARM.
>>
>> I think the real solution would be to update the network stack to a 
>> recent
>> version of NetBSD. (FreeBSD would probably be fine, but I think NetBSD 
>> would be
>> safer since it runs on more platforms.) Unfortunately, it's hard to 
>> find time to
>> do it.
>>
>> .
>>

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel at OARcorp.com                 On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
    Support Available             (256) 722-9985