Thank you for your suggestions.<br><br><div><span class="gmail_quote">On 9/14/07, <b class="gmail_sendername">Joel Sherrill</b> <<a href="mailto:joel.sherrill@oarcorp.com">joel.sherrill@oarcorp.com</a>> wrote:</span>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>I don't have an SH to try anything on so am only going to offer some<br>general ideas:

<br><br>+ Is 4.x using double precision and 3.4.6 using single precision?</blockquote><div><br>No.  AFAIK, single precision must be explicitly toggled in both compilers: -m4-single.<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

+ Cache settings change somehow? Maybe gcc 4.x is optimizing<br>   some critical setting out of the BSP initialization. </blockquote><div><br>I'm currently investigating this issue. But probably not, because most cache initialization is mostly inside "asm volatile" statements.

<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">+ If there a change in the array indexing code?  There are options<br>    to control multiply and division for the SH so I am curious.

</blockquote><div><br>I can be wrong, but those changes mostly apply to FPU less SH4 models.<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

+ Does it get better or worse when -Os is used?  Or -O2 with no<br>   particular options?</blockquote><div><br>Worse in both cases. I can post numbers if you are interested.  <br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

+ Is the BSP compiled with the old compiler or new?  I am curious<br>    if it is possible to compile the benchmarks with the new compiler<br>    and leave the rest of the system alone.  This would eliminate<br>    something weird happening to the RTEMS code in the new compiler.

</blockquote><div><br>I tried different variants: RTEMS 4.6 compiled with 3.4.6 / application compiled and linked with 4.3.0, RTEMS 4.7 compiled with 4.3.0/application compiled and linked with 3.4.6. Results vary, but application compiled with 

3.4.6 always show better performance. Currently I can't explain why application compiled under 3.4.6, run slowly under RTEMS 4.7 (we really need some profiling utilities for RTEMS).<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Nickolay Kolchin wrote:<br>> Hi,<br>><br>> We have a performance problem on SH4 with gcc4.x.<br>><br>> SciMark2 Numeric Benchmark, see <a href="http://math.nist.gov/scimark">http://math.nist.gov/scimark</a>

<br>> ================================================================<br>>            GCC: 3.4.6   4.2.1   4.3.0 (20070907)<br>>      Composite:  6.05    5.01    4.82<br>>            FFT:  4.90    4.15    4.21

<br>>            SOR: 10.10    8.36     7.64<br>>     MonteCarlo:  3.68    3.06    3.04<br>> Sparse matmult:  5.45    4.45    4.03<br>>             LU:  6.10    5.03    5.18<br>> ================================================================

<br>><br>> BYTEmark* Native Mode Benchmark ver. 2 (10/95)<br>> ================================================================<br>>              GCC:      3.4.6      4.2.1  4.3.0 (20070907)<br>>     NUMERIC SORT:     

35.459       32.2      29.327<br>>      STRING SORT:     0.5943    0.57604      0.8603<br>>         BITFIELD: 1.0585e+07  9.269e+06  9.4138e+06<br>>     FP EMULATION:     4.4944     4.6012       5.364<br>>          FOURIER:     

272.28     241.34      259.12<br>>       ASSIGNMENT:    0.35997    0.38373     0.39683<br>>             IDEA:     124.11     95.057      100.07<br>>          HUFFMAN:     45.593     52.083      56.391<br>>       NEURAL NET:    

0.36153    0.30922     0.31348<br>> LU DECOMPOSITION:     11.331     9.4938       8.255<br>> ================================================================<br>><br>> The "real world application" has 20%-200% performance regression with

<br>> GCC 4.x.<br>><br>> This effectively prevents us from moving to RTEMS 4.7 from 4.6.<br>><br>> I've reported this issue to gcc bugzilla:<br>> <a href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33431">

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33431</a><br>> <<a href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33431">http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33431</a>><br>><br>> But SH4 backend maintainer Kazumoto Kojima, was unable to reproduce it

<br>> under linux-sh:<br>> ================================================================<br>>                         gcc-3.4.6    gcc-4.2.1    gcc-4.3.0(20070910)<br>> Composite Score:            16.76        

16.86        16.99<br>> FFT              Mflops:    12.92        13.36        13.36<br>> SOR              Mflops:     27.88        26.76        28.01<br>> MonteCarlo:      Mflops:     9.96         9.73         9.67

<br>> Sparse matmult   Mflops:    14.95        16.06        14.84<br>> LU               Mflops:     18.08        18.39        19.05<br>> ================================================================<br>><br>

> Maybe, somebody is also using RTEMS on SH4 and can confirm my or<br>> Kojima results?<br>><br></blockquote></div><br>---<br>Nickolay<br>