Time spent in ticks...
Joel Sherrill
joel at rtems.org
Thu Oct 13 22:56:21 UTC 2016
On Thu, Oct 13, 2016 at 1:37 PM, Joel Sherrill <joel at rtems.org> wrote:
>
>
> On Thu, Oct 13, 2016 at 11:21 AM, Jakob Viketoft <
> jakob.viketoft at aacmicrotec.com> wrote:
>
>>
>> *From:* Joel Sherrill [joel at rtems.org]
>> *Sent:* Thursday, October 13, 2016 17:38
>> *To:* Jakob Viketoft
>> *Cc:* devel at rtems.org
>> *Subject:* Re: Time spent in ticks...
>>
>> >I don't have an or1k handy so ran on a sparc/erc32 simulator/
>> >It is is a SPARC v7 at 15 Mhz.
>>
>> >These times are in microseconds and based on the tmtests.
>> >Specifically tm08and tm27.
>>
>> >(1) rtems_clock_tick: only case - 52
>> >(2) rtems interrupt: entry overhead returns to interrupted task - 12
>> >(3) rtems interrupt: exit overhead returns to interrupted task - 4
>> >(4) rtems interrupt: entry overhead returns to nested interrupt - 11
>> >(5) rtems interrupt: exit overhead returns to nested interrupt - 3
>>
>>
> The above was from the master with SMP enabled. I repeated it with
> SMP disabled and it had no impact.
>
> Since the timing change is post 4.11, I decided to try 4.11 with SMP
> disabled:
>
> rtems_clock_tick: only case - 42
> rtems interrupt: entry overhead returns to interrupted task - 11
> rtems interrupt: exit overhead returns to interrupted task - 4
> rtems interrupt: entry overhead returns to nested interrupt - 11
> rtems interrupt: exit overhead returns to nested interrupt - 3
>
> So 42 + 12 + 4 = 58 microseconds, 58 * 15 = 870 cycles
>
> So the overhead has gone up some but as Pavel says it is quite likely
> some mathematical operation on 64 bit types is slow on your CPU.
>
> HINT: If you can write a benchmark for 64-bit operations,
> it would be a good comparison between CPUs and might
> highlight where the software implementation needs improvement.
>
I decided that another good point of reference was the powerpc/psim BSP. It
reports the benchmarks in instructions:
(1) rtems_clock_tick: only case - 229
(2) rtems interrupt: entry overhead returns to interrupted task - 102
(3) rtems interrupt: exit overhead returns to interrupted task - 95
(4) rtems interrupt: entry overhead returns to nested interrupt - 105
(5) rtems interrupt: exit overhead returns to nested interrupt - 85
229 + 102 + 96 = 427 instructions.
That seems roughly inline with the erc32 which is 1 cycle for all
instructions
except loads which are 3 and stores which are 2. And the sparc has
register windows so entering and exiting an ISR can potentially save
and restore a lot of registers.
So I am still leaning to Pavel's explanation that some primitive operation
is really inefficient.
>
>
>> >The clock tick test has 100 tasks but it looks like they are blocked on
>> a semaphore
>> >without timeout.
>>
>> >Your times look WAY too high. Maybe the interrupt is stuck on and
>> >not being cleared.
>>
>> >On the erc32, a nominal "nothing to do clock tick" would be 1+2+3 from
>> >above or 52+12+4 = 68 microseconds. 68 * 15 = 1020 machine cycles.
>> >So at a higher clock rate, it should be even less time.
>>
>> >My gut feeling is that I think something is wrong with the ISR handler
>> >and it is stuck. But the performance is definitely way too high.
>>
>> >--joel
>>
>> (Sorry if the format got somewhat I garbled, anything but top-posting
>> have to be done manually...)
>>
>> I re-tested my case using an -O3 optimization (we have been using -O0
>> during development for debugging purposes) and I got a good performance
>> boost, but I'm still nowhere near your numbers. I can vouch for that the
>> interrupt (exception really) isn't stuck, but that the code unfortunately
>> takes a long time to compute. I have a subsecond counter (1/16 of a second)
>> which I'm sampling at various places in the code, storing its numbers to a
>> buffer in memory so as to interfere with the program as little as possible.
>>
>> With -O3, a tick handling still takes ~320 us to perform, but the weight
>> has now shifted. tc_windup takes ~214 us and the rest is obviously
>> _Watchdog_Tick(). When fragmenting the tc_windup function to find the worst
>> speed bumps the biggest contribution (~122 us) seem to be coming from scale
>> factor recalculation. Since it's 64 bits, it's turned into a software
>> function which can be quite time-consuming apparently.
>>
>> Even though _Watchdog_Tick() "only" takes ~100 us now, it still sound
>> much higher than your total tick with a slower system (we're running at 50
>> MHz).
>>
>> Is there anything we can do to improve these numbers? Is Clock_isr
>> intended to be run uninterrupted as it is now? Can't see that much of the
>> BSP patch code has anything to do with the speed of what I'm looking at
>> right now...
>>
>> /Jakob
>>
>>
>>
>> *Jakob Viketoft *Senior Engineer in RTL and embedded software
>>
>> ÅAC Microtec AB
>> Dag Hammarskjölds väg 48
>> SE-751 83 Uppsala, Sweden
>>
>> T: +46 702 80 95 97
>> http://www.aacmicrotec.com
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/devel/attachments/20161013/e75514b3/attachment-0002.html>
More information about the devel
mailing list