Time spent in ticks...

Thu Oct 13 17:09:05 UTC 2016

Hello Jakob,

On Thursday 13 of October 2016 18:21:05 Jakob Viketoft wrote:
> I re-tested my case using an -O3 optimization (we have been using -O0
> during development for debugging purposes) and I got a good performance
> boost, but I'm still nowhere near your numbers. I can vouch for that the
> interrupt (exception really) isn't stuck, but that the code unfortunately
> takes a long time to compute. I have a subsecond counter (1/16 of a second)
> which I'm sampling at various places in the code, storing its numbers to a
> buffer in memory so as to interfere with the program as little as possible.
>
> With -O3, a tick handling still takes ~320 us to perform, but the weight
> has now shifted. tc_windup takes ~214 us and the rest is obviously
> _Watchdog_Tick(). When fragmenting the tc_windup function to find the worst
> speed bumps the biggest contribution (~122 us) seem to be coming from scale
> factor recalculation. Since it's 64 bits, it's turned into a software
> function which can be quite time-consuming apparently.
>
> Even though _Watchdog_Tick() "only" takes ~100 us now, it still sound much
> higher than your total tick with a slower system (we're running at 50 MHz).
>
> Is there anything we can do to improve these numbers? Is Clock_isr intended
> to be run uninterrupted as it is now? Can't see that much of the BSP patch
> code has anything to do with the speed of what I'm looking at right now...

the time is measured and timers queue use 64-bit types for time
representation. When higher time measurement resolution than tick
is requested then it is reasonable (optimal) choice but it can be problem
for 16-bit CPUs and some 32-bit one as well.

How you have configured or1k CPU? Have you available hardware multiplier
and barrel shifter or only shift by one and multiplier in SW?
Do the CFLAGS match available instructions?

I am not sure, if there is not 64 division in the time computation
either. This is would be a killer for your CPU. The high resolution
time sources and even tickless timers support can be implemented
with full scaling and adjustment with only shifts, addition and 
multiplications in hot paths.

I have tried to understand to actual RTEMS time-keeping code
some time ago when nanosleep has been introduced and
I have tried to analyze, propose some changes and compared
it to Linux. See the thread following next messages

  https://lists.rtems.org/pipermail/devel/2016-August/015720.html

  https://lists.rtems.org/pipermail/devel/2016-August/015721.html

Some discussed changes to nanosleep has been been implemented
already.

Generally, try to measure how many times multiplication
and division is called in ISR.
I think that I am capable to design implementation which
restricted to mul, add and shr and minimizes number
of transformations but if it sis found that RTEMS implementation
needs to be optimized/changed then it can be task counted
in man months.

Generally, if tick interrupt last more than 10 (may be 20) usec then
there is problem. One its source can be SW implementation ineffectiveness
other that OS selected and possibly application required features
are above selected CPU capabilities.

Best wishes,

            Pavel