Help on how to configure for user-defined memory protection support (GSoC 2020)

Thu May 21 00:13:05 UTC 2020

Yes, I completely agree with Gedare, and my reply doesn't entail
otherwise. As Gedare stated a few requirements:

"2. The basic protection isolates the text, rodata, and rwdata from
each other. There is no notion of task-specific protection domains,
and tasks should not incur any additional overhead due to this
protection."

Such areas are the ones I meant to be "Global." The design and
implementation should aim to make them stick in the TLB and don't get
kicked out. Those aren't being assigned an ASID as they are global and
won't need to get flushed and their mappings/attributes won't change.

"3. The advanced protection strongly isolates all tasks' stacks.
Sharing is done explicitly via POSIX/RTEMS APIs, and the heap and
executive (kernel/RTEMS) memory are globally shared. A task shall only
incur additional overhead in context switches and the first access to
a protected region (other task's stack it shares) after a context
switch."

The additional overhead here is the flushing of the protected region
(that might be a shared protected stack for example). Only that
region's TLB entry will differ between tasks on context switches, and
if ASID is used, the hardware will make sure it gets the correct entry
(by doing a HW page-table walk).

On Wed, 20 May 2020 at 11:05, Utkarsh Rai <utkarsh.rai60 at gmail.com> wrote:
>
>
>
>
> On Wed, May 20, 2020 at 7:40 AM Hesham Almatary <heshamelmatary at gmail.com> wrote:
>>
>> On Tue, 19 May 2020 at 14:00, Utkarsh Rai <utkarsh.rai60 at gmail.com> wrote:
>> >
>> >
>> >
>> > On Mon, May 18, 2020 at 8:38 PM Gedare Bloom <gedare at rtems.org> wrote:
>> >>
>> >> On Mon, May 18, 2020 at 4:31 AM Utkarsh Rai <utkarsh.rai60 at gmail.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Sat, May 16, 2020 at 9:16 PM Joel Sherrill <joel at rtems.org> wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Sat, May 16, 2020 at 10:14 AM Gedare Bloom <gedare at rtems.org> wrote:
>> >> >>>
>> >> >>> Utkarsh,
>> >> >>>
>> >> >>> What do you mean by "This would although mean that we would have page tables of  1MB."
>> >> >>>
>> >> >>> Check that you use plain text when inlining a reply, or at least that you broke the reply format.
>> >> >>>
>> >> >>> Gedare
>> >> >>>
>> >> >>> On Fri, May 15, 2020, 6:04 PM Utkarsh Rai <utkarsh.rai60 at gmail.com> wrote:
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>> On Thu, May 14, 2020 at 10:23 AM Sebastian Huber <sebastian.huber at embedded-brains.de> wrote:
>> >> >>>>>
>> >> >>>>> Hello Utkarsh Rai,
>> >> >>>>>
>> >> >>>>> On 13/05/2020 14:30, Utkarsh Rai wrote:
>> >> >>>>> > Hello,
>> >> >>>>> > My GSoC project,  providing thread stack protection support, has to be
>> >> >>>>> > a user-configurable feature.
>> >> >>>>> > My question is,  what would be the best way to implement this, my idea
>> >> >>>>> > was to model it based on the existing system configuration
>> >> >>>>> > <https://docs.rtems.org/branches/master/c-user/config/intro.html>, but
>> >> >>>>> > Dr. Gedare pointed out that configuration is undergoing heavy changes
>> >> >>>>> > and may look completely different in future releases. Kindly advise me
>> >> >>>>> > as to what would be the best way to proceed.
>> >> >>>>> before we start with an implementation. It would be good to define what
>> >> >>>>> a thread stack protection support is supposed to do.
>> >> >>>>
>> >> >>>>
>> >> >>>> The thread stack protection mechanism will protect against stack overflow errors and will completely isolate the thread stacks from each other. Sharing of thread stack will be possible only when the user makes explicit calls to do so. More details about this can be found in this thread.
>> >> >>>>>
>> >> >>>>> Then there should
>> >> >>>>> be a concept for systems with a Memory Protection Unit (MPU) and a
>> >> >>>>> concept for systems with a Memory Management Unit (MMU). MMUs may
>> >> >>>>> provide normal 4KiB Pages, large Pages (for example 1MiB) or something
>> >> >>>>> more flexible. We should identify BSPs which should have support for
>> >> >>>>> this. For each BSP should be a concept. Then we should think about how a
>> >> >>>>> user can configure this feature.
>> >> >>>>>
>> >> >>>>> For memory protection will have a 1:1 VA-PA address translation that means a 4KiB page size will be set for both the MPU and MMU, a 1:1 mapping will ensure we will have to do lesser page table walks.This would although mean that we would have page tables of  1MB. I will be first providing the support for Armv7 based BSPs (RPi , BBB, etc. have MMU support) then when I have a working example I will move on to provide the support for RISC-V. which has MPU support.
>> >> >>
>> >> >>
>> >> >> I think Sebastian is asking exactly what I did. What are the processor (specific CPU) requirements to support thread stack protection?
>> >> >
>> >> >
>> >> > For thread stack protection the processor should have the option of paging along with appropriate 'access bits' setting. Both RISC-V and ARMv7-A (the ones that I will be focusing on my project) have the option of defining pages of 4KiB size with appropriate access bits.
>> >> >
>> >> >>
>> >> >>
>> >> >> For example, to be effective, I imagine a 1MB granularity might be sufficient to protect code versus data/bss. But it is likely insufficient to protect thread stacks.
>> >> >>
>> >> >> Similarly, a processor with a limited number of "protection areas" would be unsuitable as a basis for implementing thread stack protection. Here I am thinking of the PowerPC with a handful of TLB registers. You would have to turn on paging.
>> >> >
>> >> >
>> >> > I agree, most of the processors have protection regions between 8 to 16 and in some cases as less as 4. For stack protection paging with each page of size 4KiB, as it is applicable for processors with mpu or mmu and is optimal, in the sense that we would have appropriate number and size of pages for thread stacks, is the best option.
>> >> >
>> >>
>> >> We should have a clear understanding of the design requirements
>> >> brefore we can make such a statement about "optimal" and "best".
>> >>
>> >> The proposal has some good ideas in it, but I think the project has
>> >> some implied expectations or assumptions, on both your side and from
>> >> mentors/stakeholders. Here are some ideas that should start to hint at
>> >> requirements. Maybe you can propose some design requirements. I'm not
>> >> too good at writing requirements myself, but here goes:
>> >> 1. Memory protection is optional. The default is no memory protection.
>> >> 2. The basic protection isolates the text, rodata, and rwdata from
>> >> each other. There is no notion of task-specific protection domains,
>> >> and tasks should not incur any additional overhead due to this
>> >> protection.
>> >> 3. The advanced protection strongly isolates all tasks' stacks.
>> >> Sharing is done explicitly via POSIX/RTEMS APIs, and the heap and
>> >> executive (kernel/RTEMS) memory are globally shared. A task shall only
>> >> incur additional overhead in context switches and the first access to
>> >> a protected region (other task's stack it shares) after a context
>> >> switch.
>> >>
>> >> I'm sure there are more you can draw out from your proposal and we can
>> >> discuss. #2 provides a useful option for systems with MPU or similar
>> >> hardware that is insufficient to support #3.
>> >>
>> >> Mainly I wanted to get to driving at #3. One implication of it is that
>> >> for a task that doesn't access any other task's stack there should
>> >> only be TLB faults when there is a context switch. Another implication
>> >> is that all entries for a task's protection domain should fit in the
>> >> TLB. If you use 4 KiB pages with 8-entry TLB, you can only have 32 KiB
>> >> active in the TLB at a time. This may exceed the size of some
>> >> protection domains (e.g., very large code bases might have more than
>> >> 32K in the text segment) and so you could not guarantee #3. This is
>> >> the kind of analysis you need to think about before you can make
>> >> design decisions. Large pages lead to internal fragmentation, which
>> >> would cause its own problem. If you can mix-and-match sizes, it may be
>> >> sensible to use large pages for statically shared regions (.text,
>> >> .data, .bss) and only the smaller pages for stacks.
>> >
>> >
>> > This seems to be the best way to proceed, we can have sections or super-sections that contain the statically shared regions and then have
>> > pages for stacks.
>> >
>> >>
>> >> Some of the complexity may be punted to the user configuration also.
>> >> Possibly, the upcoming rewrite of configuration may make this even
>> >> more flexible, but I haven't looked deeply enough at the details yet.
>> >> I'm thinking that something like a configuration option (macro,
>> >> specification) to set the number of shared task stack protection
>> >> domains, or some API methods to custom-tailor the stack allocations
>> >> for sharing tasks.
>> >>
>> >> Consider, for example, that task stacks could be
>> >> just 1 KiB, then you could pack 4 in the same 4KiB page in a shared
>> >> protection domain.
>> >
>> >
>> > Determining the page and task size for each processor has to be a function of the number of TLB entries. I think instead of leaving it to the user to determine the size this should be already defined for each CPU and the user should be made to understand the limitations of using stack protection. Suppose we have 8 TLB entries and 4KiB pages for thread stacks, 1 entry is for the statically shared region, we have 28KiB address space in the TLB for a stack(Some of the entries may be for shared memory) but if we take page size to be 1KiB we have only 7KiB memory in the address space which may not be enough when the stack shares memory and we would have to perform page table walks. A user may or may not take this decision and it may be better to have the user understand a limitation rather than making him do the calculation and then understand the limitation.
>> >
>> >>
>> >>
>> >> I'm not convinced that you should be thinking about the implementation
>> >> as providing a "page table" mechanism. Because most of the regions
>> >> (text, rodata, .data, .bss) are globally shared. Yes, it can be
>> >> implemented by a traditional page table mechanism, but I think that is
>> >> overkill. A simple, stupid implementation could just put all the
>> >> task-specific protected regions in a linked list and walk it,
>> >> installing them in the TLB, during the context_restore. Or maybe
>> >> pushing to a list/stack on context_save, and popping on restore. The
>> >> only real complication is TLB shootdown of the entries that change
>> >> between contexts, and whether you have to install again the globally
>> >> shared ones.  In a real-time system, it is far superior to pay for a
>> >> fixed, known cost at a context switch than it is to take costs (even
>> >> if they are smaller!) at a random point in task execution.
>> >
>> >
>> > Another method that can prevent us from flushing the TLB on each context switch is by assigning each entry in the TLB an ASID. The price for this would be the number of entries for each task would be lesser and we will have more TLB misses, we would have to do more page table walks(This will reduce the determinism and may not be the best option for a real-time system ). If we proceed by simply flushing and re-populating the TLB on each context-switch, then we have to consider a case where the address space of a task fills all of the TLB entries, and then it has to access stacks of another task. One way to resolve this would be to consider that if stack sharing has been performed by the user, chances are his access to the address space of the other task will be non-trivial, so there may be some merit in populating the TLB with the address space of shared stack. This would mean that every time stack sharing is used a TLB flushing will have to be performed. Another would be to set the task and page size such that few of the TLB entries remain for task sharing. (This idea overlaps with what I described earlier in the thread)
>> >
>> On recent MMU-based processors, we don't have to flush the whole TLB.
>> We will statically map .data .text, .rwdata and all global RTEMS
>> sections and mark them as "Global" in the PTE. This tells the
>> processor that such PTEs don't need to be flushed and will stay
>> unmodified regardeless of the ASID. The ASID is useful for shared, yet
>> local, pages and those are also marked as "Non Global" and "Shared,"
>> if required. On each context switch, you only need to change the ASID
>> and the HW will take care of the flushing if needed. You will only
>> have to flush during mmap, and do TLB shootdowns if on multicore.
>>
>> I'd also like to change our perspective from SW TLBs to HW PTW (Page
>> Table Walker). ARM and RISC-V have HW PTW. Hence, there's no such
>> thing like re-populating TLBs in SW, that's done by the HW on TLB
>> misses.
>>
>
> My idea to make the physical address accessing fast was to have most of the entries already present in the TLB at the context switch by simply walking through the protected regions. Granted this makes context switch slow, but as Dr. Gedare points out earlier in the thread, in a real-time system it is always better to pay for an additional fixed time at a stage rather than making system un-deterministic. Assigning ASIDs to the regions is a very efficient way to protect us from the overheads during the context switch but does that not mean( from my understanding) that we would have lesser entries for a given thread in the TLB? As we would have 'residual' translations from the previous thread already present.
>
The previous thread will have the same mappings/attributes for all TLB
entries corresponding to the "Global" RTEMS sections such as .text,
.data, .rwdata, etc. Those will get reused as is. It is only the
protected local memory region (such as stack) that may need to be
kicked out.

>>
>>
>> >>
>> >> There is a lot yet for you to unravel in this topic. Don't be afraid
>> >> to keep asking questions and digging. We should try to lay out the
>> >> design and aim to be future-proof, and establish some simple steps
>> >> toward implementation that allow you to make incremental progress over
>> >> the summer. I think the past efforts at this project were not merged
>> >> completely because they did not provide sufficiently good APIs and
>> >> default use cases--instead they tried to fashion a completely custom
>> >> solution for managing memory protection regions, although they did
>> >> provide useful advancements to BSP support for memory protection. My
>> >> hope is that by focusing on a single type of region--task stacks--your
>> >> project can achieve a more successful integration in the upper layers
>> >> of RTEMS and not just advance our lower-level BSP support.
>> >>
>> >> Gedare
>> >>
>> >> >>
>> >> >> This is the general guidance that needs to be provided so anyone can evaluate how much protection they really can have on their target.
>> >> >>
>> >> >> --joel
>> >> >>>>
>> >> >>>> _______________________________________________
>> >> >>>> devel mailing list
>> >> >>>> devel at rtems.org
>> >> >>>> http://lists.rtems.org/mailman/listinfo/devel
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> devel mailing list
>> >> >>> devel at rtems.org
>> >> >>> http://lists.rtems.org/mailman/listinfo/devel
>>
>> --
>> Hesham

-- 
Hesham