Skip to content
Snippets Groups Projects
  • Linus Torvalds's avatar
    fd468043
    x86: avoid per-cpu system call trampoline · fd468043
    Linus Torvalds authored
    
    The per-cpu system call trampoline was a clever trick, and allows us to
    have percpu data even before swapgs is done by just doing %rip-relative
    addressing.  And that was important, because syscall doesn't have a
    kernel stack, so we needed that percpu data very very early, just to get
    a temporary register to switch the page tables around.
    
    However, it turns out to be unnecessary.  Because we actually have a
    temporary register that we can use: %r11 is destroyed by the 'syscall'
    instruction anyway.
    
    Ok, technically it contains the user mode flags register, but we *have*
    that information anyway: it's still in %rflags, we've just masked off a
    few unimportant bits.  We'll destroy the rest too when we do the "and"
    of the CR3 value, but who cares? It's a system call.
    
    Btw, there are a few bits in eflags that might matter to user space: DF
    and AC.  Right now this clears them, but that is fixable by just
    changing the MSR_SYSCALL_MASK value to not include them, and clearing
    them by hand the way we do for all other kernel entry points anyway.
    
    So the only _real_ flags we'd destroy are IF and the arithmetic flags
    that get trampled on by the arithmetic instructions that are part of the
    %cr3 reload logic.
    
    However, if we really end up caring, we can save off even those: we'd
    take advantage of the fact that %rcx - which contains the returning IP
    of the system call - also has 8 bits free.
    
    Why 8? Even with 5-level paging, we only have 57 bits of virtual address
    space, and the high address space is for the kernel (and vsyscall, but
    we'd just disable native vsyscall).  So the %rip value saved in %rcx can
    have only 56 valid bits, which means that we have 8 bits free.
    
    So *if* we care about IF and the arithmetic flags being saved over a
    system call, we'd do:
    
            shlq $8,%rcx
            movb %r11b,%cl
            shrl $8,%r11d
            andl $8,%r11d
            orb %r11b,%cl
    
    to save those bits off before we then user %r11 as a temporary register
    (we'd obviously need to then undo that as we save the user space state
    on the stack).
    
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    fd468043
    History
    x86: avoid per-cpu system call trampoline
    Linus Torvalds authored
    
    The per-cpu system call trampoline was a clever trick, and allows us to
    have percpu data even before swapgs is done by just doing %rip-relative
    addressing.  And that was important, because syscall doesn't have a
    kernel stack, so we needed that percpu data very very early, just to get
    a temporary register to switch the page tables around.
    
    However, it turns out to be unnecessary.  Because we actually have a
    temporary register that we can use: %r11 is destroyed by the 'syscall'
    instruction anyway.
    
    Ok, technically it contains the user mode flags register, but we *have*
    that information anyway: it's still in %rflags, we've just masked off a
    few unimportant bits.  We'll destroy the rest too when we do the "and"
    of the CR3 value, but who cares? It's a system call.
    
    Btw, there are a few bits in eflags that might matter to user space: DF
    and AC.  Right now this clears them, but that is fixable by just
    changing the MSR_SYSCALL_MASK value to not include them, and clearing
    them by hand the way we do for all other kernel entry points anyway.
    
    So the only _real_ flags we'd destroy are IF and the arithmetic flags
    that get trampled on by the arithmetic instructions that are part of the
    %cr3 reload logic.
    
    However, if we really end up caring, we can save off even those: we'd
    take advantage of the fact that %rcx - which contains the returning IP
    of the system call - also has 8 bits free.
    
    Why 8? Even with 5-level paging, we only have 57 bits of virtual address
    space, and the high address space is for the kernel (and vsyscall, but
    we'd just disable native vsyscall).  So the %rip value saved in %rcx can
    have only 56 valid bits, which means that we have 8 bits free.
    
    So *if* we care about IF and the arithmetic flags being saved over a
    system call, we'd do:
    
            shlq $8,%rcx
            movb %r11b,%cl
            shrl $8,%r11d
            andl $8,%r11d
            orb %r11b,%cl
    
    to save those bits off before we then user %r11 as a temporary register
    (we'd obviously need to then undo that as we save the user space state
    on the stack).
    
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>