- Mar 12, 2023
-
-
Yazen Ghannam authored
A recent change introduced a flag to queue up errors found during boot-time polling. These errors will be processed during late init once the MCE subsystem is fully set up. A number of sysfs updates call mce_restart() which goes through a subset of the CPU init flow. This includes polling MCA banks and logging any errors found. Since the same function is used as boot-time polling, errors will be queued. However, the system is now past late init, so the errors will remain queued until another error is found and the workqueue is triggered. Call mce_schedule_work() at the end of mce_restart() so that queued errors are processed. Fixes: 3bff147b ("x86/mce: Defer processing of early errors") Signed-off-by:
Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by:
Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by:
Tony Luck <tony.luck@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230301221420.2203184-1-yazen.ghannam@amd.com
-
- Jan 10, 2023
-
-
Tony Luck authored
Systems that support various memory encryption schemes (MKTME, TDX, SEV) use high order physical address bits to indicate which key should be used for a specific memory location. When a memory error is reported, some systems may report those key bits in the IA32_MCi_ADDR machine check MSR. The Intel SDM has a footnote for the contents of the address register that says: "Useful bits in this field depend on the address methodology in use when the register state is saved." AMD Processor Programming Reference has a more explicit description of the MCA_ADDR register: "For physical addresses, the most significant bit is given by Core::X86::Cpuid::LongModeInfo[PhysAddrSize]." Add a new #define MCI_ADDR_PHYSADDR for the mask of valid physical address bits within the machine check bank address register. Use this mask for recoverable machine check handling and in the EDAC driver to ignore any key bits that may be present. [ Tony: Based on independent fixes proposed by Fan Du and Isaku Yamahata ] Reported-by:
Isaku Yamahata <isaku.yamahata@intel.com> Reported-by:
Fan Du <fan.du@intel.com> Signed-off-by:
Tony Luck <tony.luck@intel.com> Signed-off-by:
Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by:
Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20230109152936.397862-1-tony.luck@intel.com
-
- Dec 28, 2022
-
-
Smita Koralahalli authored
Newer AMD CPUs support more physical address bits. That is, the MCA_ADDR registers on Scalable MCA systems contain the ErrorAddr in bits [56:0] instead of [55:0]. Hence, the existing LSB field from bits [61:56] in MCA_ADDR must be moved around to accommodate the larger ErrorAddr size. MCA_CONFIG[McaLsbInStatusSupported] indicates this change. If set, the LSB field will be found in MCA_STATUS rather than MCA_ADDR. Each logical CPU has unique MCA bank in hardware and is not shared with other logical CPUs. Additionally, on SMCA systems, each feature bit may be different for each bank within same logical CPU. Check for MCA_CONFIG[McaLsbInStatusSupported] for each MCA bank and for each CPU. Additionally, all MCA banks do not support maximum ErrorAddr bits in MCA_ADDR. Some banks might support fewer bits but the remaining bits are marked as reserved. [ Yazen: Rebased and fixed up formatting. bp: Massage comments. ] Signed-off-by:
Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by:
Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by:
Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20221206173607.1185907-5-yazen.ghannam@amd.com
-
Smita Koralahalli authored
Move MCA_ADDR[ErrorAddr] extraction into a separate helper function. This will be further refactored to support extended ErrorAddr bits in MCA_ADDR in newer AMD CPUs. [ bp: Massage. ] Signed-off-by:
Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Signed-off-by:
Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by:
Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/all/20220225193342.215780-3-Smita.KoralahalliChannabasappa@amd.com/
-
- May 16, 2022
-
-
Jane Chu authored
The set_memory_uc() approach doesn't work well in all cases. As Dan pointed out when "The VMM unmapped the bad page from guest physical space and passed the machine check to the guest." "The guest gets virtual #MC on an access to that page. When the guest tries to do set_memory_uc() and instructs cpa_flush() to do clean caches that results in taking another fault / exception perhaps because the VMM unmapped the page from the guest." Since the driver has special knowledge to handle NP or UC, mark the poisoned page with NP and let driver handle it when it comes down to repair. Please refer to discussions here for more details. https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/ Now since poisoned page is marked as not-present, in order to avoid writing to a not-present page and trigger kernel Oops, also fix pmem_do_write(). Fixes: 284ce401 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()") Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jane Chu <jane.chu@oracle.com> Acked-by:
Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by:
Dan Williams <dan.j.williams@intel.com>
-
- Apr 05, 2022
-
-
Smita Koralahalli authored
Convert struct mce_bank member "init" from bool to a bitfield to get rid of unnecessary padding. $ pahole -C mce_bank arch/x86/kernel/cpu/mce/core.o before: /* size: 16, cachelines: 1, members: 2 */ /* padding: 7 */ /* last cacheline: 16 bytes */ after: /* size: 16, cachelines: 1, members: 3 */ /* last cacheline: 16 bytes */ No functional changes. Signed-off-by:
Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220225193342.215780-2-Smita.KoralahalliChannabasappa@amd.com
-
- Mar 22, 2022
-
-
luofei authored
When the hwpoison page meets the filter conditions, it should not be regarded as successful memory_failure() processing for mce handler, but should return a distinct value, otherwise mce handler regards the error page has been identified and isolated, which may lead to calling set_mce_nospec() to change page attribute, etc. Here memory_failure() return -EOPNOTSUPP to indicate that the error event is filtered, mce handler should not take any action for this situation and hwpoison injector should treat as correct. Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com Signed-off-by:
luofei <luofei@unicloud.com> Acked-by:
Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Feb 23, 2022
-
-
Borislav Petkov authored
This is pretty much unused and not really useful. What is more, all relevant MCA hardware has recoverable machine checks support so there's no real need to tweak MCA tolerance levels in order to *maybe* extend machine lifetime. So rip it out. Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/YcDq8PxvKtTENl/e@zn.tnic
-
- Feb 19, 2022
-
-
Jue Wang authored
A rare kernel panic scenario can happen when the following conditions are met due to an erratum on fast string copy instructions: 1) An uncorrected error. 2) That error must be in first cache line of a page. 3) Kernel must execute page_copy from the page immediately before that page. The fast string copy instructions ("REP; MOVS*") could consume an uncorrectable memory error in the cache line _right after_ the desired region to copy and raise an MCE. Bit 0 of MSR_IA32_MISC_ENABLE can be cleared to disable fast string copy and will avoid such spurious machine checks. However, that is less preferable due to the permanent performance impact. Considering memory poison is rare, it's desirable to keep fast string copy enabled until an MCE is seen. Intel has confirmed the following: 1. The CPU erratum of fast string copy only applies to Skylake, Cascade Lake and Cooper Lake generations. Directly return from the MCE handler: 2. Will result in complete execution of the "REP; MOVS*" with no data loss or corruption. 3. Will not result in another MCE firing on the next poisoned cache line due to "REP; MOVS*". 4. Will resume execution from a correct point in code. 5. Will result in the same instruction that triggered the MCE firing a second MCE immediately for any other software recoverable data fetch errors. 6. Is not safe without disabling the fast string copy, as the next fast string copy of the same buffer on the same CPU would result in a PANIC MCE. This should mitigate the erratum completely with the only caveat that the fast string copy is disabled on the affected hyper thread thus performance degradation. This is still better than the OS crashing on MCEs raised on an irrelevant process due to "REP; MOVS*' accesses in a kernel context, e.g., copy_page. Tested: Injected errors on 1st cache line of 8 anonymous pages of process 'proc1' and observed MCE consumption from 'proc2' with no panic (directly returned). Without the fix, the host panicked within a few minutes on a random 'proc2' process due to kernel access from copy_page. [ bp: Fix comment style + touch ups, zap an unlikely(), improve the quirk function's readability. ] Signed-off-by:
Jue Wang <juew@google.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20220218013209.2436006-1-juew@google.com
-
- Feb 13, 2022
-
-
Borislav Petkov authored
The arch helpers do not have explicit KASAN instrumentation. Use them in noinstr code. Inline a couple more functions with single call sites, while at it: mce_severity_amd_smca() has a single call-site which is noinstr so force the inlining and fix: vmlinux.o: warning: objtool: mce_severity_amd.constprop.0()+0xca: call to \ mce_severity_amd_smca() leaves .noinstr.text section Always inline mca_msr_reg(): text data bss dec hex filename 16065240 128031326 36405368 180501934 ac23dae vmlinux.before 16065240 128031294 36405368 180501902 ac23d8e vmlinux.after and mce_no_way_out() as the latter one is used only once, to fix: vmlinux.o: warning: objtool: mce_read_aux()+0x53: call to mca_msr_reg() leaves .noinstr.text section vmlinux.o: warning: objtool: do_machine_check()+0xc9: call to mce_no_way_out() leaves .noinstr.text section Signed-off-by:
Borislav Petkov <bp@suse.de> Acked-by:
Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20220204083015.17317-4-bp@alien8.de
-
- Feb 01, 2022
-
-
Tony Luck authored
Currently, the PPIN (Protected Processor Inventory Number) MSR is read by every CPU that processes a machine check, CMCI, or just polls machine check banks from a periodic timer. This is not a "fast" MSR, so this adds to overhead of processing errors. Add a new "ppin" field to the cpuinfo_x86 structure. Read and save the PPIN during initialization. Use this copy in mce_setup() instead of reading the MSR. Signed-off-by:
Tony Luck <tony.luck@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-4-tony.luck@intel.com
-
- Dec 13, 2021
-
-
Borislav Petkov authored
Fixes vmlinux.o: warning: objtool: do_machine_check()+0x4ae: call to __const_udelay() leaves .noinstr.text section Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-13-bp@alien8.de
-
Borislav Petkov authored
Fixes vmlinux.o: warning: objtool: do_machine_check()+0x482: call to mce_timed_out() leaves .noinstr.text section Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-12-bp@alien8.de
-
Borislav Petkov authored
add_taint() is yet another external facility which the #MC handler calls. Move that tainting call into the instrumentation-allowed part of the handler. Fixes vmlinux.o: warning: objtool: do_machine_check()+0x617: call to add_taint() leaves .noinstr.text section While at it, allow instrumentation around the mce_log() call. Fixes vmlinux.o: warning: objtool: do_machine_check()+0x690: call to mce_log() leaves .noinstr.text section Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-11-bp@alien8.de
-
Borislav Petkov authored
Fixes vmlinux.o: warning: objtool: do_machine_check()+0x681: call to mce_read_aux() leaves .noinstr.text section Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-10-bp@alien8.de
-
Borislav Petkov authored
It is called by the #MC handler which is noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xbd6: call to memset() leaves .noinstr.text section Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-9-bp@alien8.de
-
Borislav Petkov authored
And allow instrumentation inside it because it does calls to other facilities which will not be tagged noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xc73: call to mce_panic() leaves .noinstr.text section Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-8-bp@alien8.de
-
Borislav Petkov authored
Fixes vmlinux.o: warning: objtool: do_machine_check()+0xdb1: call to queue_task_work() leaves .noinstr.text section Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-6-bp@alien8.de
-
Borislav Petkov authored
Instead, sandwitch around the call which is done in noinstr context and mark the caller - mce_gather_info() - as noinstr. Also, document what the whole instrumentation strategy with #MC is going to be in the future and where it all is supposed to be going to. Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-5-bp@alien8.de
-
Borislav Petkov authored
MCA has its own special MSR accessors. Use them. No functional changes. Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-4-bp@alien8.de
-
Borislav Petkov authored
Use num_online_cpus() directly. No functional changes. Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-3-bp@alien8.de
-
Borislav Petkov authored
The bitmap is a single unsigned long so no need for the function call. No functional changes. Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-2-bp@alien8.de
-
- Nov 17, 2021
-
-
Zhaolong Zhang authored
Get rid of cpu_missing because 7bb39313 ("x86/mce: Make mce_timed_out() identify holdout CPUs") provides a more detailed message about which CPUs are missing. Suggested-by:
Borislav Petkov <bp@suse.de> Signed-off-by:
Zhaolong Zhang <zhangzl2013@126.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211109112345.2673403-1-zhangzl2013@126.com
-
- Sep 23, 2021
-
-
Borislav Petkov authored
Use a flag setting to call the only quirk function for that. No functional changes. Signed-off-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20210922165101.18951-5-bp@alien8.de
-
Borislav Petkov authored
Avoid having indirect calls and use a normal function which returns the proper MSR address based on ->smca setting. No functional changes. Signed-off-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20210922165101.18951-4-bp@alien8.de
-
Borislav Petkov authored
Get rid of the indirect function pointer and use flags settings instead to steer execution. Now that it is not an indirect call any longer, drop the instrumentation annotation for objtool too. No functional changes. Signed-off-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20210922165101.18951-3-bp@alien8.de
-
Borislav Petkov authored
Turn it into a normal function which calls an AMD- or Intel-specific variant depending on the CPU it runs on. No functional changes. Signed-off-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20210922165101.18951-2-bp@alien8.de
-
- Sep 20, 2021
-
-
Tony Luck authored
Sending a SIGBUS for a copy from user is not the correct semantic. System calls should return -EFAULT (or a short count for write(2)). Signed-off-by:
Tony Luck <tony.luck@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com
-
- Sep 14, 2021
-
-
Tony Luck authored
There are two cases for machine check recovery: 1) The machine check was triggered by ring3 (application) code. This is the simpler case. The machine check handler simply queues work to be executed on return to user. That code unmaps the page from all users and arranges to send a SIGBUS to the task that triggered the poison. 2) The machine check was triggered in kernel code that is covered by an exception table entry. In this case the machine check handler still queues a work entry to unmap the page, etc. but this will not be called right away because the #MC handler returns to the fix up code address in the exception table entry. Problems occur if the kernel triggers another machine check before the return to user processes the first queued work item. Specifically, the work is queued using the ->mce_kill_me callback structure in the task struct for the current thread. Attempting to queue a second work item using this same callback results in a loop in the linked list of work functions to call. So when the kernel does return to user, it enters an infinite loop processing the same entry for ever. There are some legitimate scenarios where the kernel may take a second machine check before returning to the user. 1) Some code (e.g. futex) first tries a get_user() with page faults disabled. If this fails, the code retries with page faults enabled expecting that this will resolve the page fault. 2) Copy from user code retries a copy in byte-at-time mode to check whether any additional bytes can be copied. On the other side of the fence are some bad drivers that do not check the return value from individual get_user() calls and may access multiple user addresses without noticing that some/all calls have failed. Fix by adding a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is four machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same address with page faults enabled ... repeat in copy tail bytes). Just in case there is some code that loops forever enforce a limit of 10. [ bp: Massage commit message, drop noinstr, fix typo, extend panic messages. ] Fixes: 5567d11c ("x86/mce: Send #MC singal from task work") Signed-off-by:
Tony Luck <tony.luck@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
-
- Sep 13, 2021
-
-
Thomas Gleixner authored
The exception table entries contain the instruction address, the fixup address and the handler address. All addresses are relative. Storing the handler address has a few downsides: 1) Most handlers need to be exported 2) Handlers can be defined everywhere and there is no overview about the handler types 3) MCE needs to check the handler type to decide whether an in kernel #MC can be recovered. The functionality of the handler itself is not in any way special, but for these checks there need to be separate functions which in the worst case have to be exported. Some of these 'recoverable' exception fixups are pretty obscure and just reuse some other handler to spare code. That obfuscates e.g. the #MC safe copy functions. Cleaning that up would require more handlers and exports Rework the exception fixup mechanics by storing a fixup type number instead of the handler address and invoke the proper handler for each fixup type. Also teach the extable sort to leave the type field alone. This makes most handlers static except for special cases like the MCE MSR fixup and the BPF fixup. This allows to add more types for cleaning up the obscure places without adding more handler code and exports. There is a marginal code size reduction for a production config and it removes _eight_ exported symbols. Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Borislav Petkov <bp@suse.de> Acked-by:
Alexei Starovoitov <ast@kernel.org> Link: https://lkml.kernel.org/r/20210908132525.211958725@linutronix.de
-
Thomas Gleixner authored
Prepare code for further simplification. No functional change. Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210908132525.096452100@linutronix.de
-
- Aug 24, 2021
-
-
Borislav Petkov authored
When a fatal machine check results in a system reset, Linux does not clear the error(s) from machine check bank(s) - hardware preserves the machine check banks across a warm reset. During initialization of the kernel after the reboot, Linux reads, logs, and clears all machine check banks. But there is a problem. In: 5de97c9f ("x86/mce: Factor out and deprecate the /dev/mcelog driver") the call to mce_register_decode_chain() moved later in the boot sequence. This means that /dev/mcelog doesn't see those early error logs. This was partially fixed by: cd9c57ca ("x86/MCE: Dump MCE to dmesg if no consumers") which made sure that the logs were not lost completely by printing to the console. But parsing console logs is error prone. Users of /dev/mcelog should expect to find any early errors logged to standard places. Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early machine check initialization to indicate that any errors found should just be queued to genpool. When mcheck_late_init() is called it will call mce_schedule_work() to actually log and flush any errors queued in the genpool. [ Based on an original patch, commit message by and completely productized by Tony Luck. ] Fixes: 5de97c9f ("x86/mce: Factor out and deprecate the /dev/mcelog driver") Reported-by:
Sumanth Kamatala <skamatala@juniper.net> Signed-off-by:
Borislav Petkov <bp@suse.de> Signed-off-by:
Tony Luck <tony.luck@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com
-
- Jun 29, 2021
-
-
Naoya Horiguchi authored
Now an action required MCE in already hwpoisoned address surely sends a SIGBUS to current process, but the SIGBUS doesn't convey error virtual address. That's not optimal for hwpoison-aware applications. To fix the issue, make memory_failure() call kill_accessing_process(), that does pagetable walk to find the error virtual address. It could find multiple virtual addresses for the same error page, and it seems hard to tell which virtual address is correct one. But that's rare and sending incorrect virtual address could be better than no address. So let's report the first found virtual address for now. [naoya.horiguchi@nec.com: fix walk_page_range() return] Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com Signed-off-by:
Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Jue Wang <juew@google.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Mar 18, 2021
-
-
Ingo Molnar authored
Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by:
Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
-
- Feb 08, 2021
-
-
Borislav Petkov authored
Move the APIC_LVTTHMR read which needs to happen on the BSP, to intel_init_thermal(). One less boot dependency. No functional changes. Signed-off-by:
Borislav Petkov <bp@suse.de> Tested-by:
Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://lkml.kernel.org/r/20210201142704.12495-2-bp@alien8.de
-
- Jan 12, 2021
-
-
Peter Zijlstra authored
There's some explicit tracing left in exc_machine_check_kernel(), remove it, as it's already implied by irqentry_nmi_enter(). Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210106144017.719310466@infradead.org
-
- Jan 08, 2021
-
-
Paul E. McKenney authored
The "Timeout: Not all CPUs entered broadcast exception handler" message will appear from time to time given enough systems, but this message does not identify which CPUs failed to enter the broadcast exception handler. This information would be valuable if available, for example, in order to correlate with other hardware-oriented error messages. Add a cpumask of CPUs which maintains which CPUs have entered this handler, and print out which ones failed to enter in the event of a timeout. [ bp: Massage. ] Reported-by:
Jonathan Lemon <bsd@fb.com> Signed-off-by:
Paul E. McKenney <paulmck@kernel.org> Signed-off-by:
Borislav Petkov <bp@suse.de> Tested-by:
Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20210106174102.GA23874@paulmck-ThinkPad-P72
-
- Dec 01, 2020
-
-
Gabriele Paoloni authored
Currently, if an MCE happens in user-mode or while the kernel is copying data from user space, 'kill_it' is used to check if execution of the interrupted task can be recovered or not; the flag name however is not very meaningful, hence rename it to match its goal. [ bp: Massage commit message, rename the queue_task_work() arg too. ] Signed-off-by:
Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201127161819.3106432-6-gabriele.paoloni@intel.com
-
Gabriele Paoloni authored
Currently, __mc_scan_banks() in do_machine_check() does the following callchain: __mc_scan_banks()->mce_log()->irq_work_queue(&mce_irq_work). Hence, the call to irq_work_queue() below after __mc_scan_banks() seems redundant. Just remove it. Signed-off-by:
Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20201127161819.3106432-5-gabriele.paoloni@intel.com
-
Gabriele Paoloni authored
Right now for LMCE, if no_way_out is set, mce_panic() is called regardless of mca_cfg.tolerant. This is not correct as, if mca_cfg.tolerant = 3, the code should never panic. Add that check. [ bp: use local ptr 'cfg'. ] Signed-off-by:
Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Reviewed-by:
Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20201127161819.3106432-4-gabriele.paoloni@intel.com
-