villamoving.blogg.se - Sequential testing at intel

SEQUENTIAL TESTING AT INTEL MANUAL
SEQUENTIAL TESTING AT INTEL FULL
SEQUENTIAL TESTING AT INTEL SOFTWARE
SEQUENTIAL TESTING AT INTEL CODE

The "next page prefetcher" appears to be primarily intended to enable the *page table entry* to be pre-loaded.

SEQUENTIAL TESTING AT INTEL CODE

When using most or all cores, bandwidth is limited by the DRAM subsystem and code modifications to increase concurrency actually *decrease* performance (by making it harder for the memory controller to efficiently schedule the accesses to the available DRAM banks).

When using one core, bandwidth is limited by the available concurrency, so code modifications to increase concurrency are beneficial.

This should enable the L2 HW prefetchers to ramp up faster, but it would require careful experimentation to figure out how these interact.

SEQUENTIAL TESTING AT INTEL SOFTWARE

In (almost) all recent Intel processors, software prefetches will generate page table walks to enable earlier access to the next 4KiB region. (It does not appear to be difficult to "outsmart", but the impact on bandwidth/latency/cache-misses is very small, so I only studied it to understand the effect on page table walks.) As you might expect, the next-page prefetcher does not seem to do much for contiguous accesses on 2MiB pages, but it is inconvenient to test because there is no documented way to disable this new prefetcher. It might load a cache line from the page into the L1D cache - I have not tried to test for this explicitly. This prevents an additional stall in the memory access sequence for the initial access to a 4KiB-page (which can't start until the page's virtual-to-physical translation is available).

There is no real documentation on exactly what it does, but from my testing it is clear that it causes the page table entry for the next 4KiB page to be fetched early.

Because this is in the core and not in the L2 cache, it is able to operate on virtual addresses. Starting in Ivy Bridge, Intel processors have a "next-page prefetcher" in the core. This could change in future processors, of course, so it is a good idea to keep in mind. The L2 prefetchers in recent Intel processors stop at 4KiB boundaries no matter what page size is in use (see the bottom of page 2-46 in the Intel Optimization Reference Manual, document 248966-042b, September 2019). If I recall correctly, in the IBM POWER architecture a memory request can include information about the virtual memory page size in use, allowing the hardware prefetchers to operate over larger regions, with the 64KiB page size added specifically to support more effective hardware prefetching.

SEQUENTIAL TESTING AT INTEL MANUAL

The details are not specifically documented, though there are some notes in the Intel Optimization Reference Manual (document 248966, especially section 2.5.5) that provide examples. If it sees a third request matching the sequence, it becomes more confident that the "stream" is real, and fetches two more lines, etc. In practice, the prefetcher is not so aggressive - it waits until it sees two references to the same 4KiB page, then issues prefetches for two more cache lines following the same stride.

SEQUENTIAL TESTING AT INTEL FULL

If the prefetcher requested the full 4KiB when it saw the first reference to the page, it would still take about (78ns+32ns=) 110ns to load the page. On a Cascade Lake Xeon, the memory bandwidth is up to 128 GB/s per socket, so it takes 32 ns to transfer 4KiB from memory. Therefore they only operate within 4KiB regions (the default and smallest available page size) - since they don't know what 4KiB physical address will be accessed next. The L2 hardware prefetchers operate on physical addresses with no knowledge of the virtual memory page size in use.