blog.ilvokhin.com

Jemalloc HPA Reference

Intro

Jemalloc HPA is hugepages aware implementation of pages allocator. HPA leverages hugepages to reduce cost of TLB misses and thereby improve application performance.

Glossary

Pageslab

Pageslab is hugepage aligned and sized memory range. You can think of it as a set of pages packed together into a hugepage. Pageslab is not necessary backed up by a hugepage.

Active Page

Active page is a memory page that potentially stores application data.

Dirty Page

Dirty page is a memory page that might have had application data on it in the past, but has no application data now. It can be reused (became active again) or returned back to the OS anytime.

Purge

Purge is a process of returning dirty pages back to the OS.

Hugification

Hugification is a request to the OS back pageslab by a hugepage.

Constants

PAGE

PAGE is the size of a jemalloc page. By default it is 4096 bytes on x86_64.

HUGEPAGE

HUGEPAGE is the size of the OS hugepage. By default it is 2097152 bytes (2 MiB) on x86_64.

HUGEPAGES_PAGES

Number of pages in a single hugepage: HUGEPAGE / PAGE. By default it is 512 on x86_64.

Documentation

HPA is under active development and options are not described in official documentation yet. Below is a brief description of currently available options.

hpa

HPA enabled/disabled.

Master switch to enable HPA. This option should be enabled to make other options to work.

Boolean. Default value: false.

hpa_slab_max_alloc

Maximum allocation size in bytes allowed to be served from HPA page allocator.

Allocations of greater size will be served from a more general (for now) classic page allocator (PAC), which can handle allocation requests of any size. Slab allocations will always be served out of HPA, even when the hpa_slab_max_alloc option is set to a small value like PAGE due to implementation quirks. This implementation quirks can be leveraged to serve out of HPA only small allocations (small in jemalloc definition is allocation less than 16 KiB).

Unsigned integer. Default value: 65536 bytes. Minimum value: PAGE bytes. Maximum value: HUGEPAGE bytes.

hpa_hugification_threshold

Minimum number of active bytes in a pageslab necessary for pageslab to be placed into hugification queue.

Pageslab always produced in a non-huge state. Over time, when number of active bytes became greater or equal than hpa_hugification_threshold, jemalloc puts pageslab into hugification queue.

Unsigned integer. Default value: 0.95 * HUGEPAGE bytes. Minimum value: PAGE bytes. Maximum value: HUGEPAGE bytes.

hpa_hugification_threshold_ratio

Minimum percent of active bytes in a pageslab necessary for pageslab to be placed into hugification queue.

This option has the same semantic as hpa_hugification_threshold, but in percent notation.

Fixed-point fractional. Default value: 0.95. Minimum value: 0. Maximum value: 1.0.

hpa_hugify_delay_ms

Time in milliseconds required for pageslab to spent in hugification queue, before jemalloc requests OS to back pageslab by a hugepage.

Hugification queue is ordered by timestamp, when pageslab was placed into the queue, with head of the queue being pageslab placed into the queue earliest and tail of the queue being pageslab there latest. When pageslab stops meeting hugification criteria: number of active bytes is less than hpa_hugification_threshold it is not removed from hugification queue. Only purge can remove pageslab from hugification queue.

Unsigned integer. Default value: 10000 milliseconds.

hpa_hugify_sync

Switch to use synchronous hugification requests.

Use madvise(..., MADV_COLLAPSE) to request OS back up pageslab by a hugepage alongside madvise(..., MADV_HUGEPAGE). Increments stats.arenas.<i>.hpa_shard.nhugify_failures counter on failure.

Usual asynchronous hugification introduces delay of unknown length, between request to OS has been made to hugify a pageslab and OS actually backs up pageslab by a hugepage. This option allows to eliminate this delay. Requires Linux 6.1 or higher.

Boolean. Default value: false.

hpa_min_purge_interval_ms

Minimum time between two consecutive purge phases in milliseconds.

Each hpa_min_purge_interval_ms jemalloc will check if purging criteria are met and if they are, it will purge as much pageslabs as needed until purging criteria are no longer met. Minimal unit of purging is pageslab, meaning all dirty pages will be returned back to the OS from chosen pageslab, even if less pages required to be purged to reach purging target. If there are few consecutive dirty pages, one syscall will be issued to purge them together in one go.

Unsigned integer. Default value: 5000 milliseconds.

hpa_peak_demand_window_ms

Length of peak demand sliding window in milliseconds.

Time component of purging criteria. Jemalloc will track the maximum number of active pages used within hpa_peak_demand_window_ms milliseconds sliding window. Jemalloc will purge dirty pages above that peak usage.

It is easier to explain in an example. Suppose ncurrent is the number of active pages currently in use and npeak is the peak (maximum) number of active pages within the last 10 seconds. Then jemalloc is allowed to keep npeak - ncurrent dirty pages and will purge the rest of them if there are any.

Option hpa_peak_demand_window_ms works in combination with hpa_dirty_mult.

Unsigned integer. Default value: 0 milliseconds (disabled by default).

hpa_dirty_mult

Maximum of dirty to active pages ratio jemalloc is allowed to keep.

Ratio based component of purging criteria.

Jemalloc is trying to estimate the maximum amount of active memory application might likely need in the near future. It does so by projecting future active memory demand (based on peak active memory usage observed in the past within a sliding window) and adds slack on top of it (an overhead it is reasonable to have in exchange on higher hugepages coverage). When peak demand tracking is off, projection of future active memory is current active memory usage.

Estimation is essentially the same as npeak * (1 + hpa_dirty_mult). In case, when hpa_peak_demand_window_ms is set to 0, then npeak equals to ncurrent and expression became ncurrent * hpa_dirty_mult. When hpa_dirty_mult is 0, then the expression becomes just npeak.

Option hpa_dirty_mult works in combination with hpa_peak_demand_window_ms.

Fixed-point fractional or -1. Default value is 0.25 (not a great default). When set to -1 disables purging completely.

hpa_sec_nshards

Number of small extent cache (SEC) shards.

SEC is a cache layer above the HPA page allocator. Requests are distributed across small extent cache shards [0, nshards - 1). If a request can not be served out of SEC, it will be forwarded to the HPA page allocator.

I can not say I saw cases when the SEC helped much. Probably, more work is required to make SEC useful.

Unsigned integer. Default value: 4 shards. When set to 0 disables small extent cache (SEC).

hpa_sec_max_alloc

Maximum size of allocation in bytes, that can be served out of SEC.

Jemalloc will refuse to cache any objects if their size is greater than hpa_sec_max_alloc and forward such objects to the HPA page allocator.

Unsigned integer. Default value: 32768 bytes. Minimum value: PAGE bytes. Maximum value: 32768 bytes.

hpa_sec_max_bytes

Maximum number of bytes small extent cache shard allowed to cache.

When shard cached bytes size exceeds hpa_sec_max_bytes, jemalloc will flush bins until the number of cached bytes falls below hpa_sec_bytes_after_flush.

Unsigned integer. Default value: 262144 bytes. Minimum value: PAGE.

hpa_sec_bytes_after_flush

Maximum number of bytes SEC is allowed to have after flush caused by exceeding hpa_sec_max_bytes.

This option should be less than hpa_sec_max_bytes for SEC to be useful.

Unsigned integer. Default value: 131072 bytes. Minimum value: PAGE.

hpa_sec_batch_fill_extra

Number of extra objects to fill on SEC miss.

When allocation request can not be satisfied out of SEC, because there are no available ones cached, jemalloc brings hpa_sec_batch_fill_extra additional objects to SEC out of HPA page allocator.

Unsigned integer. Default value: 0. Maximum value: HUGEPAGES_PAGES.

experimental_hpa_max_purge_nhp

Maximum number of pageslabs to purge on each purging phase.

Experimental option that likely will be removed soon. Limits number of pageslab to purge on each purging phase.

Signed integer. Default value: -1 (disabled by default).

Acknowledgements

Thanks to Kevin Svetlitski, whose note introduced me to the HPA world.