Modern operating systems provide each process it's own virtual address space that is potentially larger than physical memory. It does this by mapping the virtual addresses onto actual physical memory pages. The following diagram shows how large virtual address spaces are mapped onto small physical memory.
The virtual to physical relationship is maintained in mapping tables. Linux maintains a separate map for each process. The map contains one entry for each page where the size of the page is set by the operating system. The CPU attempts to speed up the mapping of virtual memory to physical addresses by caching a subset of the mapping entries in its internal cache. Modern large memory systems and the resulting large processes have increased the size of memory mapping tables faster than the cpu cache sizes have increased. The processor cache is usually very small with only maybe 100 entries in may processors. This mismatch causes faults that can have significant performance implications.
Linux uses physical memory to maintain the process memory mapping tables and swaps those mapping tables in and out of the processor cache on context switches and on memory page table faults. This physical memory is unavailable for other uses essentially reducing the amount of memory available to the rest of the system. The amount of memory used for page tables is the sum of all the virtual process sizes / the page table size multiplied by the size of a page table entry. So a system with a 20GB Oracle SGA uses 50MB of physical memory for page tables just for the SGA:
- 20GB Process Size / 4K page size * 10 = 50Megabytes. (5 million entries)
- 20GB Process Size/ 4K page size * 10 entry size * 200 connections = 10GB. (1 Billion page table entries)
Linux provides the ability to change the page size from 4KB to 2MB potentially returning large amounts of memory to the swappable pool. Calculations show that converting just the Oracle SGA in the previous example would save almost 10GB.
- 20GB Process size / 2MB page size * 10 entry size * 200 connections = 20MB (2 million entries)
Huge (2MB) pages are different from the normal 4K pages in that they are locked into memory and do not swap. This means that extreme care must be taken when selecting the number and size of the huge page space. The Huge Page space should be large enough to hold the Oracle SGA while leaving enough other memory to run the non-SGA client code, the operating system and other processes. Teams must understand their memory needs so that htey can set the correct number of huge pages.
I was talking with a colleague about their system and we did some back of the napkin calculations. They had an Oracle RAC with two 72GB servers each with 500 database connections. This ran a little slower than expected when both machines were up but their real problem was that they were unable to fail over to a single node even though they had plenty of CPU capacity. The following calculations show why.
Page Table Entries (SGA) = 35GB / 4KB = 8.75MB
Page Table Entries (1000 connections) = 1000 * 8.75MB = 8.75GB
Memory (1000 connections) = 10B * 8.75G = 87.5GB
Page Table Entries (SGA) = 35GB / 2MB = 17.5KB
Page Table Entries (1000 connections) = 1000 * 17.5KB = 17.5MB
Memory (1000 connections) = 10B * 17.5M = 175MB
The team converted 35GB of memory to Huge Pages so that the entire SGA fit into the huge pages. This returned 40GB of free memory to each machine and made it possible fall back to a single machine when they had planned or unplanned outages on a single server.
A smaller system at another site recovered 7GB of physical memory on their 32GB servers by converting their 20GB oracle SGA to Huge Pages. This ended all their system paging.
Intelligent conversion from small memory pages to large memory pages modern modern large RAM systems can have significant performance implications:
- Larger page sizes result in fewer page table entries in the CPU cache. This results in fewer cache misses and less CPU stalling.
- The memory savings resulting from the 1/500 reduction in page table entries is multiplied by the number of processes sharing the same shared memory.