I'm pleased to announce that our paper on TCMalloc and hugepages has been made available.
This paper describes the changes made to TCMalloc in order for it to become hugepage aware. It required a complete restructure of the "backend" - the bit that manages memory requested from the OS. The obvious change is to manage this memory in chunks of 2MiB (x86 hugepage size). Once you have that you need a cache that handles allocations of less than 2MiB, another that handles allocations of greater than 2MiB, etc.
One of the key observations in the paper is that increasing hugepage coverage improves application performance, not necessarily allocator performance. i.e. you can spend extra time in the allocator preserving hugepage coverage, this improves the performance of the code that uses the memory. The net result is that you can slightly increase time spent in the allocator, and reduce time spent in the application. This makes the time spent in the allocator increase - making it look worse!
To work around the fact that the allocator looks worse, you also need to look for metrics that represent the productivity of the application. If the productivity increases, then the changes to the allocator are an efficiency win - even if the allocator ends up taking a greater percentage of the entire runtime.