Is your memory management multi-core ready?
Posted by: software on: 11 Sep, 2009
- In: Uncategorized
- Comment!
This particular application is using one thread per user, so theoretically, if one has an 8-core machine then 8 concurrent users should fully utilize the machine giving 8x speedup compared to a sequential run. Recently I have got a workload that could not scale beyond a few cores.
See Page 11 of this document for details. Locks in the Linux page fault handlers have been discussed here. Then I used my best friend VTune™ to see what are the top hot functions in the code. Virtual to physical address mapping entries are created later upon a page fault in a lazy manner, when the virtual memory pages are touched by the process (i. I have located the piece of code that called that memset function. This was a real surprise for me, because it was known that the application did not use memory mapped files. The Intel Thread Profiler had no chance to show the possible locks in page fault handlers because these handlers are executed in Linux kernel (Intel Thread Profil er operates in user space). Then I had understood where these mmaps/munmaps are coming from. with memset) for the first time. These were mmap, munmap and memset. It was an initialization of recently allocated memory blocks.
e. The Lin ux mmap mechanism allocates new memory by assigning page file space to virtual address space of the app. It tu rned out that glibc memory alloc function (malloc) is using memory mapping mechanisms of Linux to allocate bigger chunks of memory (at least for my Linux distribution with the stock kernel). This is why there were a huge number of context switches (shown by the vmstat utility); page faults cause switches from user space to kernel space.
I have seen concurrent workloads that were speedup by a few percent only and I find this still encouraging to give Intel TBB allocator a try.
This is of course no guarantee that any arbitrary multithreaded application will profit from Intel TBB allocator at this scale.
so. 2:$LD_PRELOAD
run_your_app
You can verify if the allocator library is loaded using “pmap app_process_id | grep tbb”
2 that also supports large object allocation efficiently. One should just define some environment variables to use it for a legacy application:
export LD_LIBRARY_PATH=absolute_path_to_tbb_libs:$LD_LIBRARY_PATH
export LD_PRELOAD=libtbbmalloc_proxy. so. I knew that Intel Threading Building Blocks (Intel TBB) has a scalable memory allocator, and recently a drop-in version has become available in version 2. 2:libtbbmalloc.
It was the memory management that did not scale. It looked like the problem had been reproduced.
Now come the results with Intel TBB malloc lib:
1 thread: 6 seconds runtime, at most 1 core is busy
2 threads: 6 seconds runtime, at most 2 cores are busy
4 threads: 10 seconds runtime, at mos t 4 cores are busy
8 threads: 14 seconds runtime, at most 8 cores are busy
feedback and questions are very welcome.
To verify my hunch that the memory management via standard malloc is not scaling as one could hope for, I wrote a simple program that did just that in every thread:
void * worker(void *) {
int iterations = 300000;
while(iterations– ) {
void * p = malloc(size);
memset(p,0,size);
free(p);
}
return NULL;
}
My test spawned x worker threads and then waited until they are all finished.
Roman
or exclusive locks? I have used Int el Thread Profiler which has hel ped me a lot previously. It means it should scale. It has shown a perfect picture: with 8 users there were 8 threads busy, halt ing very rarely on a few locks. It was known to me that the relevant portion of the code could be locked with a read-write lock. I talked to the developers of this application. The queries were read-only. So, it should scale!
Too much sequential code in the processing? Then spent some time looking at the source code and guessing what the reason could be.
Looking at the vmstat output one could observe that the number of context switches was back to normal values. For this small test routine the Intel TBB malloc speedup was amazing 23x
My original workload – ran about 7x faster utilizing up to 8 cores with this new allocator.
Here are results: for 200KB blocks (size=200*1024):
1 threa d: 21 seconds runtime, at m ost 1 core is busy (via top utility)
2 threads 53 seconds runtime, at most 1. 4 cores are busy
4 threads: 105 seconds runtime, at most 1. 8 cores are busy
8 threads: 325 seconds runtime, at most 2 cores are busy
It did not happen. At most two cores have been utilized, the query throughput speedup was even smaller.
VTune is a trademark of Intel Corporation in the U. and other countries
S.
software.intel.com