A hybrid domain decomposition type of preconditioner for Jacobi-Davidson on modern hardware

Decanato - Facoltà di scienze informatiche

Data d'inizio: 18 Aprile 2016

Data di fine: 19 Aprile 2016

Speaker:	Menno Genseberger
	Deltares and Centrum Wiskunde & Informatica, The Netherlands
Date:	Monday, April 18, 2016
Place:	USI Lugano Campus, room A13, red building (Via G. Buffi 13)
Time:	17:30

Abstract:

The Jacobi-Davidson method is an iterative method suitable for computing solutions of large eigenvalue problems. For the computation of a solution to the standard eigenvalue problem, in each iteration, this method extracts an approximate solution from a search subspace, corrects the approximate eigenvector by computing a correction vector from a correction equation, and uses the correction vector to expand the search subspace. Most computational work of Jacobi-Davidson is due to the correction equation. In previous work a strategy for the computation of (approximate) solutions of the correction equation was proposed. The strategy is based on a domain decomposition technique in order to reduce wall-clock time and local memory requirements.

The use of a preconditioner based on this domain decomposition technique leads effectively to almost uncoupled subproblems at subdomains: an ideal situation for implementation on parallel computers with distributed memory. However, hardware tends to have more and more computational cores that share memory. At the moment, modern harware consists of nodes with tens of cores, whereas nodes are connected with fast network interconnects. So, on one hand memory is shared by cores within one node and on the other memory is distributed over nodes. Distributed memory implementations with MPI of the original strategy can still be used to distribute computational work over cores within the shared memory of a single node. However, because of the increasing number of cores within single nodes, memory access is becoming more and more the major bottleneck instead of computational capacity. Multiple cores share the same memory and simultaneous memory access, which is likely to occur by the original strategy, will slow down the overall wall-clock time. Therefore, in the present talk we will show how we adapted the original strategy for modern hardware by incorporating an efficient implementation of one of the building blocks. For this we used a banded direct solver optimized for multiple cores with OpenMP wihin shared memory from PARDISO. By numerical experiments we will illustrate the balance between the distributed and shared part of this new hybrid approach. For this we consider different computational and memory entities (nodes, sockets, cores, threads, L1/L2/L3 caches).

Biography:

Master Mathematics (main subject Numerical Mathematics) and Physics at University of Amsterdam 1990-1997 Ph.D. Numerical Mathematics Utrecht University and CWI (National Research Institute for Mathematics and Computer Science in the Netherlands) 1997-2001 WL | Delft Hydraulics 2002-2007 Deltares 2008-present.

Host:

Prof. Rolf Krause