...

GPU accelerated Nonlinear Soft Tissue Deformation Sathish Kottravel LiU-ITN-TEK-A--12/020--SE

by user

on
4

views

Report

Comments

Transcript

GPU accelerated Nonlinear Soft Tissue Deformation Sathish Kottravel LiU-ITN-TEK-A--12/020--SE
LiU-ITN-TEK-A--12/020--SE
GPU accelerated Nonlinear Soft
Tissue Deformation
Sathish Kottravel
2012-03-29
Department of Science and Technology
Linköping University
SE-601 74 Norrköping , Sw eden
Institutionen för teknik och naturvetenskap
Linköpings universitet
601 74 Norrköping
LiU-ITN-TEK-A--12/020--SE
GPU accelerated Nonlinear Soft
Tissue Deformation
Examensarbete utfört i medieteknik
vid Tekniska högskolan vid
Linköpings universitet
Sathish Kottravel
Handledare Umut Kocak
Examinator Karljohan Lundin Palmerius
Norrköping 2012-03-29
Upphovsrätt
Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –
under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för
ickekommersiell forskning och för undervisning. Överföring av upphovsrätten
vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av
dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,
säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ
art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i
den omfattning som god sed kräver vid användning av dokumentet på ovan
beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan
form eller i sådant sammanhang som är kränkande för upphovsmannens litterära
eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se
förlagets hemsida http://www.ep.liu.se/
Copyright
The publishers will keep this document online on the Internet - or its possible
replacement - for a considerable time from the date of publication barring
exceptional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for your own use and to
use it unchanged for any non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses
of the document are conditional on the consent of the copyright owner. The
publisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
According to intellectual property law the author has the right to be
mentioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity,
please refer to its WWW home page: http://www.ep.liu.se/
© Sathish Kottravel
Abstract
There are two types of structures in human body, solid organs and hollow membrane like organs. Brain, liver and other soft tissues such as tendons, muscles,
cartilage etc., are examples of solid organs. Colon and blood vessels are examples of hollow organs. They greatly differ in structure and mechanical behavior.
Deformation of these types of structures is an important phenomena during the
process of medical simulation.
The primary focus of this project is on deformation of soft tissues. These kind of
soft tissues usually undergo large deformation. Deformation of an organ can be
considered as mechanical response of that organ during medical simulation. This
can be modeled using continuum mechanics and FEM. The primary goal of any
system, irrespective of methods and models chosen, it must provide real-time
response to obtain sufficient realism and accurate information. One such example is medical training system using haptic feedback. In the past two decades
many models were developed and very few considered the non-linear nature in
material and geometry of the solid organs. TLED is one among them. A finite element formulation proposed by Miller in 2007,known as total Lagrangian explicit
dynamics (TLED) algorithm, will be discussed with respect to implementation
point of view and deploying GPU acceleration (because of its parallel nature to
some extent) for both pre-processing and actual computation.
i
Acknowledgments
I would like to sincerely thank the examiner of this project for providing an opportunity. And the supervisor for the support, guidance and many thoughtful
discussions during the development of the project. My friends in Sweden and India for whom I owe my sincere gratitude. Finally my parents and grand parents,
who are reason behind every step that i make in the path of knowledge. Last but
not least, my sister Priya, who filled in my shoes during the time when i was most
needed in my family.
Norrköping, April 2012
Sathish Kottravel
iii
Contents
Notation
vii
1 Introduction
1.1 Challenges in medical simulation
1.2 Non-Linear models . . . . . . . .
1.3 Contribution . . . . . . . . . . . .
1.4 Structure of this report . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
2
3
3
2 Background
2.1 Finite Element Equilibrium . . . . . . . . . . . . . . . . .
2.2 Element Stress, Element Strain and Element Nodal Forces
2.2.1 Forces represented as stress and strain . . . . . . .
2.2.2 Deformation Gradient . . . . . . . . . . . . . . . .
2.2.3 Second Piola-Kirchhoff stress . . . . . . . . . . . .
2.2.4 Strain-displacement matrix . . . . . . . . . . . . .
2.2.5 Shape Functions h . . . . . . . . . . . . . . . . . .
2.3 Time integration . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
6
6
7
8
8
11
3 Theory
3.1 Updated Lagrangian Vs Total Lagrangian
3.2 TLED Formulation . . . . . . . . . . . . .
3.3 TLED in CUDA . . . . . . . . . . . . . . .
3.4 CUDA Blocks, Grids and Warps . . . . . .
3.5 Two kernel approach . . . . . . . . . . . .
3.6 Scatter-writes and Atomic-writes . . . . .
3.6.1 Scatter-writes . . . . . . . . . . . .
3.6.2 Atomic-writes . . . . . . . . . . . .
3.7 Single kernel approach . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
14
18
18
19
19
19
19
4 Implementation
4.1 TLED in haptics environment . . . . . . . . . . . . . . . . . . . . .
4.2 Element Technology . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Pre-Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
23
23
25
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vi
CONTENTS
4.3.1 Transfer of mesh data to GPU memory .
4.3.2 Compute Volume and Neighbour count
4.3.3 Lame A, B, C . . . . . . . . . . . . . . . .
4.3.4 Shape function Derivative . . . . . . . .
4.3.5 Allocate Input and Output data memory
4.3.6 Total active GPU memory in use . . . .
4.4 Two Kernel approach . . . . . . . . . . . . . . .
4.4.1 Kernel for Element Loop . . . . . . . . .
4.4.2 Kernel for Node Loop . . . . . . . . . . .
4.4.3 Swapping displacement array . . . . . .
4.5 Atomic writes in Two kernel approach . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
25
26
26
27
28
28
30
31
31
. .
.
. .
. .
. .
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
33
33
33
34
34
34
34
35
35
6 Summary and Conclusion
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
43
44
7 Future Work
7.1 TLED Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 GPU enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
45
45
A Appendix A
A.1 Development Environment . . . . . . . . . . . . . . . . . . . . . . .
A.1.1 Hardware Configurations . . . . . . . . . . . . . . . . . . .
A.1.2 Softwares and APIs . . . . . . . . . . . . . . . . . . . . . . .
47
47
47
47
B Appendix B
B.1 Atomic Float using atomicCAS . . . . . . . . . . . . . . . . . . . .
49
49
Bibiliography
51
5 Results
5.1 GPU Memory measure . . . . . . . . . . . . . . . . .
5.1.1 Total Element Count Vs GPU Global Memory
5.1.2 Mesh statistics . . . . . . . . . . . . . . . . . .
5.2 Performance measure . . . . . . . . . . . . . . . . . .
5.2.1 Scattered-writes . . . . . . . . . . . . . . . . .
5.2.2 Atomic-writes . . . . . . . . . . . . . . . . . .
5.2.3 Pageable Vs Non-Pageable . . . . . . . . . . .
5.2.4 Total Time Comparison . . . . . . . . . . . . .
5.2.5 Block Size Comparison . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
Notation
Some Quantities
Common
Notation
h
ha,i
Ü
U̇
U
Zii
ZT
t (a)
0 Zij
δij
I
Significance
Shape Functions of hexahedral or tetrahedral element
Partial differentiation of h along dimension i and a can
be, for example 1,2,3 .. upto number of nodes
Acceleration
Velocity
Displacement
Vector
Transpose of a vector or matrix Z
Rank2 Tensor or 3x3 matrix, Z , at current configuration t with respect to reference configuration 0 where
a can be 1, 2, .. upto number of nodes per element
Kronecker delta
Identity Matrix
vii
1
Introduction
Mesh generation is one of the main problems in biomechanical models. An ideal
computer based medical simulator use patient-specific data obtained through
imaging techniques such as MRI, CT for mesh generation. There are many robust
algorithms for generating tetrahedral mesh, which are fast and accurate. But this
is not the case in hexahedral meshes as hexahedral meshes are mostly generated
using template based techniques.
The traditional template based mesh generation algorithms can only generate
healthy organs. Severe pathological or diseased organs cannot be modelled accurately using template based mesh generation techniques. These kind of diseased
organs change in position, shape and size from patient to patient and can hugely
influence the mechanical behavior, for example, they are mostly stiffer and often
deformed in shape than the healthier organs [O Comas][ G.R.Joldes et al]. Such
variations in structure of the organs can effectively be represented using tetrahedral mesh.
And as an effect, usage of tetrahedral model became popular, since tetrahedral
meshes are easy to generate with less effort. But it is to be noted that representing a model with tetrahedral elements costs approximately five times more
element data than a model represented in hexahedral elements. To resolve this
contradiction models with combination of hexahedral and tetrahedral elements
with predominantly former element type are preferred and considered most convenient [G.R.Joldes et al.]. But in this project, however, purely hexahedral and
tetrahedral models are considered for simplicity. Mixed type of hexahedral and
tetrahedral element model has been ignored. The hexahedron and tetrahedron
are represented in Figure 1.1.
1
2
1
Introduction
Figure 1.1: Node ordering of volume elements
1.1
Challenges in medical simulation
Medical Simulation can be broadly divided into four areas.[O Comas]
1. Photo-realistic rendering
2. Collision detection
3. Contact modelling
4. Anatomical structure modelling and haptic feedback
The interest of this project lies in anatomical structure modelling which will be
utilized in a haptic software framework called CEngine based on H3D API and
H3DPhysics API. The deformation of anatomical structures, especially soft tissues, are claimed to have non-linear, anisotropic and viscoelastic behavior [Fung
et al]. Traditionally mathematical models with linear behavior were often employed due to their superiority in computational efficiency. These models failed
to produce realistic result because the deformation is assumed to be infinitesimal
[K Miller et al].
1.2
Non-Linear models
In 1970s, development of non-linear finite element procedures shed light on inadequacy of linear models. Updated Lagrangian explicit dynamics was first used by
Zhuang. Several other authors adopted models assuming linear geometry (tetrahedral elements) without considering volumetric locking. An efficient finite element algorithm proposed by K. Miller et al, addressed both geometric and ma-
1.3
Contribution
3
terial non-linearity, known as total Lagrangian finite element dynamics (TLED).
Author also proposed that geometry with 8 noded hexahedron elements, TLED
requires 35% fewer floating point operations per element per time step in comparison with updated Lagrangian algorithm. Another key challenge apart from
choosing an appropriate model, is computational efficiency within the context of
real-time response. It is rational to realize that increased percentage of hexahedral elements leads to better geometric representation, accurate result and considerably easy mesh generation procedures but with higher computational power
as bottle neck. This problem has been addressed with the aid of NIVIDIA’s CUDA,
a general purpose parallel computing architecture [G.R. Joldes][O Comas]. Basic
implementation of TLED algorithm using GPGPU framework has been presented
by Z.Taylor et al. Later with some additional features (to eliminate limitations in
the prior method) G.R.Joldes et al, proposed CUDA based implementation. This
forms the basis of this project.
1.3
Contribution
This project’s aim is to study the TLED algorithm, implement it using CUDA architecture in our new framework CEngine and also discuss about achievement
of optimal performance of the parallel implementation as future work. The detailed description of TLED algorithm for soft tissue deformation was given by K
Miller et al. Later G.R Joldes et al proposed TLED implementation in GPU using
CUDA. Recent implementation of TLED can be found in SOFA framework by O
Comas et al. Having these as references and motivation, soft tissue deformation
has been implemented in our new framework.
To sum up, project concerns about by far popular numerical solution to deal with
non-linearity during soft tissue deformation, in order to obtain real-time interaction using FEM based TLED formulation. Change in topology is not considered
in this project. Tetrahedral and hexahedral elements are considered as primary
choice of element technology as they are easy to implement individually. And
also mixed element technology causes GPU threads to be divergent. Effective
use of mixed element technology, computing hourglass force vector [G.R.Joldes
et al] and simulation of anisotropy, viscoelasticity has been considered as future
enhancement to the project.
1.4
Structure of this report
This report begins with brief background information and introduction to TLED
formulation. The following chapters discusses only essential steps and equations
which have been implemented in this project. The implementation chapter describes the algorithm with reference to equations in theory section and discuss
several significant factors involved in TLED implementation. The results are presented at the end comparing performances of different implementations. The appendix section has information about development environment and code snip-
4
1
Introduction
pets. After reading this report it is possible to get following impression:
1. General introduction to TLED formulation.
2. GPU implementation of TLED for hexahedral and tetrahedral elements.
3. Understanding of Two kernel approach implemented with scatter-writes
and atomic-writes method.
4. Performance comparison of those two methods.
5. Getting a good grip on intense part of the algorithm.
6. Proposed modification to Two Kernel approach and other future works.
2
Background
2.1
Finite Element Equilibrium
Principle of virtual work can be defined as the fundamental equation that define equilibrium of the body using FEM. In total Lagrangian formulation this is
represented as
Z0 V
t
t
0
0 Sij δ0 Eij d V
Z0 V
=
0
t B
0
0 fi δui d V
ZSf
+
t S
S 0
0 fi δui d S
(2.1)
where left hand side of equation 2.1 represents external virtual work and right
hand side represents internal virtual work due to surface and body forces. All
integrals are performed over the un-deformed configuration. Discretising the
body and interpolating using relevant shape function leads to the standard finite
element equation of equilibrium.
M Ü + D U̇ + K(U )U = R
(2.2)
where M is a (constant) mass matrix, D is a (constant) damping matrix, K(U) is the
stiffness matrix which is a function of global nodal displacement U, R is external
load.
5
6
2.2
2.2.1
2
Background
Element Stress, Element Strain and Element
Nodal Forces
Forces represented as stress and strain
Finite deformation requires incremental solution i.e., evaluate the equilibrium
positions of the complete body at discrete positions at discrete time 4t, 24t, ...., t.
In order to solve for unknown displacement at U at each time step, a numerical
time integration scheme is required. Miller et al in their paper suggested using
explicit method (central difference method) as this allows the stiffness term in
equation 2.2 to be computed as
K(U )U = F(U ) =
X
e(e)
F
(2.3)
e(e) are the global nodal force contribution due to stress in element e. For
where F
e can be calculated from Piola-Kirchhoff stress and
a given element at time t F
strain-displacement as
Z0 V
te
F = (0t BL )T 0t Ŝd 0 V
(2.4)
h
i
where 0t Ŝ = 0t S11 0t S22 0t S33 0t S12 0t S23 0t S13 T is the vector form of the second PiolaKirchhoff stress having tensor symmetry, (0t BL )T is the strain-displacement matrix and 0 V is the initial volume of the element.This implies that t K(U ) need
not be assembled instead it can be computed at element level where nodal force
contribution are known. It is also decisive advantage for the GPU implementation. Hence element and nodal forces may be computed in parallel. The stress
tensor and strain-displacement matrix are calculated in following sub sections
2.2.3 and 2.2.4. But before computing stress and strain it is necessary to compute
deformation gradient as in the following section.
2.2.2
Deformation Gradient
Computation of deformation gradient is done by computation of displacement
derivative matrix 0t ∂ux with respect to global co-ordinates.
 t
 0 u1,1
 t u
t
∂u
=
 0 2,1
x
0
 t
0 u3,1
and
t
0 u1,2
t
0 u2,2
t
0 u3,2
t
0 u1,3
t
0 u2,3
t
0 u3,3





(2.5)
2.2
Element Stress, Element Strain and Element Nodal Forces
7
N
t
0 ui,j
=
∂t ui X ∂ha t
=
u
∂0 x j
∂xj ai
(2.6)
a=1
where ha is the shape function of node a in an element with N nodes. ua being
the displacement. Deformation gradient can be represented as,
t
0X
= 0t ∂ux + I
(2.7)
where I being identity matrix.
2.2.3
Second Piola-Kirchhoff stress
The deformation is measured by deformation gradient tensor Xij as,
t
0 Xij
=
∂t xi
∂0 x j
(2.8)
where ∂0 xj is original configuration and ∂t xi is the deformed configuration at
time t. The deformation gradient describes the stretch and rotation that the material fibers undergone from time 0 to t. From deformation gradient we may obtain
Right Cauchy-Green deformation tensor 0t C :
t
0C
=0t X T 0t X
(2.9)
and subsequently the Green-Lagrange strain tensor 0t E
t
0E
=
1 t
( C − I)
2 0
(2.10)
where I is the rank 2 identity tensor. This strain measure is central to total
Lagrangian formulation. But the equation 2.4 requires stress measure. It is
known that Cauchy or true stress t τ is force per unit area in the deformed geometry. Therefore stress measure is defined as work-conjugate with Green-Lagrange
strain. This stress is known as Piola-Kirchhoff stress 0t S, defined as
t
0S
=
0ρ
0 t 0 T
X τt X
tρ t
(2.11)
where 0t X t = (0t X t )−1 is the inverse deformation gradient. The mass density ratio
in above equation 2.11 is given by
8
2
0ρ
tρ
=t J = det(0t X)
Background
(2.12)
By using simplest hyperelastic model, neo-Hookean model, the stress components can be obtained as,
t
0 Sij
−1
−1
= µ(δij −0t Cij
) + λt J(t J − 1)0t Cij
(2.13)
where µ is Young’s modulus and λ is Poisson’s ratio.
2.2.4
Strain-displacement matrix
The strain-displacement matrix (0t BL ) is computed as follows:
t (a)
0 BL
(a)
= 0 BL0 0t X T
(2.14)
where 0t X T is transposed deformation gradient, a ranges from 1 to N, N being
(a)
number of nodes per element and 0 BL0 represented in terms of spatial derivatives
of the element shape functions h
t (a)
0 BL0

 ha,1
 0


 0
= 
 ha,2

 0

ha,3
0
ha,2
0
ha,1
ha,3
0
0
0
ha,3
0
ha,2
ha,1











(2.15)
where the sub-scripted commas denote partial differentiation with respect to the
x,y,z co-ordinate, i.e.,
0 ha,i
2.2.5
=
∂ha
∂0 x i
(2.16)
Shape Functions h
Derivatives of shape functions are required in both stress and strain calculation
explained in previous sections. The shape function derivatives of tetrahedral and
hexahedral elements are represented in the following sections.
Shape function derivatives of Tetrahedral Elements
Shape function derivatives h1, h2, h3, h4 corresponding 4 nodes of tetrahedral
element is computed as follows:
2.2
9
Element Stress, Element Strain and Element Nodal Forces
Let (xn, yn , z n ) are element node co-ordinates where n=1 to 4 for tet element.

 1
 x1

k = det(N ) = det 
 y1

z1
h1x =
h1y =
h1z =

 1

det  y2

z2
1
x2
y2
z2
1
y3
z3
1
x3
y3
z3
1
x4
y4
z4
1
y4
z4





1
x4
z4












(2.18)
k

 1

−det  x2

z2
1
x3
z3
(2.19)
k

 1

det  x2

y2
1
x3
y3
1
x4
y4
(2.17)





k
(2.20)
In the above equations, h1x , h1y , h1z consists co-factors of x1, y1, z1 respectively
(belongs to matrix in 2.17) . h2x , h2y , h2z are obtained from co-factors of x2, y2,
z2 . Similarly h3x , h3y , h3z and h4x , h4y , h4z are obtained.
Shape function derivatives of Hexahedral Elements
Shape function derivatives of hexahedral elements has 8 components. h1, h2,
h3, h4, h5, h6, h7, h8 is computed and stored for each element using following
shape function matrix for hex element. Later on this will be used to calculate
deformation gradient.
(xn, yn , z n ) are element node co-ordinates where n=1 to 8 for hex element.
(xc, yc , z c ) be the centroid of the hex element (Figure 2.1)
Let,
ξ=
y − yc
xn − xc
z − zc
,η = n
,ζ = n
a
b
c
(2.21)
Shape function can be represented as,
Hk =
1
(1 + ξk ξ)(1 + ηk η)(1 + ζk ζ)
8
(2.22)
10
2
Background
Figure 2.1: Shape function construction for nodes 1 through 8
where ξk , ηk , ζk takes values between -1.0 to 1.0
Then the shape function derivatives are,
∂Hk
∂Hk ∂ξ ∂Hk
∂Hk ∂η ∂Hk
∂Hk ∂ζ
=
. ;
=
. ;
=
.
∂x
∂ξ ∂x ∂y
∂η ∂y ∂z
∂ζ ∂z
(2.23)
∂Hk
1
1
1
= .ξk .(1 + ηk η).(1 + ζk ζ). =
.bc.ξk .(1 + ηk η).(1 + ζk ζ)
∂x
8
a 8abc
(2.24)
∂Hk
1
1
1
= .ηk .(1 + ξk ξ).(1 + ζk ζ). =
.ac.ηk .(1 + ξk ξ).(1 + ζk ζ)
∂y
8
b 8abc
(2.25)
∂Hk
1
1
1
= .ζk .(1 + ξk ξ).(1 + ηk η). =
.ab.ζk .(1 + ξk ξ).(1 + ηk η)
∂z
8
c
8abc
(2.26)
Expanding for k and writing in matrix format, where k=1, 2, . . .8
2.3
11
Time integration







 h1 

 h2 




 h3 






1 
 h4 



 h5  = 8abc . 




 h6 




 h7 




h8



∂H1
∂x
∂H2
∂x
∂H3
∂x
∂H4
∂x
∂H5
∂x
∂H6
∂x
∂H7
∂x
∂H8
∂x
∂H1
∂y
∂H2
∂y
∂H3
∂y
∂H4
∂y
∂H5
∂y
∂H6
∂y
∂H7
∂y
∂H8
∂y
∂H1
∂z
∂H2
∂z
∂H3
∂z
∂H4
∂z
∂H5
∂z
∂H6
∂z
∂H7
∂z
∂H8
∂z






















(2.27)
From above three equation,







1 
=
. 
8abc 





2.3
−bc.(1 − η).(1 + ζ)
bc.(1 − η).(1 + ζ)
bc.(1 + η).(1 + ζ)
−bc.(1 + η).(1 + ζ)
−bc.(1 − η).(1 − ζ)
bc.(1 − η).(1 − ζ)
bc.(1 + η).(1 − ζ)
−bc.(1 + η).(1 − ζ)
−ac.(1 − ξ).(1 + ζ) ab.(1 − ξ).(1 − η)
−ac.(1 + ξ).(1 + ζ) ab.(1 + ξ).(1 − η)
ac.(1 + ξ).(1 + ζ)
ab.(1 + ξ).(1 + η)
ac.(1 − ξ).(1 + ζ)
ab.(1 − ξ).(1 + η)
−ac.(1 − ξ).(1 − ζ) −ab.(1 − ξ).(1 − η)
−ac.(1 + ξ).(1 − ζ) −ab.(1 + ξ).(1 − η)
ac.(1 + ξ).(1 − ζ) −ab.(1 + ξ).(1 + η)
ac.(1 − ξ).(1 − ζ) −ab.(1 − ξ).(1 + η)








 (2.28)






Time integration
Assume that all displacements t U and t−4t U are current and previous displacements, t F being nodal forces or internal body forces computed. Then, from equation (2.3) by central difference method the displacement at next time step t + 4t
is
t+4t
Ui =
tR
i
−t Fi +
2Mii t
Dii
Ui + 24t
4t 2
Dii
Mii
24t + 4t 2
−
Mii t−4t
4t 2
Ui
(2.29)
We get the above equation (2.29) from acceleration Ü , velocity U̇ and stiffness
term K(U )U =t F. These velocity, acceleration and stiffness are substituted in
equation (2.3) where,
Ü =
and
t+4t U
i
− 2t Ui + t−4t Ui
4t 2
(2.30)
12
2
U̇ =
tU
i
+ t−4t Ui
24t
Background
(2.31)
The equation (2.29) can be further simplified for easy computation as follows
(represented in vector form),
Ai =
+
Mii
4t 2
(2.32)
2Mii
Ai
4t 2
(2.33)
Dii
B
Ai − i
24t
2
(2.34)
Bi =
Ci =
1
Dii
24t
Therefore equation (2.29) can be written as,
t+4t
Ui = Ai (t Ri −t Fi ) + Bi t Ui + Ci t−4t Ui
(2.35)
If the damping terms are neglected the equation (2.29) reduces to
t+4t
Ui =
4t 2 t
( Ri −t Fi ) + 2t Ui − t−4t Ui
Mii
(2.36)
The above equation (2.36) was formulated by Miller et al. But Taylor imposes
that it involves only one additional multiplication per displacement over the undamped system. Explicit time integration operators are only conditionally stable:
meaning small time step 4t is required. This must be smaller than some critical
limit 4t cr
4tcr =
Le
c
(2.37)
Le is the smallest element length and c is the constant known as dilatation wave
speed of the material, given by
s
c=
E(1 − υ)
ρ(1 + υ)(1 − 2υ)
where E is Young’s modulus and υ is Poisson’s ratio.
(2.38)
3
Theory
3.1
Updated Lagrangian Vs Total Lagrangian
In the updated Lagrangian formulation all derivatives refer to the current configuration of the system. This is the traditional method used extensively in most of
the commercial finite element soft-wares. Cauchy stress and Almansi (or logarithmic ) strain are suitable for this formulation. The advantage of this approach is
the simple incremental strain description. The disadvantage is that at each time
step the derivatives need to be recomputed because the reference configuration is
changing. The reason for this kind of approach is historical that the memory was
expensive and caused more problems than speed of computation [K. Miller et al].
In total Lagrangian formulation all the variable and derivatives refer to original
configuration of the system. Second Piola-Kirchhoff stress and Green strain are
suitable for this kind of configuration [Figure 3.1]. The only disadvantage is the
complicated description of finite strains resulting from initial-displacement effect. However, the major advantage is that all derivatives can be pre-computed.
Therefore the proposed algorithm performs fewer mathematical operations. The
complete description and comparison between formulations such as updated Lagrangian etc, lies beyond the scope of this project. Hence we will begin with
TLED in this background section.
3.2
TLED Formulation
We will begin with brief description of TLED algorithm. As mentioned before
TLED is an geometric and material non-linear dynamic finite element method. It
is a non-linear formulation for large deformation. The equation of equilibrium
includes both damping and inertial terms. Briefly TLED can be described in
13
14
3
Theory
Figure 3.1: TLED referring to original configuration (0) from the current
configuration (t) in order to compute next configuration t+4t - by KlausJürgen Bathe et al.
following steps consisting of two phases: pre-computation and time loop phase
[O Comas]:
1. apply initial displacement and boundary conditions.
2. for all the elements in the model compute:
(a) displacement derivatives and deformation gradient.
(b) right Cauchy-Green deformation tensor and strain-displacement matrix.
(c) 2nd Piola-Kirchhoff stress.
(d) element nodal force contribution and add them to global nodal forces.
3. for all the nodes in the model compute new displacement using central
difference method.
3.3
TLED in CUDA
An efficient CUDA implementation of TLED is governed by several factors such
as kernel organization, CPU-GPU interaction, memory usage and the chosen element technology. O Comas in his paper explained about three approaches of
CUDA implementation of TLED
1. Scatter as gather
2. Shared Memory
3. Atomic-writes
3.3
TLED in CUDA
15
Figure 3.2: Flow chart of the finite element algorithm with TLED for computing soft organ deformation developed at ISML. (Source : http://www.namic.org/Wiki/index.php/Collaboration:UWA-Perth)
Scatter as gather uses texture memory and basically two kernel approach. First
kernel operates over elements in the model and second kernel operates over
nodes. Second approach uses shared memory and single kernel approach. The
third method is meant for latest GPU architecture called Fermi. This particular
generation of GPU cards added capability called atomic − writes for floats. In
this project we study extensively on first and third method.
Another approach was proposed by G.R. Joldes as an enhancement to their previous method which considered handling areas with different non-linear materials,
different element types (linear hexahedron with hourglass control, linear tetrahedron and non-locking tetrahedron), time step control, steady state solution and
contacts information. This is slightly beyond the scope of this project but yet its
worth looking at for better understanding of the system. G.R. Joldes proposed following algorithm for computing the steady state solution in GPU with multiple
kernel each serving a purpose:
1. Initialization:
• Let q0 = 0 and q1 = 0 be displacements.
• Compute parameters ρ, c, h and the mass matrix M that provide maximum convergence rate.
• Pre-compute all other needed quantities (shape functions, hourglass
shape vectors, initial volumes, etc.)
16
3
Theory
Figure 3.3: CUDA kernel executed as group of threads within a grid.
(Source: NVIDIA)
3.3
17
TLED in CUDA
• Transfer all needed data to GPU memory
2. For every iteration step:
• Compute the nodal forces corresponding to the current displacements,
P (q n ):
– Run the element that computes the element pressure.
– Run the kernel that computes the nodal pressure.
– For each element type:
* Run the kernel that computes the nodal forces and saves them
in the GPU memory. These kernels also check the maximum
eigenvalue and change the mass of the element if needed (the
mass is saved in the GPU memory). One of the kernels finalizes the reduction of the displacement variation norm.
• Read the displacement variation norm from the GPU.
• Compute next displacement vector:
– Run the kernel that computes the next displacements. This kernel
assembles the force vector and mass matrix, computes the new
displacement using
q n+1 = q n + β(q n − q n−1 ) + αM −1 (f − P (q n ))
(3.1)
where α = 2h2 /(2 + ch), β = (2 − ch)/(2 + ch) , h is fixed time increment, n indicates nth time increment, c is the damping coefficient, M is the mass matrix, q is the displacement vector, P is the
vector of internal nodal forces and f is the vector of externally
applied forces (volumetric forces, surface forces, nodal forces as
well as forces derived from contacts). This equation obtained after
including derivatives of central difference method in the damped
equation of motion, is explicit as long as the mass matrix is diagonal.
• Run the kernel that enforces the contacts
• Check convergence criteria and finish analysis if met, using
ρ
||q n+1 − q ∗ ||∞ ≤
||q n+1 − qn||∞ ≤ δ
1−ρ
(3.2)
where ρ is the spectral radius of a matrix representing the reduction
in error and δ is the imposed accuracy.
3. Read the final displacement from the GPU.
4. Clean up the GPU memory.
18
3
3.4
Theory
CUDA Blocks, Grids and Warps
GPU along with CUDA API can be used to optimally perform Single Instruction, Multiple Data by distributing computation across a high number of parallel
execution threads. CUDA has two hierarchical levels: blocks which are group
of threads executed on one of the GPU’s multiprocessors and grids, which are
groups of blocks launched concurrently on the device and which all execute the
same kernel (Figure 3.3).
The kernel dimension is chosen in such a way to optimize the usage of available
computational resources in the GPU. Global memory latency can be hidden by
balancing the available memory required by the kernels. The memory requirement of kernel determine the number of threads that can run concurrently in
multiprocessors. Time slicing allows more than one block to be executed on a
single multiprocessor hiding memory latency. A block executing on the multiprocessor is able to switch processing between blocks while others are busy
accessing memory. But this cannot be possible if there is only one block is executing. Hence one can preferably launch several smaller blocks than one single
large block provided both configuration make use of same multiprocessor memory resource [O Comas]. Tools such as occupancy calculator are available from
NVIDIA for estimating the optimal execution configuration to fine tune the kernel configuration experimentally. Understanding warps in CUDA is crucial for
such optimization. The threads of thread block execute concurrently on one multiprocessor, in groups of 32 parallel threads called warps. Individual threads
in a warp start together at the same program address but are otherwise free to
branch and execute independently. Full efficiency is realized when all threads in
a warp agree on their execution path, otherwise the different branches in a warp
are executed serially.
3.5
Two kernel approach
CUDA allows scattered writes in theory, but it doesn’t support conflict management between threads. The major steps of TLED algorithm are the following [O
Comas]:
1. Compute the element nodal forces (i.e., for each node of each element) calculate the force contribution of element has on its node.
2. Obtain global force for each node by summing up the force contributions
of all elements to which this node is attached.
3. Compute displacement for each node by integration using the central difference method.
This approach is mainly based on global memory and texture fetches. In CUDA
implementation many elements are processed parallely (one element per thread).
A node can be shared by multiple elements. After calculating element nodal
forces, the result is written into global node array corresponding to each node.
3.6
Scatter-writes and Atomic-writes
19
This causes conflict between threads otherwords threads tries to access same
global node array location for certain elements sharing same node.
In this two kernel approach, first kernel operates over all the elements in the
model, compute the element stress based on current model configuration. Then,
nodal force contribution is calculated and written into global memory. The second kernel, operate over the nodal force array calculated by first kernel and sums
them for each node. Then finally compute the displacement using central difference solver.
3.6
Scatter-writes and Atomic-writes
Both methods in Two Kernel approach operate on global memory of GPU. In the
first method, each thread will register the nodal force contribution in an unique
memory location. In second method, multiple threads will try to access same
memory location. But it will be resolved using atomic write operations supported
by CUDA.
3.6.1
Scatter-writes
Each thread process an element and compute the nodal force contribution. After
this process the computed force will be stored in an unique memory location that
corresponds to each node. This idea has been described in detail in the implementation section. These unique location index can be pre-computed and passed on
to the kernels. The memory access pattern involved in this process has been illustrated in the Figure 3.4. In the provided illustration figure, each thread generates
4 nodal forces at unique location. In other words, this avoids racing condition of
the threads.
3.6.2
Atomic-writes
In this method, we do not use pre-computed unique locations. We allow threads
to access same memory location. In other words, we allow racing condition to
occur. But this can be efficiently resolved by atomic operations. These atomic
operations supported by GPU hardware differ among various classes of GPU. The
atomic-writes are inefficient to some extent. But it is easy to implement. This
atomic-writes approach has been illustrated in the Figure 3.5.
3.7
Single kernel approach
In previous approach the main problem is divergence of warps in the first kernel. Also transferring the data to global memory at the end of first kernel is
considered unnecessary. Hence in this approach we operate on the nodes and
their elements (to which the node belongs to) in a kernel. These elements can
be stored in shared memory. But these element information may not fit in the
shared memory so divide the nodes into several chunks during pre-computation
20
3
Theory
Figure 3.4: Scatter-writes, where t0,t1..are threads. force_n0, force_n1..are
force vectors. e0,e1,..are elements. Element Loop is a kernel operates on
elements and generates nodal forces. Following Node Loop is a kernel that
operates on the computed forces.
Figure 3.5: Atomic-writes, where t0,t1..are threads. force_n0, force_n1..are
force vectors. e0,e1,..are elements. Element Loop is a kernel operates on
elements and generates nodal forces. Following Node Loop is a kernel that
operates on the computed forces.
3.7
Single kernel approach
21
and these chunks are sent to GPU serially from CPU. Also it is necessary to know
the neighbor information which can be found during pre-computation. The only
disadvantage is element nodal forces contribution is computed more than once
for certain elements. In-spite of this additional computation still this approach
can outperform the previous approach by avoiding global memory access [O Comas].
4
Implementation
4.1
TLED in haptics environment
This project is built on CEngine framework. The CEngine is built on several libraries in H3D API, including the library H3DPhysics. This framework has been
serving as an interface for several linear solvers that uses FEM strategy. This
pre-existing platform have been used to access the mesh information and force
feedback from the haptics as an external force acting on each node. The deformation is represented as displacement of these nodes. The displacements and
forces are represented as vectors . This project have been tested using SenseGraphics PHANToM devices interfaced with VORTEX Workbench that supports
stereo display, enabled using dual NVidia graphics cards. The detailed hardware
configuration can be found in Appendix A.
4.2
Element Technology
In this project, tetrahedral and hexahedral element models are being tested. But
it is to be noted that algorithm doesn’t differentiate between the before mentioned
types of meshes instead we need to explicitly specify and invoke different implementation. As mentioned in earlier chapters, mixed element type has been
ignored. The element orientation and node ordering is illustrated in the Figure
1.1. gmsh and tetgen are tools that has been employed to generate the mesh. The
element type is notified to the algorithm through an external parameter.
23
24
4
Implementation
Figure 4.1: CUDA integrated in CEngine framework; Top Left: PHANToM Premimum 3.0 used in VORTEX Workbench; Bottom Left: CUDA
enabled NVidia Graphics card (GTX 290); (Sources: www.sensable.com,
www.h3dapi.org, www.geforce.com, www.nvidia.com)
4.3
Pre-Computation
4.3
25
Pre-Computation
Pre-Computation involves several steps. This includes memory allocation, data
transfers and calculating components such as volume of element, shape function
derivatives and time step. They are executed only once at the beginning of the
algorithm. Hence measuring the performance of these pre-computation steps are
not very significant. Often we can be observe a slight delay before loading a mesh
or model due to this pre-computation step. In-spite of its insignificance, some
parts of this pre-computation step are calculated parallely (for example, shape
function derivatives) and few other components are calculated serially.
4.3.1
Transfer of mesh data to GPU memory
The preliminary step is to transfer the information about mesh and other parameters to GPU memory. Later GPU kernels operate on these data. Nodal coordinates and volume element indices are required to be transferred to the memory. Note that surface elements are not useful for the algorithm so one can ignore
the triangle information and just transfer the tetrahedral or hexahedral element
information. Through out the project memory allocation and synchronization between CPU and GPU memory are managed by simple memory manager objects
created in this project specifically for synchronization and easy manipulation of
allocated memory.
4.3.2
Compute Volume and Neighbour count
After transferring mesh data to GPU, we can compute the volume of each tetrahedral or hexahedral elements. This is the volume of reference configuration.
These volume data is computed once and used often in the nodal force computation. Hence, this memory could be made to bind with texture memory of GPU.
Also it is necessary to find number of elements surrounding each node and obtain maximum element count. This will be used to allocate body force or internal
force for each node, due to neighbour elements. In addition it is necessary to
assign index for each node in an element. These indices map to unique location.
This is the location where force computed in an element will finally be stored in.
This has been illustrated in Figure 4.2 as simplified 2d version.(The alternative
approach has been discussed in “Future Work” chapter). Transfer the computed
data to global memory and bind them with texture memory of GPU.
4.3.3
Lame A, B, C
Using equations 2.32, 2.33 and 2.34 A, B, C co-efficients for each node has to
be computed. In our framework, nodes with uniform mass has been assumed.
But O Comas, computed summing contribution of mass from individual element
assuming the density to be constant. In this project, the density at each node
varies because of surrounding neighbour elements with different volume. In this
case, an average density value has been computed and this density has been used
through out the project. Calculation of these co-efficients also requires time step
which should be less than critical time step, which can be computed using equa-
26
4
Implementation
Figure 4.2: Neighbour index computation for each element
tions 2.37 and 2.38. Computing critical time step needs the measure of smallest
element edge length which can be computed from supplied mesh. And also necessary damping values are supplied. In the current implementation this computation has been done parallely on GPU. As mentioned earlier this is not significant
in this pre-computation step. Computed data resides in the global memory of
GPU, hence bind with the texture memory.
4.3.4
Shape function Derivative
Shape function derivatives are computed as per the section 2.2.5. A global memory array of size equal to (number of elements * nodes per element ) and of type
float3 is allocated. Like previous step, the shape functions are computed parallel
and data stored in such a manner to avoid coalescing problem generally arise in
CUDA context. Computed data resides in the global memory of GPU, hence its
only necessary to bind with the texture memory.
4.3.5
Allocate Input and Output data memory
The algorithm is executed every time step and produces nodal displacement as
the output. The input data is external force array of type float3 that represents
force vector and of array length equal to number of nodes. Similarly displacement array produced as output is also an vector array. Hence algorithm takes
external force as input data and produces displacement data at every time step.
This data transfer is usually considered as bottle neck for any GPGPU computation. The input and output data needs to be communicated between CPU and
GPU every time step. Allocating pagable or non-pagable memory in CPU influence the data transfer rate. Both configurations have been studied and the result
has been presented in this project. Also, in order to calculate displacement, it is
4.3
27
Pre-Computation
Data
Volume elements
Nighbour elements
Current displacement
Previous displacement
Next displacement
External Forces
Internal Forces
Lame ABC co-efficients
Shape Functions
Allocation size
sizeof(int4) * number of elements (twice, if
hexahedron)
sizeof(int4) * number of elements (twice, if
hexahedron)
sizeof(float3) * number of nodes
sizeof(float3) * number of nodes
sizeof(float3) * number of nodes
sizeof(float3) * number of nodes
sizeof(float3) * number of nodes *
max_neighbour_count
sizeof(float4) * number of nodes
sizeof(float4) * number of elements * 4 (twice, if
hexahedron)
Table 4.1: Calculation of total GPU memory in use, where sizeof(int)
= sizeof(float) = 4 bytes; sizeof(int4)=sizeof(float4)=16 bytes and
sizeof(float3)=12 bytes; int4, float3 and float4 are data types provided
in CUDA API which are linear sequence of float or int datatype
necessary to store displacement at previous and current time step. This is done
by allocating additional arrays in GPU of similar length and type similar output
displacement array.
Apart from supplied external force, the algorithm also produces internal body
forces acting on each node. The number of internal forces acting on a node
varies depending on number of neighbour element as mentioned earlier. Allocating varying number of forces for each node is not a wise choice as this will
affect the coalescing. Hence each node is allocated with force array of length
max_neighbour_count. In total there will be max_neighbour_count times the number of nodes. This array will be traversed with the help of mapping-neighbour
indices (computed previously in section 4.3.2) are used in later steps.
4.3.6
Total active GPU memory in use
After the pre-computation step, certain data are not required any more. Hence
it is necessary to identify such data and de-allocate them. For example, node
co-ordinate data is not required anymore. The following table represents the
computation of total memory in use for executing the main algorithm. Only essential data and their size are listed, there are other temporary memory allocation
which are not being listed as they are not very significant. Among all the listed
data, Next displacement and External Forces are the only data being transferred
between CPU and GPU.
28
4
Implementation
Figure 4.3: Element Loop and Node Loop are kernels producing next displacement vectors, where U denoting displacement on each node, R is external force acting on each node and 4t is time step
4.4
Two Kernel approach
There are two kernels that produces the final displacement (Figure 4.3). The first
kernel traverse through pre-computed data of each element in the mesh and computes nodal forces or body forces or internal forces. The second kernel uses the
computed nodal force and supplied external forces and invokes central difference
method to compute next displacement values. The element loop kernel is intense
part of the algorithm. These kernels are executed by individual threads in GPU.
The size of the block is supplied by the user, which is multiple of 32 (warp size).
The limit of block size can be determined by a simple device query using CUDA
API. The number of such blocks are determined in the following sections.
4.4.1
Kernel for Element Loop
First step is to fetch the pre-computed values. Store these pre-fetched values in a
local variable. This kernel computes the nodal forces and writes it to the internal
force array. The steps involved in this kernel are listed below.
Repeat the following steps for every element in the mesh (one element per thread)
1. Fetch the volume of the element and store in a local/register variable.
2. Similarly fetch the neighbour index data computed in previous section,
shape functions, current displacement and store them in a local/register
variable.
3. Compute deformation gradient as per the equation 2.7 and the result is
stored in 3x3 matrix.
4. From the obtained deformation gradient, Cauchy-Green strain tensor being
computed as per equation 2.9. This results in another 3x3 matrix, which is
being stored in a separate local/register variable.
4.4
29
Two Kernel approach
5. Also compute inverse Cauchy-Green strain tensor and determinant of the
deformation gradient called as gradient Jacobian.
6. Using neo-Hookean model, computed the stress tensor equivalent to second Piola-Kirchoff stress as per the equation 2.13. This step uses inverse
Cauchy-Green strain tensor and gradient Jacobian computed in previous
step. Though this results in 3x3 matrix due to tensor symmetry, the matrix
is stored in vector format as explained in section 2.2.1.
7. Next step is to assemble the strain-displacement matrix as in the section
2.2.4
8. Having known the volume, strain-displacement matrix and stress component, we can compute nodal force acting on each node as in equation 2.4.
9. As a pre-caution, we can make check on computed gradient Jacobian, as
this can lead to division by zero.
10. Finally the neigbour index array for the corresponding element is used to
write nodal force into internal force array in an unique location (Figure
4.2). This is like scattering the computed forces in each element to the corresponding nodes in the element (Figure 3.4). Later on in the node loop
kernel, these forces are gathered. But if we are to use atomic writes(Figure
3.5), then this mapping would be unnecessary. But it will affect the performance as threads are made to wait until each thread finish the writing the
nodal force and writing nodal force will be done serially.
The block size to launch the kernel is provided by user through blockSize parameter, mentioned previously. That is, number of threads in each block will be
determined by user. This number will also affect the solving time. The amount
of local/register variables used in this kernel will also affect the performance of
the kernel. Register variables are temporary variables allocated for each thread
by a kernel in the register memory. Hence Table 4.2 has been constructed. As
block size is given by user, from which it is necessary to compute the grid size
and thereby determining the kernel dimension. The grid size is determined by
following way,
Ge =
Ne
Be
(4.1)
where Ne is the total number of (volume) elements in the mesh, Be is the block
size provided by the user and Ge being the computed grid size. Now the launched
kernel is of dimension <<<Ge , Be >>>. Often Ne is not exactly divisible by the
provided Be , on that case add one to the computed Ge . But maximum limit of the
computed Ge is cross checked by simple device query.
30
4
Data
Element Index
Nighbour Index
Volume
Current displacement
Shape Function
Deformation Gradient
Cauchy-Green tensor
Inverse Cauchy-Green
Stress tensor
Strain displacement
Temporary Nodal Force
Implementation
Allocation size
sizeof(int4) (twice, if hexahedron)
sizeof(int4) (twice, if hexahedron)
sizeof(float)
sizeof(float3)*4 (twice, if hexahedron)
sizeof(float)*4*3 (twice, if hexahedron)
sizeof(float)*3*3
sizeof(float)*3*3
sizeof(float)*3*3
sizeof(float)*3*3 (but only 3*2 will be used)
sizeof(float)*6*3
sizeof(float3)*4*3 (twice, if hexahedron)
Table 4.2: Register variables allocated per thread in Element Loop Kernel, where sizeof(int) = sizeof(float) = 4 bytes; sizeof(int4)=sizeof(float4)=16
bytes and sizeof(float3)=12 bytes; int4, float3 and float4 are data types provided in CUDA API which are linear sequence of float or int datatype
4.4.2
Kernel for Node Loop
This is a simple kernel, that gathers the computed nodal forces and compute the
total force acting on a node. Also, external forces on each node and pre-computed
A, B, C co-efficients are passed to this kernel. This kernel computes next displacement vectors taking displacement vectors from current and previous configuration. The size of local variables allocated in this kernel is not very significant to
be measured.
Repeat the following steps for every nodes in the mesh (one node per thread)
1. Fetch nodal forces and sum them up: Previously, nodal force array of length,
number of nodes * max_neighbour_count (Table 4.1) has been allocated. Now
lets consider max_neighbour_count = 4. So array locations 0 through 3, 4
through 7, 8 through 11, . . . represents forces acting on 1st , 2nd , 3rd ,. . .
nodes respectively. Force acting on each node are summed up as total nodal
force.
2. Computed total forces are stored in a local/register variable.
3. Fetch A, B, C and External forces. Store them in a local/register variable.
4. Fetch displacement vector of current configuration and displacement vector of previous configuration.
5. Compute the displacement vector of the next configuration as per equation
2.35.
As block size is given by user, from which it is necessary to compute the grid size
for launching this kernel and thereby determining the kernel dimension. The
grid size is determined by following way,
4.5
31
Atomic writes in Two kernel approach
Gn =
Nn
Bn
(4.2)
where Nn is the total number of nodes in the mesh, Bn is the block size provided
by the user and Gn being the computed grid size. Now the launched kernel is of
dimension <<<Gn , Bn >>>. Often Nn is not exactly divisible by the provided Bn ,
on that case add one to the computed Gn . But maximum limit of the computed
Gn is cross checked by simple device query. It is to be noted that, Be = Bn .
4.4.3
Swapping displacement array
After computing next displacement vector, the array is copied to the CPU memory. The frame work will use these values and displace the nodes. After the copy
operation, swap the memory pointers so that, current displacement becomes previous displacement and computed next displacement becomes current displacement. (And the previous displacement becomes next displacement. This has to
be achieved with help of additional temporary pointer variable). Only the memory pointers are swapped. This avoids unnecessary memory copy operations.
4.5
Atomic writes in Two kernel approach
The GPU used in this project supported CUDA 1.3. GPU cards in this range has
limited supported for atomic operations. For example, it does not have support
for atomic write for float. But it has support for atomic write for integer. Hence,
with the aid of snippet provided in the appendix B (atomicFloatAdd which uses
atomicCAS ), we have been able to perform atomic write for float . Using such
atomic operations, though reduces the performance, it saves global memory consumption. This is because we no longer need to consider max_neighbour_count
when allocating element nodal forces. Multiple threads write at same nodal force
array locations, creating racing condition. But atomic operations are meant to
resolve such racing condition. This has been implemented in step 10 described
in section 4.4.1 and result has been furnished.
5
Results
5.1
5.1.1
GPU Memory measure
Total Element Count Vs GPU Global Memory
The figure 5.1 represents graph comparison between different models represented
in Table 5.1. Some parts of the global memory array have been bind with texture
memory. But for simplicity, measure of the texture memory performance has
been ignored.
5.1.2
Mesh statistics
The Table 5.1 represents statistics of the models used in this project. As discussed
in previous chapters, number of elements is a key parameter for both memory
and performance measure.
Model
Cube_H
Cube_T
Hand_1
Hand_2
Type
Hex
Tet
Tet
Tet
Nodes
1331
1331
3304
8535
Elements
2000
6000
16447
45620
Neighbours
08
32
40
38
Table 5.1: Mesh statistics
33
Memory(bytes)
376960
1100288
3178924
8270280
34
5.2
5
Results
Performance measure
In the following sections time has been measured using CUDA Visual Profiler.
The algorithm has been executed several iterations (or time step) and the average
time has been calculated.
5.2.1
Scattered-writes
The figures 5.2, 5.3 and 5.4 represents the performance measure of the scatterwrite methods. The Element Loop and Node Loop are the kernels that operate on
the mesh. It is clearly evident that Element Loop is the intense part of the algorithm. Hence number of element in a mesh determines the overall performance
of the implementation. The host to device memory transfer and device to host
memory transfer is also being measured. Pageable host memory has been used.
The distribution of time take by all these process are represented in the figure
5.4. Compared with the Element Loop and Node Loop process, memory transfer
takes very less time. This gives an idea that memory transfer is not very significant area to focus or optimize in this case. Instead we can focus on optimizing
ElementLoop and NodeLoop.
5.2.2
Atomic-writes
The figures 5.5, 5.6 and 5.7 represents the performance measure of the atomicwrite method. This is a slight modification to the scatter-write method. In previous section we understood that Element Loop is intense part of the algorithm.
In which, the nodal force has been computed and stored using global memory
write using neighbour element index. This process of storing memory in global
memory space based on neighbour indices, is very expensive part of this Element
Loop. Hence as an alternative, atomic-write method have been studied. This
atomic-write does not use neighbour element indices.
From figure 5.5, we can understand that atomic-wirte method is expensive than
the previous method. But on the other hand, it has an huge impact on the Node
Loop kernels. The main intention of atomic-write method is to reduce expensive
non-uniform memory access. But the GPU hardware in which this project has
been tested, does not support proper atomic operations. This has been discussed
in the theory and implementation section. This could be a cause for poor performance of atomic-write method. Either way, in consideration of current hardware
configuration, scatter-write seems to be more suitable method. Pageable host
memory has been used.
5.2.3
Pageable Vs Non-Pageable
The host memory used to transfer data to device memory can be either pageable
or non-Pageable. Non-Pageable is always faster than the pageable. This is evident
from the figure 5.8.
5.2
35
Performance measure
Block
Size
64
128
256
320
Grid
Size
94
47
24
19
Block
Size
64
128
256
320
Grid
Size
21
11
6
5
Table 5.2: Element Loop (left) and Node Loop (right) kernel has been executed with different Block size as in the above tables
5.2.4
Total Time Comparison
The total time taken for both Scattered-writes and Atomic-write method has been
presented in the figure 5.9. It can be observed that Scatter-write and Atomicwrite methods are similar in performance. If GPU hardware supporting proper
atomic operations are used, the Atomic-write can perform better. But still it may
perform poorly on models with several million elements because from figure 5.2
and 5.5, we can expect decrease in performance as the size of the model increases.
Hence a different approach or modification to the Two Kernel approach can be
developed.
5.2.5
Block Size Comparison
The equation 4.1 in section 4.4.1 explains calculation of grid size from supplied
block size. Block size is the number of threads in a block and grid size is the
number of blocks in a grid. The figures 5.10 and 5.11 shows the performance of
Element Loop and Node Loop for varying block size. In order to understand the
variation in the performance it is necessary to know certain parameters of the
kernels such as block size, grid size, register usage per thread. These data has
been given in the tables 5.2 and 5.3.
To generate result in the figures 5.10 and 5.11 model Cube_T (Table 5.1) has been
used. It has 1331 nodes and 6000 tetrahedron elements. For Scatter-write and
Atomic-write the number of registers differ as in Table 5.3.
In the current hardware there are 30 Streaming Multiprocessors (SM). Each multiprocessor can execute maximum of 8 block and/or maximum 1024 threads. The
maximum register memory capacity per SM is 16384 bytes.
For an Element loop kernel, in Atomic-write method a block of size 320 requires
40 × 320 register bytes. This mean only one block can fit in one SM.(Adding one
more block would cross maximum register capacity limit). And hence each block
occupy one SM. In other words, occupancy of each SM is 320 active threads or
10 active warps or 1 active block. Various block size have been tested. 320 block
size seems to be optimal in this case. By this way optimum block size for both
methods can be determined.
36
5
Method
Scatter-wirte
Atomic-wirte
Registers per
thread for
Element loop
50
40
Results
Registers per
thread for Node
loop
13
11
Table 5.3: Register memory usage in Element Loop and Node Loop for
Scatter-write and Atomic-wirte method
Figure 5.1: Active global memory used in GPU during each time step
5.2
Performance measure
Figure 5.2: Scatter-write method for Element Loop and Node Loop
Figure 5.3: Pageable memory transfer in Scatter-write
37
38
5
Figure 5.4: Total time distribution in Scatter-write
Figure 5.5: Atomic-write method for Element Loop and Node Loop
Results
5.2
Performance measure
Figure 5.6: Pageable memory transfer in Atomic-write
Figure 5.7: Total time distribution in Atomic-write
39
40
5
Figure 5.8: Pageable and Non-Pageable Comparison
Figure 5.9: Total time for Scatter-write and Atomic-write
Results
5.2
Performance measure
Figure 5.10: Varying Block Size for Scatter-write
Figure 5.11: Varying Block Size for Atomic-write
41
6
Summary and Conclusion
6.1
Summary
The TLED implementation on GPU using CUDA has been done for hexahedron
and tetrahedron mesh. The performance measure has been presented in the results. The implementation is done based on the Two Kernel approach. First step
of initialization is pre-computation.
In pre-computation step we can compute volume, shape function, lame co-efficients
etc.,. In Two Kernel approach, the intense part of the algorithm is Element Loop
module. In this module, nodal forces for each element have been computed
and the resulting force is stored in global memory space in the GPU. Registering these nodal forces can be done using the neighbour indices generated in the
pre-computation step. This needs additional memory space.
The alternative option can be registering nodal forces without using the neighbour indices. This can be achieved using atomic-write method which uses atomic
operations supported by the GPU hardware. The GPU hardware being used in
this project does not support atomic addition of floating points. Pageable and
non-pageable memory transactions are also measured.
All these implementations are done in CEngine framework based on H3D API,
tested in an semi- immersion environment which comprises VORTEX Workbench
and PHANToM Premium haptics devices that are ideal for medical simulations.
43
44
6.2
6
Summary and Conclusion
Conclusion
The results presented in the previous section, suggests that, in the Two Kernel
approach scatter-write approach outperforms atomic-write. But with appropriate GPU hardware (for instance Nvidia Fermi), it is possible to get better performance than scatter-write. The scatter-write can be further modified which has
been proposed in the chapter Future works. Also, at this juncture, it is not essential to put our focus on memory transactions between CPU and GPU. It will
be necessary only if the time for memory transaction gets larger than solving
Element Loop and Node Loop.
So in order to optimize the implementation of Two Kernel approach, it is essential
to focus on nodal force registration. And this is the reason for comparing scatterwrites and atomic-writes. Hence, in this project we can infer that performance
of the GPU accelerated TLED algorithm is influenced by many factors that are
specific to CUDA architecture and parameters supplied by the user are key to
achieve required soft tissue deformation.
7
Future Work
7.1
TLED Enhancements
Computation of hour-glass parameters has been already implemented. But this
has not been studied yet. Integrating them along with hexa implementation will
be very effective as hexahedron element meshes are memory efficient representation.
Currently neo-Hookean material model has been adopted. This method has been
proven to be best approximation of rubber-like material. But this model is not
suitable for viscous materials. Such materials exhibit certain characteristics that
are linear and non-linear. This can be another interesting area to explore.
7.2
GPU enhancements
It is possible to find alternative approach to scatter-write methods other than
atomic-writes. One such method can be using mapping methods for registering
nodal forces with unique node id followed by sorting of forces based on node id.
This approach is expected to be applicable for solving very large models (in the
order of several million node). Such conditions for solving large models may not
occur quite often in real-time interactive medical applications. But this can be
an alternative to traditional scatter-gather methods, where they fail abrubtly due
to very high thread divergence (while accessing memory). Using shared memory, minimizing use of local variables in kernels, Single Kernel approach, taking
advantage of multiple GPUs are other possible GPU enhancements.
45
A
Appendix A
A.1
A.1.1
Development Environment
Hardware Configurations
VORTEX Workbench consists of rear projection stereo display enabled by dual
graphics card (NVidia GTX 290) with CUDA capability and connected with PHANToM Premium haptic devices.
The PHANToM Premium 3.0 device provides a range of motion approximating
full arm movement pivoting at the shoulder. The Premium 3.0 device provides 3
degrees of freedom positional sensing and 3 degrees of freedom force feedback.
The Premium 3.0 device connects to the PC via the parallel port (EPP) interface.
And Table A.1 represents CUDA configuration.
A.1.2
Softwares and APIs
1. H3D APIs.
2. Visual Studio 2008 IDE, nvcc
3. CUDA ToolKit with SDK.
4. CUDA occupancy calculator.
5. CUDA Visual profilers.
6. TetGen and TetView
7. Gmsh
8. And Surface mesh models obtained from INRIA repository
47
48
A
CUDA Driver Version
CUDA Capability Major revision number
CUDA Capability Minor revision number
Total amount of global memory
Number of multiprocessors
Number of cores
Total amount of constant memory
Total amount of shared memory per block
Total number of registers available per block
Warp size
Maximum number of threads per block
Maximum sizes of each dimension of a block
Maximum sizes of each dimension of a grid
Maximum memory pitch
Texture alignment
Clock rate
Concurrent copy and execution
Run time limit on kernels
Integrated
Support host page-locked memory mapping
Compute mode: (multiple host threads can use
this device simultaneously)
Appendix A
4.0
1
3
915800064 bytes
30
240
65536 bytes
16384 bytes
16384
32
512
512 x 512 x 64
65535 x 65535 x 1
2147483647 bytes
256 bytes
1.24 GHz
Yes
Yes
No
Yes
Default
Table A.1: Configuration of GTX 290 Nividia Graphics card
B
Appendix B
B.1
Atomic Float using atomicCAS
__device__ inline void atomicFloatAdd(float *address, float val)
{
int tmp0 = *address; int i_val = __float_as_int(val + __int_as_float(tmp0));
int tmp1;
while( (tmp1 = atomicCAS((int *)address, tmp0, i_val)) != tmp0)
{
tmp0 = tmp1;
i_val = __float_as_int(val + __int_as_float(tmp1));
}
}
49
50
B
Appendix B
Bibiliography
1. Joseph C. Kolecki, An Introduction to Tensors for Students of Physics and
Engineering, NASA/TM-2002-211716
2. Grand Roman Joldes, Adam Wittek, Karol Miller and Real-time nonlinear
finite element computations on GPU - Application to neurosurgical simulation, Comput. Methods Appl. Mech. Engrg-199 (2010) 3305-3314.
3. NVIDIA CUDA - Programming Guide, Version 3.0, NVIDIA Corporation,
2008.
4. David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach.
5. Zeike A. Taylor, Mario Cheng, and Sebastien Ourselin, High-Speed Nonlinear Finite Element Analysis for Surgical Simulation using Graphics Processing Units, IEEE Trans. on Medical Imaging, Vol 27, 5, May 2008.
6. Karol Miller, Ron Kikinis, Real Time Computer Simulation of Human Soft
Organ Deformation for Computer Assisted Surgery, ISML, 2009
7. Karol Miller, Grand Joldes, Dane Lance and Adam Wittek, Total Lagrangian
explicit dynamics finite element algorithm for computing soft tissue deformation, Communications in Numerical Methods in Engineering, August
2006.
8. Olivier Comas, Zeike A. Taylor, Jeremie Allard, Sebastien Ourselin, Stephane
Cotin and Josh Passenger, Efficient nonlinear FEM for soft tissue modelling
and its GPU implementation within the open source framework SOFA, 2010.
51
52
B
Bibiliography
9. Olivier Comas, Real-time soft tissue modelling on GPU for medical simulation, 2010.
10. Umut Kocak, Issues in Deformation Simulation, NVIS, Norrköping Visualization and Interaction Studio, ITN, Linköping University, SWEDEN.
11. Karsten Ostergaard Noe and Thomas Sangild Sorensen, Solid Mesh Registration for Radiotherapy Treatment Planning, Springer-Verlag Berlin, 2010.
12. Rob Farber, CUDA Supercomputing for the Masses, March 18, 2009
13. Grand Roman Joldes, Adam Wittek and Karol Miller, An efficient hourglass
control implementation for the uniform strain hexahedron using Total Lagrangian formulation, July, 2007
Fly UP