...

Semidefinite and Second Order Cone Programming Seminar Lecture Notes Fall 2001

by user

on
1

views

Report

Comments

Transcript

Semidefinite and Second Order Cone Programming Seminar Lecture Notes Fall 2001
Semidefinite and Second Order Cone
Programming Seminar
Lecture Notes
Fall 2001
Instructor: Farid Alizadeh
January 13, 2002
Preface
This set of lecture notes are based on a P.h.D level seminar I offered while on
a sabbatical leave at the IEOR department at Columbia University. Among
responsibilities of the students were to be in charge of one or two lectures, take
detailed notes and transcribe them into LATEX format. I then posted these notes
on the course web page1 .
I made some editing of these notes to remove errors and elaborate on some
points. In addition, each student was required to find a topic, prepare notes for
it, and present it in a half-hour lecture. The appendix is the notes for these
presentations and are included essentially unedited.
Farid Alizadeh
New York, NY, USA
December 2001
1
http://www.ieor.columbia.edu/~alizadeh/CLASSES/01fallSDP/
4
Chapter 1
Introduction and Overview
Scribe: David Phillips
9/10/01
1.1
Overview
In general, this lecture gave some motivating examples for SDP and SOCP, as
well as the general formulations for both.
These notes have the following sections:
• The Goemans-Williamson approach to MAX CUT on an Undirected Graph.
• An approach towards regressing a Laffer Curve.
• Moment Cones
• The Fermat-Weber problem
• General formulations for LP, SDP, and SOCP.
1.2
1.2.1
Example 1: MAX CUT on an Undirected
Graph
The problem
Let G = (V, E) be an undirected graph, where V = {1, . . . , n} and |E| = m.
Enumerate the vertices 1, . . . , n. Further, let wij ≥ 0 be a weight on edge,
(i, j), ∀(i, j) ∈ E. The MAX CUT problem is to partition V into S ⊂ V and
S̄ = V − S such that the sum of the weights on the edges from the subset S to
S̄ is maximized. i.e., determine:
W(Copt ) = max
S⊂V
X
{(i,j)∈E|i∈S,j∈S̄}
wij
6
CHAPTER 1. INTRODUCTION AND OVERVIEW
Such a partition is also known as a cut, and quite a bit of work has been
done on both the minimization (MIN CUT) and maximization version. Indeed,
several polynomial time algorithms have been created to solve the MIN CUT
problem. However, MAX CUT has been shown to be NP-Complete. Given
the worst-case complexity of the problem, approximation approaches have been
used, as, unless P = NP, no polynomial time algorithm exists.
1.2.2
Two Approaches
One approach, the Greedy Approach, works as follows: choose a node arbitrarily,
and add it to S. Then for each remaining node n, if the sum of the weights
from n to the nodes in the current S are less than the sum of the weights from
n to the nodes in the current S̄, then add n to S; otherwise, add it to S̄. Let
W(Cgreedy ) denote the value of this cut. It can be shown that:
X
W(Cgreedy ) ≥
wij /2 ≥ W(Copt )/2)
(i,j)∈E
A second approach, the Naive Randomized Approach, works by randomly assigning nodes to S and S̄. Denote the value of cuts generated in this fashion
with W(Crand ). It can then be shown that:
E[W(Crand )] =
X
wij /2
(i,j)∈E
Until the early 90’s, it was not known whether this factor of 2 could be improved. It was then discovered that it could be improved to a constant, and that
to improve better than this constant was itself NP-Hard. The exact constant
remains unknown.
1.2.3
The Goemans–Williamson approach
Goemans and Williamson approached the problem as follows. ∀i ∈ V, let
−1 if i ∈ S
xi =
+1 otherwise
but then,
1 − xj xi =
2 if i ∈ S, j ∈ S̄ or vice versa
0 otherwise
So then the objective becomes:
X wij
(1 − xi xj )
max
2
i<j
1.2. EXAMPLE 1: MAX CUT ON AN UNDIRECTED GRAPH 7
Since this is still equivalent to the former problem, it is still NP-Hard. Hence,
consider the following relaxation. Associate to each vertex i ∈ V a vector
vi ∈ Rn , and consider:
P
wij
T
max
i>j 2 (1 − vi vj )
subject to: ||vi || = 1, ∀i ∈ V
In an ideal world, there would exist a unit vector u ∈ Rn , such that vi = u or
vi = −u, ∀i ∈ V in the solution to the above program (then if vi = u, i ∈ S, and
i ∈ S̄ otherwise, ∀i ∈ V). Naturally, this doesn’t occur in the real world. Hence,
one heuristic is to pass a random hyperplane through the origin of n-space, and
assign i to S if vi is on one side of the random hyperplane, and S̄ otherwise.
Two questions remain: 1) How do we solve this? 2) How good is the solution?
How do we solve this?
This is, in fact, the topic of this course. Consider the V ∈ Rn×n , with form
V = [v1 , v2 , . . . , vn , ]
Let Y = V T V, indexing Y with yij , i = 1, . . . , n, j = 1, . . . , n. Thus, yij = vTi vj .
The following formulation is then obtained:
P
max
i>j (1 − yij )wij /2
subject to: yii = 1, ∀i ∈ N
Y0
where the notation A 0 indicates that the matrix A is positive semidefinite.
Three equivalent (there are many others as well) definitions of positive semidefinite for a symmetric matrix A are:
1. If ∀x ∈ Rn , xT Ax ≥ 0.
2. If there exists a matrix B, not necessarily square, such that A = BT B.
3. If all the eigenvalues of A are non-negative.
Note that the condition of Y being positive semidefinite ensures that it is
of the form Y = V T V, by definition 2. The techniques for solving this kind of
optimization problem will be the subject of the course. For now let us agree
than there is an effective way of solving such problems. Once solved, a method
of generating a solution is as before: First, take the optimal Y and write it
(for instance by using Cholesky factorization) as Y = V T V. Then Generate a
random n-dimensional hyperplane through the origin, and assign the vectors,
vi , columns of V, as before. Denote the value of the cut generated this way by
W(CrandSDP ).
How good is this?
Proposition 1 E[W(CrandSDP )] ≥ αW(Copt ), where α =
Note that 0.87856 < α < .87857.
2
π
min0<θ≤π
θ
1−cos θ .
8
CHAPTER 1. INTRODUCTION AND OVERVIEW
proof: Observe that the probability that vectors vi and vj are on opposite sides
of the hyperplane is exactly the proportion of the angle between vi and vj to π,
i.e., arccos(vi · vj )/π. Then by the linearity of expectations
E[W(CrandSDP )] =
X arccos(vT vj )
i
i<j
π
Since The expected value of a given cut is at most as large as the optimal
cut, and the expected value of the optimal cut is less than the value of the
semidefinite relaxation, we have
X arccos(vT vj )
i
i<j
π
≤ W(COPT ) ≤ W(CSDP ) =
X
wij
i<j
1 − vTi vj
2
Now, we ask the question: Does there exist 0 ≤ α ≤ 1 such that αW(CSDP ) ≤
W(CRandom ) If such an α exists then by comparing term by term the sums:
X
i<j
αwij
X
arccos(vTi vj
1 − vTi vj
≤
wij
2
π
i<j
it follows that we seek an α that is at most as large as
2 arccos(vTi vj )
.
π(1 − vTi vj )
renaming y = vTi vj we see that such an α at most equals
min
−1≤y≤1
2 arccos(y)
π(1 − y)
Using software such as MAPLE or Mathematica, it is easily seen that the minimum is approximately 0.87. Thus we have proved the proposition.
1.3
Example 2: The Laffer Curve
A Laffer curve is a theoretical function that some economists have used to
explain the need for lowering tax rates so as to increase revenue from taxation
(this is often called the supply-side economics). The argument goes as follows:
If the tax rate is zero then obviously the government will have no revenues from
taxes. If the tax rate is 100% then people have no incentive to work: all they
make will have to go to government. Therefore they won’t work at all and thus
there is no income and no taxes to be collected; the government revenue is again
zero. Starting from tax rate of zero, as we slowly increase taxes, the government
revenue from taxes first increases, but–assuming continuity of the curve–it has
to reach a maximum and then start going down. One can also give economics
reasons that the curve should be concave:
1.3. EXAMPLE 2: THE LAFFER CURVE
Gov. Revenue
9
Laffer curve
Regressed curve
0
1
Tax rate
Now to actually get an approximation to the Laffer curve, one may get some
data, a set of points indicating various tax rates and government revenue at
that rate. Any such data is very possibly quite noisy. The typical approach
is to use least squares polynomial approximation to find a possible curve (note
that we should also force the curve to go through the (0,0) and (1,0) points.)
However there is no reason to believe that the resulting least square curve is
concave at all. Therefore, we need to include the concavity of the curve among
our constraints:
Given data points (xj , yj ), j = 1, . . . , n, where xj ∈ [0, 1] represents a tax
rate, and yj represents the associated revenue, our problem is to find pi , i =
0, 1, . . . , k such that
p(x) = p0 + p1 x + p2 x2 + . . . + pk xk
is concave on the interval [0, 1], and
p0 = 0
k
X
i=1
pi = 0
10
CHAPTER 1. INTRODUCTION AND OVERVIEW
which minimize
n
X
yj − p(xj )2 .
j=1
One possible way to ensure the concavity of p(x) is to use polynomials of the
form p(x) = p0 (−1) + p1 (−x2 ) + p2 (−x4 ) + . . . + pk (−x2k ), and restrict pi ≥ 0,
i = 0, 1, . . . , k. Since −x2j is concave, the resulting polynomial will be concave
if we restrict ourselves to pi ≥ 0. This approach results in a constraint set that
is polyhedral. However, there are concave functions on the interval [0, 1] that
are not of this form. Another way of ensuring that p(x) is concave is to state
that the negative of its second derivative, −p 00 (x) ∈ Pk−2 , where
def
Pk+1 =
p ∈ Rk+1 p0 + p1 x + p2 x2 + . . . + pk xk ≥ 0, ∀x ∈ [0, 1]
This is the cone of non-negative polynomials of degree k, and our formulation
becomes,
Pn
2
min
j=1 |yj − p(xj )|
subject to: p0 = 0
Pk
i=1 pi = 0
−p00 (x) ∈ Pk−1
Note that the vector p00
p00 = D2 p where

0
0

0
D2 = 
0

0
0
is obtained from p by a linear transformation, that is
0 1×2
0
0
0
0
0
0
0
0
0
0
...
2×3
...
...
...
...

0
0

...
0


0
0

0 (k − 1) × k


0
0
0
0
The approach explained above can be applied to any form of regression or
approximation using a linear combination of appropriate basis functions, where
we have additional constraints such as the functions should be
1. non-negative or non positive: e.g. p ∈ Pn ,
2. non-increasing or non-decreasing: e.g. p0 ∈ Pn ,
3. convex or concave: e.g p00 ∈ Pn .
Many other variations of these requirements can be constructed.
It turns out that optimization over Pn can be formulated as a semidefinite
program. We elaborate on this after introducing the moment cone.
1.4. EXAMPLE 4: THE MOMENT CONE
1.4
11
Example 4: The moment cone
Suppose the moments of a distribution are given, ci , i = 0, 1, ..., 2n, where
Z
ci = xi dF(x)
There are some question that can be asked concerning these moments. Are
these really the moments of a distribution? It turns out this can be answered
by determining if matrix C(defined below) is positive semidefinite, and c0 = 1.
i.e.,


c0
c1
...
cn
 c1
c2
. . . cn+1 
 0
C=
 ...

cn cn+1 . . . c2n
c0 = 1
But this is equivalent to saying that the moments are in the moment cone,
M2n+1 , where
M2n+1 = {αc : α ≥ 0, ci is the ith moment for a probability distribution}
It can be proven that M2n+1 is the dual cone of P2n+1 , the cone of nonnegative polynomials over R. The regression problem mentioned in the previously required that that the polynomials to be non-negative over the interval
[0, 1]. It can be shown that that cone is dual to moments whose distributions are
defined over [0, 1]. The latter can also be characterized by semidefinite matrices.
Thus, optimization over Mn is related to optimization over Pn . In particular
our regression problem is a semidefinite program of special kind.
Several proofs and definitions have been omitted here; we will review this
material more carefully in future lectures.
1.5
Example 5: The Fermat-Weber Problem
Suppose one is given n locations in k-space, i.e., vectors vi ∈ Rk , i = 1, . . . , n,
with associated weights, wi , i = 1, . . . , n. How can a locations x ∈ Rk be chosen
to minimize the weighted Euclidean norm (represented by || · || of the distances
from each of the locations? i.e.,
min
x∈Rk
n
X
i=1
wi kvi − xk
12
CHAPTER 1. INTRODUCTION AND OVERVIEW
P
The complication with this problem is that the function i wi kvi − xk is nonsmooth. In particular, the possible solution x = vj , for some j ∈ {1, . . . , n}, is a
point at which the gradient of the objective does not exist.
So consider the following equivalent program:
Pn
min
i=1 wi zi
subject to: zi ≥ kvi − xk, i = 1, . . . , n
Equivalently, one could write




y=



zi
(vi )1 − (x)1
·
·
·
(vi )k − (x)k


 
 
 
=
 
 
 
y0
y1
·
·
·
yk








Letting



ȳ = 


y1
·
·
·
yk






the formulation constraint is equivalent to
y0 ≥ ||ȳ||
or,










y∈Q= 










y0
y1
·
·
·
yk











q

 : y0 ≥ 2 y2 + . . . + y2
1
k









Now, Q is a convex cone, often referred to as the Lorentz cone or the second
order cone. We write the constraint of y ∈ Q as y Q 0.
So the problem becomes:
Pn
min
i
 i=1 wi z


zi
0
 (vi )1   (x)1 

 

 ·   · 
−

subject to: 
 ·   ·  Q 0, i = 1, . . . , n

 

 ·   · 
(vi )k
(x)k
This kind of problem is an example of optimization problems over the second
order cone; the class of such problems is called Second Order Cone Programming
or SOCP for short.
1.6
1.6.1
Program Formulations
Standard Form Linear Program
This is the ordinary linear program. Let c ∈ Rn and b ∈ Rm , A ∈ Rn×m with
rows, ai ∈ Rn , i = 1, . . . , m, x ∈ Rn decision vector, and 0 is a vector of zeroes,
appropriately dimensioned.
min
cT x
subject to: ai x = bi ,
x≥0
1.6.2
i = 1, . . . m
Semidefinite Program
Here instead of vectors ai we use symmetric matrices Ai ∈ Rn×n , i = 1, . . . , m,
C ∈ Rn×n and X ∈ Rn×n instead of the cost and decision vectors c and x. The
matrix X is symmetric (i.e., X = XT – this allows one to make the assumption
that C and P
Ai , i = 1, . . . , m are also symmetric). The inner product is defined
as A • B = i,j Aij Bij .
min
C•X
subject to: Ai • X = bi
X0
1.6.3
for i = 1, . . . , m
Second-Order Cone Program
Letting Aj ∈ Rm×nj , cj ∈ Rn , b ∈ Rm , j = 1, . . . , k and xj ∈ Rnj , j = 1, . . . , k
are decision variables. the general second order cone program is:
min
cT1 x1 + · · · + cTk xk
subject to: A1 x1 + . . . + Ak xk = b
xj Q 0, j = 1, . . . , k
where the relation a Q 0 is defined as a0 ≥
q
a21 + . . . + a2n .
14
CHAPTER 1. INTRODUCTION AND OVERVIEW
Chapter 2
Cones and Cone-LPs
Scribe: Xuan Li
9/17/2001
2.1
Overview
We survey the basic notions of cones and cone-LP and give several examples
mostly related to semidefinite programming.
2.2
Program Formulations
The linear and semidefinite programming problems are formulated as follows:
2.2.1
Standard Form Linear Programming
Let c ∈ Rn and b ∈ Rm ,A ∈ Rn×m with rows ai ∈ Rn , i = 1, ...m.
min: cT x
s.t.
ai x = b i ,
x≥0
2.2.2
i = 1, ..m
(2.1)
Semidefinite Programming
Here instead of vectors ai we use symmetric matrices Ai ∈ Sn×n (the set of
n × n symmetric matrices), i = 1, ...m, C ∈ Sn×n and X ∈ Sn×n instead of c
and x. The matrix X is positive semidefinite. The inner product is defined as
X
A•B =
Aij Bij
i,j
=
=
Trace(ABT )
Tr(AB) = Tr(BA).
16
CHAPTER 2. CONES AND CONE-LPS
The second equation is from definition of product, and the last one come
from the observation that even though matrix product is not commutative,
i.e. AB 6= BA in general, the diagonal entries of AB and BA are equal and thus
their traces are equal as well. The standard form of semidefinite programming
is :
min C • X
s.t. Ai • X = bi ,
X<0
2.3
i = 1, ..m
Some Notations and Definitions
• cone: A set K is called a cone if αx ∈ K for each x ∈ K and for each α ≥ 0.
• convex Cone: A convex cone K is a cone with the additional property
that x + y ∈ K for each x, y ∈ K.
• pointed cone A pointed cone K is a cone with the property that K ∩
(−K) = {0}.
• open Set A set S is open if for every point s ∈ S, B(a, ) = {x : ||x − s|| <
} ⊂ S for some positive number s .
• closed set A set S is a closed set if its compliment Sc is open.
• interior of set The interior of a set S is defined as
[
Int(S) :=
T
T ⊆S,T open
• closure of set The closure of a set S is defined as
\
cl(S) :=
T
T ⊇S,T closed
• boundary of set The boundary of a set S is defined as
Bd(S) := Cl(S) ∩ Int(S)c
Remark 1 There are some basic facts which can be easily seen from the definitions above:
2.4. THE STANDARD CONE LINEAR PROGRAMMING (K-LP)
17
1. An open set in Rn is not open in Rm for n < m ;
2. similarly, the boundary or the interior of a set isn’t the same in Rn as in
Rm ;
3. As a result one talks about an open set with respect to the topology
induced by the vector space spanned by a set S;
4. similarly we speak of relative interior and relative boundary of a set which
are understood to be with respected to topology of the space spanned by
the the set;
5. a closed set in Rn is also closed in Rm .
Consider the half closed interval [a, b) = {x : a ≤ a < b} in R1 . The interior of
[a, b) in R1 is the open interval (a, b) and the boundary of [a, b) is {a}∪{b} . But
(a, b) isn’t open in R2 since for any x ∈ (a, b), we can’t find some > 0 such
that B(x, ) ⊂ (a, b). The interior of [a, b) in R2 is empty and the boundary of
[a, b) in R2 is [a, b]. However the relative interior of [a, b) in Rn is again (a, b)
and the relative boundary {a, b}.
Definition 1 (Proper Cone) A proper cone K ⊆ Rn is a closed, pointed,
convex and full- dimensional cone (i.e dim(K) = n). A full-dimensional cone
is a cone which contains n linearly independent vectors.
Theorem 1 Every proper cone K induces a partial order which is defined as
follows:
K
∀x, y ∈ Rn , x ≥ y ⇔ x − y ∈ K
K
X > y ⇔ x − y ∈ Int(K)
K
K
K
Proof: First note that x ≥ x since x − x = 0 ∈ K . Secondly, ifx ≥ y, y ≥ x ,
then x − y ∈ K, y − x ∈ K . Since K is a proper cone, thus a pointed cone, we
K
K
get x = y . Finally, if x ≥ y, y ≥ z then x − z = (x − y) + (y − z) ∈ K , i.e.,
K
x ≥ z.
2.4
The Standard cone linear programming (KLP)
min cT x
s.t. aTi x = bi ,
i = 1, ..m
K
x≥0
where c ∈ Rn and b ∈ Rm ,A ∈ Rn×m with rows ai ∈ Rn , i = 1, ...m. Observe
that every convex optimization problem: minx∈C f(x) where C is a convex set
18
CHAPTER 2. CONES AND CONE-LPS
and f(x) is convex over C, can be turned into a cone-LP. First turn the problem
to one with linear objective and then turn it into Cone LP:
min z
s.t. f(x) − z ≤ 0
x ∈ C.
Since the set C0 = {(z, x) | x ∈ C and f(x) − z ≤ 0} is convex our problem is
now equivalent to the cone LP where
min z
s.t. x0 = 1
K
x≥0
where K = {(x0 , z, x) | (z, x) ∈ x0 C and x0 ≥ 0}
The convex set
embeded in plane
and turned into a cone
Definition 2 (Dual Cone) The dual cone K∗ of a proper cone is the set
{z : zT x ≥ 0, ∀x ∈ K}.
It is easy to prove that if K is proper so is K∗ .
2.4. THE STANDARD CONE LINEAR PROGRAMMING (K-LP)
19
Example 1 (Half line) Let R+ = {x : x ≥ 0}. The dual cone R∗+ is exactly
R+ .
Example 2 (non-negative orthant) Let
Rn
+ = {x | xk ≥ 0 for k = 1, . . . , n},
the dual cone equals Rn
+ , that is the non-negative orthant is self dual.
We recall that
Lemma 1 A matrix X is positive semidefinite if it satisfies any one of the
following equivalent conditions:
1.
(1) aT Xa ≥ 0, ∀a ∈ Rn
2.
(2) ∃A ∈ Rn×n such that AAT = X
3.
(3)
All eigenvalues of X are non-negative.
Example 3 (The semidefinite cone) Let Pn×n = {X ∈ Rn×n : X is positive semidefinite}
Now we are interested in P∗n×n . On one side,
∀Z ∈ P∗n×n , Z • X ≥ 0 for allX 0,
i.e.,
Z • X = Tr(ZX) = Tr(ZAAT ) = Tr(AT ZA) ≥ 0 for all A ∈ Rn×n .
Since X is symmetric, from the knowledge of linear algebra, X can be written
as X = QΛQT where QQT = I , that is Q is an orthogonal matrix, and Λ
is diagonal with the diagonal entries containing the eigenvalues of X. Write
Q = [q1 , ...qn ] and Λ = diag(λ1 , ...λn ). λi , i = 1..n , then qi is the eigenvector
corresponding to λi , i.e, qTi Xqi = λi
Let us choose Ai = pi ∈ Rn where pi is the eigen vector of Z corresponding to
γi and pTi pi = 1. Then,
0 ≤ Tr(ATi ZAi ) = pTi Zpi = γi
. So all the eigenvalues of Z are non-negative, i.e., Z ∈ Pn×n , P∗n×n ⊆ Pn×n.
On the other hand, ∀Y ∈ Pn×n , ∃B ∈ Rn×n such that Y = BBT . ∀X ∈
Pn×n , X = AAT , we have
Y • X = Tr(YX) = Tr(BBT AAT ) = Tr(AT BBT A) = Tr[(BT A)T (BT A)] ≥ 0
i.e., Y ∈ P∗n×n , Pn×n ⊆ P∗n×n . In conclusion, P∗n×n = Pn×n
20
CHAPTER 2. CONES AND CONE-LPS
Example 4 (The second order cone) Let Q = {(x0 , x̄) | x0 ≥ ||x̄||}. Q is a
proper cone. What is Q∗ ?
On one side, if z = (z0 , z̄) ∈ Q, then for every (x0 , x̄) ∈ Q
(z0 , z̄T )
x0
x̄
=
z0 x0 + z̄T x̄
≥ ||z̄|| · ||x̄|| + z̄T x̄
≥ −z̄T x̄ + z̄T x̄ = 0
i.e., Q ⊆ Q∗ . The inequalities come from the Cauchy-Schwartz inequality:
−zT x ≤ |xT z| ≤ ||z|| · ||x||
On the other side, we note that e = (1, 0) ∈ Q. For each element z = (z0 , z̄) ∈ Q∗
we must have zT e = z0 ≥ 0. We also note that each vector of the form
x = (kz̄k, −z̄) ∈ Q, for all z̄ ∈ Rn . Thus, in particular for z = (z0 , z̄) ∈ Q∗ ,
zT x = z0 ||z̄|| − ||z̄||2 ≥ 0
Since ||z̄|| is always non-negative, we get z0 ≥ ||z̄||, i.e., Q∗ ⊆ Q.
Therefore, Q = Q∗
Definition 3 An extreme ray of proper cone K is a half line αx = {αx | α ≥ 0}
for x ∈ K such that for each a ∈ αx, if a = b + c, then b, c ∈ αx.
Example 5 (Extreme rays of the second
order cone) Let Q the second
order cone. The vectors x = kx̄k, x̄ define the extreme rays of Q. This is
fairly easy to prove.
Example 6 (Extreme rays of the semidefinite cone) Let Pn×n be the semidefinite cone. Positive semi-definite matrices qqT of rank 1 form the extreme rays
of Pn×n . Here is the
P proof. Any positive semidefinite matrix X can be written
in the form of X = i λi pi pTi (See previous lecture to see how to get this from
spectral decomposition of X). This shows that all extreme rays must be among
matrices of the form qqT . Now we must show that each qqT is an extreme ray.
Let qqT = X+Y, where X, Y < 0. Suppose {q1 = q, q2 , . . . , qn } is an orthogonal
set of vectors in Rn . Then multiplying from left by qTi and from right by qi we
see that qTi Xqi + qTi Yqi = 0 for i = 2, . . . , n; but since the summands are both
non-negative and add up to zero, they are both zero. Thus qTi Xqi = qTi Yqi = 0
for i = 2, . . . n. Thus both X and Y are rank one matrices (their null space has
dimension n − 1) and we might as well write qqT = xxT + yyT . But the right
hand side is a rank 2 matrix unless x and y are proportional, which proves they
are proportional to q. Thus, qqT are extreme rays for each vector q ∈ Rn .
2.4. THE STANDARD CONE LINEAR PROGRAMMING (K-LP)
21
2.4.1
An Example of a cone which is not self dual
In the examples above, we note that they were all self-dual cones. But there
are cones that are not self-dual.
Let F be the set of functions F : R → R with the following properties:
1. F is right continuous,
2. non-decreasing (i.e. if x > y then F(x) ≥ F(y),) and
3. has bounded variation, that is F(x) → α > −∞ as x → −∞, and F(x) →
β < ∞ as x → ∞.
First observe that functions in F are almost like probability distribution functions, except that their range is the interval [α, β] rather than [0, 1]. Second the
set F itself is a convex cone and in fact pointed cone in the space of continuous
functions.
Now we define a particular kind of Moment cone. First, let us define
 
1
x
 2
x 

ux = 
 · .
 
 · 
xn
The moment cone is defined as:
Z
Mn+1 = c = ux dF(x) : F(x) ∈ F
that is Mn+1 consits of vectors c where for each j = 0, . . . , n, cj is the jth
moment of a distribution times a non-negative constant.
Lemma 2 Mn+1 is a proper cone.
Proof: Let’s examine the properties we need to prove:
• ∀c ∈ Mn+1 and α ≥ 0 αc
R ∈ Mn+1 . To see this observe that there
exists F ∈ F such that c = ux dF(x). Now if F is right-continuous, nondecreasing and with bounded variation, then all these properties
also hold
R
for αF for each α ≥ 0 and thus αF ∈ F. Therefore, αc = ux d(αF(x)) ∈
Mn+1 . Thus Mn+1 is a cone.
• If c and dR are in Mn+1 then c + d ∈ Mn+1 . ∀c =
Mn+1 , d = ux dF2 (x) ∈ Mn+1
Z
c + d = ux d[F1 (x) + F2 (x)] ∈ Mn+1
Thus Mn+1 is a convex cone.
R
ux dF1 (x) ∈
22
CHAPTER 2. CONES AND CONE-LPS
R
• If c and −c are in Mn+1
R then c = 0. Ifc = ux dF1 (x) ∈ Mn+1 and
c ∈ −Mn+1 , then −c = ux dF2 (x) ∈ Mn+1 .
Z
c + (−c) = 0 = ux d[F1 (x) + F2 (x)]
R
Especially, d[F1 (x)+F2 (x)] = 0. Since F1 (x)+F2 (x) ∈ F is non-decreasing
with F1 (x) + F2 (x) → 0 as x → −∞, we get F1 (x) + F2 (x) = 0 almost
everywhere,i.e., Fi (x) = 0, i = 1, 2 almost everywhere. It means c = 0,
i.e., Mn+1 ∩ −Mn+1 = 0. Thus Mn+1 is a pointed cone.
• Mn+1 is full-dimensional. Let
0, if x < a
Fa (x) =
1, if x ≥ a
Obviously, Fa (x) ∈ F and ua =
n + 1 distinct a1 , ...an+1 ,
R
ux dFa (x) ∈ Mn+1 for all a ∈ R. Choose
det[ua1 , · · · , uan+1 ] =
Y
(ai − aj ) 6= 0
i>j
Thus Mn+1 is full-dimension cone. (The determinant above is the wellknown Vander Monde determinant.)
In addition we need to show that Mn+1 is closed. This will be taken up in
future lectures.
Example 7 (Extreme rays of Mn+1 ) The extreme rays of Mn+1 are all αux
for x ∈ R. If c ∈ Mn+1 , c can be written as α1 ux1 + α2 ux2 + · · · +
αn+1 uxn+1 , αi ≥ 0 for i = 1, ..n + 1.
There is a one-to-one correspondence between c ∈ Mn+1 and
H = α1 ux1 uTx1 + α2 ux2 uTx2 + · · · + αn+1 uxn+1 uTxn+1 .
Such a matrix is called Hankel matrix. In general Hankel matrices are thos
matrices, H such that Hij = hi+j , that is entries are constant along all opposite
diagonals. A vector c ∈ R2n+1 is in the moment cone if and only if the Hankel
matrix Hij = ci+j is positive semidefinite. Again these assertions will be proved
in future lectures.
Now we examine M∗n+1 . Let’s first consider the cone defined as follows:
Pn+1 = {p = (p0 , . . . , pn ) | p0 + p1 x + p2 x2 + ... + pn xn = p(x) ≥ 0 for all x}
Lemma 3 Every non-negative polynomial is the sum of square polynomials.
2.4. THE STANDARD CONE LINEAR PROGRAMMING (K-LP)
23
Proof: First it is well known that p(x) can be written as
p(x) = c
k
n
Y
Y
(x − αj − iβj )(x − αj + iβj )
(x − γj )
j=1
j=k+1
√
where i = −1 and c ≥ 0. We first claim that n must be even. Otherwise,
p(x) → −∞ as x → −∞ p(x) and cannot be non-negative. The number of
real roots is even subsequently, say 2l.
since p(x) ≥ 0, all the real roots must have even multiplicity, because otherwise
in the neighborhood of the root with odd multiplicity there is some t such that
p(t) < 0. Thus, we can write
k
n
Y
Y
p(x) = c
(x − αj − iβj )(x − αj + iβj )
(x − γj )2
j=1
j=k+1
On the other hand for each pair of conjugate complex roots we have
(x − α − iβ)(x − α + iβ) = (x − α)2 + β2
Therefore the product expression for p(x) is product of square polynomials or
sums of square polynomials, which yields a sum of square polynomials.
This means that the set of extreme rays of the non-negative polynomials is
among polynomials that are square q2 (x). Thus, the coefficients of extreme
rays are of the form q ∗ q = q∗2 , where a ∗ b is the convolution of vectors a
and b, that is for a, b ∈ Rn+1 , a ∗ b ∈ R2n+1 and is defined as:
a ∗ b = (a0 b0 , a0 b1 + a1 b0 , . . . , a0 bk + a1 bk−1 + · · · + ak b0 , . . . , an bn )T
and q∗2 = q ∗ q.
Now not all square polynomials are extreme rays. In particular, if a square
polynomial has non-real roots then it can be written as sum of two square
polynomials as shown above. Thus, extreme rays are among those square polynomials with only real roots. We now argue that these polynomials are indeed
extreme rays.
Q
Suppose p(x) = (x − γj )2k is a polynomial with distinct roots γj which
is not an extreme ray. Then p(x) = q(x) + r(x) and since both q and r are
non-negative, we must have q(x) ≤ p(x). This means that degree of q(x) is
at most as large as degree of p. Furthermore, from the picture it is clear that
each γj is also a root of q(x). But if for some γj the multiplicity in p is 2k
and the multiplicity in q is 2m where m < k then in some neigborhood of γj
q(x) > p(x) because (x − γj )2m > (x − γj )2k in some neighborhood of γj when
m < k; therefore, k ≤ m for each root. Since degree of p is larger than or equal
to degree of q it follows that k = m for each root. Thus q(x) = αp(x) for some
constant α. We have proved:
2
Corollary 1 p is an extreme ray of Pn+1 if p = q∗ and q(x) has only real
roots.
Pn+1
P
2
We now show that P∗n+1 ⊇ Mn+1 . Note that ∀c = i=1 βi uxi ∈ Mn+1 , i p∗i ∈
Pn+1
X
i
since
2
n+1
X
p∗i )T (
j=1
X 2
βj uxj =
βj (p∗i )T (uxj ) ≥ 0.
i,j
2
2
βi ≥ 0, (p∗i )T (uxj ) = pi (x)
Later in the course we will prove that that P∗n+1 = Mn+1 .
Chapter 3
Duality theorey
Scribe: Ge Zhang
9/24/2001
3.1
Overview
First we give the Separating Hyperplane Theorem and Weak Duality Theorem.
Secondly, the Generalized Farkas Lemma is presented. Finally, we provide the
Duality Theorem and its proof.
3.2
Farkas Lemma
The following theorem is probably the fundamental theorem of convex analysis.
There are many variations of it. Here we present the conic version.
Theorem 2 (The Separating Hyperplane Theorem) Let K be a proper
cone (closed in particular), let a be a point and a 6∈ K, then ∃ b such that
b> x ≥ 0, ∀x ∈ K and b> a < 0.
The theorem may be proved using the Weierstrass theorems in analysis. The
discussion of this theorem is presented in most books on non-linear programming.
A consequence of the Separating hyperplane theorem is the following
Lemma 4 If K is proper, then (K∗ )∗ = K.
Proof: If x ∈ K, then ∀z ∈ K∗ x> z ≥ 0. This implies K ⊆ (K∗ )∗ . On the
other hand, suppose that K ⊂ K∗ but K 6= K∗ , then ∃z ∈ (K∗ )∗ , but z 6∈ K.
Therefore, by the Separating Hyperplane Theorem, ∃b such that b> z < 0 and
b> x ≥ 0 for all x ∈ K. This implies b ∈ K∗ . So ∀a ∈ (K∗ )∗ , b> a ≥ 0, but
b> z < 0. We get a contradiction. Hence, (K∗ )∗ = K.
26
CHAPTER 3. DUALITY THEOREY
Lemma 5 z ∈ K∗ iff z> x ≥ 0, ∀x, extreme rays of K.
Proof: (⇒) It is trivial.
P
(⇐) u ∈ K ⇒ u =
αi xi where xi ’s are the extreme rays of K and αi ≥ 0.
P
So clearly z> u =
αi z> xi ≥ 0.
Some Properties of the Dual of Cone
If K1 and K2 are proper cones, then the following hold
1. K1 ⊕ K2 = {(x, y) : x ∈ K1 , y ∈ K2 } is a proper cone.
2. K1 + K2 = {x + y : x ∈ K1 , y ∈ K2 } is a proper cone.
3. K1 ∩ K2 is a proper cone.
4. (K1 + K2 )∗ = K∗1 ∩ K∗2 , conversely, (K1 ∩ K2 )∗ = K∗1 + K∗2 .
The Weak Duality Theorem
Let A ∈ <m×n , x, c & s ∈ <n , b & y ∈ <m , K is a proper cone, m ≤ n and A
is a full rank matrix, then the standard form of K-LP problem is
Primal
min
c> x
s.t.
Ax = b
x K 0
Dual
max
b> y
s.t.
A> y + s = c
s K∗ 0
Theorem 3 (Weak Duality Theorem) If x is feasible for Primal and (y, s)
is feasible for Dual, then c> x ≥ b> y.
Proof: c> x − b> y = c> x − (Ax)> y = x> (c − A> y) = x> s ≥ 0 The last
inequality follows from the fact that x ∈ K and s ∈ K∗ .
Lemma 6 (Generalized Farkas Lemma) Let K ⊆ <n be a proper cone and
let A ∈ <m×n such that A(K) is closed, then for every b ∈ <m
Either: ∃x ∈ Rn , x K 0, and Ax = b;
Or: ∃y ∈ <m , A> y K∗ 0 but b> y < 0.
Remark 2 Note that A(K)—the linear transformation of K by A, that is
A(K) = {Ax : x ∈ K}—is also a convex and pointed cone if K is convex
and pointed. However, it may not be a closed cone even if K is. If K is a
closed polyhedral cone (for example the non-negative orthant), then A(K) will
always be closed. This is the reason why the stipulation that A(K) be closed
is not necessary in the usual Farkas Lemma. An example of a closed cone is
3.2. FARKAS LEMMA
27
{(x0 , x1 , x2 ) | x1 , x2 ≥ 0, and x1 x2 ≥ x0 }. This is essentially one branch of
the hyperbola embedded in the hyperplane x0 = 1 and then extended into a
cone by extending rays from the origin. Now if we apply the linear transformation that projects this cone to say the x2 coordinate, we see that the projected
cone is {x2 | x2 > 0} which is not closed since it does not contain the origin.
This occurs because the cone above has an asymptote, that is a hyperplane that
is tangent to it at infinity. Since infinity is not a real number, this causes exclusion of some points of the boundary of the projected cone. The existence
of asymptotes is essentially the only reason why A(K) might not be closed in
general. Polyhedral cones do not have asymptotes, thus for them the issue of
closedness under linear transformations never arises.
Proof: Either, b ∈ A(K), which is equivalent to ∃x ∈ <n , x K 0, Ax = b
or, b 6∈ A(K). In the former case if A> y ∈ K∗ then 0 ≤ x> A> y = b> y. In
the latter case, by the Separating Hyperplane Theorem, we must have y where
y> b < 0 and y> z ≥ 0 for all z ∈ A(K). This implies y> Ax ≥ 0 for all x ∈ K.
So y> A K∗ 0. and y> b < 0.
As can be seen, compared with Farkas Lemma in LP, Generalized Farkas Lemma
requires one more condition that is A(K) is closed. This condition is indeed
necessary. The example below shows that without this condition Generalized
Farkas Lemma is no longer true.
Counterexample to Generalized Farkas Lemma
Consider a matrix
0
1
1
u
This matrix is not positive semi-definite for any value of u because a diagonal
entry of a positive semidefinite matrix is always non-negative; furthermore, if a
diagonal Xii = 0, then the entire row i and column i are zero. (If Xij 6= 0 then
set a = (ei ± ej )> X(ei ± ej ) = ±2Xij , and choosing − sgn(Xij ) in place ± will
make a negative.)
But

E11 • X = 0,

0 1
21
0 ⇐⇒
( E12 +E
) • X = 1,
2
1 u

X 0.
where Eij is the elementary matrix with entry Eij = 1, other entries being 0; X
is a 2 × 2 matrix. Stretch X to 1 × 4 vector, we get the equivalent system:



x11





1 0
0
0 
0

 x12  =
, 




0 1/2 1/2 0
x21
1



x
22


(∗)
x11



 x12 







 x21  0.



x22
28
CHAPTER 3. DUALITY THEOREY
0 1
is never positive semi-definite, system (*) has no solution. If we
1 u
want to apply the Generalized Farkas Lemma, we get
0
21
∃(y1 , y2 ) such that y1 E11 + y2 ( E12 +E
)
0,
(y
,
y
)
< 0,
1
2
2
1
y1 y2
in particular y2 < 0. This implies
0, with y2 < 0. Clearly, this
y2 0
is impossible since by the fact y22 = 0 we must have y2 = 0. Therefore, we get
a counterexample. The reason this happened is that for this example A(K) is
not closed. In this case,
1
0
0
0
A=
, and K = P2×2 .
0 1/2 1/2 0
Since
Therefore,




1
A(K) =
0




 


x
 
0
0
0 
y
x
y
x y
 |
<
0
=
(x,
y)
|
<
0
1/2 1/2 0 y
y z
y z



z
It is easy to see that A(K) = {(0, 0)} ∪ {(x, y) : x > 0, y ∈ <}; it is essentially the
right half plane x > 0 with (0, 0) added, but none of the other boundary points
(0, y) for y 6= 0 are in it. Clearly, this is a convex cone; but it is not a closed
set.
3.3
Strong Duality Theorem
Again consider the primal-dual pair:
Primal
min
c> x
s.t.
Ax = b
x K 0
Dual
max b> y
s.t.
A> y + s = c
s K∗ 0
Theorem 4 (Strong duality Theorem) Let both Primal and Dual be feasible with finite solution. Furthermore, let z1 be theoptimal
solutionfor Primal
A
b
and z2 the optimal solution for Dual. Let b 0 =
and A 0 =
, and
z2
c>
assume that A 0 (K) is closed. Then z1 = z2 .
Proof: By Weak Duality Theorem, we know that z1 ≥ z2 .
Suppose that z1 > z2 . It implies that Ax = b, c> x = z2 , x K 0 has no
solution. By changing notation, we get that A 0 x = b 0 , x K 0 has no solution.
Applying the Generalized Farkas Lemma, ∃(y, y0 ) such that (1) A> y+y0 c K∗
0 and (2) b> y + z2 y0 < 0. Now we can prove the theorem in three cases.
3.3. STRONG DUALITY THEOREM
29
i. y0 = 0: (1)&(2) become A> y K∗ 0, b> y < 0. Applying the Generalized
Farkas Lemma again, we get that Ax = b, x K 0 is infeasible. This
contradicts that fact that Primal is feasible. (Observation: A 0 (K) is closed
implies that A(K) is closed. Therefore, it is no problem to apply the
Generalized Farkas Lemma.)
y
ii. y0 < 0: Divide (1) by −y0 , get A> ( −y
)−c K∗ 0. Primal is feasible and
0
optimal solution is finite. Therefore, ∃x+ such that Ax+ = b, x+ K 0
y
y
)−(x+ )c ≥ 0 ⇒ b> ( −y
)−z1 ≥ 0.
and c> x+ = z1 . Hence, (x+ )> A> ( −y
0
0
This contradicts the fact that z1 > z2 .
y
3. y0 > 0: Divide both (1) and (2) by y0 , we get that −A> ( −y
) + c K∗ 0
0
y
>
and −b ( −y0 ) + z2 < 0. This contradicts the optimality of z2 .
Again it is important to assume that A0 (K) is a closed cone, otherwise the
strong duality theorem need not hold. To see a counter example consider the
primal-dual pair:
min 
x1
s.t.
0
x1
0
x1
x2
0

0
0 <0
x1 + 1
max −z
2
z1
 1−z2
s.t.
2
z3
1−z2
2
0
z4

z3
z4  < 0
z2
We will see in a minute that they are indeed duals of each other. However, it is
easily seen that the optimal value of the minimization problem is x1 = 0, indeed
this is the only value x1 can have (why?). On the other hand for the dual, the
only possible value for z2 = 1 which makes its optimal value equal −1; we have
a positive gap. If you write this example in standard form and examine A and
A0 we will see that A(P3×3 ) is not closed (try it).
Table of Duality Rules
The relation of constraints and variables is given by the table below.
MIN
Max
K
K∗
∗
Variables
K
Constraints
K∗
Unrestricted
=
K
K∗
Constraints
K∗
Variables
K∗
=
Unrestricted
30
CHAPTER 3. DUALITY THEOREY
Example 8 (Semidefinite and non-negativity constraints) Here
is an example to show how to apply this table.
min c> x = c1 x1 + · · · + cn xn
s.t. x1 A1 + · · · + xn An A0
x1 ≥ 0, x2 ≤ 0
⇔Y
Using the table we get
max A0 • Y
s.t. A1 • Y = C1
· · · A1 • Y ≥ C1
· · · A2 • Y ≤ C2
· · · An • Y = Cn
Y0
⇔ x1
⇔ x2
⇔ xn
for i = 3, . . . , n
Exercise: Try to use the table and prove the counter example to
strong duality given above is indeed a pair of dual problems.
Chapter 4
Complementary Slackness
Scribe: Haengju Lee
10/1/2001
4.1
Overview
We examine the dual of the Fermat-Weber Problem. Next we will
study optimality condition in the form of generalized complementary
slackness theorem. Finally we start the study of the eigenvalue
optimization problem as a semidefinite program.
4.2
Dual of the Fermat Weber Problem
Recall that the Fermat-Weber problem seeks a point in m dimensional space whose Euclidean distance from a set of given n points is
minimum (see lecture 1). Given points v1 , v2 , . . . , vn ∈ Rm , weights
w1 , w2 , . . . , wn , this problem can be formulated as follows.
min
n
X
wi kvi − xk
i=1
The problem can be written equivalently as a cone-LP over Q,
the second order cone:
min w1 z1 + . . . + wn zn
s.t. zi ≥ kvi − xk, i = 1, . . . , n.
32
CHAPTER 4. COMPLEMENTARY SLACKNESS
But,
zi ≥ kvi − xk
zi
⇐⇒
Q 0
x − vi
zi
0
⇐⇒
Q
vi
x
0
^ Q
⇐⇒ zi e + x
vi
0
^=
where e = (1, 0, . . . , 0)T and x
. Now the cone-LP formulax
tion is:
Primal
min
w1 z1 + . . . +wnzn
0
yi0
^ Q
s.t.
zi e + x
, i = 1, . . . , n
⇐⇒
vi
yi
yi0
If we define dual variable
corresponding to the the second
yi
order cone inequality in the Primal then the dual can be formulated
as:
Dual P
n
T
max
i=1 vi yi
s.t.
y0i = wi , i = 1, . . . , n
⇐⇒ zi
y
+
.
.
.
+
y
=
0
⇐⇒
xi
n
1 y0 i
Q 0
⇐⇒ since they arise from Q .
yi
After simplification (for instance eliminating yi0 ) we get:
Dual P
n
max
vTi yi
Pi=1
n
s.t.
i=1 yi = 0
kyi k ≤ wi .
The dual of Fermat Weber problem has an interesting interpretation
in dynamics. Let us assume that wi are weights of objects hanging
from threads that go through a set of holes in a table. We are to
4.3. DUALITY IN DIFFERENT SPACES
33
take the other ends of the threads and tie them up at a position
of equilibrium, and spend minimal amount of energy. Then the yi
are interpreted as forces, and they must add up to zero so that we
have equilibrium. The condition kyi k ≤ wi simply states that the
magnitude of the force exerted at the knot by the ith object cannot
be larger than its weight. Assuming that the optimal location
P ∗ is
x∗ we can write the value
of
the
objective
function
as
−
i (x −
P
vi )T yi because (x∗ )T i yi = 0. Then the objective is simply the
location with minimum potential energy. (Question: Can you give
an interpretation of Primal and explain why the primal and dual
problem are equal at the optimum?)
w2
w4
w1
w5
w3
4.3
Duality in different spaces
In many situations a m-dimensional cone in can be expressed as
the intersection of another n-dimensional cone and a linear space:
K1 = K ∩ L where n > m. Then, remembering that a linear
space is also a cone and its dual as a cone is simply its orthogonal
complement L⊥ (why?), we get K∗1 = K∗ + L⊥ . Here K∗1 is the dual
of K1 in the space <n . But if we can get the dual in the space L
34
CHAPTER 4. COMPLEMENTARY SLACKNESS
then the dual cone will be m-dimensional and different from K∗1 ; let
us call the dual of K1 in the space L K+
1 . If it is at all possible to
find a good characterization of K+
we
should
use that instead of K∗1 .
1
Let us look at an example and see what would the problems be if
we don’t.
In linear programming our cone is the non-negative orthant <n+
and cone-LP is simply the ordinary LP:
Primal
min
cT x
s.t.
aTi x = bi i = 1, . . . , m
xi ≥ 0 i = 1, . . . , n
Dual
T
max
b
Py
s.t.
i = 1, . . . , m
i yi ai + si = ci
si ≥ 0 i = 1, . . . , n
Now suppose that we express the non-negative orthant as the intersection of positive semidefinite cone and the linear space L which
consists of only diagonal matrices, that is X ∈ L iff xij = 0 for all
i 6= j. We define diagonal matrices C = Diag(c) and Ai = Diag(ai ),
that is a matrix whose diagonal entries j, j are cj (or (ai )j ), and
non-diagonal entries i, j are all zeros. Now the primal linear programming problem can be written as a semidefinite programming
problem.
Primal : min{C•X | Ai •X = bi , for i = 1, . . . , m, Xij = 0 for i 6= j, X < 0}
Note that the condition Xij = 0 is the same as (Eij + Eji ) • X = 0
where Eij is the matrix with all entries 0 except the i, j entry which
is one. Now taking the dual of this SDP we arrive at a problem that
is not equivalent to the dual of the LP:
X
X
Dual : max{bT y |
yi Ai +
sij (Eij + Eji ) C}
Even if the original LP problem has unique primal and dual solutions
it is unlikely in general that the dual of the SDP formulation
have
P
unique solutions. The constraints in the dual imply that yi ai ≤ c
but there are in general infinitely many sij that can be added to a
set a given optimal y. The lesson is that it is not a good idea to
formulate an LP as an SDP (which was obvious at the outset). But
for the same reason it is not generally a good idea to express the
dual of a cone-LP over K1 ∩ L as K∗1 + L⊥ .
As another example consider the second order cone Q. Now we
know that x ∈ Q iff Arw x < 0. Thus again SOCP can be expressed
4.4. GENERALIZATION OF COMPLEMENTARY SLACKNESS
CONDITIONS
35
as an SDP: write Q = Pn×n ∩ L where L is the linear space saying
matrix X is arrow shaped, i.e. Xij = 0 if i 6= j and i 6= 0 and j 6= 0,
and Xii = Xjj for all i, j. But again formulating SOCP as and SDP
is not a good idea. If we form the dual as an SDP we will have extra
and unnecessary variables that play no essential role and can make
the solution numerically unstable, even if the original SOCP does
not have numerical problems. In future lectures we will see even
more compelling reasons why the SOCP poblem should be treated
in its own right rather than as a special case of SDP.
4.4
Generalization of Complementary Slackness
Conditions
Consider the pair of cone-LP problems
Primal
min
cT x
s.t.
Ax = b
x K 0
Dual
max
bT y
s.t.
AT y + s = c
s K∗ 0.
We studied before that at the optimum the following three relations hold:
x K 0
s K∗ 0 and
xT s = 0.
In the case of LP, SDP and SOCP these conditions actually imply
stronger relations which we now examine.
Example 9 (non-negative orthant) When K = K∗ = <n+ , at
the optimum, xi ≥ 0 for i = 1, . . . , n, si ≥ 0 for i = 1, . . . , n, and
xT s = 0 imply xi si = 0 for i = 1, . . . , n because sum of a set of
non negative numbers xi si is zero implies that each of them must
be zero. This is the familiar complementary slackness theorem of
linear programming.
Example 10 (the semidefinite cone) When K = K∗ = Pn×n
the optimal, X < 0, S < 0, and X • S = tr(XS) = 0. Since
the√matrix S
S can be expressed as S = QT ΩQ =
√ is symmetric
T
1/2 1/2
T
Q ΩQQ ΩQ = S S , where Q is an orthogonal matrix,
36
CHAPTER 4. COMPLEMENTARY SLACKNESS
and Ω a diagonal matrix containing eigenvalues of S on its diagonal. This shows that each positive semidefinite matrix has a unique
positive semidefinite square root which is denoted by S1/2 . Now,
0 = tr(XS) = tr XS1/2 S1/2 = tr S1/2 XS1/2
This implies that S1/2 XS1/2 = 0 because S1/2 XS1/2 is also a positive
semidefinite matrix, with non-negative eigenvalues and trace zero.
Since trace is sum of eigenvalues, this is possible only when all eigenvalues are zero, which, in the case of symmetric matrices, implies
that the matrix S1/2 XS1/2 is zero. Thus 0 = (S1/2 X1/2 )(X1/2 S1/2 ) =
AT A. We now that AAT = 0 iff A = 0, thus X1/2 S1/2 = 0 which
implies XS = 0. We have shown:
Theorem 5 (Complementary slackness theorem for SDP) If
X is optimal for the primal SDP, and (y, S) optimal for the dual
SDP, and duality gap X • S = 0, then XS = 0.
Example 11 (The second order cone) When K = K∗ = Q, we
have x Q 0, s Q 0 and xT s = 0, where x, s ∈ <n+1 , and x and s
are indexed from 0. This means that x0 ≥ kx̄k, and s0 ≥ ks̄k, and
xT s = 0. or equivalently,
P 2
xi s0
2
2
2
x0 ≥ x1 + · · · + xn ⇒ x0 s0 ≥
(4.1)
x0
P 2
si x0
2
2
2
s0 ≥ s1 + · · · + sn ⇒ x0 s0 ≥
(4.2)
s0
−x0 s0 = x1 s1 + · · · + xn sn
(4.3)
Now, adding (4.1), (4.2) and (4.3) we get
X x2 s0 s2 x0
i
i
0≥
+
+ 2xi si
x0
s0
X x2 s2 + s2 x2 + 2xi si x0 s0 i 0
i 0
=
x0 s0
X (xi s0 + si x0 )2
⇒0≥
x0 s0
Again, sum of a set of non-negative numbers is less that or equal to
zero. Therefore all of them must be zero. We thus have xi s0 +x0 si =
0, i = 1, . . . , m and xT s = 0
4.5. A GENERAL COMPLEMENTARY SLACKNESS THEOREM
37
We have shown
Theorem 6 (Complementary slackness for SOCP) If x Q 0,
s Q 0, and xT s = 0, then x0 si + xi s0 = 0 for i = 1, . . . , n. This
conditions (along with xT s = 0) can be written more succinctly as
Arw (x) Arw (s)e = 0
We have implicitly assumed that x0 6= 0 and s0 6= 0. if x0 = 0 ≥ kx̄k
then this implies that x = 0 and the theorem above is trivially true.
The same holds for when s0 = 0.
4.5
A general complementary slackness theorem
For a proper cone K ⊆ <n , define C(K)
x
T
C(K) =
| x K 0, s K∗ 0, x s = 0 ⊆ <2n
s
Now, on the surface, the set C(K) seems to be a (2n−1)-dimensional
set: Its members have 2n coordinates and since xT s = 0 we are left
with 2n − 1 “degrees of freedom”. The condition x ∈ K by itself
does not impose restriction on the dimension of the set, nor does
the condition s ∈ K∗ . Nevertheless it turns out C(K) is actually an
n-dimensional set! Here is why:
Theorem 7 There is a one-to-one and onto continuous mapping
from C(K) to <n .
Before we proceed to the proof we recall the following basic
Fact 1 Let S ⊆ Rn be a closed convex set and a ∈ <n . Then there
is a unique point x = ΠS (a) in S which is closest to a, i.e. there is
a unique point x ∈ S such that x = argminy∈S ka − yk.
The unique point above is called projection of a on to S. The proof
of this fact can be found in many texts and is based on Weierstrass’s
theorem. Now we give proof of Theorem 7.
Proof: Let a ∈ <n be any arbitrary point and define s = x − a. we
will first show that s ∈ K∗ , and then show that the correspondence
between a and (x, s) is a one-to-one, onto and continuous. First we
38
CHAPTER 4. COMPLEMENTARY SLACKNESS
show that s ∈ K∗ . For every u ∈ K, define convex combination
uα = αu + (1 − α)x where 0 ≤ α ≤ 1. Again we define ζ(α) =
ka − uα k2 . Then ζ(α) is a differentiable function on the interval
[0, 1] and min0≤α≤1 ζα is attained at α = 0.
Claim:
dζ ≥0
dα α=0
proof of Claim: Otherwise ∃α in some neighborhood of 0, such
that ka − uα k < ka − u0 k contradicting the fact x = u0 is the
closest point to a in K.
From this claim,
dζ = −2(a − x)T (u − x) ≥ 0
dα α=0
⇐⇒ 2(x − a)T (u − x) ≥ 0
⇐⇒ 2sT (u − x) ≥ 0.
(4.4)
This latter inequality is true for any u ∈ K. If we choose u = 2x
then we get sT x ≥ 0. If we choose u = x/2 then sT x ≤ 0. We
conclude that xT s = 0. If we plug this into (4.4) we get sT u ≥ 0
which means s ∈ K∗ . Thus, for each a we get a pair (x, s) ∈ C(K).
Clearly each a results in a unique (x, s) as x the projection is unique
and thus so is s = x − a. Also, both projection operation and
s = x − a are continuous.
Conversely, if (x, s) ∈ C(K), then we can set a = x − s. All we have
to show now is that projection of a onto K is x. Assume otherwise.
Then there is a point u ∈ K such that ka − uk < ka − xk that is
⇐⇒ (a − x)T (a − x) > (a − u)T (a − u)
⇐⇒ xT x − 2(x − s)T x > uT u − 2(x − s)T u noting that xT s = 0,
⇐⇒ 0 > uT u + xT x − 2xT u + 2uT s
⇐⇒ 0 > ku − xk2 + 2sT u
which implies that sT u < 0, contradicting the fact s ∈ K∗ . (This
proof is due to Osman Güler.)
Example 12 (Dual of half line) Let us see what C(K) looks like
in the case of half-line, that is when K = K∗ = <+ .
x
C(K) =
| x ≥ 0, s ≥ 0 ⊆ R2
s
4.6. EIGENVALUE OPTIMIZATION
39
In other words, C(<+ ) is the union of non-negative part of the x
and s axes: it is the real line < bent at the origin by a 90◦ angel.
Now the implication of this theorem is that since C(K) is n-dimensional,
then there must exist a set of n equations, that are independent in
some sense and define the manifold C(K). These n equations are
precisely the complementary slackness conditions. In case of nonnegative orthant, semidefinite and second order cones we were able
to get these equations explicitly. When the cone K is given by a set
of inequalities of the form gi (x) ≤ 0 for i = 1, . . . , n, and gi (x) are
homogeneous and convex functions, then the classical Karush-KuhnTucker conditions gives us a method of obtaining these equations.
4.6
Eigenvalue Optimization
In this section we relate the eigenvalues λ1 (A) ≥ λ2 (A) ≥ · · · ≥
λn (A) for some A ∈ Sn×n .
Let us find an SDP formulation of the largest eigenvalue, λ1 (A).
This problem can be formulated by primal and dual SDPs as follows.
Primal
min
z
s.t.
zI A
Dual
max
A•Y
s.t.
I • Y = tr(Y) = 1
Y0
The primal formulation simply says find the smallest z such that z is
larger than all eigenvalues of A. But z is larger than all eigenvalues
of A iff zI − A is positive semidefinite. The dual characterization is
obtained by simply taking dual. Now define the feasible set of the
dual to be S, that is
Definition 4
S = {Y ∈ Sn×n | tr(Y) = 1, Y 0}
(4.5)
E = {qqT | kqk = 1}
(4.6)
We can characterize the extreme points of S as follows:
Theorem 8 S is a convex set and the set of extreme points of S is
E.
40
CHAPTER 4. COMPLEMENTARY SLACKNESS
Proof: Convexity of S is obvious, since it is the intersection of
the semideinite cone and an affine
P set. Y 0 implies that Y =
ω1 q1 qT1 + · · · + ωk qk qTk where
ωi = 1, ωi ≥ 0, and kqi k = 1.
This shows that the extreme points of S are among elements of E.
Now we prove that all elements of E are extreme points. Otherwise
for some qqT there are p and r with kpk = krk = 1 and qqT =
√
T
√
√
√
αppT +(1−α)rrT =
αp
1 − αr
αp
1 − αr . If α 6= 0
or 1 we will have a contradiction to the fact that rank(qqT )=1. so
qqT are extreme points.
Since the optimum of a linear function over a convex set is attained
at an extreme point, it follows that the Y ∗ that maximized A • Y
in the dual characterization above is of the form Y ∗ = qqT , with
kqk = 1. That is
λ1 (A) = max qT Aq
kqk=1
This is a well-know result in linear algebra that we have proved using
duality of SDP. In future lectures we will use this characterization to
express optimization of eigenvalues over an affine class of matrices.
Chapter 5
The Lovász ϑ Function
Scribe: Anton Riabov
10/08/2001
5.1
Overview
We continue studying the maximum eigenvalue SDP, and generalize
it to an affine set of matrices. Next, we define Lovász Theta function and prove Lovász’s “sandwich theorem”. Finally, we introduce
further extensions to the eigenvalue SDP for finding minimum sum
of k largest eigenvalues over an affine set of matrices.
5.2
Finding Largest Eigenvalue (Continued)
In the previous lecture we have introduced sets
S := {X : tr X = 1, X < 0},
E := {qqT : kqk2 = 1}.
We have shown that E ⊂ S and members of E are extreme points
of S. For a fixed symmetric matrix A we have defined the largest
eigenvalue SDP as:
λ1 (A) = min z
s.t. zI < A
42
CHAPTER 5. THE LOVÁSZ ϑ FUNCTION
The dual to this problem is:
λ1 (A) = max A · Y
s.t. tr Y = 1
Y<0
Note that S is exactly the feasible set of the dual. As we know,
optimal value is attained at an extreme point of these set, i.e. Y ∗ ∈ E
if and only if Y ∗ is an optimal solution of the dual. Thus there exists
q such that Y ∗ = qqT . So, we can write:
λ1 (A) = maxY · A = max qqT · A = max qT Aq.
kqk2 =1
kqk2 =1
Thus,
λ1 (A) = max qT Aq,
kqk2 =1
(5.1)
which is a well-known result from linear algebra, that we have proven
using semi-definite programming duality.
Note 1 Similarly we can show that the smallest eigenvalue λn (A)
can be expressed as:
λn (A) = min qT Aq.
kqk2 =1
5.3
Lovász Theta Function
In this section we describe extensions of the largest eigenvalue problem. Interested students are referred to the classic book “Geometric
Algorithms and Combinatorial Optimization” by M. Grötschel, L.
Lovász and A. Schrijver, Chapter 9.
5.3.1
Optimizing Largest Eigenvalue Over an Affine Set
Suppose now that we are solving the same largest eigenvalue problem, but the matrix A is not fixed, and we are looking for a way
to minimize the largest eigenvalue over a set of matrices. For example, consider the linear combination of symmetric matrices, A :=
A0 + x1 A1 + · · · + xm Am . The problem in this case translates into
the following unconstrained optimization problem:
min λ1 (A0 + x1 A1 + · · · + xm Am ).
x∈Rm
5.3. LOVÁSZ THETA FUNCTION
43
It can be shown that this function is not linear. Furthermore, it
is not smooth, and this makes optimization a very difficult task.
However, we will prove that this function is convex, and therefore
the task of finding it’s minimum should be tractable.
Proposition 2 Function f(x) := λ1 (A0 + x1 A1 + · · · + xm Am ) is
convex.
Proof: We will prove that λ1 (A + B) ≤ λ1 (A) + λ1 (B), and the
result will follow. Using (5.1), we can state that there exists q such
that:
λ1 (A + B) = qT (A + B)q = qT Aq + qT Bq
The value of q is feasible, but not optimal, if we apply (5.1) to
A and B separately. Therefore,
qT Aq + qT Bq ≤ λ1 (A) + λ1 (B),
and the result follows.
We can rewrite the problem as an equivalent semidefinite programming problem:
min z
s.t. zI − x1 A1 − · · · − xm Am < A0
The dual for this problem is:
max A0 · Y
s.t. tr Y = 1
Ai Y = 0
Y<0
People have studied this problem from the point of view of unconstrained optimization. It turns out to be a very difficult problem,
when formulated this way. The graph of the function is not smooth.
Newton method can not be applied directly, since taking derivative
is not always possible. Semidefinite programs, in turn, can be solved
efficiently with any given precision. We will describe algorithms for
solving SDPs later in the course.
44
5.3.2
CHAPTER 5. THE LOVÁSZ ϑ FUNCTION
Maximum Clique, Stable Set and Graph Coloring
Problems
Assume G = (V, E) is an undirected graph, no loops attached to
vertices, i.e. edges (i,i) are not allowed in G.
Definition 5 Maximum clique problem: find a maximum fully connected subgraph of G (a clique). I.e. find a subgraph G0 = (V0 , E0 ), V0 ⊆
V, E0 ⊆ E such that ∀i, j ∈ V0 , i 6= j ⇒ ∃(i, j) ∈ E0 and |V0 | is maximal.
The maximum clique problem is NP-complete. In fact, it was
one of the original problems shown to be NP-complete.
1
5
2
4
3
Figure 5.1: An example of an undirected graph.
Example 13 (Maximum Clique) Consider the graph on Figure 5.1.
In this graph the cliques of size more than 2 are {2,3,4,5} and
{1,2,5}. The former clique is maximum.
Definition 6 Maximum independent (or stable) set problem: find a
maximum set of vertices none of which are connected with an edge.
The maximum independent set problem is in some sense a dual
of the maximum clique problem. The maximum independent set of
a graph is the maximum clique of its complement.
Definition 7 Complement graph Ḡ = (V, Ē) of a graph G = (V, E)
is a graph, that has the same set of vertices, and satisfying the following property: every pair of vertices, that is connected in the original
graph is not connected in its complement, and vice versa.
Example 14 (Stable Set) Using the graph in Figure 5.1, the optimal solutions are {1,4} and {1,3}.
5.3. LOVÁSZ THETA FUNCTION
45
1
5
2
4
3
Figure 5.2: Complement of a graph in Figure 5.1.
Example 15 (Complement Graph) Figure 5.2 shows the complement graph for the graph shown on Figure 5.1.
Note 2 Graph G is a complement of Ḡ.
Note 3 Every solution to the maximum stable set problem on a
graph G is a solution for the maximum clique problem on the graph
Ḡ, and vice versa.
Definition 8 Minimum graph coloring problem: assign colors to
vertices such that every pair of adjacent vertices has different colors,
using the smallest possible total number of colors. The number of
colors used in the optimal solution to the problem is referred to as
the chromatic number of the graph.
Finding the chromatic number of a graph is also NP-complete. In
fact, it can be proven that there does not exist an ε-approximation
algorithm for this problem. In other words, for any given ε ≥ 1 it
is impossible to give a polynomial algorithm that for every graph
G finds a number that is guaranteed to be at most εχ(G), unless
P = NP.
We will use the following notation to denote optimal values of
these problems:
• ω(G) - size of maximum clique in graph G;
• α(G) - size of largest stable set in graph G;
• χ(G) - chromatic number of graph G.
46
CHAPTER 5. THE LOVÁSZ ϑ FUNCTION
Proposition 3 ω(G) ≤ χ(G).
Proof: The vertices of each clique must have different colors.
Example 16 (Strict Inequality) However, the equality does not
hold all the time. The example of a cycle of 5 vertices shows that
in some cases the inequality above is strict.
1
5
2
4
3
Figure 5.3: Odd cycle example.
It is easy to see that the graph shown in Figure 5.3 can not be
painted with less than 3 colors, therefore χ(G) = 3. However, the
largest clique in this graph has only 2 vertices, and ω(G) = 2 < 3 =
χ(G). This is actually true for any odd cycle.
Proposition 4 If a symmetric matrix A ∈ Rn×n and its submatrix
B ∈ Rk×k are such that
A=
k
k
B
CT
C
,
D
then
λ1 (B) ≤ λ1 (A).
Proof: From (5.1) it follows that there exists q ∈ Rk , satisfying
kqk2 = 1, such that:
λ1 (B) = qT Bq.
Recall that the value of vector q, at which maximum in (5.1) is
achieved, is the eigenvector corresponding to the eigenvalue λ1 (B).
Further,
B C
q
T
T
≤ λ1 (A),
λ1 (B) = q Bq = q 0
T
0
C D
where the inequality follows from the fact that vector qT 0 is not
necessarily optimal for the entire matrix A in terms of (5.1).
5.3. LOVÁSZ THETA FUNCTION
47
Corollary 2 If A ∈ Rn×n is a symmetric matrix and its submatrix
B ∈ Rk×k is composed of the rows and columns corresponding the
same index subsets, i.e. for each column i of A that is included in B,
the row i of A is included in B, and vice versa, then λ1 (B) ≤ λ1 (A).
Proof: This corollary states that the submatrix B does not have
to be in the corner of A, it can be distributed over A, as long as
corresponding rows and columns are participating. We will make
use of permutation matrices to prove this.
The matrix A can be transformed to have the structure required
in Proposition 4 via multiplication to permutation matrices. A permutation matrix is a square matrix composed of 0 and 1 elements,
and having only one non-zero in each row and in each column. If
the permutation matrix P rearranges columns, when multiplied on
the right side, then PT on the left side results in the rearrangement of corresponding columns. Therefore there exists a permutation matrix P such that PT AP has the required structure. It is
known from linear algebra that permutation matrices are orthogonal, and PT = P−1 . Therefore permutation does not change eigenvalues, and the matrix PT AP has the same eigenvalues as A. Thus,
λ1 (B) ≤ λ(PT AP) = λ1 (A), where the inequality follows from Proposition 4.
Let us now apply this result to adjacency matrix A of an arbitrary
graph G. An adjacency matrix is defined as A = (aij ), where aij = 1
if there exists an edge (i, j), and aij = 0 otherwise.
We will illustrate the procedure on the example graph in Figure 5.1. For this graph, the adjacency matrix A is




0 1 0 0 1
1 1 0 0 1
1 0 1 1 1
1 1 1 1 1







A = 0 1 0 1 1 and A + I = 
0 1 1 1 1 .
0 1 1 0 1
0 1 1 1 1
1 1 1 1 0
1 1 1 1 1
It is easy to see that cliques in the graph correspond to submatrices of ones in the matrix A + I, composed of the corresponding
rows and columns. Now we can apply Corollary 2. Thus, for each
such clique submatrix Jk = 11T of size k × k,
λ1 (Jk ) ≤ λ1 (A + I).
(5.2)
48
CHAPTER 5. THE LOVÁSZ ϑ FUNCTION
The eigenvalues of Jk are easy to compute. It is a rank 1 matrix,
therefore it has only one non-zero eigenvalue, and it is easy to see
that the vector 1 is the eigenvector corresponding to this non-zero
value, and the value itself is equal to k. Thus,
λ1 (Jk ) = k.
Substituting in (5.2),
k ≤ λ1 (A + I).
Since this equation holds for any clique size k, it also holds for the
maximum clique size:
ω(G) ≤ λ1 (A + I).
5.3.3
(5.3)
Lovász Theta Function
Let us see if the inequality (5.3) can be made tighter, so we can
obtain a better estimate of ω(G). Note that Corollary 2 can still
be applied in the same way, and (5.3) holds, even if the zeros in
A + I are replaced with arbitrary values, as long as the matrix A + I
remains symmetric. The reason for this is that there are no zeros
inside any clique submatrix Jk , so that part (corresponding to B in
Corollary (2) remains unchanged. Formally, let
1,
if (i, j) ∈ E or i = j;
[A(x)]ij =
xij = xji , otherwise.
Now we can write a stronger version of inequality (5.3):
ω(G) ≤ min λ1 (A(x)) .
x
(5.4)
Definition 9 The right hand side of (5.4) is referred to as Lovácz
θ-function of a graph. It is denoted θ(G) = minx λ1 (A (x)).
Note that A(x) defined above can be written as A(x) = A0 +
x1 A1 + x2 A2 + · · · + xm Am . The example below illustrates how this
can be done.
Example 17

1 x
x 1
y 1
(Rewriting A(x) as sum)

y
1  = x(E12 + E21 ) + y(E13 + E31 ) + (A + I).
1
5.3. LOVÁSZ THETA FUNCTION
49
We know that ω(G) ≤ χ(G). Now we can formulate the following
theorem, that makes a stronger statement. This theorem is also
known as Lovász’s “sandwich theorem”.
Theorem 9 ω(G) ≤ θ(G) ≤ χ(G).
Before we prove this theorem, let us make several observations
regarding graph sums.
Definition 10 Given graphs G1 = (V1 , E1 ), G2 = (V2 , E2 ) the sum
of graphs G1 ⊕ G2 is a graph formed by putting together graphs G1
and G2 . In other words, G1 ⊕ G2 = (V1 ∪ V2 , E1 ∪ E2 ).
The following propositions are straightforward to prove, so the
proofs are not given.
Proposition 5 ω(G1 ⊕ G2 ) = max{ω(G1 ), ω(G2 )}.
Proposition 6 χ(G1 ⊕ G2 ) = max{χ(G1 ), χ(G2 )}.
Although given the two propositions above, the following proposition may be anticipated, the proof is not as obvious.
Proposition 7 θ(G1 ⊕ G2 ) = max{θ(G1 ), θ(G2 )}.
Proof: Write an SDP for finding the value of θ(G) :
θ(G) = min z
s.t.
zI − X < 0
xij = 1
if i = j or (i, j) ∈ E
Here is the dual of this problem:
X
X
max
yij +
yii = (A + I) • Y
ij∈E
s.t.
tr Y = 1
yij = 0
Y<0
(i, j) 6∈ E
50
CHAPTER 5. THE LOVÁSZ ϑ FUNCTION
Now let us write the dual for graph composition G1 ⊕ G2 . The
dual variable will have the following structure (since there are no
edges between the graphs in the sum):
Y1 0
.
0 Y2
Thus, the dual program is
max (A + I) • Y1 + (A + I) • Y2
s.t.
tr Y1 + tr Y2 = 1
yij = 0
(i, j) 6∈ E
Y1 , Y 2 < 0
(5.5)
The formulation above resembles the original dual formulation.
It could be separated into two problems (for graphs G1 and G2 ) and
the proposition will be proven immediately, if not for the constraint
(5.5), that makes Y1 dependent on Y2 .
Note that we can replace Y with αY, in the dual problem, for
some α > 0, and the objective still remains the same, except that
everything is now scaled down by α. Now let us choose an α such
that 0 ≤ α ≤ 1, and replace (5.5) with the one of following two
constraints, and solve the two dual problems:
tr Y1 = α
tr Y2 = 1 − α
The solutions to these independent problems will be, as we know,
αθ(G1 ) and (1 − α)θ(G2 ) correspondingly. The solution to the dual
problem for G1 ⊕ G2 is then:
max αθ(G1 ) + (1 − α)θ(G2 ) = max{θ(G1 ), θ(G2 )},
0≤α≤1
where the equality follows from the linearity of the optimal solution
in α.
Now we are ready to prove Theorem 9.
Proof: Denote k = χ(G). The nodes of the graph G can be partitioned into k classes according to their color. We can make all
classes contain the same number of nodes (r) by introducing additional singleton nodes, that are not connected to any nodes in the
5.3. LOVÁSZ THETA FUNCTION
51
graph, and therefore can have any color. This procedure does not
change θ(G) according to Proposition 7, since θ of a single vertex
is 1. Denote this new graph G1 . Then,
θ(G) = θ(G1 ),
|V(G1 )| = kr.
Consider matrix A(α) ∈ Rkr×kr ,

(αJr + (1 − α)Ir )
Jr

Jr
(αJr + (1 − α)Ir )
A(α) := 

···
···
Jr
Jr

···
Jr

···
Jr


···
···
· · · (αJr + (1 − α)Ir )
Note that (αJr + (1 − α)Ir ) is a symmetric r × r matrix having the
following structure:


1
α
α ··· α
α
1
α ··· α 



α
α
1
·
·
·
α
(αJr + (1 − α)Ir ) = 


· · · · · · · · · · · · · · · 
α
α
α ··· 1
Jr , as before, is a matrix of all ones of size r × r.
It is easy to see that matrix A(α) is a special case of (A(x))
matrix for graph G1 . Vertices within a color class are not connected,
and if we rename them so that classes follow one another, we will
have zero matrices of size r × r on the diagonal in adjacency matrix
for G1 . Now we can replace all zeros with variables, and we choose
to replace some of them with α, which preserves matrix symmetry.
Since it is a special case, and by definition of θ(G1 ),
θ(G1 ) ≤ λ1 (A(α)).
Note that we can rewrite A(α) as:
A(α) = Jkr − (1 − α)(Jr ⊕ Jr ⊕ · · · ⊕ Jr ) + (1 − α)I.
(5.6)
Now we would like to find the value of λ1 (A(α)) as a function of
α in closed form, and choose α such that λ1 (A(α)) ≤ k = χ(G1 ) =
χ(G). Note that α does not have to be positive.
Eigenvalue of matrix sum is equal to sum of eigenvalues, if the
matrices commute. It is known from linear algebra that for matrices
52
CHAPTER 5. THE LOVÁSZ ϑ FUNCTION
A and B, AB = BA if and only if A and B share a system of common
eigenvectors. This is almost obvious for symmetric matrices, we can
make use of eigenvalue decomposition: A = QT DQ, B = QT CQ,
then A + B = QT (D + C)Q if the matrices commute.
We claim that Jkr , (Jr ⊕Jr ⊕· · ·⊕Jr ) and I commute. I is easy, since
it commutes with every matrix. It is also straightforward to verify
that the vector of all ones, and everything orthogonal to it, can be
taken as eigenvectors for Jkr . Vector of all ones is also an eigenvector
of (Jr ⊕ Jr ⊕ · · · ⊕ Jr ); thus take any other set of eigenvectors for it,
its members are orthogonal to 1, and so are also eigenvectors of Jkr ,
and this proves the claim.
The table below summarizes the eigenvalues of the 3 terms in the
sum (5.6):
Jkr
kr
0
...
0
0
0
...
0
0
0
...
0
(1 − α)(Jr ⊕ Jr ⊕ · · · ⊕ Jr ) (1 − α)I
−(1 − α)r
1−α
0
1−α
...
...
0
1−α
−(1 − α)r
1−α
0
1−α
...
...
0
1−α
−(1 − α)r
1−α
0
1−α
...
...
0
1−α
Thus in order to complete the proof, we need to show that there
exists an α that satisfies:
kr − (1 − α)r + (1 − α) ≤ k
1 − α − (1 − α)r ≤ k
1−α≤k
Solving the first inequality for α we obtain α ≤ 1 − k. Solving
the last one we get α ≥ 1 − k. Setting α = 1 − k will satisfy all
three inequalities.
As we have just shown, this theorem proves a bound on a very
hard integer program (IP) using results obtained with semidefinite
programming (SDP) analysis. Later we will show that any IP can
5.4. COMPUTING SUMS OF EIGENVALUES
53
be relaxed to an SDP in a similar way. SDP relaxations are typically tighter than relaxations to linear programs, and we will give a
general framework for creating such relaxations.
5.4
Computing Sums of Eigenvalues
Consider further generalization of the optimization problem we have
been working with. As before, define
A(x) = A0 + x1 A1 + x2 A2 + · · · + xm Am .
But now, we want to find the sum of a given number of largest
eigenvalues, instead of just finding one largest eigenvalue:
f(x1 , x2 , · · · , xn ) = (λ1 +λ2 +· · ·+λk )(A0 +x1 A1 +x2 A2 +· · ·+xm Am ).
And, as in previous sections, we would like to find the unconstrained
minimum:
min f(x1 , x2 , · · · , xn ).
x
We can write an SDP for this problem, but the straightforward
approach (writing constraints for all possible sums of eigenvalues)
results in an exponential number of constraints. A more sophisticated approach should be used. We will first illustrate the ideas we
are going to employ on a small linear program.
Example 18 (Linear Programming)
Consider the following linear program (LP):
max w1 x1 + · · · + wn xn
s.t. x1 + · · · + xn = k
0 ≤ xi ≤ 1
(5.7)
Suppose k is an integer. This program finds the k largest values
of w, and sets corresponding values of x to 1.
All extreme points of feasible set of LP are integer. This can be
proven, for example, using complementary slackness. Let us write
the dual of this problem:
X
min kz +
yi
s.t. z + yi ≥ wi
yi ≥ 0
54
CHAPTER 5. THE LOVÁSZ ϑ FUNCTION
Now, using complementary slackness, we can show that a fractional solution is either non-optimal, or is not an extreme point.
Another way to show that all the extreme points of LP are integral is to notice that the matrix of this problem is totally unimodular
(all determinants of its sub-matrices are either 0, 1, or −1).
Either way, it can be shown that the extreme points of this polytope are 0-1 vectors having exactly k ones in them. Introducing just
one constraint on the sum (equation (5.7) in LP) greatly restricts
the degrees of freedom for x.
Chapter 6
Eigenvalue Optimization
Scribe: Eyjolfur Asgeirsson
10/15/2001
6.1
Overview
We will start by formulating the problem of minimizing the k largest
eigenvalues over an affine set of matrices as an SDP program. Then
we will consider some generalizations of that problem, show the
SOCP version of it and finally we will briefly introduce a reformulation of the Lovász Theta function.
6.2
Minimizing the Sum of Eigenvalues
In the last lecture we introduced a small linear program that we
used to illustrate our ideas:
max w1 x1 + . . . + wn xn
(LP) s.t. x1 + . . . + xn = k
0 ≤ xi ≤ 1
56
CHAPTER 6. EIGENVALUE OPTIMIZATION
If k is an integer then this program will find the k largest values of
wi and set the corresponding values of x to 1. The dual of (LP) is:
X
min kz +
yi
s.t.
z + yi ≥ wi
yi ≥ 0
All extreme point of LP are 0-1 vectors with exactly k ones and
n − k zeros.
We wish to solve the optimization problem:
min (λ1 + . . . + λn )(X)
s.t. Ai • X = bi
where x ∈ Sn×m and λi is the ith largest eigenvalue.
Theorem 10 (Filmore & Williams, 1971) Let
Sk = {x ∈ Sn×m : 0 X I, tr X = k}
Ek = QQT : Q ∈ <k , QT Q = Ik
k
X
=
qi qTi : q ∈ <n , kqk = 1, qi ⊥ qj , i 6= j
i=1
Then Conv(Ek ) = Sk and Ek is the set of extreme points of Sk
Proof: It’s easy to see that Ek ⊆ Sk is true and therefore Conv(Ek ) ⊆
Sk . To show that Conv(Ek ) = Sk we only need to show that
Sk ⊆ Conv(Ek ). Let X ∈ Sk . We can write X as
X = QDQT
where
and

D = diag(d1 , . . . , dn ) = 
d1
0
...
0
dn


0 ≤ di ≤ 1
X
di = k
This means that the vector d is feasible for our (LP) by the extreme
point argument we can write it as a convex combination of extreme
6.2. MINIMIZING THE SUM OF EIGENVALUES
57
points of (LP) that is as




d1
0
u1
X
X
...
...
=

=
α i Uj
αi 
un
0
dn
P
where αi ≥ 0 and
αi = 1 and exactly k of the ui equal 1, the
remaining ui are equal to 0. Thus we can write
P
X=
αj QUj QT
Since only k of the ui are P
equal to 1 and the remaining are equal to
0 we can write QUj QT = kr=1 qir qTir which means that
Ext(Sk ) ⊆ Ek
⇒ Sk ⊆ conv(Ek )
Since we have also shown that extreme points of Sk are in Ek we
only need to prove that no member of Ek can be written as
Pa convexT
T
combination of other members of EkP
. Suppose QQ =
αi Qi Qi
where QQT , Qi QTi ∈ Ek , αi > 0 and
αi = 1. The kernel (i.e. the
null space) of QT is a linear space, Ker(QT ) = {a : QT a = 0} and
the dimension of this is: dim(Ker(QT )) = n − k. Then
X
0 = aT QQT a =
αi aT Qi QTi a
Since aT Qi QTi a ≥ 0 we get
aT Qi QTi ai = 0 ⇒ QT a = 0
so the kernel of Q is the same as the kernel of each Qi and a ∈
Ker(QT ).
Q 0 = [Q, R],
Qi0 = [Qi , R]
where Q ∈ <k×n and R ∈ <(n−k)×n
where columns of R are orthonormal and orthogonal to columns of
Q. Then
Q
0 0T
I = Q Q = [Q, R]
R
= QQT + RRT
= Qi QTi + RRT ⇒ QQT = Qi QTi
58
CHAPTER 6. EIGENVALUE OPTIMIZATION
This means that QQT =
shows that Ext(Sk ) = Ek
P
αi Qi QTi where some αi = 1, which
Corollary 3 (Ky Fan) If A is a fixed symmetric matrix then
(λ1 + · · · + λk )(A) = max tr QT AQ
s.t. QT Q = Ik
Proof: Let the maximum value of the optimization problem be λ∗ .
We have [q1 . . . qk ] = Q where qi is the eigenvector corresponding
to the eigenvalue λi . Then
k
X
tr Q AQ = tr(
λi qi qTi ) = (λ1 + . . . + λk )(A)
T
i=1
This shows that
(λi + . . . + λk )(A) ≤ λ∗
Now assume that for some P we have
tr PT AP > (λi + . . . + λk )(A)
We can write tr PT AP as tr(PT QΛQT P) with PT Q = RT where
RT R = Ik . Then
tr PT AP = tr(PT QΛQT P)
= tr
RT ΛR
X
=
λi sTi si
T
where si are the rows of matrix R. Then
P T 0 ≤ si si ≤ 1 (since R is
not a complete n × n matrix,) and
si si = k which means that
we have a contradiction since if we replace xi in (LP) with sTi si and
wi with λi , we get a maximum which is the sum of the k largest
weights. So here we have (λi + . . . + λk )(A) ≥ λ∗ and the proof is
complete.
Corollary 4 We have that
(λ1 + . . . + λk )(A) = max Y • A
s.t. 0 Y I
tr Y = k
6.2. MINIMIZING THE SUM OF EIGENVALUES
59
Proof: By Corollary 1 (λ1 + · · · + λk )(A) is maximum of Y • A
with Y ∈ §k . By Filmore and Williams theorem extreme rays of Sk
are Ek and thus the maximum is attained at some Y = QQT where
QT Q = Ik . Y • A = QQT • A = tr(QQT A) = tr(QT AQ), and the
statement is provedY • A = QQT • A = tr(QQT A) = tr(QT AQ),
and the stamens is proved
Now we can formulate the optimization problem
min (λ1 + · · · + λk )(X)
s.t. Ai • X = bi for i = 1, . . . m
as an SDP. First we find the dual of the SDP characterization of
(λ1 + · · · + λk )(A):
min kt + tr V
s.t. tI + V A
V 0
We now replace A with X and blend in the constraints on X:
min kt + tr V
s.t. tI + V − X 0
Ai • X = bi for i = 1, . . . , m
V 0
and the dual of this is:
max bT y
s.t. I • Y = k
YI
X
yi Ai − Y = 0
Y0
6.2.1
An Application of Sum of Eigenvalues in Graph Theory
Assume we have an undirected graph G = (V, E) with no loops
attached to vertices as defined lecture 5.
60
CHAPTER 6. EIGENVALUE OPTIMIZATION
Definition 11 (Clique Covering Problem) A clique covering of
vertices of a simple graph G = (V, E) is a collection of cliques such
that each vertex is in one of the cliques. The minimum clique covering problem is that of finding the smallest number of cliques such
that each vertex is in one clique.
Remark 3 The minimum clique covering number for a graph G is
equal to the chromatic number of the complement graph G = (V, E).
This is obvious from the fact that any clique in G forms a set of
independent nodes in G and vice versa.
Definition 12 A graph G = (V, E) is k-partite if V can be divided
into the union of k (possibly empty) independent sets.
Definition 13 (Narasimhan and Manber) The generalized Lovász
ϑ function is defined as
def
θk = min (λ1 + . . . + λk )(X)
s.t. Xij = 1 if ij 6∈ E or i = j
We will use the following notation:
• αk (G) = the largest induced k-partite subgraph of graph G
• χ(G) = minimum clique covering of G.
Then we can show that:
αk (G) ≤ θk (G) ≤ kχ(G)
6.3
SOCP Versions
We have a1 , a2 , . . . , an where ai ∈ <ni and ka[1] k ≥ ka[2] k ≥ . . . ≥
ka[n] k and our problem is:
max kx[1] k + · · · + kx[k] k
s.t. A1 x1 + · · · + An xn = b
6.3. SOCP VERSIONS
61
Of course for fixed vectors ai the sum a[1] + · · · + a[k] can be written
as
max x1 ka1 k + · · · + xn kan k
X
s.t.
xi = k
0 ≤ xi ≤ 1
since kai k are just some fixed real numbers at this point. Similarly,
the dual of this is:
X
min kt +
ui
s.t. t + ui ≥ kai k
ui ≥ 0
ni
We want to find x1 , xP
such that A1 x1 +
2 , . . . , xn where xi ∈ <
k
. . . + An xn = b and i=1 kx[i] k is as small as possible. As before
we replace ai with xi in the dual formulation and blend in the
constraints on xi to get:
X
min kt +
ui
s.t. t + ui ≥ kxi k
A1 x1 + . . . + An xn = b
ui ≥ 0
The inequality constraints are in fact SOCP constraints and we can
write this problem as:
X
min kt +
ui
s.t.
t + ui
Q 0 for i = 1, . . . , n
xi
A1 x1 + . . . + An xn = b
ui ≥ 0
62
6.4
CHAPTER 6. EIGENVALUE OPTIMIZATION
Sum of absolute values of eigenvalues
The absolute value of the kth largest eigenvalue of A is |λk (A)|.
Then we have that |λ1 (A)| ≥ |λ2 (A)| ≥ . . . ≥ |λm (A)| and our new
problem is:
min (|λ1 | + . . . + |λk |)(x)
s.t. Ai • x = bi
To see how we can solve this we look at our original (LP) with the
objective that we take the sum of the absolute values. By doubling
the number of variables we can write that problem as a regular LP:
max
s.t.
X
X
wi xi +
(−wi )ui
X
X
xi +
ui = k
0 ≥ ui , xi ≥ 1
and the dual of this is:
min kt +
X
yi +
X
ui
s.t. t + yi ≥ wi
t + zi ≥ −wi
yi , zi ≥ 0
By using the same idea we can write our problem as a regular SDP:
min kt + tr Y + tr U
s.t. tI + Y X
tI + X X
Y, Z 0
Ai • X = bi
6.5. GENERALIZATION TO WEIGHTED SUMS
6.5
63
Generalization to weighted sums
Assume that we have m1 ≥ m2 ≥ . . . ≥ mk > mk+1 = . . . = mn =
0. We want to solve problem:
min (m1 λ1 + . . . + mk λk )(x)
s.t. Ai • x = bi
The condition of nonincreasing mi is necessary to make this problem
convex. We can write m1 λ1 + . . . + mk λk ) as:
m1 λ1 + . . . + mk λk ) = (m1 − m2 )λ1 +
(m2 − m3 )(λ1 + λ2 ) +
(m3 − m4 )(λ1 + λ2 + λ3 ) +
..
.
(mk−1 − mk )(λ1 + . . . + λk−1 ) +
mk (λ1 + . . . + λk )
Since for each mi we have mi ≥ mi+1 so this is a nonnegative
combination of convex functions. We assign a new variable ti to
each line and get a new problem:
min(m1 λ1 + . . . + mk λk )(X) = min
k
X
(mi − mi−1 )(iti + tr Yi )
i=1
s.t. ti I + Yi X
Yi 0
Ai • X = bi
6.6
Reformulation of the Lovász Theta function
In last lecture we talked about the Lovász Theta function and the
maximum clique problem. Assume G = (V, E) is an undirected
graph and w is a vector of weights, one weight for each vertex in V.
Our new problem is then to try to find the heaviest clique instead
of the maximum clique.
64
CHAPTER 6. EIGENVALUE OPTIMIZATION
Definition 14 Heaviest clique problem: find a fully connected subgraph of G (a clique) such that the sum of the weights of the vertices in this subgraph is maximized. I.e. find a subgraph G0 =
(V0 , E
P0 ), V0 ⊆ V, E0 ⊆ E such that ∀i, j ∈ V0 , i 6= j ⇒ ∃(i, j) ∈ E0
and i∈V0 wi is maximized.
When all weights are equal to 1 we have our usual case of finding
the largest clique. We will talk about this in more detail next time.
Chapter 7
Semidefinite Programming
Relaxations of Integer
Programs
Scribe: Yusaku Yamamoto
10/22/2001
7.1
Overview
In this lecture, we define the Lovász theta function in a more general
setting. Then we formulate a semidefinite programming relaxation
of more general integer programs and show that it is a stronger
relaxation than the LP relaxation.
7.2
7.2.1
The Maximum Weighted Clique Problem and
the Maximum Weighted Independent Set Problem
Integer programming formulation
In this section, we reconsider the two graph theoretic problems which
we have introduced in lecture 5, namely, the maximum clique problem and the maximum independent set problem. But here, we consider them in a more general setting, where each node i in the graph
G = (V, E) is assigned a positive integer called weight wi , and look
66
CHAPTER 7. SDP RELAXATIONS
for a clique K or an independent set I for which the sum of the
weights assigned to the nodes belonging to it takes the maximum
value.
We denote the vector of weights by w, and assign each node i
a variable xi which is one if the i ∈ K and zero otherwise. Then
the maximum weighted clique problem can be formulated as a 0-1
integer programming problem as follows:
Definition 15 Maximum weighted clique problem:
ω(G, w) = max wT x
s.t. xi + xj ≤ 1,
xi ∈ {0, 1}
∀i, j ∈
/E
Similarly, the maximum weighted independent set problem can
be expressed as follows:
Definition 16 Maximum weighted independent set problem:
α(G, w) = max wT x
s.t. xi + xj ≤ 1,
xi ∈ {0, 1}
7.2.2
∀i, j ∈ E
LP relaxation
We can consider the LP relaxation of these problems by replacing the
0-1 conditions with inequalities. For example, we have the following
LP for the maximum clique problem:
Definition 17 LP relaxation of the maximum weighted clique problem (1):
zLP (G, w) = max wT x
s.t. xi + xj ≤ 1,
0 ≤ xi ≤ 1
∀i, j ∈
/E
However, the usefulness of this relaxation is rather limited. In
fact, it can be shown that it produces the exact solution of the
original problem if and only if G is a bipartite graph. So, even for
a simple triangular graph, this does not produce a good result.
7.2. THE MAXIMUM WEIGHTED CLIQUE PROBLEM AND
THE MAXIMUM WEIGHTED INDEPENDENT SET PROBLEM
67
To get a better relaxation scheme, we first note that for arbitrary
clique K and arbitrary independent set I, |K ∩ I| ≤ 1 (because if two
distinct points belong to K ∩ I, they must be both connected and
disconnected by definition of K and I, which is a contradiction). Let
a subset S of V be denoted by a 0-1 vector 1S ∈ <n such that
1,
if i ∈ S;
(1S )i =
0,
otherwise.
The vector 1S is called the characteristic vector of S. Using this
notation, for every clique K and every independent set I, we have
1TK 1I ≤ 1. We further define the set of vectors corresponding to
cliques and independent sets in G as
CLIQUE(G) = {x ∈ <n | xi + xj ≤ 1, ∀i, j ∈
/ E, xi ∈ {0, 1}},
n
INDP(G) = {x ∈ < | xi + xj ≤ 1, ∀i, j ∈ E, xi ∈ {0, 1}}.
Then, for ∀x ∈ INDP(G), 1TK x ≤ 1, and for ∀x ∈ CLIQUE(G),
≤ 1.
Using this fact, we can consider another LP relaxation of the
maximum weighted clique problem as follows:
1TI x
Definition 18 LP relaxation for the maximum weighted clique problem (2):
ω∗ (G, w) = max wT x
s.t. 1TI x ≤ 1,
xi ≥ 0.
for each independent set I in G,
We denote the set of vectors satisfying these constraints by K-INDP(G).
Remark 4 I can be restricted to maximal independent sets (an
independent set I is called maximal if there is no independent set
that has I as a proper subset).
Remark 5 ω(G, w) ≤ ω∗ (G, w) holds because 1TI x ≤ 1 is only a
necessary condition for x to be a clique, and this problem is still a
relaxation.
By taking the dual of this problem, we have another LP:
68
CHAPTER 7. SDP RELAXATIONS
Definition 19 LP relaxation for the maximum weighted clique problem (3):
X
yI
χ∗ (G, w) = min
I=independent set
s.t.
X
yI ≥ wi ,
i = 1, ..., n.
I3i
Remark 6 If wi = 1, this can be interpreted as a relaxation of the
coloring problem. In fact, minimum coloring of a graph is equivalent to finding a set of independent sets {I1 , I2 , . . . , Ik } that satisfies
∪ki=1 Ii = V and Ii ∩ Ij = φ for i 6= j and has minimum number
of elements. By defining yI to be one if I belongs to this set and
zero otherwise, we have an integer programming formulation of this
problem. The constraints in the LP above is obtained by deleting
the 0-1 condition and constraints expressing Ii ∩ Ij = φ for i 6= j.
Remark 7 If wi > 1, this can be considered as a relaxation of the
generalized coloring problem, in which each vertex is painted with
wi different colors. We denote this generalized chromatic number
by χ(G, w).
From what we have stated above, we now have
ω(G, w) ≤ ω∗ (G, w) = χ∗ (G, w) ≤ χ(G, w).
By considering the complement graph of G, we have
α(G, w) ≤ α∗ (G, w) = χ̄∗ (G, w) ≤ χ̄(G, w).
We work on this second set of inequalities in the next subsection.
7.2.3
SDP relaxation
The LP relaxations (2) and (3) we have derived in the former subsection have shortcomings that the number of constraints or variables
grows rapidly as the graph G becomes larger, because the number of
independent sets in a graph generally grows exponentially with the
number of nodes. This is true even if we restrict the independent
sets to maximal ones.
In this subsection, we develop an SDP relaxation of the original
integer program that overcomes these difficulties. Moreover, it can
7.2. THE MAXIMUM WEIGHTED CLIQUE PROBLEM AND
THE MAXIMUM WEIGHTED INDEPENDENT SET PROBLEM
69
be shown that the SDP relaxation gives better approximation than
the LP relaxations.
To this end, we first define an orthonormal representation of a
graph.
Definition 20 Orthonormal representation of a graph: For a graph
G = (V, E), the set of vectors {u1 , ..., un }, ui ∈ <n is called an
orthonormal representation of G if kui k2 = 1 and uTi uj = 0 for
ij ∈
/ E. We denote the set of all orthonormal representations of G
by ORTH(G).
Note that a graph has infinitely many orthonormal representation.
For instance columns of the identity matrix—and in fact of any
orthogonal matrix—represents every graph.
Then we have the following proposition:
Proposition 8 Let c ∈P<n be a vector with kck2 = 1. Then, for
each independent set I, i∈I (cT ui )2 ≤ 1.
Proof: From the definition of the orthonormal representation, {ui |i ∈
I} form an orthonormal set of |I| vectors in <n . From the theory of
linear algebra, we can then choose other n − |I| vectors {ui |i ∈ G − I}
so that {uI } is an orthonormal basis of <n . Then,
X
T
2
(c ui ) ≤
i∈I
n
X
(cT ui )2 = kck2 = 1.
i=1
Let
TH(G) = {x ∈ < :
n
n
X
(cT ui )2 xi ≤ 1, for ui ∈ ORTH(G),
kck2 = 1}.
i=1
Then TH(G) is clearly a convex set. Also, INDP(G) ⊆ TH(G),
because
x ∈ INDP(G) ⇒
n
X
(cT ui )2 xi =
i=1
X
i∈I
Moreover, we can prove the following lemma:
Lemma 7 TH(G) ⊆ K-INDP(G).
(cT ui )2 ≤ 1.
70
CHAPTER 7. SDP RELAXATIONS
Proof: Suppose K is a clique. Without loss of generality, we can
assume that 1, 2, ..., k are vertices of K. Then we choose
c = v1 = v2 = . . . = vk = u,
where u is an arbitrary vector with
kuk2 = 1,
and let u, vk+1 , vk+2 , . . . , vn be an orthonormal set.
v1 , . . . , vn ∈ ORTH(G). Therefore
1≥
n
X
T
2
(c ui ) xi =
i=1
k
X
T
2
(c ui ) xi +
i=1
n
X
Obviously,
(cT ui )2 xi = 1TK x.
i=k+1
Let ϑ(G, w) = max wT x s.t. x ∈ TH(G). Then, from INDP(G) ⊆
TH(G) ⊆ K-INDP(G), we have α(G, w) ≤ ϑ(G, w) ≤ α∗ (G, w).
Now we will formulate the computation of ϑ(G, w) as an SDP. For
that purpose, we first state four alternative definitions of ϑ(G, w).
Definition 21
ϑ1 (G, w) =
ϑ2 (G, w) =
wi
,
T
{ui }∈ORTH, kck2 =1 i∈V (c ui )2
min
max
min λ1 (X + W),
X∈M⊥
ϑ3 (G, w) = max{W • Y; Tr Y = 1, Y ∈ M, Y < 0},
where
M = {Y = Y T , Y ∈ <n , Yij = 0 if ij ∈ E},
M⊥ = {X = XT , Y ∈ <n , Xij = 0 if ij ∈
/ E},
√ √ T
√ √
W = w w = ( wi wj )ij ,
X
ϑ4 (G, w) =
max
(dT vi )2 wi
{vi }∈ORTH(G), kdk2 =1
Then we have the following theorem:
Theorem 11
ϑ1 = ϑ2 = ϑ3 = ϑ4 = ϑ.
Proof: We will prove the theorem by showing a series of inequalities:
ϑ ≤ ϑ1 ≤ ϑ2 ≤ ϑ3 ≤ ϑ4 ≤ ϑ.
7.3. SDP RELAXATION OF A GENERAL 0-1 PROGRAM
71
The key in the proof is to prove the third inequality ϑ2 ≤ ϑ3 . However, we already know that it is true from the duality theorem of
semidefinite programming. The rest can be shown by algebraic manipulation which we omit here (See the book of Grötschel, Lovász
and Schrijver for details).
From what we have stated above, we now have
ω(G, w) ≤ ϑ(G, w) ≤ ω∗ (G, w) = χ∗ (G, w) ≤ χ(G, w),
α(G, w) ≤ ϑ(Ḡ, w) ≤ α∗ (G, w) = χ̄∗ (G, w) ≤ χ̄(G, w),
which shows that the SDP relaxation ϑ(G, w) gives a better approximation than the LP relaxation. Moreover, we show in the
future lectures that ϑ(G, w) is polynomial time computable.
7.3
SDP relaxation of a general 0-1 program
In this section, we consider SDP relaxation of the following general
integer program:
¯
Definition 22 problem IP:
max w̄T x̄
s.t. Āx̄ ≥ b
x̄ ∈ {0, 1}n .
The LP relaxation of this problem is as follows:
¯
Definition 23 problem LP:
max w̄T x̄
s.t. Āx̄ ≥ b
0 ≤ x̄ ≤ 1.
It is more convenient for our purpose to consider the following
equivalent problems in which the right hand sides of the inequality
constraints are zero:
72
CHAPTER 7. SDP RELAXATIONS
Definition 24 problem IP:
ZIP = max wT x
s.t. Ax ≥ 0
x0 = 1
x ∈ {0, 1}n+1 ,
and
Definition 25 problem LP:
ZLP = max wT x
s.t. Ax ≥ 0
x0 = 1
0 ≤ xi ≤ x0 ,
where
A = (−b, Ā),
x0
x=
,
x̄
0
and w =
.
w̄
Remark 8 The conditions 0 ≤ xi ≤ x0 can be rewritten as xi ≥ 0
and x0 − xi ≥ 0, to make the right hand sides zero. Also note that
ZIP ≤ ZLP .
To develop an SDP formulation, we first note the fact:
aTi x ≥ 0
aTj x ≥ 0
⇒ (aTi x)(aTj x) ≥ 0 ⇔ (ai aTj ) • (xxT ) ≥ 0.
T
Remark 9 Matrix X = (xx ) =
1 x̄T
x̄ x̄x̄T
has the following prop-
erties:
X00 = 1,
Xii = xi 2 ,
Xi0 = X0i = xi .
Then we define two sets of inequalities I1 and I2 such that
I1 = a set of inequalities containing 0 ≤ xi ≤ x0 : A1 x ≥ 0
I2 = another set of inequalities containing 0 ≤ xi ≤ x0 : A2 x ≥ 0
I1 ∪ I2 = all of the inequalities.
7.3. SDP RELAXATION OF A GENERAL 0-1 PROGRAM
73
We are now ready to state the SDP formulation.
Definition 26 problem SDP+RK1
ZRK1 = max wT diag(X)
s.t. (ai aTj ) • X ≥ 0
∀i ∈ I1 , ∀j ∈ I2
X00 = 1, X0i = Xii , X < 0, rank(X) = 1.
We refer the problem obtained by deleting the rank 1 condition
from SDP+RK1 as SDP, and the value of its objective function as
ZSDP . Then we have the following relationships among ZIP , ZRK1 ,
ZSDP and ZLP .
Theorem 12
ZIP = ZRK1
Proof: ZIP ≤ ZRK1 can readily be shown by noting that x ∈ {0, 1}n ,
x0 = 1, X = xxT is a feasible solution for SDP+RK1.
To show that ZIP ≥ ZRK1 , we first note that from rank-1 condition, X can be written as
1
1 x̄T
T
X=
(1 x̄ ) =
x̄
x̄ x̄x̄T
for some x̄. Then, from xi = X0i = Xii = (xi )2 , we conclude xi =
0 or 1. Next, we note that the constraints (ai aTj ) • X ≥ 0 includes
the inequalities of the form xi (aTj x) ≥ 0 and (x0 − xi )(aTj x) ≥ 0,
because I1 contains 0 ≤ xi ≤ x0 . By adding these two inequalities,
we have aTj x ≥ 0 for ∀j ∈ I2 . Then, aTi x ≥ 0 for ∀i ∈ I1 can also be
concluded from (aTi x)(aTj x) ≥ 0.
Theorem 13
ZIP ≤ ZSDP ≤ ZLP
Proof: ZIP ≤ ZSDP follows immediately from the preceding theorem
and the fact that SDP is a relaxation of SDP+RK1.
To show ZSDP ≤ ZLP , let X be a feasible solution of SDP. Then, by
noting that any submatrix of a positive semidefinite matrix obtained
by extracting the same subset of rows and columns is again a positive
semidefinite matrix, and that X0i = Xii , we have
1 x̄T
1 xi
X=
< 0 ⇒ Det
≥ 0 ⇒ xi ≥ x2i ⇒ 0 ≤ xi ≤ 1.
xi xii
x̄ X̄
74
CHAPTER 7. SDP RELAXATIONS
Ax ≥ 0 can be shown in the same way as in the preceding theorem.
Hence SDP is a stronger relaxation than LP in this formulation.
The SDP relaxation of 0-1 integer programs can be, depending
on the particular problem, extremely powerful. It can be shown
that if the above SDP relaxation is applied to the LP relaxation of
the maximum independent set problem, then the resulting SDP is
stronger that the TH(G) formulation given above, and contains, in
addition to clique and Orthonormal inequalities, a number of other
classes inequalities, see the paper of Lovász and Schrijver for more
details.
Chapter 8
Interior Point Methods: An
Introduction
Scribe: Emre Erdogan
10/29/01
8.1
Overview
We first briefly show that the problem of feasibility for SDP and
SOCP is quite different from that of linear programming. Then
we introduce the notion of barrier functions and give some suitable
barrier functions for LP, SDP and SOCP. Finally, we show how
Newton’s method can be used to solve the optimization problems.
8.1.1
Difficulty of the feasibility problem in SOCP and
SDP
Consider the following example:
Example 19 (Exponentially large SDPs) Consider the optimization problem.
min xn
s.t. x2i ≤ xi+1 , i = 1, . . . m
x0 = 2
xi+1 xi
2
Since xi ≤ xi+1 ⇐⇒
0. This optimization problem
xi 1
n
can be cast as an SDP. The optimal value of this problem is 22 while
76
CHAPTER 8. INTERIOR POINT METHODS
the size of input is O(n). Notice that this SDP is in fact an SOCP.
2
2
The constraint xi+1 ≥ x2i is equivalent to xi+12 +1 − xi+12 −1 ≥ x2i .
This inequality in turn is equivalent to
 xi+1 +1 
2
 xi+1 −1  <Q 0
2
xi
This example shows that for an SOCP or an SDP one can have an
exponentially large output. It is therefore necessary to make further
assumptions in order to even begin to discuss polynomial-time algorithms for SOCP and SDP. Here is how this problem is circumvented
in general. Consider the optimization problem min cT x, subject to
x ∈ C, where C is convex. If there exist a ball B(0, r) of radius r
centered at 0 contained inside the feasible set C, and a ball B(0, R)
of radius R centered at the origin containing C, then the solution
of the optimization problem is at least as large as r and at most as
large as R. Taking r and R from rational numbers, the total number
of bits required to write down the solution of the convex program is
bounded by | log(r)| and | log(R)|.
Another issue to keep in mind is that even if the the input data for
an SDP or SOCP, i.e. A, b and c are rational numbers, the solution
x and the dual solution (y, z) are not necessarily rational; this is
in contrast with linear programming. Thus, we usually need to
supply a rational number and then be satisfied with a pair primal
^ , z^) such that the duality
and dual rational feasible solutions (^
x, y
T
^ z^ ≤ . In this situation we may speak of polynomial time
gap x
complexity if the number of operations performed by the algorithm
is a polynomial function of | log(r)|, | log(R)|, | log()|, m and n.
8.2
Barrier functions for LP, SDP, SOCP
We now define the notion of barrier functions for convex sets in
general and for the LP, SDP and SOCP, in particular. Then we
explain how we can use barrier functions to find the optimal solutions of these problems. In this lecture we use the barrier functions
to convert the constrained optimization problem to essentially, an
unconstrained optimization problem. When we are solving these
8.2. BARRIER FUNCTIONS FOR LP, SDP, SOCP
77
problems we assume that the constraint set is a convex set and has
nonempty interior. Now let us define the barrier function.
Definition 27 (Barrier Function) Let C ⊆ Rn be a convex set
with nonempty interior. Then the function b : Int(C) → R is called
a barrier function if it has the following properties.
1. b is convex.
2. For each sequence of points xn ∈ Int(C) such that limn→∞ xn
exists and belongs to Bd(C), the limn→∞ b(xn ) = ∞.
Note that since the domain of b(x) is Int(C) and by the properties
of b(x), the minimum value of b is attained in the Int(C).
Now let us explain how we use the barrier functions. Consider the
following constrained minimization problem.
(P) min cT x
s.t. x ∈ C
Let b(x) be a barrier function for problem (P). For a given µ > 0,
µ ∈ R, by multiplying the barrier function with µ and adding it to
the objective function, we can convert the constrained minimization
problem to unconstrained minimization problem Pµ :
(Pµ )
min fµ (x) = cT x + µb(x)
Let for a given µ > 0, x∗µ be the point at which the min of fµ (x) is
attained. Since b(x) → ∞ when x approaches the boundary of C
and Pµ is a minimization problem, x∗µ ∈ Int(C).
Suppose we solve this problem for a decreasing sequence of µn .
One can show that as µn → 0, x∗µ → x∗ , the optimum of (P). In other
words, sequence of optimal points of the problem (Pµ ) converges to
the optimal point of the original problem (P). In fact, as µ varies
towards zero, the set of “µ-optimal” points x∗µ traverse a smooth
path in the interior of C called the central path associated with the
barrier function b; it can be shown that this path ends at x∗ .
One may think that at the outset by choosing µ very small, we
can solve the problem (Pµ ) once and get a sufficiently close approximation to x∗ . However, it turns out that the problem becomes
numerically very ill-conditioned when µ is small. Also, for such a
small µ it is hard to find a suitable initial solution.
78
CHAPTER 8. INTERIOR POINT METHODS
Instead a better approach is to first choose µ fairly large; the
larger µ, the easier it is to find an initial solution x0 . Next, one uses
one or a few iterations of Newton’s method at the current value of
µ to get a new point x1 . Then µ is reduced by a constant factor:
µ ← σµ for some constant σ, and the last result of Newton’s method,
x1 , is used as a start point for the new optimization problem (it is
new because we have a new µ.) This process is repeated until xk is
sufficiently close to x∗ . With judicious choices of the barrier function
b(·), σ, and the initial point x0 one can show that the procedure
just outlined can result in a well-behaved algorithm. The details for
SDP and SOCP will be given in future lectures. For now, we focus
on finding barriers for these and the LP problems.
8.2.1
Barrier function for LP
Consider the following standard from linear program;
(LP) min cT x
s.t. Ax = b
xi ≥ 0 for i = 1, . . . , n
where c ∈ Rn and b ∈ Rm , A ∈ Rn×m , and x ∈ Rn is the unknown
vector. In this problem, as all the problems to follow, we won’t
worry about the linear equality constraints Ax = b, and focus on
the inequality constraints xi ≥ 0. The boundary of the nonnegative
orthant consists of those vectors where at least one of xi is zero. By
definition any barrier function for the LP problem must approach
to ∞ as one of the components of x = (x1 , x2 , . . . , xn )T goes to 0.
Two examples of barrier functions for the nonnegative orthant are:
n
X
1
xi
i=1
and
−
n
X
i=1
log xi = − log
n
Y
xi .
i=1
Notice that both of these function are convex (why?) and approach
infinity as any of xi approaches zero. It turns out (as we will see
later) that the logarithmic barrier function has some desirable properties that allow us to prove polynomial time complexity. So let us
explore what we get by applying this barrier to the LP problem.
P
LPµ min cT x − µ log(xi )
s.t. Ax = b
8.2. BARRIER FUNCTIONS FOR LP, SDP, SOCP
79
The optimal solution to this problem can be found by using Lagrang’s theorem. The Lagrangian function for LPµ is given by
X
def
Lµ (x, y) = cT x − µ
log(xi ) + yT (b − Ax)
where y ∈ Rm are called Lagrange multipliers. By Lagrange’s theorem, xµ is optimal for LPµ if, and only if, the derivatives of Lµ with
respect to both x and y are zero. That is we need to solve:
1 1
1
T
∇x L = c − µ
, ,...,
− yT A = 0
x 1 x2
xn
∇y L = b − Ax = 0
So we have converted the LP into a nonlinear system of equations.
Here we have m+n variables and m+n equations. Since the number
of variables are equal to the number of equations and by the assumption that the constraint qualification is met, in principal, we can
T
1
1
1
T
solve this system of equations. If we let z = µ x1 , x2 , . . . , xn .
Then by rearranging the above equations we get:
Ax = b
AT y + z = c
zi − µx−1
i = 0 for i = 1, 2...n
The third set of equations can be written in a number of mathematically equivalent forms. For example as xi − zµi = 0 or xi zi = µ for
i = 1, 2...n. Note also that the first two sets of equations are primal
and dual feasibility conditions for the LP problem its dual, respectively. In addition the third set of equations, since are equivalent to
xi .zi = µ, can be interpreted as a relaxation of the complementary
slackness conditions; thus as µ → 0, the complementary slackness
conditions will be satisfied by the (xµ , yµ , zµ ).
8.2.2
Barrier function for SDP
First let us define some notation. If X ∈ Rn×n , define
2
vecX = (x11 , . . . , xn1 , x12 , . . . , xn2 , . . . , x1n , . . . , xnn )T ∈ Rn
Now Consider:
(SDP) min C • X
s.t. Ai • X = bi
X<0
for i = 1, . . . , m
80
CHAPTER 8. INTERIOR POINT METHODS
where C, X, Ai ∈ Sn , i = 1, . . . , m, Sn the set of n × n P
symmetric
matrices, and the inner product is defined as A • B = i,j Aij Bij .
Recall that X < 0 if, and only if, λi (X) ≥ 0 for all i = 1, ..m, where
λi (X) are Q
the eigenvalues of the matrix X. Because λi (x)
Q are nonnegative, ni=1 λi (X) is also nonnegative. Therefore, log ni=1 λi (X)
is defined if X 0. We can use the following function as a barrier
for SDP:
n
X 1
X
Y
, and
− log λi (X) = − log
λi (X) = − log det X
λi (X)
i
i=1
Before forming the Lagrangian function and the equations which we
shall use to find the critical points of it, let us slightly change the
2
form of these equations. Let A ∈ Rm×n whose ith row is vecT Ai
and b ∈ Rm whose ith entry is bi . Since Ai •X = bi = vecT Ai vecX,
we can write the set of equations Ai • X = bi for i = 1, . . . , m as
Avec(X) = b. In other words;
Ai • X = bi
for i = 1, . . . , m ⇐⇒ Avec(X) = b.
By the same procedure as defined for LP and using our new notation
for the constraint set Ai • X = bi for i=1...m, the SDPm u can be
written as:
(SDP)µ min C • X − µ log det X
s.t. Avec(X) = b
Then the Lagrangian function of the above problem is:
Lµ (X, y) = C • X − µ log det X +
m
X
i=1
m
yT b − Avec(X)
where µ ∈ R and y ∈ R . Now the problem is to determine
∇x (µ log det X). Let’s give an example for the computation of ∇x (µ log det X)
when X ∈ R3×3 . The generalization will be clear after this example.
Let


x11 x12 x13
X =  x21 x22 x23 
x31 x32 x33
Pn
1+j
a1j det(A1j )
Since for any n × n matrix A, det A =
j=1 (−1)
where A1j is the (n − 1) × (n − 1) matrix obtained by deleting the
8.2. BARRIER FUNCTIONS FOR LP, SDP, SOCP
81
first row and the jth column of A. Then we can write log(det X) as,
x11 x12 x13 x22 x23 − x12 x21 x23
log(det X) = log x21 x22 x23 = log x11 x31 x33
x32 x33 x31 x32 x33 If we take derivative of log(det X)
∂ log det X
=
∂x11
with respect to x11 we get
x22 x23 x32 x33 det X
Then we can generalize this result for X ∈ Rn×n as follows
det(Aij )
∂ log det X
= (−1)i+j
= (X−T )ij
∂xij
det X
where Aij is the (n − 1) × (n − 1) matrix obtained by deleting the
ith row and the jth column of A. Now we can easily compute the
derivatives of Lagrangian function.
Lµ (X, y) = C • X − µ log det X +
T
T
∇X L = vec (C) − µvec X
∇yi L = b − A.X = 0
−1
m
X
i=1
T
yT b − Avec(X)
−y A=0
Let Z = µX−1 . Then we can write the above equations as;
AvecX = b
X
C−Z−
yi Ai = 0
i
−1
Z − µX
=0
The third equation can be written equivalently as X − µZ−1 = 0 or
as XZ = µI where I is the n×n identity matrix. Using XZ = µI will
cause some headache later on. The problem is that if we consider
the function F(X, Z) = XZ, this is a function that maps a pair of
+ x13 x21 x22
x31 x32
82
CHAPTER 8. INTERIOR POINT METHODS
symmetric matrices, X and Z to a matrix XZ that is not in general
symmetric. Notice that this is not a problem for functions, Z−µX−1
or X − µZ−1 . We now show that for X and Z positive semidefinite,
XZ = µI, if and only if XZ + ZX = 2µI.
Lemma 8 If X and Z ∈ Sn×n and either X 0 or Z 0 then
XZ = µI if, and only if
XZ + ZX
= µI.
2
Proof: If XZ = µI then since X and Z are symmetric matrices
XZ + ZX = 2µI. So XZ+ZX
= µI.
2
For the converse let A = XZ − ZX. Then −AT = A that is A
is a skew symmetric matrix.
Therefore the eigenvalues of A are
√
λj (A) = iαj where i = −1 and αj are real numbers. So we have,
XZ + ZX = 2µI
XZ − ZX = A
If we add these two equations we get 2XZ = 2µI+A. The eigenvalue
of the right hand side are of the form 2µ + iαj . Now suppose X < 0.
Then X has a square root X1/2 . Multiplying XZ = µI from left by
X−1/2 and from right by X1/2 we see that XZ is similar to a symmetric
matrix X1/2 ZX1/2 therefore has only real eigenvalues. This means
that all αj = 0, that is A = XZ − ZX = 0, and thus XZ = ZX = µI.
Thus, it is better to write, instead of XZ = µI, XZ + ZX = 2µI.
8.2.3
Barrier function for SOCP
After solving the LP and SDP by using barrier function method, now
let us define a suitable barrier function for SOCP and calculate the
derivatives of its Lagrangian function. For simplicity let us consider
the following single block SOCP:
(SOCP) min cT x
s.t. Ax = b
x<Q 0
n
where A ∈ Rm×n , c ∈ Rn , b ∈ Rm , and
p x ∈ R are the variables.
The relation a<Q 0 is defined as a0 ≥ a21 + . . . + a2n .
8.2. BARRIER FUNCTIONS FOR LP, SDP, SOCP
83
Let’s write the constraints x0 ≥ ||x̄||
as x20 − x21 − x22 − . . . − x2n ≥ 0. Let x = (x0 , x1 , x2 , . . . , xn )T , and


1 0
0 ... 0
0 −1 0 . . . 0 


0 0 −1 . . . 0 
R=
. .

..
 .. ..
. 0
0 0
0 . . . −1
then x20 − x21 − x22 − . . . − x2n = xT Rx. A suitable barrier function for
SOCP is:
log(x20 − ||x̄||2 )
By a similar procedure as LP and SDP, the SOCP problem is replaced by
SOCPµ min cT x − µ log(x20 − kx̄k2 )
s.t. Ax = b
for µ > 0. Therefore the Lagrangian Function of this problem is:
Lµ (x, y) = cT x − µ log(x20 − ||x̄||2 ) + yT (b − Ax)
2µ
(x0 , −x1 , −x2 , . . . , −xn ) − yT A = 0
∇x L = cT − 2
x0 − kx̄k2
∇y L = b − Ax = 0
Introducing the slack variable z, this system is equivalent to
Ax = b
AT y + z = c
2µ
z− 2
Rx = 0
x0 − kx̄k2
where z =
2µ
2 Rx.
x2
0 −kx̄k
This in turn can be written equivalently as:
xT z = 2µ
x0 zi + xi z0 = 0 for i = 1, . . . , n
The first can be obtained by multiplying from left the equaltion
2µ
T
z − x2 −kx̄k
and noting that xT Rx = x20 kx̄k2 . The
2 Rx = 0 by x
0
84
CHAPTER 8. INTERIOR POINT METHODS
second set of equations arise from observing that
− xzii =
x0
z0
−
2µ
2.
x2
0 −kx̄k
Thus, just like LP and SDP, applying the logarithmic barrier
function to the SOCP problems results in getting primal and dual
feasibility and a relaxed form of complementarity conditions.
8.2.4
Newton’s Method
So far we have been trying to find equations by using barrier functions. Now we wish to solve these systems. For all three problems,
LP, SDP and SOCP, application of the logarithmic barrier function
resulted in a system of equations which contained primal and dual
feasibility (a set of linear equations) and a relaxed form of complementarity conditions (a set of nonlinear equations.) To handle the
nonlinear equations, the main tool is using the Newton’s method.
The general approach is as follow: We start with an estimate of the
solution, (x, y, z). Next we seek a direction (∆x, ∆y, ∆z) such that
moving in that direction with an appropriate step length will bring
us closer to the solution of the system. The Newton method replaces, (x, y, z) with (x + ∆x, y + ∆y, z + ∆z), and plugs it into the
system of equations. Then, noting that (∆x, ∆y, ∆z) are unknowns,
it removes any nonlinear terms in ∆’s and solves the remaining system of linear equations. Let us see how this works out in LP.
Ax = B
AT y + z = c
xi zi = µ
Replace each variable
x ← x + ∆x
y ← y + ∆y
z ← y + ∆y.
Then put these new values of the variables into the equations
A∆x = b − Ax
AT ∆y + ∆z = c − AT y − z
xi ∆zi + ∆xi zi = µ − xi zi for i=1...n
8.2. BARRIER FUNCTIONS FOR LP, SDP, SOCP
85
We can write the above equations in closed form as;

   
∆x
rp
0 AT I
A 0 0 ∆y = rd 
∆z
rc
E 0 F
where E = Diag(z1 , z2 , . . . , zn ) and F = Diag(x1 , x2 , . . . , xn ).
If we decide instead to use zi − xµi = 0 we get
µ
= 0, or equivalently
xi + ∆xi
µ
zi + ∆zi −
=0
i
xi (1 + ∆x
)
xi
zi + ∆zi −
Since
µ
−
xi (1 +
µ
=−
∆xi
xi
)
xi
∆xi
1−
+
xi
∆xi
xi
2
+ ···
!
we can throw out the nonlinear terms and write:
µ
∆xi
µ
µ
zi + ∆zi −
1−
= 0, or equivalently ∆zi − 2 ∆xi = −zi + .
xi
xi
xi
xi
In this case, F = I, E = µ Diag(x−2
i ). Using a similar procedure we
get
∆xi −
for linearizing xi +
µ
zi
µ
µ
∆zi = −xi + .
2
zi
zi
= 0 and thus, E = I, and F = µ Diag(z−2
i ).
86
CHAPTER 8. INTERIOR POINT METHODS
Chapter 9
Newton’s Method in
Interior Point Methods
Scribe: David Phillips
11/12/2001
9.1
Overview
We complete last lecture’s discussion of the Newton method for LP,
SDP, and SOCP. Recall that these problems shared the following
structure when Newton’s method was applied to primal and dual
feasibility and a relaxed form of complementary slackness condition
that arose from applying first order conditions to the logarithmic
barrier function:
A4x
= rp primal feasibility
AT 4y + 4z = rd dual feasibility
E4x + F4z = rc complementary slackness conditions
We complete the derivation for the E and F matrices in LP, SDP, and
SOCP, in order motivate the unifying framework of Jordan algebras.
Finally, we introduce Jordan algebras.
88
9.2
CHAPTER 9. NEWTON’S METHOD
Newton Method for LP, SDP, and SOCP
9.2.1
Linear Programming
We include the results for LP here for review. The specifics of the
derivation is in the previous lecture.
Ax
= b primal feasibility
AT y + z = c dual feasibility
and any one of mathematically equivalent forms of relaxed complementary slackness conditions:
(1) xi − µz−1
= 0 i = 1, . . . , n
i
−1
(2) zi − µxi = 0 i = 1, . . . , n
(3) xi zi
= µ i = 1, . . . , n
In LP the matrices E and F have the forms:
(1) E = I, F = µ Diag(z−2 )
(2) E = µ Diag(x−2 ), F = I
(3) E = Diag(z), F =Diag(x)
where, for vector v ∈ Rn ,
v−2
 −2 
1/v1
1/v−2

2 
=
.
 ..  .
1/v−2
n
9.2.2
Semidefinite Programming
Letting

vecT (A1 )
..
,
A=
.

vecT (Am )
the specific relaxed form for SDP becomes:
Avec(4X)
= rp primal feasibility
AT 4y + vec(4Z) = rd dual feasibility
XZ
= µI complementary slackness conditions
9.2. NEWTON METHOD FOR LP, SDP, AND SOCP
89
The relaxed complementary slackness conditions has many mathematically equivalent conditions, three of which are:
(1) X − µZ−1 = 0
(2) Z − µX−1 = 0
(3) XZ + ZX = 2µI (recall Lemma 1 of lecture 8)
The situation here is quite analogous to LP. Consider replacing Z
with Z + 4Z in (1):
−1
Z + 4Z − µ X + 4X
= 0.
Since X is in the interior of the feasible region, it is positive definite,
implying that X−1 exists, so this equation can be rewritten as:
−1
−1
Z + 4Z − µ X I + X 4X
= 0
But X(I + X−1 4X) is not necessarily symmetric - so instead we will
use the following:
−1
Z + 4Z − µ X1/2 I + X−1/2 4XX−1/2 X1/2
= 0
Z + 4Z − µX−1/2 (I + X−1/2 4XX−1/2 )−1 X−1/2 = 0
Now, recall that for z ∈ R, if |z| < 1, then the following identity
holds:
1
= 1 − z + z2 − z3 + . . .
1+z
For matrices, the analogous identity is, for square matrix Z, if the
absolute value of all eigenvalues are all less than 1, then
(1 + Z)−1 = I − Z + Z2 − Z3 + · · ·
Using this identity, we obtain,
Z + 4Z − µX−1/2 (I − X−1/2 4XX−1/2 + (X−1/2 4XX−1/2 )2 − · · · )X−1/2 = 0
Applying Newton’s method means dropping all non linear terms in ∆s:
Z + 4Z − µX−1/2 (I − X−1/2 4XX−1/2 )X−1/2 = 0
Z + 4Z − µX−1 + µX−1 4XX−1 = 0
90
CHAPTER 9. NEWTON’S METHOD
It is clear that F = I; it will,
however, be
easier to represent E as
−1
−1
E(X), where E(X) : 4X−→µ X 4XX
. Since E is a linear transformation, it is possible to represent it as some matrix dependent
on X, and write E(X)vec(4X).
Before continuing, it will simplify the notation to introduce Kronecker products.
Kronecker Products
Let A and B be arbitrary matrices of dimensions m × n and p × q
respectively. Then, let ⊗ be a binary operator such that
A⊗B=C
where C is a mp × nq block matrix of form:


a11 B a12 B . . . a1n B

C =  ...
am1 B
...
amn B
where the ij-th block entry of the product is the p×q matrix B scaled
by the ij-th entry of the A matrix (i.e., aij ), with i = 1, . . . , m and
j = 1, . . . , n. There are a number of properties that make it easy to
algebraically manipulate expressions involving Kronecker products.
The fundamental property is stated in the following
Lemma 9 Let A = (aij ), B = (bij ), C = (cij ), and D = (dij ) be
matrices with consistent dimensions so that the products AB and
CD are well-defined. Then
(A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD)
(9.1)
Proof: To prove (9.1) we recall the property of matrix multiplication which states that if X = (Xij ) and Y = (Yij ) are two matrices
that are partitioned into blocks, so that Xij and Yij are also matrices, then Z = XY P
can be written as a block partitioned matrix
Z = Zij where Zij = k Xik Ykj . The only requirement here is that
the partitionings of X and Y should be in such a way that all products Xik Ykj for all values of i, j and k are well-defined. Now Let us
look at the ij block of (AB) ⊗ (CD). By definition this block equals
9.2. NEWTON METHOD FOR LP, SDP, AND SOCP
(AB)ij (CD) =
equals
P
k
91
aik bkj (CD). Now the ij block of (A ⊗ C)(B ⊗ D)
X
X
X
(A⊗C)(B⊗D) ij =
(A⊗C)ik (B⊗D)kj =
aik C bkj D =
aik bkj CD,
k
k
k
which is equal to the ij block of (AB) ⊗ (CD).
The Kronecker product has a number of properties all of which
can be derived easily from (9.1). For example
(A ⊗ B)−1 = A−1 ⊗ B−1 .
(9.2)
−1
This is proved by observing that A ⊗ B A ⊗ B−1 = AA−1 ⊗
BB−1 = I ⊗ I. Now it is obvious from definition that I ⊗ I = I.
Eigenvalues and eigenvectors of Kronecker products can also be
determined easily. Let Ax = λx and By = ωy. Then Kronecker
multiplying the two sides of these equations we get (Ax) ⊗ (By) =
(λx) ⊗ (ωy). Using (9.1) we get (A ⊗ B)(x ⊗ y) = (λω)(x ⊗ y).
Since for vectors x and y, x ⊗ y is also a vector, we have proved
that
Lemma 10 If λ is an eigenvalue of A with corresponding eigenvector x, and ω is an eigenvalue of B with corresponding eigenvector
y, then λω is an eigenvalue of A⊗B with corresponding eigenvector
x ⊗ y.
Other properties of Kronecker products are
(A ⊗ B)T = AT ⊗ BT
vec(ABC) = CT ⊗ A vec(B)
Exercise: Prove the above properties.
Back to SDP
Returning to the SDP relaxation, we obtain:
(1) E = I, F = −µE(Z) = −µ(Z−1 ⊗ Z−1 )
(2) E = −µE(X) = −µ(X−1 ⊗ X−1 ), F = I
and for (3):
(X + 4X)(Z + 4Z) + (Z + 4Z)(X + 4X) = 2µI
92
CHAPTER 9. NEWTON’S METHOD
As usual, dropping nonlinear terms in ∆s we get
X4Z + 4ZX + 4XZ + Z4X = 2µI − XZ − ZX
which in Kronecker product notation can be written as
(I ⊗ X + X ⊗ I)vec(4Z) + (I ⊗ Z + Z ⊗ I)vec(4X) = rc
Thus, E = I ⊗ Z + Z ⊗ I, and F = I ⊗ X + X ⊗ I.
9.2.3
Second-order cone programming
For simplicity of notation, we will consider the single-block problem.
The relaxed form for this is:
A(4x)
= rp primal feasibility
AT 4y + 4z = rd dual feasibility
and the relaxed complementarity relations (as developed in our last
lecture) were:
 


x0
z0
 x1 
 −z 
 = ρ  . 1  = ρRz
(1) 
.
 .. 
 .. 
xn
−zn
or in a mathematically equivalent form
 


z0
x0
 z1 
 −x 
 = γ  . 1  = γRx
(2) 
.
 .. 
 .. 
zn
where ρ =
2µ
2
z2
0 −||z̄||
and γ =
−xn
2µ
2
x2
0 −||x̄||
in (1) and
1 0 ...
0 −1 . . .
R=
 ...

0
0

0
0
.

. . . −1
9.2. NEWTON METHOD FOR LP, SDP, AND SOCP
93
Since x = ρRz, multiplying by zT we get,
T
T
z x = ρz Rz = ρ
z20
−
n
X
i=1
z2i = 2µ.
By the comments at the end of section 2.3 of the last lecture, these
conditions also imply that:
x0 zi + z0 xi = 0, i = 1, . . . , n
Thus, we get
(3) xT z = 2µ
x0 zi + xi z0 = 0
or more succinctly,
(3)
Arw (x)z = Arw (x) Arw (z)e = 2µe
where e = (1, 0, . . . , 0)T , and as we have already defined


x0 x1 x2 . . . xn
 x1 x0 0 . . . 0 


x
0 x0 . . . 0 
Arw (x) = 
 .2
.. 
 ..
.
xn 0 . . .
x0
But then, for (3), replacing x and z with x + 4x and z + 4z,
(i)
or,
(ii)
(x0 + 4x0 )(zi + 4zi ) + (z0 + 4z0 )(xi + 4xi ) = 0
i = 1, . . . , n
T
x0 4zi + zi 4x0 + xi 4z0 + z0 4xi
= 2µ − x z i = 1, . . . , n
x0 4z0 + z0 4x0 + . . . + xn 4zn + zn 4xn )
= 2µ − xT z
But (i) and (ii) are equivalent to:
Arw (x)4z + Arw (z)4x = rc
So, for (3) E = Arw (x), and F = Arw (z).
−2
Exercise: Show that for case (1), E = I and F = −µ Arw (x)
−2
and for case (2), E = −µ Arw(z) , and F = I
94
CHAPTER 9. NEWTON’S METHOD
9.2.4
Summary
The following two tables have the different versions of the relaxed
complementary problems and the corresponding E and F matrix.
LP
(1) xi − µz−1
i = 0
SDP
X − µZ−1 = 0
SOCP
x−
2µ
2 Rz
z2
0 −||z̄||
2µ
2 Rx
x2
0 −||x̄||
=0
(2) zi − µx−1
Z − µX−1 = 0
z−
=0
i = 0
(3)
xi zi = µ
XZ + ZX = 2µI Arw (x) Arw (z) = 2µe
Then the matrix forms for E and F are:
LP
(1)
(2)
(3)
SDP
(1)
(2)
(3)
SOCP (1)
(2)
(3)
9.3
E and F Matrix Forms
E
F
−2
I
−µ
Diag
z
−µ Diag x−2
I
Diag(z)
Diag(x)
I
−µ(Z−1 ⊗ Z−1 )
−µ(X−1 ⊗ X−1 ) I
I⊗Z+Z⊗I
I⊗X+X⊗I
−2
I
−µ Arw(z)
−2
−µ Arw(x)
I
Arw(z)
Arw(x)
Euclidean Jordan Algebras
To unify the presentation of interior point algorithms for LP, SDP
and SOCP, it is convenient to introduce an algebraic structure that
provides us with tools for analyzing these three cases (and several
more). This algebraic structure is called Euclidean Jordan algebra.
We first introduce Jordan algebras. (Jordan refers to Pascal Jordan,
the 20th century German physicist who, along with Max Born and
Verner Heisenberg was responsible for the so-called matrix interpretation of quantum mechanics. It does not refer to another famous
9.3. EUCLIDEAN JORDAN ALGEBRAS
95
Jordan, that is Camille Jordan, the 19th century French mathematician of Jordan blocks and the Jordan closed curve theorem fame.)
First we have to make a comment on terminology. The word algebra is used in two different contexts. The first context is the name
a branch of mathematics that deals with algebraic manipulations at
the elementary level, or the study of groups, rings, fields, vector
spaces etc. at higher level. The second meaning of the word algebra
is a mathematical term that is a particular algebraic structure. Here
are the definitions
9.3.1
Definitions
For the purposes of the definitions, let V be some vector space defined over the field of real numbers, that is V = Rn for some integer
n. Although most of what we will say is also valid for an arbitrary
field F, (with scalar multiplication and normal addition), this level
of generality is not needed in our course and therefore will not be
treated.
Definition 28 (Algebra) An algebra, (V, ∗), is a vector space, V,
with an operation, ∗, such that for any x, y, z ∈ V, α, β ∈ R,
x ∗ y ∈ V (closure property),
x ∗ (αy + βz) = α(x ∗ y) + β(x ∗ z)
(αy + βz) ∗ x = α(y ∗ x) + β(z ∗ x)
(distributive law),
Notice that the distributive law implies that the binary operation
x ∗ y is a bilinear function of x and y. In other words, there are
n × n matrices, Q1 , Q2 , . . . , Qn such that
 T

x Q1 y
 xT Q2 y 

x∗y=z=
 ...  (bilinearity)
xT Qn y
Thus, it is immediate that determining the matrices Q1 , . . . , Qn
determines the multiplication, which in turn, determines the algebra. Further, given x ∈ V, for all y ∈ V, x ∗ y = L(x)y, where L(x)
is some matrix linearly dependent on x. Thus, L() also determines
the algebra.
96
CHAPTER 9. NEWTON’S METHOD
Definition 29 (Identity Element) For a given algebra, (V, ∗), if
there exists an element e ∈ V such that for all x ∈ V
e∗x=x∗e=x ,
then e is the identity element for (V, ∗).
Exercise: Prove that if (V, ∗) has an identity element, e, then
e is unique.
Definition 30 (Associative Algebra) An algebra, (V, ∗), is an
associative algebra if for all x, y, z ∈ V,
(x ∗ y) ∗ z = x ∗ (y ∗ z)
Example 20 (Matrices under matrix multiplication) Let Mn
be the set of all square matrices of dimension n. Then ordinary
matrix multiplication, (denoted as ·) forms an associative (but not
commutative) algebra. Note that, for X ∈ Mn , L(X) = I ⊗ X.
Example 21 (A commutative, but not associative, algebra)
Consider (Mn , ◦), where for A, B ∈ Mn , A ◦ B = AB+BA
. This is an
2
algebra which is commutative but not associative. For X ∈ Mn ,
L(X) = I⊗X+X⊗I
.
2
Consider now, an associative algebra, (V, ∗), with a matrix L()
determining ∗. We have that:
(x ∗ y) ∗ z = L(x ∗ y)z = x ∗ (y ∗ z) = L(x)(L(y)z)
Since this is true for all z it follows that for all x, y ∈ V L(x ∗ y) =
L(x)L(y). Thus, (V, ∗) is isomorphic to some subalgebra of matrices
under matrix multiplication. Let us define these terms.
Definition 31 Let (V, ∗) and (U, ?) be two algebras. Let h : V → U
be a function such that f(V) is a subspace of U, and f(u ∗ v) =
f(u) ? f(v). Then f is a homomorphism from (V, ∗) to (U, ?). If f is
one-to-one and onto then f is called an isomorphism. In that case,
the algebras (V, ∗) and (U, ?) are essentially the same thing.
9.3. EUCLIDEAN JORDAN ALGEBRAS
97
Definition 32 (Subalgebra) Given an algebra, (V, ∗), and a subspace U of V, (U, ∗) is a subalgebra, if and only if U is closed under
∗. More generally, we say that U is a subalgebra of V if it is isomorphic to a subalgebra of V.
Thus we have shown that all associative algebras are essentially
homomorphic to some some algebra of square matrices. Let us see
some examples of subalgebras and isomorphisms within the algebra
of matrices.
Example 22 (Some subalgebras of (Mn , ·))
1. All n × n diagonal matrices form a subalgebra of (Mn , ·).
2. All n×n upper triangular matrices form subalgebras of (Mn , ·),
so do all block upper triangular matrices. All n × n lower triangular matrices also form a subalgebra of Mn as do block
lower triangular matrices. Also notice that under the function
f(X) = XT , the algebra of upper triangular matrices is isomorphic to the algebra of lower triangular matrices.
3. All matrices of the form
A 0
0 0
where A ∈ Mk for k < n form a subalgebra of Mn . Also notice
that this subalgebra is isomorphic to (Mk , ·), so it is accurate
to say that Mk is a subalgebra of Mn for all k ≤ n.
4. Someone in class mentioned the set of orthogonal matrices, that
is matrices Q such that QQT = I. This is not a subalgebra.
Although it is indeed closed under matrix multiplication, the
set of orthogonal matrices is not a subspace of Mn : for two
orthogonal matrices, Q1 and Q2 , in general, Q1 + Q2 is not
orthogonal, nor is αQi for real numbers α 6= 1.
Definition 33 The subalgebra generated by x, V(x), is the set–
intersection–wise smallest subalgebra of V that contains x. More
generally if S ⊂ V is any subset, then V(S) is the smallest subalgebra of V that contains S.
98
CHAPTER 9. NEWTON’S METHOD
V(x) can be thought of being created as follows: start with V(x)
containing only x. Then add αx for all α ∈ R. Then multiply ∗-wise
everything we already have with everything and add the products
to V(x). Then form all possible linear combinations and add the
results to V(x). Continue this process of multiplying and forming
linear combinations until no new element is created. You are left
with a smallest subalgebra of V that contains x.
Definition 34 (Power Associative Algebra) For an algebra, (V, ∗),
if, ∀x ∈ V, the subalgebra V(x) is associative, then (V, ∗) is called a
power associative algebra.
To show that an algebra is power associative it suffices to show
that in the product x ∗ x ∗ · · · ∗ x the order in which multiplications
are carried out does not matter. In that case it is easily seen that
k
X
V(x) := v ∈ V : v =
αi xi , αi ∈ R, k an integer ork = ∞
i=0
that is elements of V(x) are polynomials or infinite power series of
an element x.
In the case of a power-associative algebra V, for each element
x ∈ V we may also define
k
X
V[x] := v ∈ V : v =
αi xi , αi ∈ R, k an integer
i=0
that is V[x] is the subalgebra of V(x) that is made of polynomials
in x.
Remark 10 Later we shall see that V[x] and V(x) are actually
equal.
Remark 11 (Mn , ◦) is power associative.
Proof: For p a nonnegative integer and X ∈ Mn , let X◦p = X ◦
X◦(p−1) , with X◦0 = I. We first show that X◦p = Xp (ordinary matrix
multiplication, which is associative), which we can do by induction.
For the base case, note that:
X◦0 = X0 = I.
9.3. EUCLIDEAN JORDAN ALGEBRAS
99
Now, assume that X◦k = Xk . Then,
X◦(k+1) = X ◦ X◦k
= X ◦ Xk
XXk + Xk X
=
2
k+1
= X
Now, for arbitrary integers q, r, s, and real numbers αi , βj , γk (i =
0, . . . , q, j = 0, . . . , r, k = 0, . . . , s), let
U =
q
X
α i Xi
i=0
V =
r
X
βi Xi
i=0
W =
s
X
γi Xi
i=0
U ◦ V = U · V, and thus
U ◦ (V ◦ W) = (U ◦ V) ◦ W
Example 23 (The algebra associated with SOCP) Let V = Rn+1
and ◦ : V × V −→ V such that, for x, y ∈ V:


   
xT By
x0
y0
 x y + y1 x0 
 x1   y1   1 0

 .  ◦  .  =  x2 y0 + y2 x0 

 ..   ..  
..


.
xn
yn
xn y0 + yn x0
where B is a nonsingular symmetric matrix. Note that:
1. The right-hand side is bilinear and thus (Rn+1 , ◦) is indeed an
algebra satisfying the distributive law.
100
CHAPTER 9. NEWTON’S METHOD
2. The multiplication ◦ is in general non-associative but it is commutative, since B is symmetric.
3. That ◦ is power associative will be shown later in the next
lecture.
4. When B = I then the complementary slackness theorem for
SOCP can be expressed succinctly as x ◦ z = 0.
5. For B = I, the identity element is e = (1, 0, . . . , 0)T . More
generally if B is such that Be = e, then e is the identity
element.
Definition 35 (Jordan Algebra (not necessarily Euclidean))
Let (J, ) be an algebra. (J, ) is a Jordan algebra if and only if
∀x, y ∈ J
x y = y x,
2
2
x (x y) = x (x y),
(9.3)
∀x, y ∈ J
(9.4)
where x2 = x x.
Thus, a Jordan algebra is a commutative algebra which has a
property similar to, but weaker than, associativity.
Remark 12 Let x y = L(x)y. Then the property (9.4) implies
that L(x)L(x2 )y = L(x2 )L(x)y for all y. In other words, (9.4) is
equivalent to
L(x)L(x2 ) = L(x2 )L(x)
that is L(x) and L(x2 ) commute.
Remark 13 (Mn , ◦) is a Jordan Algebra.
(9.5)
9.3. EUCLIDEAN JORDAN ALGEBRAS
101
Proof: Note that (9.3) is trivial by commutativity of matrix addition. To see (9.4), note that, for X, Y ∈ Mn ,
2 Y+YX2
◦2
X ◦ (X ◦ Y) =
X( X
2
2 Y+YX2
) + (X
2
2
3 Y+XYX2 +X2 YX+YX3
=
=
(X
2
)X
)
2
3
2
2 +YX3
( X Y+X YX+XYX
)
2
2
X2 ( XY+YX
)
+ ( XY+YX
)X2
2
2
=
2
◦2
= X ◦ (X ◦ Y)
Remark 14 If (V, ∗) is an associative algebra, then it induces a
Jordan algebra (V, ◦) where x ◦ y = x∗y+y∗x
. The proof given to
2
show that (Mn , ◦) is a Jordan algebra, is also valid for (V, ◦).
Lemma 11 Jordan algebras are power associative.
When a Jordan algebra is induced by an associative algebra, then it
is easy to show that it is power associative. Indeed the proof given
above that (Mn , ◦) is power associative is verbatim applicable to
such algebras. It turns out that there are Jordan algebras that are
not induced by any associative algebra. Such algebras are called
exceptional Jordan algebras. To prove that any Jordan algebra is
power associative in general is a bit complicated and is omitted.
102
CHAPTER 9. NEWTON’S METHOD
Chapter 10
Jordan algebras
Scribe: Ge Zhang
11/19/2001
10.1
Overview
In this section we continue the study of Jordan algebras. We first
study the notions of minimum and characteristic polynomials, and
from there define eigenvalues. Next we introduce a central concept
called the quadratic representation.
10.2
Two examples of Jordan algebra and the
identity element
10.2.1
Two prototypical examples
In the last lecture we showed that the following examples are indeed
Jordan algebras with identity. We use these two algebras as our
prototypical examples throughout.
Jordan algebra of square matrices
The set Mn of all n × n matrices under the binary operation “◦”
forms the Jordan algebra (Mn , ◦). Clearly
defined by X ◦ Y = YX+YX
2
I the identity matrix is the identity element with respect to this
Jordan algebra.
104
CHAPTER 10. JORDAN ALGEBRAS
The set Sn of all n × n symmetric matrices under the “◦” operation forms a subalgebra of (Mn , ◦). Even though Sn is not closed
under the ordinary (associative multiplication), it is closed under
the “◦” operation. Since the identity matrix is symmetric it is the
identity of Sn as well.
The algebra of quadratic forms
Define ∀x = (x0 ; x̄)T , y = (y0 ; ȳ)T ∈ Rn+1
x ◦ y = (x0 y0 + x̄T Bȳ; x0 ȳ + y0 x̄)
where B is a symmetric n × n matrix. Then (Rn+1 , ◦) is also a
def
Jordan algebra. Clearly under this multiplication e = (1; 0) is the
identity element. Below, we will always assume that B = I the
identity matrix, unless otherwise stated.
10.3
The minimum polynomial
In this section we start developing the concepts of eigenvalues, rank,
inverse elements, and so on. In the algebra of matrices, these concepts are developed in a somewhat different order than the one we
are about to present. In fact, in Jordan algebras we start from
minimum polynomials, then define the notion of the characteristic
polynomials, and from there define eigenvalues, inverse, trace, determinant and other concepts. This, somewhat reverse development
happens to be more convenient for general power-associative algebras.
From lecture 9 recall that in a power associative algebra (A, ?) the
subalgebra generated by a single element x ∈ A, denoted by A(x) is
associative. Also, recall that A(x) is essentially
the set of all power
P
series in a single element, i.e. the set of fi xi . A subalgebra
of this
Pk
i
associative
algebra is the set of polynomials A[x] =
i=0 fi x | k ∈
Z , where Z = {0, 1, 2, . . .}.
10.3.1
Degree and rank
Definition 36 Let A be a power-associative algebra with the identity element e such that dim(A) = n. For x ∈ A, let d be the small-
10.3. THE MINIMUM POLYNOMIAL
105
est integer such that e, x, · · · , xd is linearly dependent. Then d
is called degree of x and denoted by deg(x).
Obviously d ≤ n.
Definition 37 Let r = maxx∈A deg(x), then r is called rank of A
and denoted by rank(A).
Definition 38 The vector x is regular if deg(x) = rank(A) = r.
It can be shown that if the underlying vector space is over the set
of real numbers R, then the set of regular elements is dense in A.
This means that for every element x ∈ A and any positive number
there is a regular element in the ball of radius centered at x.
Furthermore almost all elements in A are regular. These statements
will become clear shortly.
It is now easy to see that the set of power series A(x) and the
set of polynomials A[x] are identitcal. Since xd can be written as
aPlinear combination of xj for j = 0, 1, . . . ,P
d − 1, each power series
∞
d
j
j
j=1 αj x can be written as a polynomial
j=0 βj x , where each βj
is an infinite sum over an appropriate subsequence of αi .
10.3.2
Definition of minimum polynomials
Definition 39 Let x ∈ A have degree d. Then xd , xd−1 , · · · , x, e
are linearly dependent. Therefore there are functionsa1 (x), a2 (x), · · · , ad (x)
such that
xd − a1 (x)xd−1 + · · · + (−1)d ad (x)e = 0.
def
The polynomial mx (t) = td − a1 (x)td−1 + · · · + (−1)d ad (x) is called
the minimum polynomial of x.
Now suppose there are two minimum polynomials m1 (t) and
m2 (t) for x. Then they both are monic (i.e. coefficient of the highest degree term is 1) and of the same degree, say d. But this means
for m3 (t) = m1 (t) − m2 (t) we must have m3 (x) = 0, which is a
contradiction since the degree of m3 (t) is less than d. Thus,
Lemma 12 The minimum polynomial of an element x is unique.
106
CHAPTER 10. JORDAN ALGEBRAS
Observe that deg(e) = 1 since e and e1 are already linearly
dependent and therefore the minimum polynomial of e is m(t) =
t − 1.
For each regular element x, there are r functions aj (x) that determine the coefficients of the minimum polynomial. These coefficients
depend only on x and otherwise are fixed. Thus if we know the functional form of these coefficients, we have in principle an algorithm
that determines the minimum polynomial.
We can show almost immediately that the coefficients of the minimum polynomial are in fact rational functions of x. To see this, note
that, since {e, x, . . . , xd−1 } is a set of linearly independent vectors,
we can extend it by a set of n − d other vectors to a basis of the
vector space A. Let this basis be {e, x, . . . , xd−1 , ed , . . . , en }. Then,
the equation
xd − a1 (x)xd−1 + · · · + (−1)d ad (x) = 0
can be thought of as a system of n linear equations in d unknowns
aj (x). By Cramer’s rule, each aj (x) is the ratio of two determinants
Det e, x, . . . , xj−1 , xd , xj+1 , . . . , xd−1 , ed , . . . , en
.
aj (x) =
Det e, x, . . . , xj−1 , xj , xj+1 , . . . , xd−1 , ed , . . . , en
In fact it can be shown that aj (x) are all polynomials in x, that
is in the ratio of determinants above, the denominator, divides the
numerator. This is a consequence of a lemma in ring theory due to
Gauss, which is beyond the scope of our course to state and prove.
Accepting that the aj are polynomials, we can actually prove
that they are homogeneous polynomials. (Recall that f(x) is said
to be homogeneous of degree p, if f(αx) = αp f(x)). To see this
suppose m1 (t) is the minimum polynomial of x and m1 (x) = xd −
a1 (x)xd−1 + · · · + (−1)d ad (x)e = 0. Clearly, ∀α ∈ R, α 6= 0, if
deg(x) = d then deg(αx) = d as well. In fact,
m1 (αx) = (αx)d − a1 (αx)(αx)d−1 + · · · + (−1)d ad (αx)e = 0.
def
Let us call m2 (t) = m1 (αt). Since, m2 is monic and of degree d,
and m2 (x) = m1 (αx) = 0, we conclude that m2 (t) is identical to
the minimum polynomial of x, that is m1 (t) ≡ m2 (t). Therefore,
αd aj (x) = αd−j aj (αx) which means that aj (αx) = αj aj (x). Thus
we have proven
10.3. THE MINIMUM POLYNOMIAL
107
Lemma 13 The coefficients aj (x) of the minimum polynomials of
x are homogeneous polynomials of degree j in x.
It follows in particular that the coefficient a1 (x) is a linear function
of x.
Using the notion of minimum polynomials we can now extend
the usual notions of linear algebra, such as trace, determinant, and
eigenvalues, at least for regular elements.
Definition 40 Let x be a regular element of the Jordan algebra J of
rank r and m(t) its minimum polynomial. (Thus m(t) is of degree
r.) Then
• the roots of m(t), λi (x), are the eigenvalues of x,
P
def
• the trace of x, tr(x) = a1 (x) = i λi ,
Q
• the determinant of x, det(x) = ar (x) = i λi .
Example 24 (Minimum polynomials in (Mn , ◦)) For (Mn , ◦),
The concept of minimum polynomial coincides with the linear algebraic notion we are already familiar with. Also, a matrix X is regular
if its characteristic polynomial (defined as F(t) = Det(tI − X)) coincides with its minimum polynomial. (From linear algebra we know
that the minimum polynomial of a matrix divides its characteristic polynomial.) For regular matrices X, then the notions of trace,
determinant, and eigenvalues coincide with the familiar ones.
For symmetric matrices, that is the subalgebra (Sn , ◦), we now
that for each X there are are orthogonal eigenvectors, that is, there
is an orthogonal matrix Q with columns qi such that X = QΛQT =
P
λi qi qTi . We also know that for a symmetric matrix X, the characteristic and minimum polynomials coincide if, and only if X has
n distinct eigenvalues.
Example 25 (Minimum polynomials for (Rn+1 , ◦)) It can be very
easily verified that for each x = (x0 ; x̄),
x2 − 2x0 x + (x20 − x̄T Bx̄)e = 0
Thus, (Rn+1 , ◦) is a Jordan algebra of rank 2. The roots of this
quadratic polynomial are x0 ± kx̄k. If these two roots are equal
then we must have kx̄k = 0 which implies x̄ = 0; that is x is a
108
CHAPTER 10. JORDAN ALGEBRAS
multiple of the identity element e. Thus, like all Jordan algebras
since deg(e) = 1, it implies that the only nonregular elements in
Rn+1 are multiples of the identity element. We also see that
tr(x) = 2x0
and det x = x20 − x̄T Bx̄.
10.4
Characteristic polynomial and the inverse
10.4.1
Characteristic polynomial
To extend the notions of eigenvalues, trace and determinant from
regular elements to all elements of the Jordan algebra J we need
to develop the notion of the characteristic polynomial (sometimes
called the generic minimum polynomial). An easy way to do this
extension is to first define for regular elements the characteristic
polynomial to be the same as the minimum polynomial. Then, since
we have already seen that the coefficients of the this polynomial are
polynomials in x, it follows that they are well-defined for all x. Thus,
Definition 41 For each x ∈ J, where (J, ◦) is a Jordan algebra of
rank r, let the r polynomials aj (x) for j = 1, . . . , r be defined as
above. Then,
i The characteristic polynomial of x is
def
fx (t) = tr − a1 (x)tr−1 + · · · + (−1)r ar (x),
ii The r roots of fx are the eigenvalues of x, and the algebraic
multiplicity of an eigenvalue λi is its multiplicity as a root of
the characteristic polynomial,
P
Q
def
def
iii tr(x) = a1 (x) = ri=1 λi and det(x) = ar (x) = ri=1 λi .
It can be shown that the minimum polynomial of x divides the
characteristic polynomial, and the roots of the minimum polynomial
are the same as the roots of the characteristic polynomial, except
that the multiplicity of each root may be larger in the characteristic
polynomial. It also follows from the definition that an element x is
10.4. CHARACTERISTIC POLYNOMIAL AND THE INVERSE
109
regular if, and only if its minimum and characteristic polynomials
are identical.
As an example, the characteristic polynomial of the identity element e is the polynomial (t−1)r , and thus tr(e) = r and det(e) = 1,
as in the algebra of matrices.
Also, one can show the extension of the Cayley-Hamilton theorem
of linear algebra:
Theorem 14 For each element x ∈ J, where (J, ◦) is a Jordan
algebra fx (x) = 0.
10.4.2
The linear operator L0
By the Cayley-Hamilton theorem fx (x) = 0, that is,
xr − a1 (x)xr−1 + · · · + (−1)r ar (x)e = 0.
Applying the L operator to both sides, and noting that the Lk (x)
commute for all k, we get:
Lr (x) − a1 (x)Lr−1 (x) + · · · + (−1)r ar (x)L(e) = L(0).
Since L(e)
= I, the identity matrix, and L(0) = 0, we get that
f L(x) = 0. Now, L(x) is a zero of the characteristic polynomial of
x fx (t). From matrix algebra we know that for any square matrix
A, any polynomial p(t) for which p(A) = 0 must divide the characteristic polynomial (in the ordinary matrix sense) of A. Thus,
def
FL(x) (t) = Det tI − L(x) = fx (t)q(t).
for some polynomial q(t). Since, for a regular element x, {e, x, · · · , xr−1 }
P
i
is a basis of of the subalgebra J[x], we have ∀u ∈ J[x], u = r−1
i=0 αi x .
Therefore,
x ◦ u = L(x)u =
r−1
X
i=0
αi x
i+1
=
r−1
X
βi xi .
i=0
P
r−i
ai (x)xr−i ,
where the βi are obtained by noting that xr = r−1
i=1 (−1)
that is βi = αi−1 + (−1)i ar−i (x). As a result, even though L(x) is
an n × n matrix, it maps J[x] back to itself; therefore, its restriction
to the r-dimensional subspace J[x] may be considered as an r × r
110
CHAPTER 10. JORDAN ALGEBRAS
matrix, denoted by L0 (x). The matrix L0 (x) actually can be written explicitly with respect to the basis {e, x, . . . , xr−1 }: This is the
matrix that maps the αi to the βi :


0 0 · · · 0 (−1)r−1 ar (x)
1 0 · · · 0 (−1)r−2 ar−1 (x)


0 1 · · · 0 (−1)r−3 ar−2 (x) .
L0 (x) = 
. . .

..
 .. .. . . ...

.
0 0 ··· 1
a1 (x)
Then, x ◦ u = L0 (x)u for all u ∈ J[x].
Lemma 14 If x is regular in the rank r Jordan algebra (J, ◦), then
for every polynomial p(t), we have p(x) = 0 if, and only if p L0 (x) =
0
Thus, remembering that aj (x) are defined for all elements and not
just the regular elements, we have
Theorem 15 The characteristic polynomial of x is identical to the
characteristic polynomial of the matrix L0 (x).
10.4.3
Definition of the inverse
At first glance it seems natural to define the inverse of an element
in a Jordan algebra (J, ◦) with an identity element e, to be the
element (if it exists) y such that y ◦ x = e. The problem is that,
unlike associative algebras, this y may not be unique. For instance,
in the Jordan algebra (M2 , ◦), define,
1 0
1 t
X=
T=
|t∈R
0 −1
t −1
Then, every element Y of T has the property that Y ◦ X = (YX +
XY)/2 = I.
Remember that in associative algebras for each x, there is at most
one element y such that xy = yx = e. This is easy to see. Let y1
and y2 be two vectors such that
xy1 = y1 x = e
xy2 = y2 x = e
10.5. THE QUADRATIC REPRESENTATION
111
Multiply both sides of the first equation by y2 to get y2 (xy1 ) = y2 .
By associativity, (y2 x)y1 = y2 , and since y2 x = e, we get y1 = y2 .
To define an appropriate notion of inverse, we insist that the
inverse of x be also in the (associative) subalgebra J[x]; this requirement ensures uniqueness of the inverse element. In this subalgebra,
it turns out that x−1 can actually be expressed as a polynomials in
x: Since xr − a1 (x)xr−1 + · · · + (−1)r ar (x)e = 0, we can give
def xr−1 −a1 (x)xr−2 +···+(−1)r−1 ar−1 (x)e
.
(−1)r−1 ar (x)
Definition 42 x−1 =
A simple calculation shows that x−1 ◦ x = e.
Example 26 (Inverse in (Mn , ◦) and (Rn+1 , ◦).)
• For the Jordan algebra (Mn , ◦), the inverse X−1 coincides with
the usual matrix theoretic inverse of the matrix X.
• For (Rn+1 , ◦),
−1
x0
 x1  def x − 2x0 e
1
. =
= 2
 .. 
2
x0 − kx̄k2
− x0 − kx̄k
xn


x0
 −x1 
 .  = 1 Rx.
 ..  det(x)
−xn

For arbitrary symetric matrix B only the last eqaulity holds.
10.5
The quadratic representation
A fundamental concept in Jordan algebras is the notion of quadratic
representation, which is a linear transformation associated to each
element x of a Jordan algebra J. First let us give the
Definition 43 The quadratic representation of x ∈ J is
def
Qx = 2L2 (x) − L(x2 ) and thus
Qx y = 2x ◦ (x ◦ y) − x2 ◦ y
At first glance this seems a bit arbitrary. However, this matrix is
the generalization of the operation in square matrices that sends a
112
CHAPTER 10. JORDAN ALGEBRAS
matrix Y to XYX:
XY + YX XY + YX
X2 Y + YX2
+
X) −
2
2
2
2
2
2
X Y + XYX + XYX + YX − X Y − YX2
=
2
= XYX.
QX Y = (X
Therefore, vec(XYX) = (XT ⊗ X)vec(Y), that is QX = XT ⊗ X.
Example 27 (Qx is (Rn+1 , ◦)) For (Rn+1 , ◦),
kxk2
x0 x̄T
2
2
Qx = 2 Arw (x) − Arw(x ) =
x0 x̄ det(x)I + 2x̄x̄T
Chapter 11
The Qaudratic
Representation and
Euclidean Jordan Algebras
Scribe: Xuan Li
11/26/2001
11.1
Overview
We first explore more properties about the quadratic representation
Qx which introduced last lecture. Then we discuss Euclidean Jordan
algebras and spectral decomposition of Euclidean Jordan algebras,
in order to prepare the tools for verifying the polynomial algorithm
of SDP and SOCP. Finally, we introduce the cone of squares of
Euclidean Jordan algebra.
11.2
Properties of Qx
In Jordan algebras we do not have access to one of the most fundamental algebraic properties, namely the associative law. Therefore
many of the tricks and short-cuts we have learned to conduct algebraic manipulations are no longer available to us. There are however
alternative techniques. One of the most important ones is the use of
derivatives to prove algebraic identities. This technique is the basis
114
CHAPTER 11. QAUDRATIC REPRESENTATION
of a more sophisticated technique called polarization. Let us quickly
review some of the properties of derivatives and differentials.
Definition 44 (Differential) Let f : Rn → Rm be a continuous
and sufficiently smooth function, and x be its vector of variables.
The differential of f with respect to x is the m × n matrix Dx f
defined as
def ∂fi (x)
Dx f ij =
,
∂xj
The derivative of f along direction u is
def
Dux f = Dx f u.
At first sight it seems that we are using a concept from mathematical
analysis to prove algebraic identities. However, partial derivatives
∂fi /∂xj can be defined completely algebraically, as long as the function f is a polynomial or a power series in x. As will be seen shortly,
this is the only sense in which we will need the differential. In other
words, we really need not use any of the limit definitions that are
presented in calculus texts. Rather we can start from defining the
partial derivatives with respect to a single variable
∂ αxr
= rαxr−1
∂x
∂ f(x) + g(x)
∂f(x) ∂g(x)
=
+
∂x ∂x
∂x
∂ f(x)g(x)
∂f(x)
∂g(x)
=
g(x) +
f(x)
∂x
∂x
∂x
and derive all the necessary identities from these—at least for the
functions that involve polynomials or power series. We need not
assume any topological structure on the vector space underlying our
Jordan algebras. For all we care, they may be over discrete fields
such as the set of rational numbers, or even fields like {0, 1, . . . , 6}
with multiplication and addition done module 7.
We now recall some identities for differentials that will be needed
for our proofs.
Lemma 15 If T is a linear transformation, then Dx T (f) = T (Dx f).
11.2. PROPERTIES OF QX
115
In a power associative algebra, (A, ∗), with the identity element e,
and where x ∗ y = L(x)y, we have
Dx x = I = L(e) and thus,
Dyx x = L(e)y = e ∗ y = y,
Since L is a linear transformation,
Dyx L(x) = L(Dyx x) = L(Dx xy) = L(y)
and
Dyx (x2 ) = Dyx (L(x)x)
= (L(x)y + L(y)x) = 2(x ∗ y),
and therefore,
Dyx L(x2 ) = 2L(x ∗ y).
Lemma 16 Let and f : Rm → Rn , and g : Rk → Rn be continuous
and sufficiently smooth functions. Then
Dx f g(x) = Dg(x) f g(x)
Dx g(x)
Now we are ready to prove some key properties of the Qx operator.
11.2.1
Properties of Qx
1. Qx and L(x) commute.
Proof: L(x) and L2 (x) commute, L(x) and L(x2 ) commute.
Consequently, Qx = 2L2 (x) − L(x) and L(x) commute.
2. Qαx = α2 Qx where α is a real number.
Proof:
Qαx = 2L2 (αx) − L(α2 x2 )
= 2(L(αx))(L(αx)) − L(α2 x2 )
= 2α2 L2 (x) − α2 L(x2 ) = α2 Qx
The third equality comes from the linearity of L
3. Qx e = x2 where e is the identity element.
Proof:
Qx e = 2L2 (x)e − L(x2 )e
= 2L(x)L(x)e − L(x2 )e
= 2x ◦ (x ◦ e) − x2 ◦ e = x2
116
CHAPTER 11. QAUDRATIC REPRESENTATION
4. Qx x−1 = x
Proof:
Qx x−1 =
=
=
=
2L2 (x)x−1 − L(x2 )x−1
2x ◦ (x ◦ x−1 ) − x2 ◦ x−1
2x ◦ e − x ◦ (x ◦ x−1 )
2x − x = x
5. Qx L(x−1 ) = L(x)
This is he first place where we need to use the more sophisticated techniques of using differentials. This lemma is an auxiliary lemma and is used to prove some of the other lemmas
listed below. Proof: Since we have the identity L(y)L(y2 ) −
L(y2 )L(y) = 0 in general. Take the derivatives of both sides
along direction x.
Dxy {L(y)L(y2 ) − L(y2 )L(y)} = 0
⇒ Dxy L(y)L(y2 )+L(y)Dxy L(y2 )−Dxy L(y2 )L(y)−L(y2 )Dxy L(y) = 0
⇒ L(x)L(y2 ) + L(y)[2L(x ◦ y)] − 2L(x ◦ y)L(y) − L(y2 )L(x) = 0
Multiplying from right by an arbitrary vector z
x ◦ (y2 ◦ z) + 2y ◦ [(x ◦ y) ◦ z] − 2(x ◦ y) ◦ (y ◦ z) − y2 ◦ (x ◦ z) = 0
⇒ (y2 ◦z)◦x+2y◦[z◦(y◦x)]−2(y◦z)◦(y◦x)−y2 ◦(z◦x) = 0
⇒ L(y2 ◦z)x+2[L(y)L(z)L(y)]x−2L(y◦z)L(y)x−L(y2 )L(z)x = 0
Since this is true for any vector x:
⇒ L(y2 ◦z)+2[L(y)L(z)L(y)]−2L(y◦z)L(y)−L(y2 )L(z) = 0 (1)
This is an important identity which we will use a few times
below. In order to show that Qx L(x−1 ) = L(x) replace for z,
y−1 to get
⇒ L(y2 ◦y−1 )+2[L(y)L(y−1 )L(y)]−2L(y◦y−1 )L(y)−L(y2 )L(y−1 ) = 0
11.2. PROPERTIES OF QX
117
Note that L(y−1 ) and L(y) commute
L(y) + 2L(y)L(y−1 )L(y) − 2L(y) − L(y2 )L(y−1 ) = 0
⇒ [2L2 (y) − L(y2 )]L(y−1 ) = L(y)
i.e., Qy L(y−1 ) = L(y)
6. (Qy )−1 = Qy−1 if y−1 exists.
Proof: Again in identity (1) above let z = y−2 to get
L(y2 ◦y−2 )+2L(y)L(y−2 )L(y)−2L(y◦y−2 )L(y)−L(y2 )L(y−2 ) = 0
Note that y2 ◦ y−2 = e , L(y) and L(y−2 ) commute.
I = −2L2 (y)L(y−2 ) + 2L(y−1 )L(y) + L(y2 )L(y−2 )
= −[2L2 (y) − L(y2 )]L(y−2 ) + 2L(y−1 )L(y)
= −Qy L(y−2 ) + 2L(y−1 )L(y)
Since Qy L(y−1 ) = L(y)
I = −Qy L(y−2 ) + 2Qy L2 (y−1 )
= Qy [2L2 (y−1 ) − L(y−2 )] = Qy Qy−1
i.e., (Qy )−1 = Qy−1
Remark 15 Property 6) states that if y−1 exists, (Qy )−1 exists and it is exactly Qy−1 . As a matter of fact, the converse
−1
direction is also right: if Q−1
exist.
y exists, then y
7. Dx x−1 = −(Qx )−1 = −Qx−1
Proof: Since x ◦ x−1 = e, Dx (x ◦ x−1 ) = Dx e.
L(x)Dx x−1 + L(x−1 )I = 0
(2)
Since x2 ◦ x−1 = x, Dx (x2 ◦ x−1 ) = Dx x.
L(x2 )Dx x−1 + L(x−1 )Dx x2 = I
i.e.,
L(x2 )Dx x−1 + 2L(x−1 )L(x) = 0
(3)
118
CHAPTER 11. QAUDRATIC REPRESENTATION
Now multiply 2L(x)and (2) and subtract (3):
[2L2 (x) − L(x2 )]Dx x−1 + 2L(x)L(x−1 ) − 2L(x−1 )L(x) = −I
i.e., Qx (Dx x−1 ) = −I ⇒ Dx x−1 = −(Qx )−1 = −Qx−1
8. (Qx y)−1 = Qx−1 y−1
Proof: Using Property 5), Qx L(x−1 ) = L(x) = L(x−1 )Qx .
Multiply both sides by y from right to get
x−1 ◦ (Qx y) = x ◦ y
Take the derivatives of both sides along direction u
Dux {x−1 ◦ (Qx y)} = Dux (x ◦ y)
⇒ Dux {L(x−1 )(Qx y)} = Dux (x ◦ y)
Dux L(x−1 )Qx y + L(x−1 )Dux Qx y = u ◦ y (4)
Note that
Dyx Qx =
=
=
=
Dyx {2L2 (x) − L(x2 )}
Dyx {2L2 (x) − L(L(x) ◦ x)}
2L(x)L(y) + 2L(y)L(x) − L(L(y)x + L(x)y)
2[L(x)L(y) + L(y)L(x) − L(x ◦ y)] = 2Qx,y
def
where Qx,y = L(x)L(y) + L(y)L(x) − L(x ◦ y)
Plug this result into (4)
−(Qx y) ◦ (Qx−1 u) + 2L(x−1 )Qx,u y = u ◦ y
Let u = y−1
−(Qx y) ◦ (Qx−1 y−1 ) + 2L(x−1 )Qx,y−1 y = e
Since
Qx,y−1 y = L(x)L(y−1 )y + L(y−1 )L(x)y − L(x ◦ y−1 )y
= x ◦ (y−1 ◦ y) + y−1 ◦ (x ◦ y) − (x ◦ y−1 )y = y−1 ◦ (x ◦ y)
by using (x ◦ y−1 ) ◦ y = y ◦ (y−1 ◦ x) = x.
−(Qx y) ◦ (Qx−1 y−1 ) + 2e = e
11.3. EUCLIDEAN JORDAN ALGEBRAS
119
i.e.,e = (Qx y) ◦ (Qx−1 y−1 )
Remark 16 This does not yet prove that Qx y and Qx−1 y−1
are inverse of each other (recall that in a Jordan algebra (E, ◦)
there may be many ys such that x ◦ y = e, only the one
y ∈ E[y] is the right inverse.) We also need to show that Qx y
and Qx−1 y−1 belong to same E[u] for some suitable u.
9. QQx y = Qx Qy Qx
−1
Proof: To both sides of the identity Qy x
= Qy−1 x−1 apply
the operator Dx to get
−1
Dx Qy x
= Dx Qy−1 x−1
−Q(Qy x)−1 Qy = Qy−1 Dx x−1
−Q−1
Qy x Qy = −Qy−1 Qx−1
Q−1
Qy x = Qy−1 Qx−1 Qy
QQx y = Qy Qx Qy
11.3
Euclidean Jordan Algebras
11.3.1
Definitions and examples
−1
Definition 45 (Euclidean Jordan Algebra) A Euclidean Jordan
algebra (E, ◦) is a Jordan algebra such that tr(x2 ) > 0 for all x 6=
0, x ∈ E.
Example 28 (Jordan algebra of matrices) The Jordan algebra
(Mn×n , ◦) is not Euclidean since for an arbitrary n × n matrix,
tr(X2 ) may be negative. For example, let
−1 −2
−3 0
2
X=
, then X =
.
2
1
0 −3
120
CHAPTER 11. QAUDRATIC REPRESENTATION
tr X2 = −6 < 0. However, the Jordan subalgebra (Sn×n , ◦) is Euclidean since
2
tr(X ) =
n
X
X2ij > 0 if X 6= 0
i,j=1
Example 29 (Jordan algebra of SOCP) (Rn+1 , ◦) is a Euclidean
Jordan algebra since
tr(x2 ) = 2 x2 0 = 2xT x > 0 if x 6= 0
11.3.2
Spectral decomposition in Euclidean Jordan algebras
From the previous lectures, we know if A ∈ Sn , A can be expressed
as
T
A = QΛQ =
n
X
λi qi qTi
i=1
where Λ = Diag λ1 , λ2 , . . . , λn and Q = (q1 , . . . , qn ). λi is the ith
eigenvalue of A and qi is the corresponding eigenvector such that
QQT = I.
Since A is symmetric, all λi , i = 1, . . . , n are real numbers. Suppose
there are k distinct λ’s, say λ1 , . . . , λk , we can arrange the eigenvalues in the following way:
λ1 > λ2 > · · · > λk
and the multiplicity of λi is ni . Define,
Pi =
ni
X
qij qTij ,
j=1
so that A can also be rewritten as
A=
k
X
i=1
λi Pi
11.3. EUCLIDEAN JORDAN ALGEBRAS
121
If ni , the multiplicity of λi , is larger than one then the eigenvectors qij are not unique. However, and this is key, the matrix
P
Pi = j qij qTij is unique. This observation from linear algebra leads
to the extension to Euclidean Jordan algebras of spectral decomposition. First, we need to give a
Definition 46 Suppose (E, ◦) is a Euclidean Jordan Algebra. If
c1 , . . . , ck ∈ E satisfy
1. c2i = ci , i = 1, . . . , k
2. ci ◦ cj = 0 ∀i 6= j
P
3. ki=1 ci = e
then c1 , . . . , ck form a complete system of orthogonal idempotentns.
Example 30 (A simple example from S3 ) In (S3 , ◦), let A =
2(q1 qT1 + q2 qT2 ) + 3q3 qT3 where
(q1 , q2 , q3 )(q1 , q2 , q3 )T = I
Let P1 = q1 qT1 +q2 qT2 , P2 = q3 qT3 . {P1 , P2 } is a complete orthogonal
system of idempotentns. In fact
P12 = (q1 qT1 + q2 qT2 )(q1 qT1 + q2 qT2 ) = q1 qT1 + q2 qT2 = P1
P22 = (q3 qT3 )(q3 qT3 ) = q3 qT3 = P2
P1 P2 = 0 and thus, P1 ◦ P2 = 0,
and P1 + P2 = I
Definition 47 An idempotent is primitive if it can not be written
as sum of other idempotents.
In the example above, P2 is primitive since it is rank 1. However, P1
is not primitive because it is sum of q1 qT1 and q2 qT2 , each of which
is an idempotent. In (Sn , ◦), only rank one matrices qqT where
qT q = 1 are primitive idempotents.
Definition 48 A Jordan frame is a complete system of orthogonal
idempotents, all of which are primitive.
122
CHAPTER 11. QAUDRATIC REPRESENTATION
Theorem 16 (spectral decomposition) Let x ∈ E where (E, ◦)
is a Euclidean Jordan algebra of rank r, then
i. there exist a unique system of orthogonal idempotents {c1 , . . . , ck }
and unique real numbers λ1 , . . . , λk such that x = λ1 c1 + . . . +
λk ck ; furthermore, each ci is a polynomial in x, that is ci ∈
E[x] and each λi is an eigenvalue of x;
ii. there exists a Jordan frame {c1 , . . . , cr } and real numbers λ1 , . . . , λr
such that x = λ1 c1 + . . . + λr cr , the λi are the eigenvalues of x
and r = rank(E).
Furthermore, the λi are the eigenvalues of x.
Thus this theorem is a generalization of the spectral decomposition
in symmetric matrices; this should be evident from the remarks
we made at the beginning of this section. As a consequence, in
Euclidean Jordan algebras the eigenvalues are always real. Also,
an element is regular (see the definition lecture 10) if the λi are all
distinct. In that case, the two versions of the spectral decomposition
stated in the theorem above coincide.
Remark 17 Using spectral decomposition we can extend many concepts from real numbers to Euclidean Jordan algebras. For instance,
if f : R → R is a continuous function, we can define a function
def P
f : E → E as f(x) = ri=1 f(λi )ci , for x ∈ E. As a special example,
we can define
1
1
x−1 = c1 + · · · + cr and more generally,
λ1
λr
xt = λt1 c1 + · · · + λtr cr for all t ∈ R
where λti should be well-defined for all λi , i = 1, . . . , r. Notice that
x−1 above and the one defined earlier for Jordan algebras are the
same. Similarly for t an integer, xt is identical to the definition
given in Lecture 9.
We may also extend various norms that are defined on symmetric
matrices to Euclidean Jordan algebras. For instance, in (Sn , ◦), the
Frobenius norm k·kF is defined as
v
v
u n
u n
X
p
u
uX
2
2
t
kAkF =
Aij = tr(A ) = t
λ2i ,
i=1
i=1
11.3. EUCLIDEAN JORDAN ALGEBRAS
123
Analogously, in Euclidean Jordan algebra we define the Frobenius
norm as
sX
kxkF/E =
λ2i , ∀x ∈ E
i
where λi ’s are the eigenvalues of x.
Another important norm on symmetric matrices is the so-called
2-norm: kAk2 = maxi |λi |. This can also be extended to elements of
Jordan algebra
def
kxk2 = max |λi |
i
Example 31 (spectral decomposition in (Rn+1 , ◦)) In (Rn+1 , ◦),
we have already seen that the eigenvalues of x = (x0 , x̄)T are x0 ±kx̄k,
so
q
√
kxkF = 2(x20 + kx̄k2 ) = 2 kxk .
Also we have seen that (Rn+1 , ◦) is a rank-two Jordan algebra.
Thus for each x we need to find two vectors c1 and c2 forming a
Jordan frame, such that x can be written as x = λ1 c1 + λ2 c2 . It
turns out that the Jordan frame in this algebra is given by
1/2
1/2
c1 =
, c2 = −x̄ = Rc1 .
x̄
2kx̄k
2kx̄k
We claim that {c1 , c2 } is Jordan frame. In fact, c1 + c2 = e is
obvious from definition. To show that they are idempotents note
that
!
x̄T x̄
1
+
1
1
1
1
2
kx̄k
c21 =
◦ x̄ =
= c1
x̄
2x̄
4 kx̄k
4
kx̄k
kx̄k
c22
1
=
4
1
1
◦ −x̄ =
−x̄
4
kx̄k
kx̄k
1
x̄T x̄
kx̄k2
−2x̄
kx̄k
1+
!
= c2
c1 ◦ c 2 = 0
Thus, the spectral decomposition is much simpler in the case of
(Rn+1 , ◦) algebra. We have simple closed formulas for both their
eigenvalues and Jordan frames.
124
11.3.3
CHAPTER 11. QAUDRATIC REPRESENTATION
Cone of squares of a Euclidean Jordan algebra
Definition 49 The cone of squares of Euclidean Jordan algebra
(E, ◦) is
KE = {x2 , x ∈ E}
In the next lecture we will show that KE is a convex and self-dual
cone with some additional properties.
However, the following lemma is immediate:
Lemma 17 In a Euclidean Jordan algebra (E, ◦), x ∈ KE if, and
only if, all eigenvalues of x are nonnegative.
P
Proof:
If
x
=
λi ci is the spectral decomposition of x then x2 =
P 2
λi ci is the spectral decomposition of x2 , which is equivalent to
saying that all eigenvalues of x2 are nonnegative.
Example 32 (Cone of squares of (Sn , ◦)) In (Sn , ◦), KSn is the
same as the cone of positive semidefinite matrices Pn×n . This is
easily seen when we notice that every positive semidefinite matrix
has nonnegative eigenvalues, That is X < 0 if, and only if, X =
P √
2
λi qi qTi =
λi qi qTi .
Example 33 (Cone of squares of (Rn+1 , ◦)) Since eigenvalues of
x are x0 ± kx̄k, then x is in the cone of squares if, and only if
x0 ± kx̄k ≥ 0 which means that x0 ≥ kx̄k. Thus the cone of squares
of KRn+1 is the second order cone.
Chapter 12
Cone of Squares and
Symmetric Cones
Scribe: Anton Riabov
12/03/2001
12.1
Overview
In this lecture we continue to study the cone of squares, define inner
product with respect to Euclidean Jordan algebra, prove convexity,
self-duality, homogeneity, and symmetry of cone of squares. We also
define direct sums and simple algebras. We state without proof a
theorem that there are only 5 classes of simple Euclidean Jordan
algebras, and describe these classes. Finally we briefly outline application of Jordan algebra theory to solving optimization problems
over symmetric cones using interior point methods.
12.2
Cone of Squares
Suppose (E, ◦) is an Euclidean Jordan algebra. In the previous lecture we have given the following definition:
def
Definition 50 KE = {x2 : x ∈ E} is a cone of squares of an associative algebra.
√
KE is a cone, since ∀α ≥ 0 ⇒ αx2 = ( αx)2 ∈ KE . In this section we will show that cones of squares of Euclidean Jordan algebras
126
CHAPTER 12. SYMMETRIC CONES
are convex, self-dual and homogeneous.
12.2.1
Inner Product
Definition 51 For any x, y ∈ E the inner product hx, yi with redef
spect to Euclidean Jordan algebra (E, ◦) is defined as hx, yi = tr(x ◦
y).
Note that this inner product is bilinear and hx, yi = hy, xi, so it
conforms to definition of an inner product.
Fact 2 Inner product is associative: hx ◦ y, zi = hx, y ◦ zi.
The proof of this statement above is not straightforward. In
fact, we do not have the required machinery to prove it, and will
accept it as a fact. The following definition may be needed in future
discussions.
def
Definition 52 τ(x, y) = Tr(L(x ◦ y)).
Note that the associativity here holds as well: τ(x ◦ y, z) =
τ(x, y ◦ z).
Lemma 18 L(x) is a symmetric matrix with respect to h·, ·i.
Proof: We need to show that hL(x)y, zi = hy, L(x)zi ⇔ hx ◦ y, zi =
hy, x ◦ zi, which follows from commutativity of ◦ and associativity of h·, ·i.
Example 34 (SOCP Algebra (Rn+1 , ◦))
hx, yi = tr(x ◦ y) = 2xT y,
so h·, ·i corresponds to the definition of the usual inner product, and
L is a symmetric matrix in the usual sense.
Example 35 (Symmetric Matrices (Sn , ◦))
hX, Yi = Tr(X ◦ Y) = Tr(XY) = X • Y.
12.2. CONE OF SQUARES
12.2.2
127
Convexity and Self-Duality of KE
To prove that KE is convex we first note
Proposition 9 If K is a cone, then K∗ is a convex cone.
Proof: By definition, K∗ = {y : hx, yi ≥ 0, ∀x ∈ K}. For any
y1 , y2 ∈ K∗ we have:
hy1 , xi ≥ 0
hy2 , xi ≥ 0
∀x,
∀x.
Adding these two inequalities, obtain: hy1 + y2 , xi ≥ 0, ∀x.
Corollary 5 Every self-dual cone is convex.
Now we only need to prove self-duality of KE , and convexity will
follow.
Lemma 19 K∗E = {y : L(y) < 0}.
Proof:
y ∈ K∗E ⇐⇒
⇐⇒
⇐⇒
⇐⇒
hy, x2 i ≥ 0 ∀x
hy, x ◦ xi ≥ 0 ∀x
hy ◦ x, xi ≥ 0 ∀x
hL(y)x, xi ≥ 0 ∀x,
which means that L(y) < 0.
Lemma 20 If c is an idempotent (i.e. c2 = c), then eigenvalues of
c are 0,1 and L(c) < 0.
Proof: Since c is an idempotent, we can write c2 − c = 0. Minimal
polynomial t2 − t = 0 has two roots: {0, 1}. If {c1 , ..., cr } is Jordan
frame, then c = 1c1 + 0c2 + ... + 0cr . So it has 1 eigenvalue equal
to one, and r − 1 eigenvalues equal to zero.
In the previous lecture we have derived the following equation:
L(y2 ◦z)+2L(y)L(z)L(y)−2L(y◦z)L(y)−L(y◦z)L(y)−L(y2 )L(z) = 0.
Now, we will substitute y ← c, z ← c:
L(c3 ) + 2L3 (c) − 2L(c2 )L(c) − L(c2 )L(c) = 0
⇒ L(c)[2L2 (c) − 3L(c)|I] = 0,
128
CHAPTER 12. SYMMETRIC CONES
i.e. L(c) satisfies t(2t2 − 3t + 1) = 0, which is equivalent to t(2t −
1)(t − 1) = 0, and eigenvalues of L(c) are in the set {0, 1/2, 1}, i.e.
all its eigenvalues are positive therefore L(c) < 0.
P
Corollary 6 If x = i λi ci and λi ≥ 0 ∀i, then L(x) < 0.
P
Proof: L(x) =
λi L(ci ), and we know that L(ci ) < 0.
Fact 3 Let x = λ1 c1 +...+λr cr . Then eigenvalues of L(x) are
and eigenvalues of Qx are λi λj .
λi +λj
,
2
Proof: We are accepting this statement as a fact, since in the general case we need to know more about Jordan algebras to be able
to prove it. However we note that for the case of matrix algebra we
have seen the proof in previous lectures. For SOCP case the proof is
also easy. We refer the reader to recent survey paper by F. Alizadeh
and D. Goldfarb for more details.
Theorem 17 KE = K∗E .
Proof: First, we
show that KE ⊆PK∗E . Choose an element
Pwill
x ∈ KE ⇒ x =
λ2i ci . Then, L(x) =
λ2i L(ci ), and L(ci ) < 0,
2
and λi ≥ 0 ∀i. Therefore L(x) < 0, and thus x ∈ K∗E .
∗
∗
P Now we will show that KE ⊆ KE . Choose y ∈ KE . Then y =
λj cj . Hence,
X
hy, ci i =
λj hcj , ci i = λi hci , ci i = λi tr(c2i ).
The second equality above follows from the fact that hcj , ci i = 0 if
i 6= j. Now we can obtain an expression for λi :
λi =
hy, ci i
.
tr(c2i )
We know that tr(c2i ) > 0, and we only need to show that hy, ci i ≥ 0.
This will imply that all eigenvalues λi of y are nonnegative and
y ∈ KE . Indeed,
hy, ci i = hy, ci ◦ ci i = hL(y)ci , ci i ≥ 0,
the last inequality follows from L(y) < 0, and this completes the
proof.
12.2. CONE OF SQUARES
12.2.3
129
Homogeneous Cones
Definition 53 Cone K is homogeneous, if it is proper, and for all
x, y ∈ Int K there exists a linear transformation T , such that T (x) =
y and T (K) = K.
Example 36 (Pn×n = K(Sn ,◦) )
Choose any X 0, Y 0. Using eigenvalue decomposition,
write: X = QΛQT , Y = PΩPT . Transformation T is then defined as
the following sequence of linear transformations:
X
Q−1 •Q−T
−→
Λ
Λ−1/2 •Λ−1/2
−→
I
Ω1/2 •Ω1/2
−→
P•PT
Ω −→ Y,
where • is used to show multiplication from left and right.
Each of these steps maps Pn×n onto itself because if A is nonsingular then AXAT < 0 if, and only if, X < 0. Thus, Pn×n is
homogeneous.
Example 37 (Second-Order Cones) The following set of 3 operations is sufficient to obtain the required transformation for any x
and y in the interior of the second-order cone, and therefore secondorder cones are homogeneous.
x0
y
x
x1
If the points x and y happen to
be on the same “circle”, as in the
picture on the left, rotation transformation can be applied. It is
easy to see that this transform has
all the required properties.
xn
x0
y
x
x1
xn
If the points x and y are on
the same ray, we can apply dilation transformation, multiplying
by yx I. All the required properties
are satisfied.
130
CHAPTER 12. SYMMETRIC CONES
x0
y
x
x1
xn
An operation called hyperbolic
rotation can be constructed similarly to the usual rotation by replacing sin and cos by sinh and
cosh. This operation can be constructed to “rotate” points along a
hyperbola, which has cone boundaries as asymptotes.
Any point in the interior of the second order cone can be transformed into another point in the interior by a combination of dilation
and rotation along x0 axis and hyperbolic rotations as follows. Let
x = λ1 c1 + λ2 c2 and y = ω1 d1 + ω2 d2 , where c1 , c2 is a Jordan
frame and likewise, d1 , d2 is a Jordan frame. To transform x to y,
we first rotate c1 to d1 ; this automatically maps c2 to d2 , because
c1 ⊥c2 and d1 ⊥d2 . So, now we have x0 = λ1 d1 + λ2 d2 . Next, in the
plane spanned by d1 , d2 the vector y has coordinates ω1 , ω2 with
respect to the basis d1 , d2 and x0 has coordinates λ1 , λ2 . Applying
2
the dilation ωλ11 ω
I maps x0 to the point x00 = λ001 d1 + λ002 d2 , where
λ2
2
2
λ001 = ωλ1 ω
and λ002 = ωλ1 ω
. Now, both y and x00 are on the same
2
1
branch of the hyperbola a1 a2 = ω1 ω2 ; thus a hyperbolic rotation
will map x00 to y.
We are going to claim that cone of squares KE is homogeneous.
But before we are able to prove this, we need the following theorem.
Theorem 18 If x is invertible, then Qx (Int KE ) = Int KE .
Proof: First note that the set of invertible elements is a disconnected set. For example, in the case of second-order cones there
are 3 regions of invertible elements, separated by the borders of the
cone, as it is illustrated in the figure.
1
3
2
In the algebra of symmetric matrices all quarters of the eigenvalue space form a region of invertible elements. Intuitively this is
12.2. CONE OF SQUARES
131
explained by the fact that if for two symmetric matrices eigenvalues
have different signs, then there exists a linear combination of these
matrices having an eigenvalue 0.
One of these connected regions is Int KE . If y ∈ Int KE , then
Qx y is also invertible, since
(Qx y)−1 = Qx−1 y−1
Therefore Qx (Int KE ) can not cross any boundary lines, and is
either (a) contained in Int KE entirely, i.e. Qx (Int KE ) ⊆ Int KE ,
or (b) does not have any common points with it, and Qx (Int KE ) ∩
Int KE = ∅.
We know that Qx e = x2 ∈ KE . Thus (a) is true, and
Qx (Int KE ) ⊆ Int KE
for all invertible x.
If y ∈ KE , then y−1 ∈ Int KE . Hence, y = Qy y−1 ∈ Qy Int KE ,
and the inverse inclusion holds:
Int KE ⊆ Qx (Int KE ).
Corollary 7 KE is a homogeneous cone.
Proof: Suppose we are given y2 , x2 ∈ Int KE . The following linear
transformation can be used to prove that KE is homogeneous:
Q
−1
Qy
x
x2 −→
e −→ y2 .
Each of the steps transforms Int KE into itself, by Theorem 18.
12.2.4
Symmetric Cones, Direct Sums and Simple Algebras
Definition 54 A cone is symmetric, if it is proper, self-dual, and
homogeneous.
Clearly, the cone of squares KE of any Euclidean Jordan algebra
(E, ◦) is symmetric. It turns out that the converse is also true.
Fact 4 If K is a symmetric cone, then it is the cone of squares of
some Euclidean Jordan algebra.
132
CHAPTER 12. SYMMETRIC CONES
In fact, there are not many significantly different classes of symmetric cones. But before we can define these classes, we need to
introduce direct sums.
Definition 55 Let (E1 , ∗), (E2 , ) be Euclidean Jordan algebras. Then
def
direct sum of these algebras is (E1 , ∗) ⊕ (E2 , ) = (E1 × E2 , ◦), where
for all x1 , x2 ∈ E1 and y1 , y2 ∈ E2 ,
x1
x2 def x1 ∗ x2
◦
=
.
y1
y2
y1 y2
Proposition 10 Let (E1 × E2 , ◦) be a direct sum of Euclidean Jordan algebras and let x ∈ E1 and y ∈ E2 . The following properties
hold:
x
L(x)
0
1. L
= L(x) ⊕ L(y) =
y
0
L(y)
x
Qx 0
2. Q
= Qx ⊕ Qy =
y
0 Qy
x
3. pE1 ⊕E2
= pE1 (x)pE2 (y), where p(·) is the corresponding
y
characteristic polynomial.
x
4. trE1 ⊕E2
= trE1 x + trE2 y
y
x
5. det
= det x det y
y
2
x 6. = kxk2F/E1 + kyk2F/E2
y F/E1 ⊕E2
x 7. = kxk2/E1 + kyk2/E2
y 2/E1 ⊕E2
8. KE1 ⊕E2 = KE1 × KE2
9. rk(E1 ⊕ E2 ) = rk(E1 ) + rk(E2 )
12.2. CONE OF SQUARES
133
Example 38 (Direct sums in SOCPs)
min cT1 x1 + ... + cTr xr
s.t. A1 x1 + ... + Ar xr = b
xi <Q 0 1 ≤ i ≤ r
The cone constraint in this SOCP restricts x to a direct sum of
quadratic cones:
x ∈ Q1 × Q2 × ... × Qr .
Example 39 (Direct sums in LPs) The usual boring algebra of
real numbers (R, ·) is an Euclidean Jordan algebra, where “·” stands
for number multiplication. Then, the algebra underlying the linear
programs is a direct sum of such algebras:
(Rn , ∗) = (R, ·) ⊕ (R, ·) ⊕ ... ⊕ (R, ·).
Multiplication operator “∗” is defined as follows:
   


x1
y1
x1 y1
 x2   y2  def  x2 y2 
 ∗  = 

 ...   ... 
 ...  .
xn
yn
xn yn
Note that


x1 0
0
 0 x2
0
y = x ∗ y
L(x)y = 
.


..
0 0
xn
Since direct sums of Euclidean Jordan algebras are Euclidean Jordan algebras, the theory that we have developed covers any combinations of these algebras. LP variables can be combined with SOCP
variables, and with SDP variables, and so on. It would be interesting to find out, what the “minimal” algebras with respect to the
direct sum are. In a sense we want to find the “basis” of all possible
Euclidean Jordan algebras. The following definition and a theorem
(given without proof) answer these questions.
134
CHAPTER 12. SYMMETRIC CONES
Definition 56 An Euclidean Jordan algebra is simple, if it is not
isomorphic to a direct sum of other Euclidean Jordan algebras.
Theorem 19 There exist only 5 different classes of simple Euclidean
Jordan algebras.
In the remaining part of this subsection we will briefly describe
these classes.
10 . SOCP Algebra (Rn+1 , ◦).
This is the familiar algebra associated
with SOCP where B = I. (If B is any symmetric positive definite
matrix, the corresponding Jordan algebra will be Euclidean, and in
fact isomorphic to the case where B = I.)
20 . Symmetric Matrices (Sn×n , ◦).
Again, this is the familiar algebra
of symmetric matrices, that we have discussed in previous lectures.
30 . Complex Hermitian Matrices (Hn×n , ◦).
A matrix X of complex
numbers is Hermitian (X ∈ Hn×n ), if X = X . Operation (·)∗ denotes
conjugate transpose, which is defined as following: if (X)lk = alk +
iblk , then (X∗ )lk = akl − ibkl .
For a matrix of complex numbers of size n × n, one can provide
a matrix in real numbers of size 2n × 2n, for which the algebra
operations will carry through in exactly the same way. To achieve
this, each element is replaced by a 2 × 2 matrix:
a b
.
a + ib →
−b a
∗
Consider an example:

a
c − di
a 0
0 a

c + di
→

b
 c −d
d c

c d
−d c 



b 0
0 b
(12.1)
Therefore, it is easy to see that (Hn×n , ◦) is a subalgebra of
S2n×2n . Even though Hn×n is a subalgebra of S2n×2n its rank is
only r. Let u be a unit length complex vector. Then transforming
it to a real matrix by (12.1) we map u to a n × 2 matrix, and uuT
12.2. CONE OF SQUARES
135
to a rank 2 real matrix. This rank 2 real matrix is not a primitive
idempotent within S2n×2n , but it is primitive in Hn×n .
40 . Hermitian Quaternion Matrices.
Quaternions are an extension
of complex numbers. Each quaternion number is a sum a + bi +
cj + dk, where a, b, c, d ∈ R and i, j, k are such that:
i2 = j2 = k2 = −1
ij = k = −ji
jk = i = −kj
ki = j = −ik
Analogous to how a complex number can be expressed as a pair
of real numbers, a quaternion can be expressed as a pair of complex
numbers: a + bi + cj + dk = (a + bi) + (c + di)j.
Conjugate transpose X∗ of a quaternion matrix X is defined as
following. If (X)pq = a+bi+cj+dk, then (X∗ )qp = a−bi−cj−dk.
Hermitian quaternion matrices satisfy X = X∗ .
In this algebra multiplication is defined as
def
X◦Y =
XY + YX
.
2
50 . Hermitian Matrices of Octonions of Size 3 × 3.
Octonions are an
extension of quaternions in the same way, as the quaternions are
an extension of complex numbers. Introduce a number l, such that
l2 = −1. Then an octonion can be written as p1 + lp2 , where p1
and p2 are quaternions.
By definition,
def
(p1 + lp2 )(q1 + lq2 ) = (p1 q1 − q̄2 p2 ) + (q2 p1 + p2 q̄1 )l.
The main difference between octonions and quaternions is that
multiplication in octonions is not associative. Thus, if we build
matrices out of octonions, the matrix multiplication will not be associative either. However, by an amazing coincidence, over the set
of 3 × 3 octonion matrices that are Hermitian, the multiplication
def
X ◦ Y = (XY + YX)/2 is a Euclidean Jordan algebra. It is however a
Jordan algebra which is not induced by an associative algebra, and
in fact it can be shown that it is isomorphic to no Jordan algebra
136
CHAPTER 12. SYMMETRIC CONES
that is a subalgebra of a Jordan algebra induced by an associative
algebra. Therefore, this algebra is often called the exceptional Jordan algebra or the Albert algebra, named after Adrian Albert who
discovered it. It can be shown that this algebra has rank 3. The
underlying vector space of this algebra is a 27-dimensional vector
space (there are 3 real numbers on the diagonal, and three octonions on the off-diagonal; since octonions are 8-dimensional the set
of such matrices yields a 27-dimensional algebra),
12.3
Symmetric Cone LP
We will give a brief sketch of how this theory is applied for describing
interior point methods.
Suppose we are given a program:
minhc, xi
s.t.
hai , xi = bi
x <KE 0
Its dual is:
max bT y
s.t.
X
yi ai + z = c
z <KE 0
Complementary slackness conditions:
x <KE 0
z <KE 0
hx, zi = 0
⇒ x◦z=0
As we discussed earlier, (− ln det x) is an appropriate barrier of
the primal:
min hc, xi − µ ln det x
s.t. Ax = b
Lagrangian L(x, y) = hc, xi − µ ln det x + yT (b − Ax). It can be
shown that the gradient ∇x ln det(x) = x−T . Thus, the optimality
12.3. SYMMETRIC CONE LP
137
conditions imply
∇x L = cT − µx−T + yT A = 0
∇y L = b − Ax = 0
def
Let z = cT − µx−T . So, we have to solve
Ax = b
AT y + z = c
The following equalities are equivalent:
z − µx−1 = 0
x − µz−1 = 0
x ◦ z = µe
If we replace x ← x + ∆x, y ← y + ∆y, z ← z + ∆z as before,

   
A 0 0
∆x
rp
 0 AT I  ∆y = rd 
∆z
rc
E 0 F
And now we only need to define what are E and F:
z − µx−1 = 0
x − µz−1 = 0
x◦z=µ
→ E = −µQx−1 , F = I
→ E = I, F = −µQz−1
→ E = L(z), F = L(x)e
These relations unify LP, SOCP, and SDP formulations of interior point methods. In fact, We can express by Jordan algebraic
notation, any optimization problems with any combination of nonnegativity, second order, or semidefinite constraints. Analysis of
interior point algorithms is also streamlined in Jordan algebraic formulation.
138
CHAPTER 12. SYMMETRIC CONES
Appendix A
Modeling with Semidefinite
Programming
Eyjolfur Asgeirsson
12/10/2001
A.1
Overview
In this paper I will look at a few problems that can be cast as
a semidefinite or a second-order cone problem. These problems
are quadratically constrained quadratic programming, logarithmic
Chebyschev approximation, pattern separation by ellipsoids, geometrical problems involving quadratic forms and combinatorial and
non-convex optimization.
A.2
A.2.1
Applications of SDP/SOCP
Quadratically Constrained Quadratic Programming
Consider the general convex quadradically constrained quadratic
program (QCQP):
min xT P0 x + 2qT0 x + r0
s.t. xT Pi x + 2qTi x + ri ≤ 0, i = 1, . . . , p
(A.1)
where P0 , P1 , . . . , Pp ∈ Rn×n are symmetric and positive semidefinite
matrices, i.e. Pi 0, Pi = PiT . Assume for simplicity that the
APPENDIX A. MODELING WITH SEMIDEFINITE
PROGRAMMING
140
matrices Pi are positive definite, i.e. Pi 0, although the more
general problem of semidefinite matrices can also be reduced to an
SOCP. We can write (A.1) as
1/2
−1/2
q0 k2 + r0 − qT0 P0−1 q0
1/2
−1/2
qi k2 + ri − qTi Pi−1 qi ≤ 0
min kP0 x + P0
s.t.
kPi x + Pi
which can be solved as a SOCP with p + 1 constraints of dimension
n+1
min t
1/2
−1/2
s.t. kP0 x + P0
1/2
kPi x
+
q0 k ≤ t,
−1/2
Pi q0 k
≤
(qTi Pi−1 qi
(A.2)
1/2
− ri )
, i = 1, . . . , p
Problems (A.1) and (A.2) will have the same optimal solution but
the optimal values will differ. The optimal value of (A.1) is equal
to (p∗ )2 + r0 − qT0 P0−1 q0 , where p∗ is the optimal value of (A.2).
As a special case we can look at the convex quadratic programming problem (QP)
min xT P0 x + 2qT0 x + r0
s.t. aTi x ≤ bi , i = 1, . . . , p
where P0 0 and solve this as a SOCP by adding a variable t and
using a single constraint of dimension n + 1 and p constraints of
dimension one:
min t
1/2
−1/2
s.t. kP0 x + P0 q0 k ≤ t
aTi x ≤ bi , i = 1, . . . , p
A.2.2
Logarithmic Chebyschev Approximation
Suppose we would like to solve Ax = b approximately, where A =
[a1 , . . . , ap ]T ∈ Rp×k and b ∈ RP , by solving the problem
min max | log(aTi x) − log(bi )|
1≤i≤r
(A.3)
A.2. APPLICATIONS OF SDP/SOCP
141
where bi > 0, i = 1, . . . , r and log(aTi x) is defined as −∞ when
aTi x ≤ 0. This can reduced to a SDP or a SOCP by using the
observation that if aTi > 0 then
| log(aTi x) − log(bi )| = log max(
aTi x bi
)
,
bi aTi x
Then problem (A.3) is equivalent to
min t
s.t. 1 ≤ (aTi x/bi )t, i = 1, . . . , r
aTi x/bi ≤ t, i = 1, . . . , r
t≥0
which is a second-order cone program. This can also be written as:
min 
t

t − aTi x/bi
0
0
0
aTi x/bi 1 0, i = 1, . . . , p
s.t. 
0
1
t
A.2.3
Pattern separation by ellipsoids
The simplest classifiers in pattern
recognition
hyperplanes
to
1
use
K
1
L
separate two sets of points x , . . . , x
and y , . . . , y in Rp .
The hyperplane defined by aT x + b = 0 separates these two sets if
aT xi + b ≤ 0 i = 1, . . . , K
aT yj + b ≥ 0 j = 1, . . . , L
This is a set of linear inequalities in a ∈ Rp and b ∈ R which
can be solved using LP. If these two sets cannot be separated by a
hyperplane, we can try to separate them by a quadratic surface. To
do that we need to find a quadratic function f(x) = xT Ax + bT + c
such that
(xi )T Axi + bT xi + c ≤ 0 i = 1, . . . , K
(yj )T Ayj + bT yj + c ≥ 0 j = 1, . . . , L
(A.4)
This is also a set of linear inequalities in the variables A = AT ∈
Rp×p , b ∈ Rp and c ∈ R so this can also be solved using LP. We can
142
APPENDIX A. MODELING WITH SEMIDEFINITE
PROGRAMMING
put further restrictions on the quadratic surface separating the two
sets. As an example we might try to find an ellipsoid that contains
all the points xi and none of the yj . This constraint imposes the
condition A 0 in addition to the constraints in (A.4) which means
that our problem can be solved as a semidefinite feasibility problem.
The next step is to optimize the shape and the size of the ellipsoid by adding an objective function and other constraints. As an
example we can search for the ”most spherical” ellipsoid. The ratio
of the largest to the smallest semi-axis length is the square root of
the condition number of A. In order to make the ellipsoid as spherical as possible we add an additional variable γ and the constraint
I A γI and then solve the SDP problem
min γ
s.t. (xi )T Axi + bT xi + c ≤ 0 i = 1, . . . , K
(yj )T Ayj + bT yj + c ≥ 0 j = 1, . . . , L
I A γI
This semidefinite program is feasible if and only if there is an ellipsoid that contains all the xi and none of the yi . The optimum value
is one, this occurs only if there is a sphere that separates these two
sets of points.
A.2.4
Geometrical problems involving quadratic forms
Many geometrical problems involving quadratic functions can be expressed as semidefinite programs. Suppose we are given k ellipsoids
ξ1 , . . . , ξk described as the sublevel sets of the quadratic functions
fi (x) = xT Ai x + 2bTi x + ci , i = 1, . . . , k
i.e. ξi = {x | fi (x) ≤ 0)}. The goal is to find the smallest sphere
that contains all k of these ellipsoids (or equivalently, contains the
convex hull of their union).
Suppose that the ellipsoids ξ = {x | f(x) ≤ 0} and ξ̃ = x | f̃(x) ≤ 0 ,
with
f(x) = xT Ax + 2bT x + c, f̃(x) = xT Ãx + 2b̃T x + c̃,
A.2. APPLICATIONS OF SDP/SOCP
143
have nonempty interior. Then it can be shown that ξ contains ξ̃ if
and only if there is a τ ≥ 0 such that
à b̃T
A bT
τ
b c
b̃ c̃
If we consider the sphere S represented by f(x) = xT x − 2xTc x +
γ ≤ 0. S contains the ellipsoids ξ1 , . . . , ξk if and only if there are
nonnegative τ1 , . . . , τk such that
I −xTc
Ai bTi
τi
, i = 1, . . . , k.
(A.5)
−xc γ
bi ci
Our
p goal is to minimize the radius of the sphere2 S, which is r =
xTc xc − γ. To do this we express the condition r ≤ t as
I
xTc
0
xc t + γ
and minimize the variable t.
Hence we can find the smallest sphere containing the ellipsoids
ξ1 , . . . , ξk by solving the semidefinite program
min t
I −xTc
Ai bTi
s.t.
τi
, i = 1, . . . , k.
−xc γ
bi ci
τi ≥ 0, i = 1, . . . , k
I
xTc
0
xc t + γ
The variables are xc , τ1 , . . . , τk , γ and t.
This semidefinite program can be rewritten as a second-order
cone program. Define the sphere S as S = {x ∈ Rn | kx − xc k ≤ ρ}
and rewrite (A.5) as
τi Ai − I
τi bTi + xTc
Mi =
0, i = 1, . . . , k
τi bi + xc τi ci + ρ2 − xTc xc
The matrices Mi are positive semidefinite if and only if τi ≥ λmin1(Ai )
where Ai = Qi Λi QTi is the spectral decompostition of Ai and Λi =
diag(λi1 , . . . , λin ), and the Schur complement
τi ci + ρ2 − xTc x − (τi bi + xc )T (τi Ai − I)−1 (τi bi + xc ) ≥ 0 (A.6)
APPENDIX A. MODELING WITH SEMIDEFINITE
PROGRAMMING
144
for i = 1, . . . , k. If we define t = ρ2 , vi = QTi (τi bi + xc ) and
sij =
v2
ij
,
τi λij −1
j = 1, . . . , n, then (A.6) is equivalent to
t ≥ xTc xc − τi ci + vTi (τi Λi − I)−1 vi = xTc xc − τi ci + 1T si
where si = (si1 , . . . , sin ). Since we are minimizing t, we can relax
the definition of sij to sij ≥ v2ij (τi λij − 1), j = 1, . . . , n. Combining
all of the above yields the following formulation involving only linear
and restricted hyperbolic constraints:
min t
s.t. vi ≤ QTi (τi bi + xc ), i = 1, . . . , k
v2ij ≤ sij (τi λij − 1), i = 1, . . . , k and j = 1, . . . , n
xTc xc ≤ t + τi ci − 1T si , i = 1, . . . , k
1
, i = 1, . . . , k
τi ≥
λmin (Ai )
Then we can transform this problem into an SOCP with kn 3dimensional and k (n+2)-dimensional second-order cone inequalities.
A.2.5
Combinatorial and non-convex optimization
Semidefinite programs play a very useful role in non-convex or combinatorial optimization. Consider the quadratic optimization problem
min f0 (x)
s.t. fi (x) ≤ 0, i = 1, . . . , L
(A.7)
where fi (x) = xT Ai x + 2bTi x + ci , i = 1, . . . , L. The matrices Ai
can be indefinite and therefore problem (A.7) is a very hard, nonconvex optimization problem. For example, it includes all optimization problems with polynomial objective function and polynomial
constraints.
It is important to have good and cheaply computatable lower
bound on the optimal value of (A.7), e.g. for branch-and-bound algorithms. We can get such a lower bound by solving the semidefinite
A.2. APPLICATIONS OF SDP/SOCP
145
program
max t
A0
bT0
A1 bT1
AL bTL
s.t.
+ τ1
+ . . . + τL
0
b0 c0 − t
b1 c1
bL cL
τi ≥ 0, i = 1, . . . , L
Suppose x satisfies the constraints in the nonconvex problem (A.7),
i.e.
T x
x
Ai bTi
fi (x) =
≤0
1
1
bi ci
for i = 1, . . . , L and t, τ1 , . . . , τL satisfy the constraints in the
semidefinite program (A.7). Then
x
x
A0
bT0
A1 bT1
AL bTL
+ τ1
+ ... + τ
0≤
1
1
b0 c0 − t
b1 c1
bL cL
= f0 (x) − t + τ1 f1 (x) + . . . + τL fL (x)
≤ f0 (x) − t.
(A.8)
Therefore t ≤ f0 (x) for every feasible x in (A.7).
Most semidefinite relaxations of NP-hard combinatorial problems
seem to be related to the semidefinite problem in (A.7), or the related semidefinite problem
min Tr XA0 + 2bT0 x + c0
s.t. Tr XAi + 2bTi x + ci ≤ 0, i = 1, . . . , L
X x
0
xT 1
(A.9)
where the variables are X = XT ∈ Rk×k and x ∈ Rk . Problem (A.9)
is a dual of Shor’s relaxation of (A.8) so these two problems give
the same bound.
Note that the constraint
X x
0
xT 1
is equivalent to X xxT . The semidefinite program (A.9) can therefore be directly interpreted as a relaxation of the original problem
146
APPENDIX A. MODELING WITH SEMIDEFINITE
PROGRAMMING
(A.7), which can be written as
min Tr XA0 + 2bT0 x + c0
s.t. Tr XAi + 2bT0 x + ci ≤ 0, i = 1, . . . , L
X = xx
(A.10)
T
The only difference between (A.10) and (A.9) is the replacement
of the (nonconvex) constraint X = xxT with the convex relaxation
X xxT . If we add the (nonconvex) constraint that the matrix X
is of rank one to the relaxation (A.9) then it becomes the problem
in (A.10).
As an example, consider the (-1,1)-quadradic program
min xT Ax + 2bT x
s.t. xi ∈ {−1, 1}, i = 1, . . . , k
(A.11)
which is NP-hard. The constraint xi ∈ {−1, 1} can be written as
the quadratic equality constraint x2i = 1, or equivalently, as two
quadradic inequalities x2i ≤ 1 and x2i ≥ 1. Using (A.9) we find that
the semidefinite program in X = XT and x
min Tr XA + 2bT x
s.t. Xii = 1, i = 1, . . . , k
X x
0
xT 1
(A.12)
yields a lower bound for (A.11). A special case of (A.11) is the
Max-Cut problem where b = 0 and the diagonal of A is zero.
A.3
Conclusions
A wide variety of problems, from different fields ranging from engineering, finance, computer science and others, can be represented
and solved efficiently using SDP and/or SOCP. This paper introduces a few of those but many more exist. Examples of problems
that are not mentioned in this paper but still have very interesting
SDP/SOCP representations are maximum eigenvalue and matrix
norm minimization, structural optimization, robust least-squares,
robust linear programming, control and system theory, problems
A.3. CONCLUSIONS
147
with hyperbolic constraints, matrix-fractional problems, antenna array weight design, grasping force optimization and equilibrium of
system with piecewise-linear springs.
148
APPENDIX A. MODELING WITH SEMIDEFINITE
PROGRAMMING
Appendix B
Robust Portfolio selection
problems and their SOCP
representations
B.1
Outline
This presentation will be based on the following papers:
F. Alizadeh and D. Goldfarb, ”Second-Order Cone Programming”,
D Goldfarb and G. Iyengar,”Robust Portfolio Selection Problems”.
Outline of the presentation
-What does robust mean?
-Robust formulations and their SOCP represenations.
-(a)Robust Least Squares
-(b)Robust Linear Programming
-Why does Robust formulation necessarry in Finance?
-Robust factor model for asset returns.
-Robust portfolio selection problems and their SOCP representations.
150
APPENDIX B. PORTFOLIO OPTIMIZATION AND SDP
B.2
What is robust optimization?
In robust optimization, uncertainty on problem data is treated as
deterministic, unknown-but-bounded, i.e. it is only known to belong
to a given uncertainty set.
A robust solution is one that tolerates changes in the problem data,
up to a given bound known a priori.
B.3
Robust Least Squares(RLS) Problem)
Given an over-determined set of equations;
Ax ≈ b, A ∈ Rm×n ,
where [A, b] is subject to unknown but bounded errors
k[4A, 4b]kF ≤ ρ,
where kBkF denotes the Frobenius norm of the matrix B.
Then RLS problem is defines as
max{k(A + 4A)x − (b + 4b)k,
RLS= minx
.
k[4A, 4b]kF ≤ ρ}
How do we reformulate this RLS as SOCP?
For a given x define,
r(A, b, x) =
max{k(A + 4A)x − (b + 4b)k,
k[4A, 4b]kF ≤ ρ}
Therefore,
RLS = minx r(A, b, x)
Now, by triangle inequality for a given x,
k(A + 4A)x − (b + 4b)k
x ≤ kAx − bk + (4A, −4b)
1 B.3. ROBUST LEAST SQUARES(RLS) PROBLEM)
151
Then by properties of Frobenius norm and the bound,
x
≤ k[4A, 4b]kF x (4A, −4b)
1 1 x ≤ ρ
1 but for the choice (4A, −4b) = uvT , where



 ρ Ax−b , if Ax − b 6= 0
kAx−bk
u=
andv =

any
vector ∈ Rm of norm ρ


x
1
x 1 which satisfies k[4A, −4b]kF = kuvT kF ≤ ρ, and
(4A, −4b) x = kρk × x = ρ x 1
1
1 Hence,
r(A, b, x) =
Therefore
max{k(A + 4A)x − (b + 4b)k,
k[4A, 4b]kF ≤ ρ}
x = kAx − bk + ρ 1 x RLS=minx {kAx − bk + ρ 1 }
RLS problem is equivalent to minimization of sum of norms problem.
So the SOCP representation of RLS problem is,
min λ + ρτ
s.t. kAx
bk ≤ λ
−
x 1 ≤τ
152
APPENDIX B. PORTFOLIO OPTIMIZATION AND SDP
B.4
Robust Linear Programming
Consider the following LP;
min cT x
s.t. Ax ≤ b
We can write this LP as;
^
min c^T x
^x ≤ 0
s.t. A^
^ = −1
δT x
^ = [A, b], δT = (0, 1).
^T = (xT , ξ), A
where c^T = (cT , 0), x
Uncertainity sets:
^ as
^ i , it h row of A,
Let εi be the uncertainity set of a
^ i ∈ Rn+1 |a
^ i = āi + Bi u, kuk2 ≤ 1}
ε i = {a
^
where Bi ∈ R(n+1)×(n+1) and Bi 0. For fixed x
^ Ti x
^ ≤ 0, ∀a
^ i ∈ εi ⇔ max{(āi + Bi u)T x
^| kuk2 ≤ 1} ≤ 0.
a
However, if we choose
Bi x^
^ 6= 0
if Bi x
^ k2
kBi x
u=
^=0
any u with kuk2 = 1 ifBi x
Then,
^| kuk2 ≤ 1} = āTi x
^ + kBi x
^ k2
max{(āi + Bi u)T x
Hence, for the ellipsoidal uncertainity set εi , the robust counterpart
of the LP becomes;
^
min c^T x
T
^ x
^ + ti ≤ 0 i=1,. . . ,m
s.t. a
^k2 ≤ ti i=1,. . . ,m
kBi x
T
^ = −1.
δ x
B.5
Why Robustness is important in Finance?
Markowitz mean-variance portfolio theory
-Return is measured as expected value of the random portfolio return
B.6. MARKET MODEL
153
-Risk is quantified as variance of the portfolio return.
Optimal portfolio is obtained by solving a quadratic optimization
problem.
Why practitioners have shied away from this model?
-”Although Markowitz efficiency is a convenient and useful theoretical framework for portfolio optimality, in practice it is an error prone procedure that often results in error − maximized and
investment − irrelivant portfolios.”
Where does this error-prone behavior come from?
• Market parameters are estimated from noisy data and subject
to statistical errors.
• Solutions of the optimization problems are often very sensitive
to perturbations in the parameters of the problem.
B.6
Market model
The single period return r is assumed to a random variable given
by
r = m + VT f + e
where m=vector of mean returns, f ∼ N(0, F)= vector of returns of
the factors that drive the market, V= the matrix of factor loadings
of the n assets, e ∼ N(0, D) residual return.
Assumptions:
• e is independent of f
• F0
• D = diag(d) 0 with di > 0.
Thus market return is r ∼ N(m, V T FV + D)
B.7
Uncertainty sets:
Let Sd , Sv and Sm , the uncertainty sets for D, V and m, given by:
Sm = {m : m = m0 + ξ, |ξi | ≤ γi , i = 1, . . . , n}
154
APPENDIX B. PORTFOLIO OPTIMIZATION AND SDP
Sd = {D : D = diag(d), di ∈ [di , di ]}
Sv = {V : V = V0 + W, kWi kg ≤ ρi , i = 1, . . . , n}
where Wi is the it h column of W and kAkg is the elliptic norm with
respect to matrix G 0. Cumulative return of portfolio φ
rφ = φT r = φT m + φT V T f + φT e
∼ N(φT m, φT V T FVφ + φT Dφ)
B.8
Robust portfolio selection problems and SOCP
representaions
B.8.1
Minimum variance portfolio selection
Since the return rφ ∼ N(φT m, φT (V T FV +D)φ), the robust min-var
portfolio selection problem is given by
min max{V∈Sv } {φT V T FVφ} + max {D ∈ Sd }{φT Dφ)}
s.t. min{m∈Sm } φT m ≥ α,
1T φ = 1.
Since di ≤ di ≤ di ⇒ φT Dφ ≤ φT Dφ, whereD = diag(d) and.
√
Also, since F 0 the function kxkf : x −→ xT Fx defines a norm
on Rm . Thus above problem is equivalent to the following robust
augmented least squares problem.
min max{V∈Sv } kVφk2f + φT Dφ
s.t. min{m∈Sm } φT m ≥ α,
1T φ = 1.
By introducing auxiliary variables ν and δ,
min ν + δ
s.t. max{V∈Sv } kVφk2f ≤ ν,
φT Dφ ≤ δ,
min{m∈Sm } φT m ≥ α,
1T φ = 1.
B.8. ROBUST PORTFOLIO SELECTION PROBLEMS AND
SOCP REPRESENTAIONS
155
If the uncertainty sets are finite, i.e. Sv = {V1 , . . . , Vs } and Sm =
{m1 , . . . , mr }, then
min λ + δ
s.t. kVk φk2f
φT Dφ
min{m∈Sm } φT mk
1T φ
≤λ
≤ δ,
≥α
= 1.
for all k=1,...,s
for all k=1,...,r
If Sv given by,
Sv = {V : V = V0 + W, kWk =
p
Tr(wT W) ≤ ρ}
El Ghaoui and Lebter showed that it can be formulated as SOCP.
However the problem is if V and m is given as;
Sm = {m : m = m0 + ξ, |ξi | ≤ γi , i = 1, . . . , n}
Sv = {V : V = V0 + W, kWi kg ≤ ρi , i = 1, . . . , n}
where Wi is the it h column of W and kAkg is the elliptic norm with
respect to matrix G 0.
If V and m is given as above than the worst case mean return of a
fixed portfolio φ is given by,
min µT = µT0 φ − γT |φ|
{µ∈Sm }
and the worst case variance is given by,
P1 max k(V0 + W)φk2F
s.t. kWi kg ≤ ρi i=1,...,n
Since the constraints kWi kg ≤ ρi i=1,...,n imply the bound,
n
n
n
X
X
X
kWi kg = φi Wi ≤
|φi |kWi kg ≤
ρi |φ|
i=1
g
i=1
then the optimization problem
P2 max kV0 φ + wk2f
s.t. kwkg ≤ r
i=1
156
APPENDIX B. PORTFOLIO OPTIMIZATION AND SDP
where r = ρT |φ| is a relaxation of P1,i.e. optimal value of P2 is at
least as large that of P1. The objective function of P2 is convex so
kw∗ kg = r. For i = 1, ...n, define
|φi | ρ ∗
i
w,
φi 6= 0
∗
φi r
Wi =
ρi
∗
w , otherwise
r
P
Then kWi∗ kg = ρi i.e. W ∗ is feasible for P1 and W ∗ φ = ni=1 φi Wi∗ =
w∗ . Therefore optimal value of P1 and P2 are equal.Thus for fixed
portfolio φ, the worst-case variance is less than ν if, and only if,
max
{y:kykg ≤r}
ky0 + yk2f ≤ ν
where y0 = V0 φ and r = ρ|φ|. The following lemma reformulates
this constraint as an SOC constraint.
Lemma 21 Let r > 0, y0 , y ∈ Rn and F, G 0. Then the constraint
max
{y:kykg ≤r}
ky0 + yk2f ≤ ν
is equal to the either of the following:
(i) there exists τ, σ ∈ R and t ∈ Rm satifying
ν
σ
2r
σ−τ 2wi
(1 − σλi − ti ) τ
1
1
≥ τ + 1T t
≤ λmax1 (H) ,
≤ σ + τ,
≤ (1 − σλi + ti ) i=1,...,m,
≥ 0
. where H = G− 2 FG− 2 , H = QLambdaQT is the spectral decompo1
1
sition of H, Λ = diag(λi ) and w = QT H 2 G 2 y0 .
(ii) there exists τ and s ∈ Rm satisfying,
2r
T
T
(τ − ν + 1 s) ≤ (τ + ν − 1 s)
2ui
(1 − τθi − si ) ≤ (1 − τθi + si ) i=1,...,m
1
τ ≤ λmax
,
(K)
B.8. ROBUST PORTFOLIO SELECTION PROBLEMS AND
SOCP REPRESENTAIONS
157
1
1
where K = F 2 G−1 F 2 , K = PΘPT spectral decomposition of K, Θ =
1
diag(θi ) and u = PT F 2 y0 .
By using the above results we have
minimize
subject to
w
1T φ
µT0 φ − γT |φ|
σ
T
2ρ |φ| σ−τ 2wi
(1 −σλi − ti ) 2DT φ 1−σ τ
τ + 1T tδ
1
1
= QT H 2 G 2 V0 φ
= 1
≥ α
1
,
≤ λmax
H
≤ σ + τ,
≤ (1 − σλi + ti )i = 1, ...m
≤ 1+σ
≥ 0,
1
1
where H = QΛQT is the spectral decomposition of H = G− 2 FG− 2 ,
Λ = diag(λi )
B.8.2
Robust maximum return problem
Robust maximum return problem is given by,
maximize min{m∈Sm } φT m
s.t.
max{V∈Sv } φT V T FVφ + φT Dφ ≤ λ,
1T φ = 1
φ ≥ 0,
or equivalently
maximize α
s.t.
min{m∈Sm } φT m ≥ α
max{V∈Sv } φT V T FVφ ≤ λ − δ,
φT Dφ ≤ δ,
1T φ = 1
φ ≥ 0,
158
APPENDIX B. PORTFOLIO OPTIMIZATION AND SDP
Then, again by using the lemma;
maximize
subject to
w
1T φ
µT0 φ − γT φ
τ + 1T tδ
σ
T 2ρ φ σ−τ 2wi
(1 −σλi − ti ) 2DT φ 1−σ α
=
=
≥
≤
≤
1
1
QT H 2 G 2 V0 φ
1
α
λ
1
,
λmax H
≤ σ + τ,
≤ (1 − σλi + ti ) i=1,...,m,
≤ 1+σ
τ ≥ 0,
φ ≥ 0,
1
1
where H = QΛQT is the spectral decomposition of H = G− 2 FG− 2 ,
Λ = diag(λi )
B.9
Other robust formulations that can be represented as SOCP
These are, Robust maximum Sharpe ratio problem and Robust
Value-At-Risk (VaR) portfolio selection problems.
B.10
Conclusion
In this presentation;
-Definition of robust optimization.
-Robust formulations and their SOCP represenations.
–(a)Robust Least Squares
–(b)Robust Linear Programming
-Why does Robust formulation necessarry in Finance?
-Robust factor model for asset returns.
B.10. CONCLUSION
159
-Robust portfolio selection problems and their SOCP representations.
–(a)Robust Minimum variance portfolio selection
–(b)Robust Maximum return problem
160
APPENDIX B. PORTFOLIO OPTIMIZATION AND SDP
Appendix C
Primal-Dual Interior-Point
Methods
for Second Order Cone
Programming
Yusaku Yamamoto
12/10/2001
C.1
abstract
In this talk, I will explain how to construct an efficient primal-dual
interior- point methods for second order cone programming, following chapters 7 and 8 of [2]. In particular, I will focus on the linear
equations arising from Newton’s method, and point out some of
the special properties of the coefficient matrix. It is shown that
by applying an appropriate transformation to the original problem,
we can make the matrix symmetric positive definite, and therefore
much easier to solve. It is also shown that its Cholesky factor can
be calculated efficiently using a sparse Cholesky factorization techniques and some update formulas.
APPENDIX C. PRIMAL-DUAL INTERIOR-POINT METHODS
162
FOR SECOND ORDER CONE PROGRAMMING
C.2
Introduction
We consider a second order cone programming problem
cT1 x1 + . . . + cTr xr
A1 x1 + . . . + Ar xr = b
xi Q 0, for i = 1, . . . , r
(C.1)
bT y
ATi y + zi = ci , for i = 1, . . . , r
zi Q 0, for i = 1, . . . , r
(C.2)
min
s.t.
with its dual problem
max
s.t.
where xi ∈ Rni , ci ∈ Rni , Ai ∈ Rm×ni , and b ∈ Rm . We also write
x = (xT1 , . . . , xTr )T
c = (cT1 , . . . , cTr )T
A = (AT1 , . . . , ATr )T .
(C.3)
We assume that the m rows of A are linearly independent, without
loss of generality.
C.3
The primal-dual path following methods
To solve (C.1), we replace the second-order cone inequalities
P xi Q 0
by xi Q 0, and add the logarithmic barrier term −µ i ln det(xi )
to the objective function. Then we obtain
min
s.t.
r
X
i=1
r
X
cTi xi − µ
X
ln det(xi )
i
Ai xi = b
i=1
xi Q 0,
for i = 1, . . . , r
(C.4)
C.3. THE PRIMAL-DUAL PATH FOLLOWING METHODS
163
The Karush-Kuhn-Tucker (KKT) optimality conditions for (C.4)
are
r
X
Ai xi = b
(C.5)
i=1
ci − ATi y − 2µx−1
for i = 1, . . . , r
i = 0,
xi Q 0, for i = 1, . . . , r
(C.6)
(C.7)
By defining zi = ci − ATi y, these equations can be rewritten as
r
X
Ai xi = b
i=1
ATi y
+ zi = ci , for i = 1, . . . , r
xi ◦ zi = 2µe, for i = 1, . . . , r
xi , zi Q 0, for i = 1, . . . , r
(C.8)
Here, the ◦ operator in equation (C.8) is multiplication in Jordan
algebra of second order cones, and det(xi ) in eq. (C.4) and x−1
in
i
eq. (C.6) should also be taken in the sense of Jordan algebra.
Next, we apply Newton’s method to equation (C.8). By replacing
xi , y and zi appearing in (C.8) with xi + ∆xi , y + ∆y and zi + ∆zi ,
respectively, and ignoring the second and higher order terms in the
∆’s, we obtain
r
X
Ai ∆xi = b −
i=1
ATi ∆y
r
X
Ai xi
(C.9)
i=1
+ ∆zi = ci − ATi y − zi , for i = 1, . . . , r
(C.10)
zi ◦ ∆xi + xi ◦ ∆zi = 2µe − xi ◦ zi , for i = 1, . . . (C.11)
, r.
This equation can be written in the block matrix form


  
A 0 0
∆x
rp
 0 AT I   ∆y  =  rd 
∆z
rc
E 0 F
(C.12)
where
E ≡ Arw(z), F ≡ Arw(x),
rp ≡ b − Ax, rd ≡ c − AT y − z,
(C.13)
rc ≡ 2µe − x ◦ (C.14)
z,
APPENDIX C. PRIMAL-DUAL INTERIOR-POINT METHODS
164
FOR SECOND ORDER CONE PROGRAMMING
and Arw(x) and Arw(z) are direct sums of Arw(xi ) and Arw(zi ),
respectively. By applying block Gaussian elimination, the solution
to this linear equation can be written formally as
∆y = (AE−1 FAT )−1 (rp + AE−1 (Frd − rc )),
∆z = rd − AT ∆y,
∆x = −E−1 (F∆z − rc ).
(C.15)
(C.16)
(C.17)
These are the equations defining the Newton direction.
Note that although we have derived the Newton direction (C.15),
(C.16) and (C.17) for SOCP, exactly the same formulation holds for
any symmetric cone LP including the LP and SDP, if only we use
proper definitions of the ◦ operator, det(xi ), and x−1
i for each case.
See, for example, [3] for the SDP case.
C.4
Properties of the coefficient matrix
In this talk, I will focus on the numerical computation of equations
(C.15), (C.16) and (C.17) to obtain the Newton direction, although
it is also important and difficult to determine the step size along this
direction appropriately to guarantee polynomial-time convergence.
For the latter issue, consult [2] and the references therein.
The most computationally intensive part in computing these equations is calculation of (AE−1 FAT )−1 u in eq. (C.15), where we put
u = rp +AE−1 (Frd −rc ). It is therefore worthwhile to investigate the
structure of the matrix M = AE−1 FAT , and (a) to look for a transformation of the original problem (C.1) and (C.2) that keeps the
solution invariant, but simplifies M in some sense, and (b) to design
an efficient algorithm that takes advantage of the special structure
of the matrix. In particular, we are interested in an algorithm that
can exploit the sparsity of A. We deal with these issues later in
section 4 and 5, respectively.
As a preparation for a general SOCP case, we first consider Linear Programming, which is a special case of SOCP where all ni ’s
are one. In this case, the matrix D = E−1 F reduces to a diagonal
matrix (because Arw(xi ) and Arw(zi ) reduce to scalars and E and
F become diagonal), and all of its entries are positive. This implies
two favorable properties of M:
C.5. SYMMETRIZING THE M MATRIX
165
1. M is symmetric and positive definite. So M−1 u can be calculated using Cholesky factorization M = LLT , instead of LU
factorization. The former has the advantage that the required
computational work is only half of the latter (in the case of a
dense matrix) and pivoting to ensure numerical stability is not
necessary.
2. M = ADAT has the same nonzero structure as AAT . When we
compute the Cholesky factorization of a sparse SPD matrix B,
we usually permute the rows and columns of B by some permutation matrix P, and compute the factorization of B 0 = PBPT ,
in order to reduce the number of nonzero elements introduced
during factorization. Because the optimal choice of P depends
only on the nonzero structure of B, we need to calculate P only
once for B = AAT , and then can use the same P in all of the
subsequent iterations.
Unfortunately, these two properties do not hold for the general
SOCP case: M is nonsymmetric and much denser than AAT , because E−1 F = Arw−1 (z)Arw(x) is, in general, nonsymmetric and
dense. We discuss how to circumvent this situation and develop an
efficient algorithm in the following two sections.
C.5
Symmetrizing the M matrix
In this section, we consider a transformation of the original problem
(C.1) and (C.2) to an equivalent problem for which the coefficient
matrix M = AE−1 FAT is symmetric positive definite.
First, we define the quadratic representation Qp of p Q 0 as
follows:
Qp = 2Arw2 (p) − Arw(p2 )
kpk2
2p0 pT
=
2p0 p det(p)I + 2p pT
= 2ppT − det(p)R
(C.18)
where R is the reflection matrix; R = I − 2eeT . We give some
properties of Qp as the following theorem:
Theorem 1
APPENDIX C. PRIMAL-DUAL INTERIOR-POINT METHODS
166
FOR SECOND ORDER CONE PROGRAMMING
1. The transformation x → Qp x keeps the second-order cone Q
invariant.
2. Qp e = p2 .
t
3. Qp−1 = Q−1
p , and more generally, Qpt = Qp for any integer t.
4. If x is nonsingular, (Qp x)−1 = Qp−1 x−1
5. QQp x = Qp Qx Qp
See Chapter 4 of [2] for proofs.
We now introduce new variables
e
x ≡ Qp x
z ≡ Qp−1 z
e
c ≡ Qp−1 c
e
A ≡ AQp−1
(C.19)
e
Using these variables, the primal and dual problems can be rewritten
as
min
s.t.
and
x1 + . . . + cTr e
xr
cT1 e
e
e
A1 e
x1 + . . . + Ar e
xr = b
e
e
e
xi Q 0, for i = 1, . . . , r
(C.20)
bT y
ATi y + zi = ci , for i = 1, . . . , r
e
e
e
zi Q 0, for i = 1, . . . , r.
(C.21)
e
Here we used cTi e
xi = cTi xi and Ai e
xi = Ai xi , which follow from part 3
e
e
of Theorem 1, and the fact that e
xi Q 0 and zi Q 0 are equivalent
to xi Q 0 and zi Q 0, respectively, which efollows from part 1 of
Theorem 1. We also used the fact that applying Qp−1 on both sides
of an equation keeps the equation equivalent to the original one.
Applying the logarithmic barrier function to this problem, calculating the KKT optimality condition, and finding the linear equation
defining the Newton direction, we have
max
s.t.
f = b − Ae
A∆x
x
eT
e
A ∆y + ∆z = c − AT y − z
e
e
f e e
f +e
z ◦ ∆x
x ◦ ∆z = 2µe − e
x ◦ z.
e
e
f
(C.22)
C.5. SYMMETRIZING THE M MATRIX
167
Written in the original variables, these equations are
A∆x = b − Ax
AT ∆y + ∆z = c − AT y − z
(Qp−1 z) ◦ (Qp ∆x) + (Qp x) ◦ (Qp−1 ∆z) = 2µe − (Qp x) ◦ (Q(C.23)
p−1 z).
This shows that a different Newton direction is obtained by transforming the original problem to an equivalent one and then applying Newton’s method. (This is natural considering that the Newton
direction is not invariant under change of variables.) The new direction depends on p, and the original one corresponds to the special
case when p = e. By comparing the equation (C.11) and equation
(C.23), we see that the matrices E and F in the transformed problem
are written as
so that
E = Arw(z)Qp ,
e
F = Arw(e
x)Qp−1 ,
(C.24)
E−1 F = Qp−1 Arw−1 (z)Arw(e
x)Qp−1 .
(C.25)
e
We are interested in a special choice of p that makes Arw(e
x) =
Qp x and Arw(z) = Qp−1 z commute, because then Arw(e
x) and
e commute, and we have
Arw−1 (z) will also
e
(Arw−1 (z)Arw(e
x))T = ArwT (e
x)Arw−T (z)
e
e
= Arw(e
x)Arw−1 (z)
e
= Arw−1 (z)Arw(e
x),
(C.26)
e
which shows that Arw−1 (z)Arw(e
x), and therefore M = AE−1 FAT ,
is symmetric. In this case,eit is also easy to show that M is positive
definite.
Some of the p’s that satisfy this condition are
p1 = z1/2 , p2 = x−1/2 ,
p3 = [Qx1/2 (Qx1/2 z)−1/2 ]−1/2 = [Qz−1/2 (Qz1/2 x)1/2 ]−1/2
(C.27)
In fact, one can show using Theorem 1 that z = e for p1 , e
x=e
e
e
e
for p2 , and x = z for p3 . Hence, Arw(x) and Arw(z) commute in
all of the cases. eThe Newton directions obtained ine each case are
known as an analogue of the XZ direction in SDP, an analogue of the
ZX direction in SDP, and the Nesterov and Todd (NT) direction,
respectively.
APPENDIX C. PRIMAL-DUAL INTERIOR-POINT METHODS
168
FOR SECOND ORDER CONE PROGRAMMING
C.6
Efficient Cholesky factorization of M
Next, we study a special structure of E−1 F that can enable efficient
Cholesky factorization of M. In the case of LP, we showed in section
3 that E−1 F is diagonal. In the case of general SOCP, we have the
following theorem.
Theorem 2
Let x ∈ Rn and z ∈ Rn be such that z is nonsingular. For all
nonsingular p we have
E−1 F = Qp−1 Arw−1 (z)Arw(e
x)Qp−1 = D + T = D 0 + T 0 (C.28)
e
where D is the direct sum of multiples of the identity matrix and
T is a nonsymmetric matrix, in general, whose rank is at most 3r;
and D 0 is the direct sum of multiples of R and T 0 is a nonsymmetric
matrix, in general, of rank at most 2r.
To prove the theorem, we first restrict ourselves to a single block
case and note the lemma.
Lemma 1
Let x ∈ Rn and z ∈ Rn be such that z is nonsingular. Then
G(x, z) = Arw−1 (z)Arw(x) = βI −
1
1
RxeT + RzuT ,
z0
z0
(C.29)
where β = x0 /z0 , u = x ◦ z−1 , and e is the identity vector.
This lemma can be proved by inspection, noting that z−1 =
1
Rz and
det(z)
1
z0
−zT
−1
.
(C.30)
Arw (z) =
T
det z
1
det(z) −z z0 I + z0 z z
Now we prove Theorem 2 for the single block case where r = 1.
If we set q = p−1 and θ = β det(q)2 , we have from part 2 and 3 of
Theorem 1 and eq. (C.18)
Q2q = Qq2 = − det(q2 )R + 2q2 (q2 )T
T
e Qq
= − det(q2 )(I − 2eeT ) + 2q2 (q2 )T ,
= (q2 )T .
(C.31)
(C.32)
C.6. EFFICIENT CHOLESKY FACTORIZATION OF M
169
Multiplying G(e
x, z) in Lemma 1 from the left and right by Qp−1 ,
e (C.32), we have
and using (C.31) and
E−1 F = Qq Arw−1 (z)Arw(e
x)Qq
e
1
1
= Qq (βI − Re
xeT + RzuT )Qq
z0
z0 e
1
1
= θI − 2θeeT + β(2q2 −
Qq Re
x)(q2 )T + Qq Rz(C.33)
uT Qq
z0 β
z0
e
1
1
= θR + β(2q2 −
Qq Re
x)(q2 )T + Qq RzuT Qq (C.34)
z0 β
z0
e
It is apparent from (C.33) and (C.34) that E−1 F can be written as
D + T or D 0 + T 0 . The multiple block case can be proved by noting
that E−1 F can be written as the direct sum of E−1
r Fr for each block.
Although T is, in general, nonsymmetric, one can show that it
is symmetric and positive semidefinite when Arw(e
x) and Arw(z)
e
commute. In that case, M = AE−1 FAT can be written as
M = A(D + T )AT
= ADAT + ATAT = LD 0 LT + VV T ,
(C.35)
where LD 0 LT is the Cholesky factorization of ADAT , which can be
computed efficiently by sparse Cholesky factorization techniques [4]
when A is sparse, and V is a matrix with at most 3r columns.
Hence, the Cholesky factorization of M can be computed as a rank3r update of the Cholesky factorization of ADAT . There are several
methods to do this, and though the popular Sherman-MorrisonWoodbury formula [6] has been shown to be numerically unstable in
this case, Goldfarb and Scheinberg have developed a stable method
called the product-form Cholesky factorization [5].
APPENDIX C. PRIMAL-DUAL INTERIOR-POINT METHODS
170
FOR SECOND ORDER CONE PROGRAMMING
Bibliography
[1] F. Alizadeh: Semidefinite Programming Seminar Lecture
Notes, Columbia University, Fall 2001.
[2] F. Alizadeh and D. Goldfarb: Second-Order Cone Programming, RUCTOR Research Report, RRR 51-2001, Rutgers University, Nov. 2001.
[3] F. Alizadeh, J. A. Haeberly, and M. L. Overton: Primal-Dual
Interior-Point Methods for Semidefinite Programming: Convergence Rates, Stability and Numerical Results, SIAM Journal on
Optimization, Vol. 8, No. 3, pp. 746-768 (1998).
[4] A. George and W. H. Liu: Computer Solution of Large Sparse
Positive Definite Systems, Prentice-Hall, 1981.
[5] D. Goldfarb and K. Scheinberg: A Product-form Cholesky
Factorization Method for Handling Dense Columns in Interior
Point Methods for Linear Programming, Manuscript, 2001.
[6] N. Higham: Accuracy and Stability of Numerical Algorithms,
SIAM, 1996.
172
BIBLIOGRAPHY
Appendix D
On the Shannon Capacity
of a Graph
Xuan Li
12/05/2001
D.1
Overview
Let there be a graph G, whose vertices are letters in an alphabet and
in which adjaceny means that the letter can be confused. Then the
maximum number of one-letter messages which can be sent without danger of confusion is clearly α(G), the maximum number of
independent points in the graph G. Denote by α(Gk ) the maximum number of k−letter messages which can be sent without danger of confusion ( two k−letter words are confoundabe if for each
1 ≤ i ≤ k, their ith letters are confuoundable or equal). It is clear
that there are at least α(G)k such words, but one may be able to
do better.
Definition 57 Let
Θ(G) = sup
k
p
k
α(Gk ) = lim
k→∞
p
k
α(Gk )
This number was introduced by Shannon and is called the Shannon
capacity of the graph G.
174
APPENDIX D. SHANNON CAPACITY
The previous consideration shows that Θ(G) ≥ α(G) and in general,
equality does not hold. The determination of the Shannon capacity
is a very difficult problem even for very simple small graphs.
A general upper bound on Θ(G) was also given by Shannon. We
assign nonegative weights w(x) to the vertices x of G such that
X
w(x) ≤ 1
x∈C
for every complete subgraph C in G; such P
assignment is called a
fractional vertex packing. The maximum of x w(x), taken over all
fractional vertex packings, is denoted by α∗ (G).
With this notation Shannon’s theorem states
Θ(G) ≤ α∗ (G)
For the case of the pentagon, this result yield the bounds
√
5 ≤ Θ(C5 ) ≤
5
2
We will introduce a well-computable funtion–Lovasz θ−function which
bounds the capacity from above and equals the capacity in a large
number of cases.
D.2
Lovasz Theta function
Let G be a finite undirected graph without loops. We say that two
vertices of G are adjacent if they are either connected by an edge or
are equal. The set of points of the graph G is denoted by V(G).
Definition 58 If G and H are two graphs, then their strong product
G · H is defined as the graph with V(G · H) = V(G) × V(H), in which
(x, y) is adjacent to (x0, y0) if and only if x is adjacent to x0 in G
and y is adjacent to y0 in H.
If we denote by Gk the strong product of k copies of G, then α(Gk ) is
indeed the maximum number of independent points in Gk . Besides
the inner product of vectors v, w(denoted by vT w, where T denote
transpose. We shall use tensor product.
D.2. LOVASZ THETA FUNCTION
175
Definition 59 If v = (v1 , . . . , vn ) and w = (w1 , . . . , wm ), then we
denote by v ◦ w the vector
(v1 w1 , . . . , v1 wm , v2 w1 , . . . , vn wm )T
of length nm.
A simple computation shows that the two kinds of vector multiplication are connected by
(x ◦ y)T (v ◦ w) = (xT v)(yT w)
(1)
Let G be a graph. For simplicity, we always assume that its vertices
are 1, . . . , n.
Definition 60 An orthonormal representation of G is a system (v1 , . . . , vn )
of units vectors in Eulidean space such that if i and j are nonadjacent
vertices, then vi and vj are orthogonal.
Clearly, each graph has an orthonormal representation, for example,
by pairwise orthogonal vectors.
Lemma 22 Let (u1 , . . . , un ) and (v1 , . . . , vm ) be orthonormal representation of G and H, respectively. The the vectors ui ◦ vj form
an orthonomal representation of G · H.
Proof: The result is immediate from (1).
Definition 61 The value of an orthonormal representation (u1 , . . . , un )
to be
1
min max T 2
c 1≤i≤n (c ui )
where c range over all unit vectors. The vector c yielding the minimum is called the handle of the representation.
Definition 62 Let θ(G) denote the minimum value over all representations of G. It is easy to see that this minimum is attained.
Call a representation optimal if it achieves this minimum value.
Lemma 23
θ(G · H) ≤ θ(G)θ(H)
176
APPENDIX D. SHANNON CAPACITY
Proof: let (u1 , . . . , un ) and (v1 , . . . , vm ) be optimal orthonormal
representations of G and H, with handle c and d, respectively. Then
c ◦ d is a unit verctor by (1), and hence
1
i,j ((c ◦ d)T (ui ◦ vj ))2
1
1
= max T 2 T 2
i,j (c ui ) (d vj )
= θ(G)θ(H)
θ(G · H) ≤ max
Lemma 24
α(G) ≤ θ(G)
Proof: Let (u1 , . . . , un ) be an optimal orthonormal representation
of G with handle c. Let (1, . . . , k), for example, be a maximum
independent set in G. The u1 , . . . , uk are pairwise orthogonal, and
so
k
X
α(G)
1 = c2 ≥
(cT ui )2 ≥
θ(G)
i=1
Theorem 20
Θ(G) ≤ θ(G)
Proof: By Lemmas 1 and 2 , α(Gk ) ≤ θ(Gk ) ≤ (θ(G))k .
Lemma 25
Θ(C5 ) =
√
5
Proof: Consider an umbrella whose handle and five ribs have unit
lenghth. Open the umbrella to the point where the maximum angle
between the ribs is π2 . Let u1 , u2 , u3 , u4 , u5 be the ribs and c be the
handle, as vectors oriented away from their common point. Then
u1 , . . . , u5 is an orthonormal representation of C5 . Moreover it is
easy to compute that cT ui = 5−1/4 ,hence
√
1
Θ(C5 ) ≤ θ(C5 ) ≤ maxi T 2 = 5
(c ui )
The opposite inequality is known, and hence the theorem follows.
D.3. FORMULAS FOR θ(G)
D.3
177
Formulas for θ(G)
In the previous lecture, we have introduce Lavosz theta function as
minimum of the maximum eigenvalue of a set of matrices. We will
show now that the two definition are equivalent.
Theorem 21 Let G be a graph on vertices {1, . . . , n}. Then θ(G)
is the minimum of the largest eigenvalue of any symmetric matrix
(ai,j )ni,j=1 such that
aij = 1 if i = j or if i and j are nonadjacent. (2)
Proof: 1)Let (u1 , . . . , un ) be an optimal orthonormal representation of G with handle c, Define
aij = 1 −
uTi uj
(cT ui )(cT uj )
aii = 1
And A =
(aij )ni,j=1
Then (2) is satisfied, Moreover,
−aij = (c −
ui T
uj
) (c − T
), i 6= j
T
(c ui )
(c uj )
and
θ(G) − aii = (c −
ui 2
1
) + (θ(G) − T 2 )
T
(c ui )
(c ui )
Since (θ(G) − (cT 1ui )2 ) ≥ 0, these equations imply that θ(G)I − A
is positive semidefinite, and hence the largest eigenvalue of A is at
most θ(G).
2)Conversely, let A = (aij ) be any matrix satisfying (2), and let λ
be its largest eigenvalue. Then λI − A is positive semidefinite, and
hence there exist vectors x1 , . . . , xn such that
λδij − aij = xTi xj
Let c be a unit vector perpendicular to x1 , . . . , xn and set
1
ui = √ (c + xi )
λ
Then
u2i =
1
(1 + x2i ) = 1, i = 1, ...n
λ
178
APPENDIX D. SHANNON CAPACITY
and for nonadjacent i and j,
1
(1 + xTi xj ) = 0
λ
So (u1 , . . . , un ) is an orthonormal representation of G. Moreover,
ui T uj =
1
= λ, i = 1, . . . , n
(cT ui )2
and hence θ(G) ≤ λ. This complete the proof of the therom.
Remark 18 Note that it follows that among the optimal representation there is one such that
1
1
= ... = T
= θ(G)
T
2
(c u1 )
(c u2 )2
D.4
Some further properties of θ(G)
Lemma 26 Let (u1 , . . . , un ) be an orthonormal representation of
G and (v1 , . . . , vn ) be an orthonornmal representation of the complementary graph Ḡ. Moreover, let c and d be any vectors. Then
n
X
(uTi c)2 (vTi d)2 ≤ c2 d2
i=1
Proof: By (1), the vectors ui ◦ vi satisfy
(ui ◦ vi )T (uj ◦ vj ) = (uTi uj )(vTi vj ) = δij
Thus they form an orthonormal system, and we have
(c ◦ d)2 ≥
n
X
((c ◦ d)T (ui ◦ vi ))2
i=1
which is just the inequality in lemma 4.
Corollary 8 If (v1 , . . . , vn ) be an orthonormal representation of Ḡ
and d is any unit vector, then
θ(G) ≥
n
X
i=1
(vTi d)2
D.4. SOME FURTHER PROPERTIES OF θ(G)
179
Proof: Let (u1 , . . . , un ) be an optimal orthonormal representation
of G with handle c. Then,
1
θ(G) = max
1≤i≤n (cT ui )2
i.e.,
1
≤ (cT ui )2 , i = 1, . . . , n
θ(G)
1 = c2 d 2
n
X
≥
(uTi c)2 (vTi d)2
i=1
1 X T 2
(v d)
≥
θ(G) i=1 i
n
i.e.
θ(G) ≥
n
X
(vTi d)2
i=1
Corollary 9
θ(G)θ(Ḡ) ≥ n
Proof: Let (u1 , . . . , un ) be an optimal representation of G with
handle c such that
θ(G) =
1
(cT u
1
)2
= ... =
1
(cT u
2
n)
Let (v1 , . . . , vn ) be an optimal representation of Ḡ with handle d
such that
1
1
θ(Ḡ) = T 2 = . . . = T
(d v1 )
(d vn )2
180
APPENDIX D. SHANNON CAPACITY
By the result of Lemma 4,
1 = c2 d 2
n
X
≥
(uTi c)2 (vTi d)2
i=1
n
X
1
1
θ(G) θ(Ḡ)
i=1
n
=
θ(G)θ(Ḡ)
=
i.e.,
θ(G)θ(Ḡ) ≥ n
Appendix E
Extensions
Goemans-Williamson
Analysis:
Max-Bisection and
Max-k-Cut
Ge Zhang 12/12/01
E.1
Outlines
• Goemans and Williamson Approach
• Brief Introduction to Max k- Cut and Max Bisection Problem
• Theorems(Main Results)
• Algorithm for Max k-Cut
• Algorithm for Max Bisection
• Appendix(Proof of Thorems)
E.2
Goemans and Williamson Approach
Example 40 (L) et G = [V, E] be a undirected graph, where V =
{1, · · · , n} and |E| = m. Further, let wij ≥ 0 be a weight on edge(i, j)
182
APPENDIX E. SDP RELAXATIONS
∀(i, j) ∈ E. The MAX CUT problems is to partition V into S ⊂ V
and V\S such that the sum of the weights on the edges from the
subset S to V\S is maximized.
Therefore the problem can be formulated as following:
X
wij .
Wopt = maxS⊂V
(i,j)∈E,i∈S,j∈S̄
Such a pratition can also be viewed as a cut. IfPC denote a cut,
then the value of this cut is denoted by W(C) = (i,j)∈C wij . It is
known that MIN cut problem has been sloved with polynomial time
algorithm. However, the MAX cut problem is NP hard. There are
couples of approaches to solve this problem.
We are interested in Goemans and Williamson Approach.
Let xi = {−1, 1}, then we can convert this problem to a IP programming
IP: max
X wij
i<j
subject to
2
(1 − xi xj )
xi = {1, −1}
This is still a NP-Hard. Therefore we consider the following
relaxiation. Associated to each vetex i ∈ V a vector vi ∈ Rn and
consider
max
X wij
i<j
subject to
2
(1 − vTi vj )
kvi k = 1, ∀i ∈ V
Let V = [v1 , v2 , · · · , vn ] and Y = V T V, then yij = vTi vj . We can
rewrite the problem above as following:
max
X
(1 − yij )wij /2
i<j
subject to
yii = 1, ∀i ∈ V
Y0
E.2. GOEMANS AND WILLIAMSON APPROACH
183
This is so called SDP problme. Y 0 means that matrix Y is postive semi-definite.
Suppose we can solve this SDP Problem and let Y ∗ denote the
opitmal solution of this SDP problem and write Y ∗ = V ∗ T V ∗ . Suppose V ∗ = [v∗1 , v∗2 , · · · , v∗n ].
Now we are ready for the Goemans and Williamson approach:
Generate a random hyper plane passing origin. Then v∗1 , v∗2 , · · · , v∗n
will be divided into two groups by this hyperplane. Therefore, we
get a corresponding cut. Denote the value of cut by W(CSDP ).
θ
Proposition 11 E[W(CrandSDP )] ≥ αW(Copt ), where α = π2 min0<θ<π 1−cos
θ
and 0.87856 < α < 0.8857, Copt is the optimal cut.
Proof: Observe that the probability that vectors v∗i and v∗j are on
opposite sides of the hyperplane is exactly the proportion of the
angle between v∗i and v∗j to π. Therefore
E[W(CrandSDP )] =
X arccos(v∗i T v∗j )
i<j
π
wij
.
Since the expected value of a given cut is at most as large as the
optimal cut, and the expected value of the optimal cut is less than
the value of the semidefinite relaxation we have
X arccos(1 − v∗i T v∗j )
i<j
π
wij ≤ W(Copt ) ≤ W(CSDP ) =
X
i<j
Let α = min−1≤y≤1 2 arccos(y)
, then clearly
π(1−y)
α
X
i<j
wij
X
arccos(v∗i T v∗j )
1 − v∗i T v∗j
≤
wij
2
π
i<j
1 − v∗i T v∗j
wij
2
184
E.3
APPENDIX E. SDP RELAXATIONS
MAX k-CUT and MAX BISECTION problem
Since we’ve already done the one cut problem, there are two natural
generalizations.
E.3.1
generalization: MAX k-CUT
Instead of one cut, the problem now is to divide the graph into k
groups and we want to know what is maximum of this k partition.
Let P = P1 , P2 , · · · , Pl be a partition
Pof V and
P|P| = l.Denote the
weight of this partition by W(P) = 1≤r<s≤l i∈Pr ,j∈Pj wij . Then,
the formulation of this problem is
max
suject to
E.3.2
W(P)
|P| = k
generalization: MAX BISECTION
The one cut problem we stated above may generate two group of
nodes with different size. Now, we are interested in the cut that
divide the graph equally. Suppose |V| = n is an even number. A
partiion P = [S, V\S], S ⊂ V with |S| = n/2, then MAX BISECTION problem can be formulated as following
max
subject to
E.4
E.4.1
W(P)
P = [S, V\S], S ⊂ V
|S| = n/2
Theorems
Max k-cut problem
The simplest heuristic for MAX k-CUT is just to randomly partition
^ dnotes the partition produced and P∗ denotes the
V into k sets. If P
optimum partition then it is easy to see that
^ ≥ (1 − 1 )W(P∗ ),
E[W(P))
k
E.4. THEOREMS
185
since each edge (i, j) has probability (1 − k−1 ) of joining vertices in
different sets of the partition.
Nevertheless, the coefficient (1 − k−1 ) we got from this simple heruistic is far from the best. In this paper, the authors describe a new heuristic(randomized) which produces a better partition Pk .We’ll state the theorem first and present this new heuristic(randomized) in the next section.
Let Pk be the partition produced by this new heruristic(randomized).
Let {αk } be a sequence of positive number, k ≥ 2. Denote the optimal partition in MAX k-CUT by Pk∗ .
Theorem 22 E[W(Pk )] ≥ αk W(Pk∗ ),
where the αk satisfy
• αk > 1 − k−1
• αk − (1 − k−1 ∼ 2k−2 lnk
• α2 ≥ 0.878567, α3 ≥ 0.800217, α4 ≥ 0.850304, α5 ≥ 0.874243, α10 ≥
0.926642, α100 ≥ 0.990625.
E.4.2
MAX BISECTION
A random bisection porduces an expected guarantee of 1/2. In this
paper, the authors presented a heuristic BISECT which produces
much better partition PB . Let PB∗ denote the optimal bisection. Following is the theorem and the heuristic will be shown in the next
section.
Theorem 23 Let bep
a small positive constant. Then E[W(PB )] ≥
∗
βW(PB ) where β = 2( 2(1 − )α2 − 1), which is greater than 0.65
for sufficiently small.
186
E.5
E.5.1
APPENDIX E. SDP RELAXATIONS
Algorithm
MAX k-CUT
Similar to Goemans and Williamson approach, we also can formulate
the k-cut problem as an IP programming. However, because there
are k sets now, we have to let xi ∈ {1, 2, · · · , k}. Unfortunately,
this is not useful. Instead we construct k vectors a1 , a2 , · · · , ak
and let xi be one of them.
P The following is the construction: take
an equilateral simplex k in Rk−1 with vertices
P b1 , b2 , · · · , bk . Let
ck = (b1 +b2 +· · ·+b
Pk )/k be the center of k and let ai = bi −ck ,
for 1 ≤ i ≤ k. Scale k so that kai k = 1, ∀i ∈ V.
Lemma 27 ai aj = −1/(k − 1).
Proof:Since a1 , a2 , · · · , ak are of unit length we have to show that
the angle between ai and aj is arccos(−1/(k − 1)) for i 6= j. By
rotation and shift we can put b1 , b2 , · · · , bk−1 in the plane xk−1 = 0
and form an equilateral simplex of dimentaion k − 2. Let bi =
(bi0 , 0), for 1 ≤ i ≤ k − 1, where bi0 has dimension k − 2, and
move the whole simplex so that b10 + b20 + · · · + bk−1 = 0. Then
ck = (0, 0, · · · , 0, x) and bk = (0, 0, · · · , kx) for some x > 0. But
kbk − ck k = 1 and so x = 1/(k − 1). But then (bk − ck )(b1 − ck ) =
−1(k − 1)x2 = −1/(k − 1). By symmetry, we know it is true for any
two vectors.
Note that −1/(k − 1) is the best angle separation we can obtain
for k vectors as we see from:
Lemma 28 If u1 , u2 , · · · , uk satisfy kui k = 1 for 1 ≤ i ≤ k, and
ui uj ≤ γ for i 6= j, then γ ≥ −1/(k − 1).
Proof: 0 ≤ (u1 + u2 + · · · + uk )2 ≤ k + k(k − 1)γ.
Based on Lemma 1 we can formulate MAX k-CUT as follows:
IPk :
max
subject to
k−1X
wij (1 − xi xj )
k i<j
xi ∈ {a1 , a2 , · · · , ak }
The SDP relaxation will be following:
E.5. ALGORITHM
SDPk :
187
max
subject to
k−1X
wij (1 − vi vj )
k i<j
kvi k = 1, ∀i,
vi vj ≥ −1/(k − 1), ∀i 6= j
Heuristic:
Step 1 slove the problem SDPk to obtain vectors v1 , v2 , · · · , vk .
Step 2 choose k random vectors z1 , z2 , · · · , zk .
Step 3 partition V according to which of z1 , z2 , · · · , zk is closest to
each vj ,i.e., let P = P1 , P2 , · · · , Pk be defined by Pi = {j : vj zi ≥ vj zi 0
for i 6= i 0 }, for 1 ≤ i ≤ k.
Here is how we get random vector: Letzi = (z1i , z2i , · · · , zki ),
1 ≤ j ≤ k where the zij are kn independent samples from a standard normal distribution with mean 0 and variance 1. When k = 2,
it will be exactly the same as Goemans and Williamson approach.
Let Wk denote the weight of the partition produced by the heuristic, let Wk∗ be the weight of the optimal partition and let W̃k be
denote the maximum value of SDPk . Make xj = ai for j ∈ Pi , 1 ≤
i ≤ k we see that
E(Wk ) =
X
wij Pr(xi 6= xj )
(E.1)
i<j
By symmetry Pr(xi 6= xj ) depends only on the angle θ between
vi and vj , and hence on ρ = cos θ = vi vj . Let this separation
probability be denoted by Φk (ρ). It then follows from (1) that
E(Wk )
E(Wk )
≥
=
∗
Wk
W̃k
P
i<j
k−1
k
P
wij Φk (vi vj )
i<j
wij (1 − vi vj )
kΦk (ρ)
where αk = min−1/(k−1)≤ρ≤1 (k−1)(1−ρ)
.
≥ αk
188
E.5.2
APPENDIX E. SDP RELAXATIONS
MAX BISECTION
We can formulate the MAX BISECTION problme as following:
IPB :
1X
wij (1 − xi xj )
2 i<j
X
xi xj ≤ −n/2
max
subject to
i<j
xi ∈ {−1, 1}∀i ∈ V
Take a arbitrary BISECTION
P = S, V\S, |S| = n/2.
Let xi ∈ {−1, 1},for ∀i ∈ V. Let xi = 1 if i ∈ S, xi = −1 if i ∈ V\S.
Clearly
X
∀i,
xi xj = −1.
j:j6=i
It implies
XX
i
xi xj =
j:j6=i
X
It follows that
X
xi xj = −n.
i6=j
xi xj = −n/2.
i<j
P
Note if |S| 6= n/2, then
i<j xi xj > −n/2. Therefore, constrait
P
i<j xi xj ≤ −n/2 will force |S| = n/2 and this formulation is indeed a max bisection problem.
Now we can easily relaxiate to a SDP problem.
SDPB :
max
subject to
1X
wij (1 − vi vj )
2 i<j
X
vi vj ≤ −n/2,
i<j
kvi k = 1, ∀i ∈ V
Let be a small positive constant, = 1/100 is samll enough.
Heuristic
E.5. ALGORITHM
189
Stage 1 solve the problem SDPB to obtain vectors v1 , v2 , · · · , vn
Repeat Stages 2-4 below for t = 1, t, · · · , K = K() = [−1 ln−1 ]
and output the best partition S˜t , V\S˜t found in Stage 4.
Stage 2 choose 2 random vecors z1 , z2 .
Stage 3 let St = {j : vj z1 ≤ vj z2 }.
Stage P
4 suppose (w.l.o.g.) that |St | ≥ n/2. For each i ∈ St , let
ζ(i) = i∈S
/ t wij and let St = {x1 , x2 , · · · , xl } where ζ(x1 ) ≥ ζ(x2 ) ≥
· · · ≥ ζ(xl ). Let S˜t = {x1 , · · · , xn/2 }
Clearly construction sastisfies
nW(St : V\St )
.
W(S˜t : V\S˜t ) ≥
2l
Define two sets of random variables.
Xt = W(St : V\St ), 1 ≤ t ≤ K.
Yt = |St |(n − |St |), 1 ≤ t ≤ K.
Let PB∗ denotes the optimum bisection, and let W ∗ ≥ W(PB∗ )
denote the maximum of SDPB . Then we know
E[Xt ] ≥ α2 W ∗ ,
by Goemans and Willamson approach.
Also note Yt is the number of edges that connect two sets produced by partition, so
P
P
E[Yt ] = i<jP
Φ2 (vi vj ) ≥ α22 i<j (1 − vi vj )
⇒ E[Yt ] ≥ α2 N,
i<j vi vj ≤ −n/2
where N = n2 /4.
Let
Zt =
Xt
Yt
+ ,
∗
W
N
190
APPENDIX E. SDP RELAXATIONS
then
E[Zt ] ≥ 2α2 .
On the other hand
Xt ≤ W ∗ and Yt ≤ N ⇒ Zt ≤ 2.
Define Zτ = max1≤t≤K {Zt }.
Zt ≤ 2 ⇒ 2 − Zt ≥ 0,
by Chebyshev’s inequality, we have
Pr(Z1 ≤ 2(1 − )α2 ) = Pr(2 − Z1 ≥ 2 − 2(1 − )α2 )
E[2 − Z1 ]
≤
2 − 2(1 − )α2
2 − 2α2
≤
2 − 2(1 − )α2
1 − α2
=
1 − (1 − )α2
Since Zt ’s are all independent,
Pr(Zτ ≤ 2(1 − )α2 ≤ (
1 − α2
)K ≤ ,
1 − (1 − )α2
for the given choice of K().
Assume that
Zτ ≥ 2(1 − )α2
∗
and suppose Xτ = λW . It implies
Yτ ≥ (2(1 − )α2 − λ)N.
Suppose |Sτ | = δn, then it follows
δ(1 − δ) ≥ (2(1 − )α2 − λ)/4.
Then by few lines of algebra, we see that
W(S˜τ : V\S˜τ ) ≥ W(Sτ : V\Sτ )/(2δ) ≥ λW ∗ /(2δ)
≥ (2(1 − )α2 − 4δ(1 − δ)W ∗ /(2δ)
p
≥ 2( 2(1 − )α2 − 1)W ∗ .
E.6. APPENDIX(PROOF OF THEOREMS)
191
The last inequality follows from simple calculus.
Hence,
p
1 − α2
)K )W ∗
E[W(S˜τ )] ≥ 2( 2(1 − )α2 − 1)(1 − (
1 − (1 − )α2
p
∗
≥ 2( 2(1 − 3)α2 − 1)W .
Finally note that the partition output by BISECT is at least as
good as S˜τ We divide above by 3 to get the precise result.
E.6
Appendix(Proof of Theorems)
Let g(x) = (2π)−1/2 exp(−x2 /2) be the probability density function
of the univariate normal distribution. For i = 1, 2, · · · , the normalised H ermit polynomials φi (·) are defined by
√
di g(x)
(−1)i i!φi (x)g(x) =
.
dxi
Let hi = hi (k) denote the expection of φi (xm ax), where xmax is
distributed as the maximum of a sequence of k independent normally distributed random variables.
Lemma 29 Suppose u, v ∈ Rn are unit length vectors at angle θ,
and r1 , r2 , · · · , rk is a sequence of random vectors. Let ρ = cosθ =
uv, and denote by Nk (ρ) = 1 − Φk (ρ) the probability that u and v
are not separated by r1 , · · · , rk . Then the Taylor series expansion
Nk (ρ) = a0 + a1 ρ + a3 ρ2 + a4 ρ3 + · · ·
of Nk (ρ) about the point ρ = 0 converges for all ρ in the range
|ρ| ≤ 1. The coefficients ai of the expansion are all non-negative,
and their sum converges to Nk (1) = 1. The first three coefficients
are a0 = 1/k, a1 = h21 /(k − 1) and a2 = kh22 /(k − 1)(k − 2).
The proof of this lemma is complicated, so it is omited.
We’ll prove the Theorem 1 by three corollaries.
192
APPENDIX E. SDP RELAXATIONS
Corollary 10 αk > 1 − k−1 , for all k ≥ 2.
Proof: At ρ = 0, the numerator and dnominator of Ak (ρ) are both
k − 1; at ρ = 1 they are both 0. Since the power series expansion of Nk (ρ) has only positive terms, the numberator is a concave
function in the range 0 ≤ ρ ≤ 1, and hence Ak (ρ) ≥ 1 in that range.
Turning
P to the case ρ < 0, note that Nk (1) = 1 and Nk (−1) = 0
implies ieven a√
i = 1/2; futhermore, since h1 (k) increases with k
and h1 (3) = 3/2 π, we have a1 ≥ 9/4π(k − 1). Therefore
Nk (ρ) ≤
1
9(−ρ)
ρ2
1
(−ρ)
−
+
≤ −
,
k 4π(k − 1)
2
k 5(k − 1)
where the second inequality is valid over the range −1/(k − 1) ≤
ρ ≤ 0, since 9/4π − 1/2 ≥ 1/5; hence
Ak (ρ) ≥
1
k(−ρ)
(1 +
)
1−ρ
5(k − 1)2
. It is easily verified that the above expression is strictly greater
than 1 − k−1 over the closed interval −1/(k − 1) ≤ ρ ≤ 0.
Corollary 11 αk − (1 − k−1 ) ∼ 2K−2 ln k.
√
Proof: We know h1 (k) ∼ 2 ln k. Thus we have the asymptotic
estimate
1
2 ln k
Nk (ρ) = + (1 + (k))
ρ + O(ρ2 ),
k
k
where (k) is a function tending to 0, as k → ∞. The result follows
by arguments used in the proof of the previous corollary.
Corollary 12 α2 ≥ 0.878567, α3 ≥ 0.800217, α4 ≥ 0.850304, α5 ≥
0.874243, α10 ≥ 0.926642, and α100 ≥ 0.990625.
Proof:We can get all the numbers by numerically estimating the
coefficients a0 , a1 , a2 · · · .
Appendix F
Nesterov’s Extension of
Goemans–Williamson
Anaylysis
David Phillips
12/12/2001
F.1
Overview
As covered in class, the Goemans-Williamson approach to the maximum cut problem (MAXCUT) used a relaxation of a quadratic
formulation to approximate the solution. Nesterov addresses a generalization of this formulation and derives bounds for both the maximization and the minimization problem. Note that the GoemansWilliamson approach is not able to do this. Except where cited,
these notes are results from [?], [?], and [?].
F.2
Motivation
Let G = (V, E) be an undirected graph, where V = {1, . . . , n} and
|E| = m. Enumerate the vertices 1, . . . , n. Further, let wij ≥ 0 be
a weight on edge, (i, j), ∀(i, j) ∈ E. The MAXCUT problem is to
partition V into S ⊂ V and S̄ = V − S such that the sum of the
weights on the edges from the subset S to S̄ is maximized. Recall
194
APPENDIX F. NESTEROV’S EXTENSIONS
that the Goemans-Williamson approach uses the objective function:
P
max i<j (1 − xi xj )wij
subject to xi ∈ {−1, +1}
i = 1, . . . , n
1
2
This is equivalent to the formulation
P P
max ni=1 nj=1 (1 − xi xj )wij
subject to xi ∈ {−1, +1}
i = 1, . . . , n
1
4
or, more succinctly,
1
4
max W − xT Wx
subject to x ∈ {−1, +1}n
where

w11 w12 . . . w1n
 w12 w12 . . . w2n 

W=
 ...

w1n w12 . . . wnn

The W matrix represents the undirected weights on each arc, and
by this fact, has the feature that W = W T . Note, however, that W
is possible indefinite. With this in mind, Nesterov considered the
following optimization problems:
P∗ = max xT Cx
subject to x ∈ {−1, +1}n
where C is an arbitrary n×n symmetric matrix. The corresponding
minimization problem is:
P∗ = min xT Cx
subject to x ∈ {−1, +1}n
Since C is possibly indefinite, these problems are NP-hard, and it
seems unlikely that they can be solved in polynomial time. Hence, in
the spirit of Goemans-Williamson, Nesterov considered semidefinite
relaxations.
F.3. THE APPROACH
F.3
F.3.1
195
The approach
A further generalization
Nesterov actually derived his results for the following pair of quadratically constrained problems:
Q∗ = max xT Cx
subject to A[x]2 = b
[x]2 ∈ K
and
Q∗ = min xT Cx
subject to A[x]2 = b
[x]2 ∈ K
where C is an arbitrary n×n symmetric matrix, A ∈ Rm×n , b ∈ Rm ,
and K is a proper cone. [x]2 denotes the vector whose components
are the squared components of x, i.e., ([x]2 )T = (x21 x22 . . . x2n ). This
formulation is actually quite general. For example, letting m = n,
A = I, b = e, and K be the positive orthant shows that P∗ and P∗
are special cases of Q∗ and Q∗ (where e denotes a vector of ones,
appropriately dimensioned).
F.3.2
SDP relaxations
Nesterov made the following assumptions about this problem:
Assumption 1 The set F = {v ∈ Rn : Av ≤ b} is bounded.
Assumption 2 There exists v ∈ F such that v > 0.
It should be noted here that Nesterov assumed K was a proper
cone. He was also able to relax the full dimensionality of K, and
derive the same results.
He then considered the following relaxations (which he calls the
conic relaxations):
S∗ = max C • X
subject to Adiag(X) = b
X<0
diag(X) ∈ K
196
APPENDIX F. NESTEROV’S EXTENSIONS
and
S∗ = min C • X
subject to Adiag(X) = b
X < 0,
diag(X) ∈ K
Nice duals exist for these formulations. Recall that the dual cone
to K is K∗ = {u ∈ Rn : uT v ≥ 0, ∀v ∈ K}
Lemma 30
S∗ = D(S∗ ) = min bT y
subject to Diag(AT y) − Diag(u) < C
u ∈ K∗
and
S∗ = D(S∗ ) = max bT y
subject to C < Diag(AT y) + Diag(u)
u ∈ K∗
Proof: Using the Lagrangian, and in view of the assumptions:
S∗ = max{C • X : Adiag(X) = b, diag(X) = v, X < 0, v ∈ K}
X,v
= max
min
X,v y∈Rm ,u∈Rn
=
=
{C • X + yT (b − Adiag(X)) + u(diag(X) − v) : X < 0, v ∈ K}
min
max{(C + Diag(u − AT y)) • X + yT b − uT v) : X < 0, v ∈ K}
min
{yT b : Diag(AT y) − Diag(u) < C, u ∈ K∗}
y∈Rm ,u∈Rn X,v
y∈Rm ,u∈Rn
where the the second to third equality follows because, for any two
n × n matrices, A, B, and n × 1 vector c, we have
A • B + cT diag(B) = A • B + Diag(c) • B
= (A + Diag(c)) • B
The second result follows similarly.
F.4
The bound analysis via trigonometric forms
Consider any vector x feasible for Q∗ , and let X = xxT . Noting that
diag(X) = [x]2 shows immediately that S∗ ≤ Q∗ , i.e., S∗ is indeed a
F.4. THE BOUND ANALYSIS VIA TRIGONOMETRIC FORMS
197
relaxation. By the same logic, S∗ ≥ Q∗ . Thus we have the following
relationship immediately:
Lemma 31
S∗ ≥ Q∗ ≥ Q∗ ≥ S∗
Nesterov’s main goal is to try and establish a better bound between
Q∗ and Q∗ . To do this, he required another form for Q∗ and Q∗ .
F.4.1
Trigonometric form
In order to prove the validity of the forms, some lemmas are needed.
Nesterov emphasizes that the following techniques are based on the
Goemans-Williamson approach ([?]). Indeed, he uses the following
lemma from it:
Lemma 32 For a number a ∈ R, let sgn(a) = 1 if a ≥ 0, and
−1 otherwise. Given vectors x, y ∈ Rn , and unit vector u drawn
uniformly from the unit sphere,
Pr[sgn(xT u) 6= sgn(yT u)] =
1
arccos(xT y)
π
This lemma was proved in class, (see lecture notes 1, proposition 1),
and in [?].
For the following, for a vector v, let


sgn(v1 )
σ(v) =  ... 
sgn(vn )
and let the matrix V = [v1 . . . vn ], where each of the are vectors
vi are of dimension n. For a feasible point, x∗ , to problem P∗ , let
I∗ = {i : x∗i = 1}.
Lemma 33 For an arbitrary unit vector u ∈ Rn ,
P∗ = max(Cσ(V T u))T σ(V T u)
subj. to ||vi || = 1
i = 1, . . . , n
198
APPENDIX F. NESTEROV’S EXTENSIONS
Proof: Denote the right hand side by µ. Letting x = σ(V T u),
we have xi = ±1. Thus, µ ≤ P∗ . On the other hand, choose an
arbitrary unit vector u, and for i ∈ I∗ , let vi = u, and for i 6∈ I∗ let
vi = −u. Thus, σ(V T u) = x∗ . Thus, µ ≥ P∗ .
Now, let Eu [f(u)] be the average value of f(u) with u ranging over
the n-dimensional unit sphere.
Lemma 34
P∗ = max Eu [(Cσ(V T u))T σ(V T u)]
subj. to ||vi || = 1
i = 1, . . . , n
Proof: Let the righthand side by ω. Then, letting f(u) = (Cσ(V T u))T σ(V T u),
and by Lemma 33, P∗ ≥ ω. On the other hand,
ω=
max
vi ||=1i=1,...,n
n X
n
X
cij Eu [sgn(vTi u) sgn(vTj u)]
i=1 j=1
where cij are the components of C. Using the same technique, fix
a unit vector y ∈ Rn , and let vi = y for i ∈ I∗ , and vi = −y for
i 6∈ I∗ . Then,
1
i, j ∈ I∗ or i, j 6∈ I∗
T
T
Eu [sgn(vi u) sgn(vj u)] =
−1 otherwise
∗ ∗
= xi xj
and ω ≥ (x∗ )T Cx∗ = P∗ by the linearity of expectation.
We’re now ready to prove Nesterov’s main theorem. For any function, f, of one variable, and any matrix X, with components xij , let
f[X] denote the matrix whose components are f(xij ). e denotes the
vector of ones, appropriately dimensioned (in this case, of dimension
n).
Theorem 24
P∗ = max π2 C • arcsin[X]
subj. to diag(X) = e
X<0
P∗ = min π2 C • arcsin[X]
subj. to diag(X) = e
X<0
F.4. THE BOUND ANALYSIS VIA TRIGONOMETRIC FORMS
199
Proof: Choose an X < 0, which has diag(X) = e, and let V = X1/2 .
Then xij = vTi vj , and ||vi || = 1, i = 1, . . . , n. But for an V where
||vi || = 1, i = 1, . . . , n, V T V < 0 and diag(V T V) = e. Thus, by
Lemma 34, we only need to show that
Eu [(CT σ(V T u))T σ(V T u) =
2
C • arcsin[X]
π
where X = V T V. But, since
Eu [(sgn(vTi u) sgn(vTj u)] = 1 − 2 Pr[sgn(vTi u) 6= sgn(vTj u)]
we can use Lemma 32 to obtain
Eu [(sgn(vTi u) sgn(vTj u)] = 1 −
=
2
arccos(vTi vj )
π
2
arcsin(vTi vj )
π
The result follows from the linearity of Eµ , and the fact that P∗ =
−P∗ .
For our purposes, we’re interested in the following corollary.
Corollary 13
Q∗ = max π2 C • Diag(d)arcsin[X] Diag(d)
subj. to diag(X) = e
X<0
A[d]2 = b
d≥0
[d]2 ∈ K
Q∗ = min π2 C • Diag(d) arcsin[X] Diag(d)
subj. to diag(X) = e
X<0
A[d]2 = b
d≥0
[d]2 ∈ K
200
APPENDIX F. NESTEROV’S EXTENSIONS
Proof: Represent x ∈ Rn as

d1 s1
x =  ... 
dn sn

where d ≥ 0 and si = sgn(xi ), i = 1, . . . , n. So, [x]2 = [d]2 . Then,
Q∗ = maxd [Φ(d) : d ≥ 0, A[d]2 = b, [d]2 ∈ K] where
Φ(d) = max[(Diag(d)C Diag(d)s)T s : s ∈ {−1, 1}n ]
s
But by Theorem 24, we have
Φ(d) = max[(Diag(d)C Diag(d))T arcsin[X] : X < 0, diag(X) = e]
X
This shows the second statement, and the first is completely analogous.
F.4.2
Some good facts about positive semidefinite matrices
This section is provided for completeness; the results are well known
(although Nesterov provided the proofs stated).
Lemma 35 For any symmetric matrices A, B < 0 the matrix C
with entries cij = aij bij is also positive semidefnite.
Proof: Fix an arbitrary u ∈ Rn . Then Cu = diag(A Diag(u)B).
Therefore,
uT Cu
= uT (diag(A Diag(u)B))
= A Diag(u)B • Diag(u)
= A • Diag(u)B Diag(u)
≥0
since the matrix Diag(u)B Diag(u) < 0.
Corollary 14 Let X < 0. Then
1. [X]k < 0 for any integer k.
2. If all |xij | ≤ 1 then arcsin[X] < X.
([X]k is the matrix whose entries are all raised to the kth power.)
F.4. THE BOUND ANALYSIS VIA TRIGONOMETRIC FORMS
201
Proof: The first statement follows directly from Lemma 35. The
second follows from the Taylor series for arcsin(x) at zero:
arcsin[X] = X +
F.4.3
1 [X]3 1 · 3 [X]5
·
+
·
+ . . . , |xij | ≤ 1
2 3
2·4 5
The bounds
With these tools, the bounds can be refined. For α ∈ R, let
S(α) = αS∗ + (1 − α)S∗
Theorem 25
S∗ ≤ Q∗ ≤ S(1 −
2
2
) ≤ S( ) ≤ Q∗ ≤ Q∗
π
π
Proof: The first and last inequalities follow from Lemma 31. The
middle follows immediately from the fact that S∗ ≥ S∗ . We’ll now
prove the last two inequalities. Choose an arbitrary y ∈ Rm , and
u ∈ K that satisfy D(S∗ ) (see Lemma 30). Thus,
(u, y) ∈ {(u, y) ∈ K∗ × Rm : Diag(AT y) + Diag(u) < C}
Consider a pair (X, d) which satisfies the constraints of the trigonometric representation for Q∗ (see Corollary 13):
X < 0, diag(X) = e, d ≥ 0, A[d]2 = b, [d]2 ∈ K
202
APPENDIX F. NESTEROV’S EXTENSIONS
Since X < 0 and |xij | ≤ 1, by Corollary 14, arcsin[X] < X. Then,
using Corollary 13 we have:
Q∗ ≥
=
=
≥
=
≥
2
C • Diag(d) arcsin[X] Diag(d)
π
2
Diag(d)C Diag(d) • arcsin[X]
π
2
(Diag(d)(C − Diag(u) − Diag(AT y))) Diag(d) • arcsin[X]) + ([d]2 )T (u + AT y)
π
2
(Diag(d)(C − Diag(u) − Diag(AT y))) Diag(d) • X) + ([d]2 )T (u + AT y)
π
2
2
C • Diag(d)X Diag(d) + (1 − )([d]2 )T (u + AT y)
π
π
2
2
C • Diag(d)X Diag(d) + (1 − )(bT y)
π
π
Moreover, letting Y = Diag(d)X Diag(d), we have
Y < 0
Adiag(Y) = Adiag(Diag(d)X Diag(d)) = A[d]2 = b
diag(Y) ∈ K
that is, Y is feasible for S∗ . Thus,
2
2
C • Y + (1 − )(bT y)
π
π
2 ∗
2
=
S + (1 − )(bT y)S∗
π
π
Q∗ ≥
The other equality is proved analogously.
F.5
Conclusion
Nesterov was able to improve on the quality of these bounds by using
some additional information. He also generalized the assumption
that the feasible region had full dimension. As he himself states,
the quadratic ”constraints are rather specific”. He further describes
an example of a quadratic program with one linear constraint, for
F.5. CONCLUSION
203
which the approximation scheme is infinitely bad. The interested
reader is referred to [?].
204
APPENDIX F. NESTEROV’S EXTENSIONS
Appendix G
Moment Cones and
Nonnegative Polynomials
Haengju Lee
December 12, 2001
This chapter was typeset in powerpoint and then turned
into a pdf file.
0RPHQW&RQHVDQG1RQ
QHJDWLYH3RO\QRPLDOV
‡
6WUXFWXUHRI1RQQHJDWLYH3RO\QRPLDOVDQG
2SWLPL]DWLRQ3UREOHPV<X1HVWHURY
‡
6TXDUHG)XQFWLRQDO6\VWHPVDQG2SWLPL]DWLRQ
3UREOHPV<X1HVWHURY
‡
+DQGERRNRI6HPLGHILQLWH3URJUDPPLQJ«+
:RONRZLF]56DLJDODQG/9DQGHQEHUJKH
SUHSDUHGE\+DHQJMX/HH
6WUXFWXUHRI1RQQHJDWLYH3RO\QRPLDOV
DQG2SWLPL]DWLRQ3UREOHPV
‡
&RQWULEXWLRQ
± 7KH\UHSUHVHQWWKHFRQHRIXQLYDULDWHQRQQHJDWLYH
SRO\QRPLDOVDVWKHFRQHRISRVLWLYHVHPLGHILQLWH
PDWULFHV
‡
$SSOLFDWLRQ
± 1RQQHJDWLYHRQDUD\
± 1RQQHJDWLYHRQDQLQWHUYDO
± 1RQQHJDWLYHWULJRQRPHWULF3RO\QRPLDO
± FDQH[WHQGWRPXOWLGLPHQVLRQDOFDVH
6RPH1RWDWLRQVDQG'HILQLWLRQV
[ \
Q
= Ã[
$ %
S = S
L =
Q
\
P
L [ \ ¬ 5
= Ã Ã $
LN L = N =
Q
K S
S W = Ã S
N =
$
L N W
Q
N
á
%
7
LN Q
$ % ¬ 5
¬5
Sp
Q
Q +
p
Q
QíP
W = W W
W W ¬ 5
f = $LVSRVLWLYHVHPLGHILQLWH
K W
Q
7
1RQQHJDWLYH3RO\QRPLDOV
‡
&RQVLGHUWKHIROORZLQJFRQYH[FRQH
.
Q
= ^S ¬ 3
S¬.
‡
Q
Q
S W ã "W ¬ 5`
¹ S W á
Sp
/HPPD7KHFRQH.
Q
W ã "W ¬ 5
LVDFORVHGFRQYH[
Q
SRLQWHGFRQHZLWKQRQHPSW\LQWHULRUDQG
LQW .
.
Q
Q
= ^S ¬ 3
= ^S ¬ 3
Q
Q
S
Q
> SW > "W ¬ 5`Q ã SW = gW - W TW TW ¬ .
Q -
`Q ã 7KHRUHP
‡
S¬.
)RUDQ\DQGWKHUHH[LVWVD
Q ã Q
V\VWHPRISRO\QRPLDOV TL ¬ 3 L = L
VXFKWKDW
Q
S W = Ã T W L
KQ
L =
‡
3URRIE\LQGXFWLRQRQQDQG/HPPD
S W = a + bS W S ¬ .
S W = gW - W Q +
T W T W ¬ .
Q
'XDORIWKHFRQH.
‡
/HPPD
.
Q
= &O^&RQ p
.
Q
Q
Q
W W ¬ 5 `
/HW5+6 . ê
.ê « .
Q
.ê ¨ .
,I S ¬ .
Q
S W =
S p
W Q
ã Q
$VVXPHVê ¬ .
Q
? . ê&ORVHGFRQYH[
$Sê Sê Vê < — Sê V "V ¬ . ê
¼ Sê ¬ .
Q
FRQWUDGLFWWRVê ¬ .
Q
+DQNHO0DWUL[
)RUV ¬ 3
Q
ÄV
Å V
Å
$ V = V
Q
Å
V
V
V
M
Å
Å Q
ÆV
V
Q
V
V
V
Q +
$ V = Ã V
Q
N =
N (
V
Q+
L
L
L
O
L
V
V
V
Q +
Q+ V
Q
N
( FURVVGLDJRQDOXQLWPDWULFHV
N
Q
Ô
Õ
Õ
Õ
Õ
Õ
Ö
7KHRUHP
.
Q
= ^V ¬ 3
Q
$ V
Q
/HW5+6 . ê
.
.
Q
Q
« .ê
$ p
Q
Q
f = `
W = p
Q
W p
7
Q
W ¨ .ê
,I V ¬ . ê S ¬ .
S V
Q
= Ã T V
L
L =
Q
Q
= Ã
L =
$ V T T
Q
L
L
ã 7KHRUHP
.
Q
= ^S ¬ 3
S
,I
N S ¬ LQW .
/ S = ^<
Q
Q
= <(
N = N
KQ < f = `
WKHQ
f= <(
N
= S
N N = K Q`
LVERXQGHGDQGLQWHUVHFWVWKHLQWHULRURI
WKHFRQHRISRVLWLYHVHPLGHILQLWH
PDWULFV
3RO\QRPLDOVQRQQHJDWLYHRQDUD\
.
+
Q
= ^S ¬ 3
S¬.
Q
+
Q
_ S W ã "W ã `
¹ S W
¬.
Q
7KHRUHP
S W ¬ .
+
Q
S
¹ $<
N f= ¬ 5
= <(
= <(
N
N -
Q +í Q +
N = N = K Q
K Q
VW 3RO\QRPLDOVQRQQHJDWLYHRQD
VHJPHQW
.
> @
Q
= ^S ¬ 3
S¬.
> @
Q
¹ + W
_ S W ã "W ¬ >@`
Q
W
¹ S
+ W
Q
S
W
+ W
ã "W
¬ .
Q
7KHRUHP
S W ¬ .
> @
Q
¹ $<
/ S
N
f= ¬ 5
= <(
= < (
N -
N
Q +í Q +
VW N = Q
N = Q
/ URZRIORZWULDQJXODUQQPW[/
N
NL /
=
Q - L N - L Q - N N ã L
6TXDUHG)XQFWLRQDO6\VWHPVDQG
2SWLPL]DWLRQ3UREOHPV
‡
7KH7HFKQLTXHRIWKHSUHYLRXVSDSHUFDQ
EHH[WHQGHGRQDUELWUDU\IXQFWLRQDO
V\VWHPV
‡
7KLVSDSHUJLYHDQH[SOLFLWGHVFULSWLRQRI
WKHFRQHVRISRO\QRPLDOVUHSUHVHQWDEOHDVD
VXPRIVTXDUHGIXQFWLRQV
'HILQLWLRQV
$QDUELWUDU\V\VWHPRIOLQHDUO\LQGHSHQGHQW
IXQFWLRQV
6 = ^X
K X
[ )XQFWLRQDOVXEVSDFH
P
P
) 6 = ^T [ = Ã T
N =
[ ` [ ¬ D
N X
N [ T ¬ 5
P
`
:HDUHLQWHUHVWHGLQWKHUHSUHVHQWDWLRQRI
1
. = ^ S [ = Ã T [ T [ ¬ ) 6 L = 1 `
L
L
L =
'HILQLWLRQV
6TXDUHGIXQFWLRQDOV\VWHP
6
= ^X
LN
[ = X
L [ µ X
N [ L N = P`
6RPHEDVLVRI6
Y [ = Y
[ Y
Q
7
[ 9HFWRUFRHIILFLHQWV
X
L [ µ X
N [ =
l
LN
Y [ "[ ¬ D l
LN
¬5
Q
'HILQLWLRQV
0DWUL[YDOXHGOLQHDURSHUDWRU
7
X [ X [ L Y LN á L Y [ =
l
LN
Y
$GMRLQWOLQHDURSHUDWRU
< L Y á
L < Y "< ¬ 5
PíP
Y ¬ 5
Q
5HSUHVHQWDWLRQWKHRUHPV
S [ =
S Y [ S ¬ 5
Q
. = ^S ¬ 5
Q
EHORQJVWR. ¹ $< ¬ 5
U LQW . = ^ S ¬ 5
.
= ^F ¬ 5
Q
LF
S = L < <
Q
f = VW S = L
f = `
S = L < <
f = ` LQW .
Q íQ
f `
þf
,IY[LVDPLQLPDOV\VWHPVIRU6 WKHQ
. LVDSRLQWHGFRQHDQGLQW.LVQRWHPSW\
$Q\SLQ.FDQEHUHSUHVHQWHGDVDVXPRIDWPRVWPVTXDUHV
N
S [ = à T [ T [ ¬ ) 6 L = N — P
L
L
L =
S ¬ U LQW . ¹
7KHUHH[LVWVVXFKDUHSUHVHQWDWLRQZLWK
OLQHDUO\LQGHSHQGHQWT DQGN
L
P
< 4XLFNSURRIRI
º
S [ =
L < Y [ 7
= Ã D X [
L
7
L =
1
S [ = Ã T X [
L
=
L =
=
N
= Ã D D X [X [
L L
< X [X [
1
= < L Y [ N
7
¼
=
S Y [ Ã TL TL L Y [ 7
L =
1
WDNH< = Ã T T
L
L
L =
7
L =
1
7
7
à TL TL X [ X [ L =
=
Ä
1
L Å ÃT T
L
L
Æ L =
7
f = S = L < Ô
Õ Y [ Ö
=
¬.
4XLFNSURRIRI
F¬.
¹
S F
ã "S ¬ .
¹
L < F
¹
< L F ã "<
¹ L F ã "<
f= f= f= 6XPRIZHLJKWHGVTXDUHV
)RUIL[HGT [ ã "[ ¬ D
&RQVLGHU
1
.T = ^S[ = T [ÃT [T [ ¬ )6L = 1`
L =
L
L
,IZHFKDQJHWKHLQLWLDOIXQFWLRQDOV\VWHP
6 = ^X
=
T [ X
1
[ X
P
=
T [ X
P
[ `
. T = ^ S [ = Ã T [ T [ ¬ ) 6 L = 1 `
L
L
L =
6XPRIZHLJKWHGVTXDUHV
:HKDYHVHYHUDOIXQFWLRQDOV\VWHPV6 N
N
O
1
N =
L =
«O
. T T = ^ S [ = Ã T [ Ã T [ T [ ¬ ) 6 `
O
N
LN
LN
N
O
. T T = Ã . T O
N
N =
/HWY[EHODUJHHQRXJKWRFRYHU 6
7
T [ X [ X [ N
N
N
= L
N
N
N = O
Y [ N = O
5HSUHVHQWDWLRQIRUVXPRI
ZHLJKWHGVTXDUHV
. T T = ^ S ¬ 5
O
<
N
.
Q
LQW .
O
S = Ã L < N
N
N =
f= ¬ 5
T T = ^F ¬ 5
O
Q
L
N
P íP
N
F N
f= ¬ 5
U6
N =
T T N
N
`
O
,IY[LVDPLQLPDOV\VWHPIRU
P íP
T T þ f
O
.
N = O`
O
SRLQWHGFRQH
N
WKHQ
LQW . T T þ f
O
0RPHQWSUREOHPVLQ)LQDQFH
‡
1RDUELWUDJHDVVXPSWLRQ
‡
:LWKRXWDVVXPLQJDQ\PRGHOIRUWKHXQGHUO\LQJSULFH
G\QDPLFV
‡
1RDUELWUDJHLIIH[LVWHQFHRIPDUWLQJDOHPHDVXUHp
2SWLPDOERXQGVRQWKHSULFHRID(XURSHDQFDOORSWLRQ
JLYHQWKHPHDQDQGYDULDQFHRIWKHXQGHUO\LQJVWRFNSULFH
=
= VXS (
p
VW (
p
[PD[ ;
[; ] =
9DU
p
m
[; ] = s
- N ]
3ULPDODQG'XDO
*LYHQQILUVWPRPHQWVT «T 3
[PD[ ;
p
0D[(
VW ( > ;
L
p
ì
Q
- N ] = Ð PD[ [ - N p [G[
ì
@ = Ð [ p [ G[ = T L = Q
L
L
p [ ã Q
'
PLQ Ã \ T
U
U
U =
Q
VW Ã \ [
U =
U
U
ã PD[ [ - N "[ ¬ 5
+
6HPLGHILQLWH3UREOHP
Q
PLQ Ã \ T
U
U
U =
VW =
Ã
L ML + M = O -
[
LM
ÄN - UÔ U
=
ÕN
à \U Å
U =
Æ O -U Ö
O = Q
O
=
\
\
Ã
L ML + M = O -
]
L ML + M = O
LM
+ N + \
LM
O = Q
- N + Ã \ N
U
N
- N + Ã
U =
O = Q
[
N
U
U =
\ UN
ÄUÔ
U
\
N
=
Å
Ã
Õ
U
U =
ÆO Ö
N
Ã
U
U
=
Ã
Ã
= [
L ML + M = L ML + M = O
[
LM
[
LM
;=
f= 4XLFNSURRI
7KHIHDVLEOHUHJLRQFDQEHZULWWHQDV
Q
à \U [
U =
\
U
ã IRUDOO [ ¬ > N @
+ N + \
Q
U
- [ + Ã \ [ ã IRUDOO [ ¬ > N ì U
U =
J [ ã [ ¬ > D @ ¹ + W
Ä DW
JÅ
Å
W
+
Æ
N
J [ ã [ ¬ > D ì ¹ J D + W
Ô
Õ ã "W
Õ
Ö
ã "W
206
APPENDIX G. MOMENT CONES AND NONNEGATIVE
POLYNOMIALS
Appendix H
Polynomial Regression with
Shape Constraints
Author: Anton Riabov
December 12, 2001
H.1
Overview
The purpose of this project is to study semidefinite and
second order cone programming methods in application
to polynomial regression. Least-squares regression is widely
used to discover functional dependencies based on noisy
data. We describe how the problem for least-squares
approximation with polynomials can be formulated as
a SOCP, and how additional shape constraints on the
function can be incorporated into the cone program as
semidefinite programming constraints. Finally we present
computational results (obtained from simulated data using SeDuMi SDP solver).
208
APPENDIX H. POLYNOMIAL REGRESSION
H.2
Polynomial Regression
H.2.1
Problem Formulation
Consider least-squares polynomial regression problem in
R2 . Given a set of m points {(pj , qj )}m
j=1 , and maximum
polynomial degree n, we need to provide polynomial
def
n
coefficients
t
=
{t
i }i=0 , such that for function f(x) =
Pn
Pm
i
t
x
the
corresponding
squared
error
i
j=0 f(pj ) −
i=0
2
qj is minimized. Obviously, linear regression is a special case of polynomial regression. Additional shape constraints may be imposed on f(x), requiring, for example,
convexity of f(x) or monotonicity of f(x) on R or on a
certain interval. Often shape constraints help to obtain
a better approximation of underlying function.
Since off-the-shelf solvers are used in experiments, we
need to provide a formulation in a format that can be
recognized by these solvers. In this work we give formulations corresponding to the following “standard primal”
formulation of a cone program, which is recognized by
most solvers:
min
cTl xl
+
Nq
X
ckq xkq
+
Ns
X
k=1
Nq
s.t. Al xl +
X
k=1
xl ≥ 0
xkq <Q 0,
Xks < 0,
Cks · Xks
k=1
Akq xkq
+
Ns
X
Aks · Xks = b
k=1
1 ≤ k ≤ Nq
1 ≤ k ≤ Ns
H.2. POLYNOMIAL REGRESSION
H.2.2
209
Polynomial Regression as an SOCP
Polynomial regression is an easy problem, which reduces
to solving a system of linear equations (just setting the
derivative of sum of squares to zero). However, if we want
to add shape constraints on polynomials, such as convexity, cone program formulation becomes very handy. As
we will see in the next subsection, shape constraints can
be expressed as an SDP, which can be given to a solver.
But first we will describe how the problem is formulated
for unconstrained regression.
Let matrix P ∈ R(n+1)×m be the matrix of powers of
pj :
def
P = (pij )0≤i≤n,1≤j≤m .
Then, the least squares minimization problem of choosing polynomial coefficients t ∈ Rn+1 can be written as:
min z
s.t. PT t + s = q
z ≥ ksk
In this program the last constraint is a quadratic cone
constraint: (z sT )T <Q 0. We will be solving the dual
problem using a solver, but to do this we need the dual
program to be a minimization problem. Introduce varidef
def
able vector r ∈ Rm+1 such that r̄ = s ∈ Rm and r0 = z,
and rewrite:
max −r0
s.t. PT t + r̄ = q
r <Q 0
(H.1)
210
APPENDIX H. POLYNOMIAL REGRESSION
Now we can write the dual SOCP. Let u ∈ Rm be the
dual variable corresponding to the equality constraint.
min qT u
s.t. Pu
= 0
0
−1
u <Q∗
I
0
Quadratic cones are self-dual, i.e. Q∗ = Q. Introduce
variable v ∈ R(m+1) and rewrite:
min qT u
s.t. Pu =0
0
1
u+
v=
I
0
v <Q 0
Note that from the constraints it follows that v̄ = u,
and u can be removed from the program:
min qT v̄
s.t. Pv̄ = 0
v0 = 1
v <Q 0
(H.2)
Note that the problem matrix is very dense, but the
number of constraints, as well as the number of variables,
grows linearly with the size of the problem.
H.2.3
Approximation With Non-Negative Polynomials
Additional constraints can be added to SOCP (H.1) in
order to restrict the shape of the polynomial. The first
H.2. POLYNOMIAL REGRESSION
211
example we will consider is approximation with nonnegative polynomials. We will require the polynomial
def Pn
i
f(x) =
i=0 ti x ≥ 0 for all x ∈ R. We can assume,
without loss of generality, that n = 2k for some integral k. It turns out that the coefficients t of such polynomials form a convex cone (denoted Pn in lecture notes)
in Rn+1 .
Now we can add nonnegativity constraint on polynomials defined by t to (H.1):
max −r0
s.t. PT t + r̄ = q
r <Q 0
t <Pn 0
(H.3)
Following the same steps as in the previous subsection, we can write the dual program, which now has new
variable w ∈ Rn+1 :
min qT v̄
s.t. Pv̄ − w = 0
v0 = 1
v <Q 0
w <P∗n 0
(H.4)
We will use the results from section 3 of paper [2]
by Yu. Nesterov to define dual cones for cones of restricted polynomials. In this case, the relevant result
tells us that
P∗2k = w = (w0 , w1 , ..., w2k )T ∈ R2k+1 : Λ(w) < 0 .
212
APPENDIX H. POLYNOMIAL REGRESSION
where Λ(w) ∈ R(k+1)×(k+1) is a Hankel matrix defined
as:


w0
w1 ... wk−1
wk
 w1
w2 ... wk
wk+1 


def

Λ(w) = (wi+j )0≤i,j≤k = 
...
...
...
...
...


wk−1 wk ... w2k−2 w2k−1 
wk wk+1 ... w2k−1 w2k
Therefore the last constraint in (H.4) is essentially a
positive semidefinite constraint on Λ(w), combined with
some additional constraints needed in order to define the
elements matrix Λ(w). For most solvers, the total of
(k+1)(k+2)
upper-diagonal elements of Λ(w) need to be
2
defined, since the constraints that are forcing a matrix
to be positive semidefinite force it to be symmetric as
well.
H.2.4
Monotonicity and Convexity Constraints
The polynomial function can be restricted to be nondecreasing or convex, by restricting its derivatives to be
non-negative. Since taking a derivative with respect to x
of f(x) is just a linear transformation of t, this is easy to
do. In fact, this operation does not even introduce new
variables, if done carefully. See the sourse code in the
Appendix for more details.
The inverse constraints (concave or non-increasing functions) can be implemented by reversing the sign of w in
the equality constraint in (H.4).
H.3. EXPERIMENT RESULTS
H.3
213
Experiment Results
We are using a simulated data model, in which the underlying function is a quadratic: g(x) = 0.3(x−10)2 +3. The
sample values q were generated for p ∈ {1, 2, ..., 30} and
to each of the function values an independent random
error following N(0, 12) was added. Another experiment
involved an exponential function with the same added
noise structure.
H.3.1
Unconstrained Polynomial Regression
The results in figures H.1 and H.2 were obtained by solving SOCP (H.2) on this data using SeDuMi SDP solver.
The dotted line shows the original function, and the continious line is the approximating polynomial. On both
data sets, for n = 8 SeDuMi encounters some non-critical
numerical problems, but finds a solution. For n ≥ 9 SeDuMi is unable to find a solution because of numerical
problems. Based on this experiment, we can not expect
Sedumi to be able to solve the same problem with added
shape constraints without running into numerical problems even in polynomials of degree lower than 8.
H.3.2
Shape-Constrained Polynomial Regression
For positive-constrained polynomials, as in SDP (H.4),
we have subtracted 30 from the simulated data. Plots
in fugire H.3 show that polynomials of degress less than
6 were successfully constrained to be positive. But for
degree 6 SeDuMi encountered some numerical problems,
and the solution crosses the horizontal axis. In this case
214
APPENDIX H. POLYNOMIAL REGRESSION
UNCONSTRAINED REGRESSION, N=1
UNCONSTRAINED REGRESSION, N=2
140
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
−20
0
5
10
15
20
25
30
−20
0
5
UNCONSTRAINED REGRESSION, N=3
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
0
5
10
15
20
25
30
−20
0
5
UNCONSTRAINED REGRESSION, N=5
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
0
5
10
15
20
25
30
−20
0
5
UNCONSTRAINED REGRESSION, N=7
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
0
5
10
15
20
25
30
10
15
20
25
30
10
15
20
25
30
25
30
UNCONSTRAINED REGRESSION, N=8
140
−20
20
UNCONSTRAINED REGRESSION, N=6
140
−20
15
UNCONSTRAINED REGRESSION, N=4
140
−20
10
25
30
−20
0
5
10
15
20
Figure H.1: Unconstrained Regression (Quadratic Function).
H.3. EXPERIMENT RESULTS
215
UNCONSTRAINED REGRESSION, N=1
UNCONSTRAINED REGRESSION, N=2
160
160
140
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
−20
−20
−40
0
5
10
15
20
25
30
−40
0
5
UNCONSTRAINED REGRESSION, N=3
160
140
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
−20
−20
0
5
10
15
20
25
30
−40
0
5
UNCONSTRAINED REGRESSION, N=5
160
140
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
−20
−20
0
5
10
15
20
25
30
−40
0
5
UNCONSTRAINED REGRESSION, N=7
160
140
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
−20
−20
0
5
10
15
20
25
30
10
15
20
25
30
10
15
20
25
30
25
30
UNCONSTRAINED REGRESSION, N=8
160
−40
20
UNCONSTRAINED REGRESSION, N=6
160
−40
15
UNCONSTRAINED REGRESSION, N=4
160
−40
10
25
30
−40
0
5
10
15
20
Figure H.2: Unconstrained Regression (Exponential Function).
216
APPENDIX H. POLYNOMIAL REGRESSION
small numerical problems in the primal lead to an infeasible dual solution. For degree 8 non-critical numeric
problems result in an even worse solution. The problem for degree 10 and higher could not be solved due to
critical numerical problems.
Positive derivative condition is illustrated on figure H.4.
The polynomial is forced to be increasing. Small numeric problems occur for polynomials of degree 7, and
Sedumi can not go higher than 7. Dotted curve is the
result of unconstrained polynomial regression (from figure H.1). Because of numeric problems, the curve goes
slightly down for n = 7 in the beginning.
The last series of plots (on figures H.5 and H.6) demonstrates convex constraints. The second derivative of the
polynomial is restricted to be positive on <+ . The curves
obtained without this constraint are given (dashed line)
for comparison. Dotted lines show the original function.
Some numerical problems occur on polynomials of degree 8, and Sedumi can not go higher than 8 on this
data.
H.4
Conclusion
Off-the-shelf solvers like Sedumi can not solve the problem well enough. However special algorithms, which exploit specific structure of Λ(w) can be developed. This
matrix of size (k + 1) × (k + 1) actually has only 2k + 1
variables, which define the entire matrix. Using different orthogonal systems of polynomials can also help to
increase the maximum polynomial degree for which the
problem can be solved.
H.4. CONCLUSION
217
NON−NEGATIVE POLYNOMIAL REGRESSION, N=2
NON−NEGATIVE POLYNOMIAL REGRESSION, N=4
100
160
140
80
120
60
100
80
40
60
20
40
20
0
0
−20
−20
−40
−5
0
5
10
15
20
25
30
35
−40
−5
0
NON−NEGATIVE POLYNOMIAL REGRESSION, N=6
5
10
15
20
25
30
35
30
35
NON−NEGATIVE POLYNOMIAL REGRESSION, N=8
200
100
80
150
60
40
100
20
0
50
−20
−40
0
−60
−50
−5
0
5
10
15
20
25
30
35
−80
−5
0
5
10
15
20
25
Sedumi starts running into numerical problems starting with polynomials of degree 6. This results in a violation of non-negativity
constraint, which becomes very clear in polynomials of degree 8.
Figure H.3: Positive Constrained Regression.
218
APPENDIX H. POLYNOMIAL REGRESSION
INCREASING POLYNOMIAL REGRESSION, N=3
140
120
100
80
60
40
20
0
−20
0
5
10
15
20
25
30
25
30
25
30
INCREASING POLYNOMIAL REGRESSION, N=5
140
120
100
80
60
40
20
0
−20
0
5
10
15
20
INCREASING POLYNOMIAL REGRESSION, N=7
140
120
100
80
60
40
20
0
−20
0
5
10
15
20
Dotted curve is the result of unconstrained polynomial regression
(from figure H.1). Small numeric problems for n = 7 result in curve
going slightly down.
Figure H.4: Non-Negative Derivative Constrained Regression.
H.4. CONCLUSION
219
CONVEX POLYNOMIAL REGRESSION, N=6
140
120
100
80
60
40
20
0
−20
0
5
10
15
20
25
30
25
30
CONVEX POLYNOMIAL REGRESSION, N=8
200
150
100
50
0
−50
0
5
10
15
20
Solid line – regression with convex polynomials, dotted line – original function, dashed line – polynomial regression without shape constraints (from figure H.1).
Figure H.5: Regression With Convex Polynomials (Quadratic Function).
220
APPENDIX H. POLYNOMIAL REGRESSION
CONVEX POLYNOMIAL REGRESSION, N=6
200
150
100
50
0
−50
0
5
10
15
20
25
30
25
30
CONVEX POLYNOMIAL REGRESSION, N=8
200
150
100
50
0
−50
0
5
10
15
20
Solid line – regression with convex polynomials, dotted line – original function, dashed line – polynomial regression without shape constraints (from figure H.2).
Figure H.6: Regression With Convex Polynomials (Exponential Function).
Bibliography
[1] F. Alizadeh. Lecture notes for Second Order Cone and
Semidefinite Programming Seminar.
[2] Yu. Nesterov. Squared functional systems and optimization problems. In J.B.G. Frenk, C. Roos, T. Terlaky, and S. Zhang (editors), High Performance Optimization, pages 405–440. Kluwer Academic Publishers, 2000.
[3] J. Sturm. Using Sedumi 1.02, a Matlab
toolbox for optimization over convex cones.
http://fewcal.kub.nl/~sturm/
222
H.5
BIBLIOGRAPHY
The Source Code
This is the complete source code for Matlab used for
simulation. This function requires Sedumi toolbox to be
installed in the system.
% [t,pt] = sdpfit(x,y,n,derivative,positive)
%
% this function fits a positive polynomial of degree ’n’ to the
% set of random points given in vectors ’x’ and ’y’ using SeDuMi SDP code.
%
% returns ’t’ -- the polynomial coefficients, starting with t(0),
% and ’pt’ -- the points corresponding to x-es, y-approximation.
%
% ’derivative’ can be 0,1,2 to indicate the derivative which is required to be
% positive or negative, depending on the sign of ’positive’. If ’positive’ is 0,
% no positive/negative condition on polynomial is enforced.
%
function [t,pt] = sdpfit(x,y,n,derivative,positive)
fprintf(’Validating parameters..\n’);
% parameter validation
m = validate_params_and_get_size(x,y,n);
%% we are going to use standard primal SeDuMi formulation
fprintf(’Generating constraints..\n’);
if positive==0
%% no positive polynomial constraints, simple regression
c = [0; y];
X=get_power_matrix_x(x,n);
A=[zeros(n+1,1), X’];
A=[A; 1, zeros(1,m)];
%create vector b
b=[zeros(n+1,1);1];
% call sedumi
[a_n,a_m]=size(A);
fprintf(’Calling solver -- %d constraints and %d variables.\n’,a_n,a_m);
K.q=[m+1];
[xx,yy,info]=sedumi(A,b,c,K);
t=yy(1:n+1);
H.5. THE SOURCE CODE
223
else
%% enforcing positive polynomial constraints using moment cones
w1_side_size = floor((n-derivative)/2)+1;
w1_size = w1_side_size*w1_side_size;
c = [0; y; zeros(w1_size,1)];
%
%
%
%
the matrix is very sparse, but we are ignoring it here,
since the reasonable size does not seem too big, and
depends mostly on the degree of the polynomial and
the number of points.
X=get_power_derivative_matrix_x(x,n,derivative);
M1 = zeros(n+1,w1_size);
t2=get_corner_indices(w1_side_size);
j=1;
for i=1+derivative:n+1
M1(i,t2(j)) = positive;
j=j+1;
end
if M1*ones(w1_size,1)~=[zeros(derivative,1);positive*ones(n+1-derivative,1)]
error(’Post-condition on power-derivative constraints failed. Check the code.’);
end
A=[zeros(n+1,1), X’, -M1]; b=zeros(n+1,1);
M1=[];
A=[A; 1, zeros(1,m+w1_size)]; b=[b;1];
%% add equality constraints to form matrix Lambda(w).
% w1 cross diagonal constraints
M2 = get_cross_diagonal_constraints(w1_side_size);
[m2_n,m2_m]=size(M2);
fprintf(’ generated %d cross-diagonal constraints for W1.\n’, m2_n);
A = [A;zeros(m2_n,m+1), M2]; b=[b;zeros(m2_n,1)];
M2=[];
% call sedumi
[a_n,a_m]=size(A);
fprintf(’Calling SeDuMi -- %d constraints and %d variables.\n’,a_n,a_m);
K.q=[m+1];
K.s=[w1_side_size];
[xx,yy,info]=sedumi(A,b,c,K);
t=yy(1:n+1);
% restore the polynomial coefficients
if derivative>=1
224
BIBLIOGRAPHY
for j=2:n
t(j+1)=t(j+1)/j;
end
if derivative>=2
for j=3:n
t(j+1)=t(j+1)/(j-1);
end
end
end
end
%% find the corresponding approximated points for the given values of ’x’
for i=1:m
pt(i)=t(1);
for j=1:n
pt(i)=pt(i)+t(j+1)*x(i)^j;
end
end
%% check the results
if info.numerr==2
fprintf(’Results are absolutely untrustable, complete numeric failure.\n’);
elseif info.numerr==1
fprintf(’Some non-critical numeric problems found.\n’);
else
fprintf(’No numerical problems found.\n’);
end
if info.pinf==0 & info.dinf==0
fprintf(’The optimal solution is discovered.\n’);
else
fprinf(’Could not find the optimum.\n’);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function m=validate_params_and_get_size(x,y,n)
[m,m1]=size(x);
if m1~=1
error(’parameter x must be a vertical vector’);
return;
end
[m_2,m_3]=size(y);
if (m_2~=m) | (m1~=m_3)
error(’parameters x and y must have the same dimensions’);
return;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
H.5. THE SOURCE CODE
225
function X=get_power_matrix_x(x,n)
[m,m_1]=size(x);
X = zeros(m,n+1);
for i=1:m
for j=0:n
X(i,j+1)=x(i)^j;
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function X=get_power_derivative_matrix_x(x,n,derivative)
[m,m_1]=size(x);
X = zeros(m,n+1);
for i=1:m
for j=0:n
X(i,j+1)=x(i)^j;
end
end
if derivative>=1
for i=1:m
for j=2:n
X(i,j+1)=X(i,j+1)/j;
end
end
if derivative>=2
for i=1:m
for j=3:n
X(i,j+1)=X(i,j+1)/j-1;
end
end
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% i and j start with 0, returned index starts with 1
function ndx=vec_index(i,j,side)
if i>=side | j>=side | i<0 | j<0
error(’Index values out of bounds in vec_index()’);
end
ndx=i+side*j+1;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function M=get_cross_diagonal_constraints(side_size)
M = zeros(0,side_size*side_size);
% before crossing main diagonal
226
BIBLIOGRAPHY
for sm=2:side_size-1
for i=floor(sm/2):-1:1
j = sm-i;
t1 = zeros(1,side_size*side_size);
t1(1,vec_index(0,sm,side_size))=-1;
t1(1,vec_index(i,j,side_size))=1;
M=[M; t1];
end
end
% after crossing main diagonal
for sm=side_size:2*side_size-4
for j=floor((sm+1)/2):side_size-2
i = sm-j;
t1 = zeros(1,side_size*side_size);
t1(1,vec_index(sm-(side_size-1),side_size-1,side_size))=-1;
t1(1,vec_index(i,j,side_size))=1;
M=[M; t1];
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function ind=get_corner_indices(side_size)
t1 = zeros(side_size,side_size);
t1(1,1:side_size) = ones(1,side_size);
t1(1:side_size,side_size) = ones(side_size,1);
ind=find(vec(t1));
Fly UP