...

Application of Quantum Computing principles to Natural Language Processing

by user

on
2

views

Report

Comments

Transcript

Application of Quantum Computing principles to Natural Language Processing
Application of Quantum Computing principles
to Natural Language Processing
B.Tech Project Report
Submitted in partial fulfillment of the requirements for the degree of
Bachelor of Technlogy (Honors)
by
Vipul Singh
Roll No : 100050057
under the guidance of
Prof. Pushpak Bhattacharyya
Department of Computer Science and Engineering
Indian Institute of Technology, Bombay
Acknowledgement
First and foremost, I express my sincere gratitude towards my guide Prof. Pushpak Bhattacharyya for his guidance and for the freedom he has been providing
us for our research work. He is daily source of inspiration for me to strive harder
in the pursuit of my research goals.
Next, I would like to thank Prof. Pranab Sen, Tata Institute of Fundamental
Research, Mumbai and Prof. Avatar Tathagat Tulsi, IIT Bombay for their valuable inputs on our work and help with designing the Quantum Viterbi algorithm.
Next, I am really thankful to my batchmate Dikkala Sai Nishanth for being a great
colleague in this journey of learning. I am grateful to him for being a co-operative
co-learner and partner in this project. Last but not the least I would like to thank
my family, friends and teachers for their love and kind support.
Abstract
The discovery of quantum mechanics has led to some radical changes in the theory of computation. A quantum theory of computing has come up and has been
applied to give fascinating theoretical results for even classically unsolvable problems. With quantum computers being a part of the foreseeable future, it is definitely
worthwhile to take a look at whether they can speed up the existing algorithms for
common tasks in Natural Language Processing (NLP).
This thesis gives a description of the principles on which quantum computing is
based, namely qubits, their superposition and the process of measurement after the
application of quantum operations or gates, and also some of the above-mentioned
results/algorithms. Then, we explore some search methods pertaining to Machine
Learning and Natural Language Processing and see if these can be integrated into
the world of quantum computing. Of particular interest to us has been the problem
of Part-of-Speech (POS) tagging for which we develop a quantum counterpart to
the classical Viterbi. We provide results pertaining to our implementation of the
same on the British National Corpus (BNC).
Closely related to POS tagging is the machine translation among similar languages, for which, our quantum counterpart, actually gives a huge reduction in
running time of the viterbi algorithm. Following this, we foray into the realm of
quantum ideas applied to other intelligence tasks, for example, quantum random
walks for the A-star search algorithm.
Contents
1
2
3
Introduction
1.1 Motivation . . . . .
1.2 Aim of the Thesis .
1.3 Experimental Setup
1.4 Road Map . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
6
7
7
Quantum Computing Principles
2.1 Qubit - The Quantum Bit . . . . . . . . .
2.1.1 Bits vs. Qubits . . . . . . . . . .
2.1.2 Superposition . . . . . . . . . . .
2.1.3 Representation . . . . . . . . . .
2.2 Quantum States . . . . . . . . . . . . . .
2.2.1 Entanglement . . . . . . . . . . .
2.2.2 Registers . . . . . . . . . . . . .
2.3 Operators - Quantum Gates . . . . . . . .
2.3.1 Reversible Logic Gates . . . . . .
2.3.2 Matrix Operator Correspondence
2.3.3 Commonly used gates . . . . . .
2.3.4 Quantum Fourier Transform . . .
2.4 Measurement in Quantum Mechanics . .
2.4.1 A Qualitative Overview . . . . .
2.4.2 The Quantitative Overview . . . .
2.4.3 Collapsing of States . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
9
9
9
9
10
10
11
12
12
12
13
15
16
16
16
17
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Classical Optimization and Search Techniques
3.1 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Stochastic Process . . . . . . . . . . . . . . . . . . . . .
3.1.2 Markov Property and Markov Modelling . . . . . . . . .
3.1.3 The urn example . . . . . . . . . . . . . . . . . . . . . .
3.1.4 Formal Description of the Hidden Markov Model . . . . .
3.1.5 The Trellis Diagram . . . . . . . . . . . . . . . . . . . .
3.1.6 Formulating the Part-of-Speech tagging problem using HMM
3.1.7 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . .
1
19
20
20
20
20
21
21
22
22
3.2
3.3
3.4
3.5
3.6
4
3.1.8 Pseudocode . . . . . . . . . . . . . . . . . . . .
Maximum Entropy Approach . . . . . . . . . . . . . . .
3.2.1 Entropy - Thermodynamic and Information . . .
3.2.2 The Maximum Entropy Model . . . . . . . . . .
3.2.3 Application to Statistical Machine Learning . . .
The ME Principle and a Solution . . . . . . . . . . . . .
3.3.1 Proof for the ME Formulation . . . . . . . . . .
3.3.2 Generalized Iterative Scaling . . . . . . . . . . .
Improved Iterative Scaling . . . . . . . . . . . . . . . .
3.4.1 The Model in parametric form . . . . . . . . . .
3.4.2 Maximum Likelihood . . . . . . . . . . . . . .
3.4.3 The objective to optimize . . . . . . . . . . . . .
3.4.4 Deriving the iterative step . . . . . . . . . . . .
Swarm Intelligence . . . . . . . . . . . . . . . . . . . .
3.5.1 Foundations . . . . . . . . . . . . . . . . . . . .
3.5.2 Example Algorithms and Applications . . . . . .
3.5.3 Case Study: Ant Colony Optimization applied to
hard Travelling Salesman Problem . . . . . . . .
Boltzmann Machines . . . . . . . . . . . . . . . . . . .
3.6.1 Structure . . . . . . . . . . . . . . . . . . . . .
3.6.2 Probability of a state . . . . . . . . . . . . . . .
3.6.3 Equilibrium State . . . . . . . . . . . . . . . . .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
the
. .
. .
. .
. .
. .
Some popular Quantum Computing Ideas
4.1 Deutsch-Jozsa Algorithm . . . . . . . . . . . . . . . . . . .
4.1.1 Problem Statement . . . . . . . . . . . . . . . . . .
4.1.2 Motivation and a Classical Approach . . . . . . . .
4.1.3 The Deutsch Quantum Algorithm . . . . . . . . . .
4.2 Shor’s Algorithm . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 The factorization problem . . . . . . . . . . . . . .
4.2.2 The integers mod n . . . . . . . . . . . . . . . . . .
4.2.3 A fast classical algorithm for modular exponentiation
4.2.4 Reduction of the Factorization problem . . . . . . .
4.2.5 The Algorithm . . . . . . . . . . . . . . . . . . . .
4.2.6 An example factorization . . . . . . . . . . . . . . .
4.3 Grover’s Algorithm . . . . . . . . . . . . . . . . . . . . . .
4.3.1 The search problem . . . . . . . . . . . . . . . . . .
4.3.2 The Oracle . . . . . . . . . . . . . . . . . . . . . .
4.3.3 The Grover Iteration . . . . . . . . . . . . . . . . .
4.3.4 Performance of the algorithm . . . . . . . . . . . .
4.3.5 An example . . . . . . . . . . . . . . . . . . . . . .
4.4 The Quantum Minimum Algorithm . . . . . . . . . . . . . .
4.4.1 The Problem . . . . . . . . . . . . . . . . . . . . .
4.4.2 The Algorithm . . . . . . . . . . . . . . . . . . . .
2
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
NP. . .
. . .
. . .
. . .
. . .
36
39
39
40
41
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
42
42
42
43
43
44
44
44
45
45
46
48
48
48
49
49
49
50
50
50
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
24
24
25
25
27
27
28
29
29
29
31
31
33
33
35
4.5
5
6
7
4.4.3 Running Time and Precision . . . . . . . . . . . . . . . .
Quantum Walks . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Random Walks . . . . . . . . . . . . . . . . . . . . . . .
An example: A one-dimensional random walk . . . . . .
4.5.2 Terminology used with Random Walks . . . . . . . . . .
4.5.3 Quantum Analogue: Quantum Markov Chains or Quantum
Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.4 Application to Element-Distinctness Problem . . . . . . .
Quantum Computing and Intelligence Tasks
5.1 Quantum Classification . . . . . . . . . . . . . .
5.1.1 Learning in a Quantum World . . . . . .
5.1.2 The Helstrom Oracle . . . . . . . . . . .
5.1.3 Binary Classification . . . . . . . . . . .
5.1.4 Weighted Binary Classification . . . . . .
5.2 Quantum Walk for A-star search . . . . . . . . .
5.2.1 The A∗ Algorithm . . . . . . . . . . . .
The Heart of A∗ : The Heuristic Function
5.2.2 A Quantum Approach? . . . . . . . . . .
The Quantum Viterbi
6.1 The Approach . . . . . . . . . . . . . . . . . .
6.1.1 Can Grover be used? . . . . . . . . . .
6.2 The Algorithm . . . . . . . . . . . . . . . . .
6.2.1 The Classical Version . . . . . . . . . .
6.2.2 Quantum exponential searching . . . .
6.2.3 The Grover Iteration . . . . . . . . . .
6.2.4 The Quantum Approach to Viterbi . . .
6.3 Experimental Results . . . . . . . . . . . . . .
6.3.1 Implementation . . . . . . . . . . . . .
6.3.2 Results . . . . . . . . . . . . . . . . .
6.3.3 Tag-wise Precision and Recall Analysis
6.3.4 Concluding Remarks . . . . . . . . . .
Machine Translation among Close Languages
7.1 Machine Translation . . . . . . . . . . . . .
7.1.1 What is machine translation? . . . . .
7.1.2 How does machine translation work?
7.1.3 Advantages of machine translation . .
7.2 Similarity to POS tagging for close languages
7.2.1 The izafat phenomenon . . . . . . . .
7.3 Phrase-Book Translation . . . . . . . . . . .
7.4 Experiments and Results . . . . . . . . . . .
7.4.1 Training corpus . . . . . . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
51
51
51
51
52
52
.
.
.
.
.
.
.
.
.
54
54
54
55
56
56
57
57
59
60
.
.
.
.
.
.
.
.
.
.
.
.
61
61
62
62
62
63
63
63
64
64
64
69
69
.
.
.
.
.
.
.
.
.
70
70
70
70
71
71
72
72
72
72
7.4.2
7.4.3
7.4.4
Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
74
74
75
8
Conclusions
77
9
Future Work
78
4
List of Figures
2.1
2.2
3.1
3.2
3.3
3.4
3.5
3.6
3.7
5.1
5.2
Sphere representation for a qubit in the state: α = cos 2θ and
β = eiφ sin 2θ . . . . . . . . . . . . . . . . . . . . . . . . . . .
Circuit representation of Hadamard, CNOT and Toffoli gates, respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
14
An example Hidden Markov Model with three urns . . . . . . . .
An Example Trellis . . . . . . . . . . . . . . . . . . . . . . . . .
The Pareto Optimal frontier is the set of hollow points. Operational
decisions must be restricted along this set if operational efficiency
is to be maintained . . . . . . . . . . . . . . . . . . . . . . . . .
The Pareto hypervolume . . . . . . . . . . . . . . . . . . . . . .
Search process for m=1000 ants . . . . . . . . . . . . . . . . . .
Search process for m=5000 ants . . . . . . . . . . . . . . . . . .
Graphical representation for a Boltzmann machine with a few labelled weights . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
22
Start State for the 8-puzzle problem . . . . . . . . . . . . . . . .
Goal State for the 8-puzzle problem . . . . . . . . . . . . . . . .
58
59
5
34
35
38
39
39
Chapter 1
Introduction
1.1
Motivation
With the development of quantum mechanics, new paradigms have opened up in
many sciences. One such paradigm is a novel way of performing computation
directly using quantum mechanical principles. Quantum computing looks at the act
of computing from a viewpoint that is radically different from classical theories of
computation, the most popular among the latter being the model of Turing machine.
Once, a quantum theory of computing was developed, the next task was to develop
algorithms for various problems using quantum computing which brings us to the
focus of this thesis. A close relation between quantum mechanics, natural language
processing and the functioning of the mind has been proposed by many previous
works. [4, 8] In this thesis, we study further how quantum computing can be used
to give efficient algorithms for common problems which are a part of NLP.
1.2
Aim of the Thesis
The aim of the thesis was to develop quantum computing algorithms for classical
tasks in Natural Language Processing. Although, these algorithms would need a
quantum computer to be actually implemented in practice, here we intend to perform a theoretical study of such algorithms ignoring the implementation part for the
time being, as is the case with most existing quantum algorithms such as Grover [3]
and Shor [2]. One problem we studied was that of part-of-speech tagging. Given
labeled data (sentences with the part-of-speech tags of the words given), a popular
classical algorithm for solving this is the Viterbi algorithm [13](it makes the assumption that the data follows a bigram Markov model). In Chapter 6 we present
a quantum version of the Viterbi algorithm which runs faster than the classical
Viterbi algorithm.
6
1.3
Experimental Setup
We also present a discussion of accuracy and precision results obtained from running a classical simulation of the quantum Viterbi algorithm on the BNC English
Corpus which has a 57 large tag set. The experimental setup is briefly described
here. Firstly, since we are simulating a quantum algorithm on a classical machine
we face an exponential blow-up in time which will be described in detail in Chapter
6. To combat this blow-up we ran the algorithm for different folds of the corpus as
different processes on a multi-core machine after performing compiler optimizations. Also, we had to cut down on the suggested number of iterations in the corpus
to reduce execution time losing some amount of accuracy in the process.
1.4
Road Map
The layout of this thesis is as follows. First the principles of quantum computing
are presented which the familiar reader can skip. This chapter is then followed by
a broad analysis of various search and optimization techniques used classically in
NLP. The intent of this section is two-fold. First, the unfamiliar reader can familiarize himself about the presented techniques. Secondly, we study the theoretical
aspects of these techniques in detail to gain deeper insight into how the quantum
versions of them might be developed.
Then we go on to study some popular quantum computing algorithms for classical problems such as factoring and searching in an unsorted database. We present
the quantum nature of these algorithms and where they defer from the limitations
imposed by classical computation models.
After this, we move on to look at some recent work in quantum algorithms for
classification and we look at the development of a quantum A∗ algorithm. Next, we
present the chapter on the Quantum Viterbi algorithm where we give the quantum
algorithm we have developed and also describe the results of the performance of a
simulation of the algorithm for Part-of-Speech tagging.
Next we look at the problem of Machine Translation (in Chapter 7) which is
a hallmark problem of NLP. We note that machine translation of close languages
such as hindi-urdu, hindi-marathi is simplified by the fact that most sentence translations are simply word by word replacement hence allowing the problem to be
modelled as a POS tagging problem and making it amenable for the Viterbi algorithm to be applied. We also present a study we have undertaken with a relatively
small parallel corpus of hindi-urdu sentences and present the results of the quantum
Viterbi approach to machine translation on this corpus. We end with the conclusions and insights we have gathered from our study and the direction in which
future work will proceed.
7
Chapter 2
Quantum Computing Principles
The massive amount of processing power generated by computer manufacturers
has not yet been able to quench our thirst for speed and computing capacity. In
1947, American computer engineer Howard Aiken said that just six electronic digital computers would satisfy the computing needs of the United States. Others have
made similar errant predictions about the amount of computing power that would
support our growing technological needs. Of course, Aiken didn’t count on the
large amounts of data generated by scientific research, the proliferation of personal
computers or the emergence of the Internet, which have only fuelled our need for
more, more and more computing power.
Will we ever have the amount of computing power we need or want? If, as
Moore’s Law states, the number of transistors on a microprocessor continues to
double every 18 months, the year 2020 or 2030 will find the circuits on a microprocessor measured on an atomic scale. And the logical next step will be to
create quantum computers, which will harness the power of atoms and molecules
to perform memory and processing tasks. Quantum computers have the potential
to perform certain calculations significantly faster than any silicon-based computer.
Scientists have already built basic quantum computers that can perform certain
calculations; but a practical quantum computer is still years away. In this chapter,
we explore what a quantum computer is and how it operates.
8
2.1
Qubit - The Quantum Bit
In quantum computing, a qubit or quantum bit is a unit of quantum information
the quantum analogue of the classical bit.
2.1.1
Bits vs. Qubits
A bit is the basic unit of information. It is used to represent information by computers. Regardless of its physical realization, a bit is always understood to be either
a 0 or a 1. An analogy to this is a light switch with the off position representing 0
and the on position representing 1.
A qubit is a two-state quantum-mechanical system, such as the polarization of
a single photon: here the two states are vertical polarization and horizontal polarization. It has a few similarities to a classical bit, but is overall very different. Like
a bit, a qubit can have two possible values normally a 0 or a 1. The difference is
that whereas a bit must be either 0 or 1, a qubit can be 0, 1, or a superposition of
both.
2.1.2
Superposition
Think of a qubit as an electron in a magnetic field. The electron’s spin may be
either in alignment with the field, which is known as a spin-up state, or opposite to
the field, which is known as a spin-down state. Changing the electron’s spin from
one state to another is achieved by using a pulse of energy, such as from a laser let’s say that we use 1 unit of laser energy. But what if we only use half a unit of
laser energy and completely isolate the particle from all external influences? According to quantum law, the particle then enters a superposition of states, in which
it behaves as if it were in both states simultaneously. Each qubit utilized could take
a superposition of both 0 and 1.
The principle of quantum superposition states that if a physical system may be
in one of many configurations arrangements of particles or fields then the most
general state is a combination of all of these possibilities, where the amount in
each configuration is specified by a complex number.
2.1.3
Representation
The two states in which a qubit may be measured are known as basis states (or
basis vectors). As is the tradition with any sort of quantum states, Dirac, or bra-ket
notation, is used to represent them. This means that the two computational basis
states are conventionally written as |0i and |1i (pronounced ”ket 0” and ”ket 1”).
A pure qubit state is a linear quantum superposition of the basis states. This means
that the qubit can be represented as a linear combination of |0i and |1i:
9
|ψi = α|0i + β|1i
where α and β are probability amplitudes and can in general both be complex numbers.
The possible states for a single qubit can be visualised using a Bloch sphere
as shown in Figure 2.1 1 . Represented on such a sphere, a classical bit could only
be at the ”North Pole” or the ”South Pole”, in the locations where |0i and |1i are,
respectively. The rest of the surface of the sphere is inaccessible to a classical bit,
but a pure qubit state can be represented by any point on the surface. For example,
√
the pure qubit state |0i+i|1i
would lie on the equator of the sphere, on the positive
2
y-axis.
Figure 2.1: Sphere representation for a qubit in the state: α = cos
eiφ sin 2θ
2.2
2.2.1
θ
2
and β =
Quantum States
Entanglement
An important distinguishing feature between a qubit and a classical bit is that
multiple qubits can exhibit quantum entanglement. Entanglement is a non-local
property that allows a set of qubits to express higher correlation than is possible
in classical systems. Take, for example, two entangled qubits in the Bell state
√1 (|00i + |11i).
2
Imagine that these two entangled qubits are separated, with one each given
to Alice and Bob. Alice makes a measurement of her qubit, obtaining |0i or |1i.
1
Source:http://en.wikipedia.org/wiki/Bloch_sphere
10
Because of the qubits’ entanglement, Bob must now get exactly the same measurement as Alice; i.e., if she measures a |0i, Bob must measure the same, as |00i is
the only state where Alice’s qubit is a |0i.
This is a real phenomenon (Einstein called it ”spooky action at a distance”),
the mechanism of which cannot, as yet, be explained by any theory - it simply
must be taken as given. Quantum entanglement allows qubits that are separated by
incredible distances to interact with each other instantaneously (not limited to the
speed of light). No matter how great the distance between the correlated particles,
they will remain entangled as long as they are isolated.
Entanglement also allows multiple states (such as the Bell state mentioned
above) to be acted on simultaneously, unlike classical bits that can only have one
value at a time. Entanglement is a necessary ingredient of any quantum computation that cannot be done efficiently on a classical computer. Many of the successes
of quantum computation and communication, such as quantum teleportation and
superdense coding, make use of entanglement, suggesting that entanglement is a
resource that is unique to quantum computation.
2.2.2
Registers
A number of entangled qubits taken together is a qubit register. Quantum computers perform calculations by manipulating qubits within a register. An example of a
3-qubit register:
Consider first a classical computer that operates on a three-bit register. The
state of the computer at any time is a probability distribution over the 23 = 8 different three-bit strings 000, 001, 010, 011, 100, 101, 110, 111. If it is a deterministic
computer, then it is in exactly one of these states with probability 1. However, if
it is a probabilistic computer, then there is a possibility of it being in any one of a
number of different states. We can describe this probabilistic state by eight nonnegative numbers A,B,C,D,E,F,G,H (where A = probability computer is in state
000, B = probability computer is in state 001, etc.). There is a restriction that these
probabilities sum to 1.
The state of a three-qubit quantum computer is similarly described by an eightdimensional vector (a,b,c,d,e,f,g,h), called a ket. However, instead of the sum of
the coefficient magnitudes adding up to one, the sum of the squares of the coefficient magnitudes, |a|2 +|b|2 +...+|h|2 , must equal one. Moreover, the coefficients
can have complex values. Since the absolute square of these complex-valued coefficients denote probability amplitudes of given states, the phase between any two
coefficients (states) represents a meaningful parameter, which presents a fundamental difference between quantum computing and probabilistic classical computing.
11
Now, an eight-dimensional vector can be specified in many different ways depending on what basis is chosen for the space. The basis of bit strings (e.g., 000,
001, ..., 111) is known as the computational basis. Other possible bases are unitlength, orthogonal vectors, etc. Ket notation is often used to make the choice of
basis explicit.
For example, the state (a,b,c,d,e,f,g,h) in the computational basis can be written
as: a|000i + b|001i + c|010i + d|011i + e|100i + f |101i + g|110i + h|111i where,
e.g., |010i = (0, 0, 1, 0, 0, 0, 0, 0).
Similarly, the computational basis for a single qubit (two dimensions) is |0i =
(1, 0) and |1i = (0, 1).
Taken together, quantum superposition and entanglement create an enormously
enhanced computing power. Where a 2-bit register in an ordinary computer can
store only one of four binary configurations (00, 01, 10, or 11) at any given time, a
2-qubit register in a quantum computer can store all four numbers simultaneously,
because each qubit represents two values. If more qubits are added, the increased
capacity is expanded exponentially.
2.3
2.3.1
Operators - Quantum Gates
Reversible Logic Gates
Ordinarily, in a classical computer, the logic gates other than the NOT gate are not
reversible. Thus, for instance, for an AND gate one cannot recover the two input
bits from the output bit; for example, if the output bit is 0, we cannot tell from this
whether the input bits are 0,1 or 1,0 or 0,0.
In quantum computing and specifically the quantum circuit model of computation, a quantum gate (or quantum logic gate) is a basic quantum circuit operating
on a small number of qubits. They are the building blocks of quantum circuits, like
classical logic gates are for conventional digital circuits. Unlike many classical
logic gates, quantum logic gates are reversible. However, classical computing
can be performed using only reversible gates. For example, the reversible Toffoli
gate can implement all Boolean functions. This gate has a direct quantum equivalent, showing that quantum circuits can perform all operations performed by
classical circuits.
2.3.2
Matrix Operator Correspondence
We can treat an n-qubit state as a vector consisting of 2n complex numbers, each
representing the coefficient of a state from the computational basis. Now, a gate
operates on such a state and yields another of the same dimension. So, a gate can
12
be seen as a function that transforms a 2n dimensional vector to another. Hence,
in the vector-matrix representation in n-qubit space, a gate is a square matrix of
dimensions 2n , whose ith column is the vector that results when we apply the gate
on the ith element of the computational basis.
For a quantum computer gate, we require a very special kind of reversible function, namely a unitary mapping, that is, a mapping on the state-space that preserves
the inner product. So, if H is a gate and |ψi and |φi represent two quantum states in
0
0
n-qubit space, then ψ = H|ψi and φ = H|φi will also be n-qubit states and will
0
0
satisfy the property that hψ |φ i = hψ|φi, where h..|..i denotes the inner-product
in bra-ket notation.
Hence, quantum logic gates are represented by unitary matrices. Note - a complex square matrix U is unitary if U ∗ U = U U ∗ = I, where I is the identity matrix
and U ∗ is the conjugate transpose of U. The real analogue of a unitary matrix is an
orthogonal matrix.
The most common quantum gates operate on spaces of one, two or three qubits.
This means that as matrices, quantum gates can be described by 2X2 or 4X4 or
8X8 unitary matrices.
2.3.3
Commonly used gates
Quantum gates are usually represented as matrices. A gate which acts on k qubits
is represented by a 2k X2k unitary matrix. The number of qubits in the input and
output of the gate have to be equal. The action of the quantum gate is found by
multiplying the matrix representing the gate with the vector which represents the
quantum state.
• Hadamard gate
The Hadamard gate acts on a single qubit. It maps the basis state |0i to
|0i+|1i
√
√
and |1i to |0i−|1i
, and represents a rotation of π about the axis (x̂ +
2
2
√
ẑ)/ 2. It is represented by the Hadamard matrix:
1 1
1
√
H= 2
1 −1
Since HH ∗ = I where I is the identity matrix, H is indeed a unitary matrix.
• Controlled Gates
Controlled gates act on 2 or more qubits, where one or more qubits act as
a control for some operation. For example, the controlled NOT gate (or
CNOT) acts on 2 qubits, and performs the NOT operation on the second
qubit only when the first qubit is |1i, and otherwise leaves it unchanged. It
is represented by the matrix:
13

1
0
CNOT = 
0
0
0
1
0
0
0
0
0
1

0
0

1
0
More generally if U is a gate that operates on single qubits with matrix representation
x00 x01
U=
,
x10 x11
then the controlled-U gate is a gate that operates on two qubits in such a way
that the first qubit serves as a control. It maps the basis states as follows:
|00i 7→ |00i
|01i 7→ |01i
|10i 7→ |1iU |0i = |1i (x00 |0i + x10 |1i)
|11i 7→ |1iU |1i = |1i (x01 |0i + x11 |1i)
The matrix representing the controlled U is:

1
0
C(U ) = 
0
0

0 0
0
1 0
0 

0 x00 x01 
0 x10 x11
Figure 2.2: Circuit representation of Hadamard, CNOT and Toffoli gates, respectively
• Toffoli Gate
The Toffoli gate, also CCNOT gate, is a 3-bit gate, which is universal for
classical computation. The quantum Toffoli gate is the same gate, defined
for 3 qubits. If the first two bits are in the state |1i, it applies a Pauli-X
(bit inversion) on the third bit, else it does nothing. It is an example of a
controlled gate. It swaps the states |110i and |111i; it is an identity map for
the other 6 states in the computational basis for a 3-qubit space. The matrix
representation is:
14

1
0

0

0

0

0

0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1

0
0

0

0

0

0

1
0
It can be also described as the gate which maps |a, b, ci to |a, b, c ⊕ abi.
2.3.4
Quantum Fourier Transform
This is a linear transformation on quantum bits, and is the quantum analogue of the
discrete Fourier transform. The quantum Fourier transform is a part of many quantum algorithms, notably Shor’s algorithm for factoring and computing the discrete
logarithm, the quantum phase estimation algorithm for estimating the eigenvalues
of a unitary operator, and algorithms for the hidden subgroup problem.
The quantum Fourier transform can be performed efficiently on a quantum
computer, with a particular decomposition into a product of simpler unitary matrices. Using a simple decomposition, the discrete Fourier transform can be implemented as a quantum circuit consisting of only O(n2 ) Hadamard gates and controlled phase shift gates, where n is the number of qubits. This can be compared
with the classical discrete Fourier transform, which takes O(n2n ) gates (where n
is the number of bits), which is exponentially more than O(n2 ).
The quantum Fourier transform is the classical discrete Fourier transform applied to the vector of amplitudes of a quantum state. The classical (unitary) Fourier
transform acts on a vector (x0 , ..., xN −1 ) and maps it to the vector (y0 , ..., yN −1 )
according to the formula:
yk =
where ω = e
2πi
N
√1
N
NP
−1
xj ω jk
j=0
is a primitive N th root of unity.
Similarly, the quantum Fourier transform acts on a quantum state
PN −1
maps it to a quantum state i=0
yi |ii according to the formula:
yk =
√1
N
NP
−1
j=0
This can also be expressed as the map
15
xj ω jk
NP
−1
i=0
xi |ii and
|ji 7→
√1
N
NP
−1
ω jk |ki
k=0
Equivalently, the quantum Fourier transform on a n-qubit vector (N = 2n ) can
be viewed as a unitary matrix acting on quantum state vectors, where the unitary
matrix FN is given
 by

1
1
1
1
···
1
1
ω
ω2
ω3
···
ω N −1 


2
4
6
2(N −1) 
1
ω
ω
ω
·
·
·
ω


FN = √1N 1
ω3
ω6
ω9
···
ω 3(N −1) 


.

..
..
..
..
 ..

.
.
.
.
1 ω N −1 ω 2(N −1) ω 3(N −1) · · ·
2.4
2.4.1
ω (N −1)(N −1)
Measurement in Quantum Mechanics
A Qualitative Overview
One of the most difficult and controversial problems in quantum mechanics is the
so-called measurement problem. Opinions on the significance of this problem vary
widely. At one extreme the attitude is that there is in fact no problem at all, while
at the other extreme the view is that the measurement problem is one of the great
unsolved puzzles of quantum mechanics. The issue is that quantum mechanics
only provides probabilities for the different possible outcomes in an experiment it provides no mechanism by which the actual, finally observed result,
comes about. Of course, probabilistic outcomes feature in many areas of classical
physics as well, but in that case, probability enters the picture simply because there
is insufficient information to make a definite prediction. In principle, that missing
information is there to be found, it is just that accessing it may be a practical impossibility. In contrast, there is no missing information for a quantum system, what
we see is all that we can get, even in principle.
In Dirac’s words - The intermediate character of the state formed by superposition thus expresses itself through the probability of a particular result for an observation being intermediate between the corresponding probabilities for the original
states, not through the result itself being intermediate between the corresponding
results for the original states.
2.4.2
The Quantitative Overview
For an ideal measurement in quantum mechanics, also called a von Neumann measurement, the only possible measurement outcomes are equal to the eigenvalues
(say k) of the operator representing the observable. Consider a system prepared in
state |ψi. Since the eigenstates of the observable Ô form a complete basis called
eigenbasis, the state vector |ψi can be written in terms of the eigenstates as
16
|ψi = c1 |1i + c2 |2i + c3 |3i + · · ·
where c1 , c2 , . . . are complex numbers in general. The eigenvalues O1 , O2 , O3 , ...
are all possible values of the measurement. The corresponding probabilities are
given by
Pr(On ) =
|hn|ψi|2
hψ|ψi
=
|cn |2
P
|ck |2
k
Usually |ψi is assumed to be normalized, i.e. hψ|ψi = 1. Therefore, the expression
above is reduced to
Pr(On ) = |hn|ψi|2 = |cn |2 .
A quantum computer operates by setting the n qubits in a controlled initial state
that represents the problem at hand and by manipulating those qubits with a fixed
sequence of quantum logic gates. The sequence of gates to be applied is called a
quantum algorithm. The calculation ends with measurement of all the states, collapsing each qubit into one of the two pure states, so the outcome can be at most n
classical bits of information.
For example, if we prepare a 2-qubit system in the state |psii = √1 |00i +
(2)
√1 |01i + √1 |11i, then a measurement on the system will yield results corre(3)
(6)
sponding to the state |00i with probability 12 , state |01i with probability
|11i with probability 16 .
1
3
and state
Partial measurement We can even perform a measurement on just one register.
Then, the probability of the state |0i being measured on the register is just a sum
of the probabilities of all states wherein this particular register is in the 0i state.
So, in the above example, a measurement on the first register will yield |0i with
probability = 21 + 13 = 65
2.4.3
Collapsing of States
A postulate of quantum mechanics states that the process of measurement formally
causes an instantaneous collapse of the quantum state to the eigenstate corresponding to the measured value of the observable. A consequence of this is that the
results of a subsequent measurement essentially unrelated to the form of the precollapse quantum state (unless the eigenstates of the operators representing the
observables coincide). So, in the example mentioned in the previous subsection,
if a measurement on the system had yielded the result corresponding to eigenstate
|00i, then all subsequent measurements would have given the same result too because the system would have collapsed to this state.
17
The scenario is slightly different in the case of partial measurement. Here,
the measured register collapses entirely into a particular state and then, the states
that remain in the system must all have this register in the measured state. Also,
as expected, the mutual ratio of the probability associated with these states stays
conserved. So, in the example where we did a measurement on the first register
only, the resultant state would be
s
s
2
q
q
1 2
√
√1
3
2
2
3
|00i
+
|01i
=
|00i
+
2
2
2
2
1
1
1
1
5
5 |01i
√
2
+√
3
√
2
+√
3
18
Chapter 3
Classical Optimization and
Search Techniques
In this chapter we discuss a few popular optimization techniques in use in current
day natural language processing algorithms. First we present the Hidden Markov
Model (HMM) used for part-of-speech tagging (POS-tagging) among other tasks.
Then we formulate the POS-tagging problem using HMM and present its classical
solution which is due to the Viterbi algorithm.
Then we present the Maximum Entropy approach, which is a heuristic used
in problems related to finding probability distributions. Next up is the Maximum
Entropy Markov Model (MEMM), a discriminative model that extends a standard
maximum entropy classifier by assuming that the unknown values to be learnt are
connected in a Markov chain rather than being conditionally independent of each
other. MEMMs find applications in information extraction, segmentation and in
natural language processing, specifically in part-of-speech tagging.
This is followed by an overview of some methods namely Generalised Iterative
Scaling and an improved iterative version of it, which find use in solving for the
training objectives of many problems which use maximum likelihood estimation
on the training data to get the parameters.
Then comes the concept of swarm intelligence, which is inspired by the action
of insects such as ants. Finally we briefly discuss Boltzmann machines. They
were one of the first examples of a neural network capable of learning internal
representations, and are able to represent and (given sufficient time) solve difficult
combinatoric problems.
19
3.1
Hidden Markov Model
The Hidden Markov model is a stochastic model in which the system being modelled is assumed to be a Markov process with unobserved (hidden) states. A key
aspect of the HMM is its Markov property which is described in brief below along
with other background definitions required.
3.1.1
Stochastic Process
Definition 1. Stochastic Process: A stochastic process is a collection of random
variables often used to represent the evolution of some random value over time.
There is indeterminacy in a stochastic process. Even if we know the initial
conditions, the system can evolve in possibly many different ways.
3.1.2
Markov Property and Markov Modelling
Definition 2. Markov Property: A stochastic process has the Markov property if
the conditional probability distribution of future states of the process (conditional
on both past and present values) depends only upon the present state, not on the
sequence of events that preceded it. That is, the process is memoryless.
A Markov model is a stochastic model that follows the Markov property. Next
we present the HMM through the urn problem which eases the exposition. In a
further sub-section the formal description of the HMM is given.
3.1.3
The urn example
There are N urns, each containing balls of different colours mixed in known proportions. An urn is chosen and a ball is taken out of it. The colour of the ball is
noted and the ball is replaced. The choice of the urn from which the nth ball will
be picked is determined by a random number and the urn from which the (n − 1)th
ball was picked. Hence, the process becomes a Markov process.
The problem to be solved is the following: Given the ball colour sequence find
the underlying urn sequence. Here the urn sequence is unknown (hidden) from
us and hence the name Hidden Markov Model. The diagram1 below shows the
architecture of an example HMM. The quantities marked on the transition arrows
represent the transition probabilities.
1
Source:http://en.wikipedia.org/wiki/Hidden_Markov_model
20
Figure 3.1: An example Hidden Markov Model with three urns
3.1.4
Formal Description of the Hidden Markov Model
The hidden Markov model can be mathematically described as follows:
N
T
θi=1...N
φi=1...N,j=1...N
φi=1...N
xt=1...T
yt=1...T
F (y|θ)
xt=2...T
yt=1...T
3.1.5
=
=
=
=
=
=
=
=
∼
∼
number of states
number of observations
emission parameter for an observation associated with state i
probability of transition from state i to state j
N -dimensional vector, composed of φi,1...N ; must sum to 1
state of observation at time t
observation at time t
probability distribution of an observation, parametrized on θ
Categorical(φxt−1 )
F (θxt )
The Trellis Diagram
Given the set of states in the HMM, we can draw a linear representation of the state
transitions given an input sequence by repeating the set of states at every stage.
This gives us the trellis diagram. A sample trellis is shown in Figure 3.22 . Each
level of the trellis contains all the possible states and transitions from each state
onto the states in the next level. Along with every transition, an observation is
emitted simultaneously (in the figure a time unit is crossed and observations vary
with time).
2
Source: Prof. Pushpak Bhattacharyya’s lecture slides on HMM from the course CS 344 - Artificial Intelligence at IIT Bombay, spring 2013
21
Figure 3.2: An Example Trellis
3.1.6
Formulating the Part-of-Speech tagging problem using HMM
The POS tagging problem can be described as follows. We are given a sentence
which is a sequence of words. Each word has a POS tag which is unknown. The
task is to find the POS tags of each word and return the POS tag sequence corresponding to the sentence. Here the POS tags constitute the hidden states. As in the
urn problem, we again assume that words (balls) are emitted by POS tags (urns), a
property called the lexical assumption. That is, the probability of seeing a particular word depends only on the POS tag previously seen. Also, as was the case in the
urn problem, the probability of a word having a particular POS tag is dependent
only on the POS tag of the previous word (urn to urn probability). Having modelled the problem as given above, we need to explain how the transition tables are
constructed. The transition probabilities come from data. This is a data-driven approach to POS tagging, and using data on sentences which are already POS tagged
we construct the transition tables. Given this formulation, we next present an algorithm which given an input sentence and the transition tables outputs the most
probable POS tag sequence.
3.1.7
The Viterbi Algorithm
The Viterbi algorithm[13] is a dynamic programming algorithm for finding the
most likely sequence of hidden states that result in the sequence of observed states.
Here the hidden states are the POS tags (or urns in the example) and the observed
sequence is the word sequence (ball colours).
The state transition probabilities are known (in practice these are estimated
from labelled data) and so are the probabilities of emitting each word in the sentence given the POS tag of the previous word. We start at the start of the input
sentence. We define two additional POS tags ˆ and $ to represent the tag for the
start of the sentence and the terminal character at the end of the sentence (full stop,
exclamation mark and question mark).
A straight-forward algorithm to find the most probable POS tag sequence (hidden sequence) would be to just try all possibilities starting from the beginning of
22
the sentence. Here, our problem has more structure. We will exploit the Markov
assumption we made earlier to get a much more efficient algorithm which is precisely the Viterbi algorithm.
In the trellis for POS tagging problem the following are the major changes to
be done.
• The observations (words) do not vary with time. Instead they vary with the
position of the pointer in the input sentence.
• The states are the POS tags. The state transition probabilities are pre-computed
using a POS-tagged corpus.
Next, we observe that due to the Markov assumption, once we have traversed
a part of the sentence, the transition probabilities do not depend on the entire sentence seen so far. They depend only on the previous POS tag. This crucial observation gives rise to the Viterbi algorithm:
Suppose we are given a HMM with S possible POS tags (states), initial probabilities πi of being in state i, the transition probabilities P (sj |si ) of going from
state i to j and the emission probabilities P (xt |si ) of emitting xt from the state si .
If the input sentence is x1 , x2 , . . . , xT then the most probable state sequence that
produces the sentence y1 , y2 , . . . , yT is given by the recurrence relations
V1,k = P (y1 |sk )πk
(3.1)
Vt,k = P (yt |sk )maxsx ∈S (P (sk |sx ).Vt−1,x )
(3.2)
where Vt,k is the probability of the most probable state sequence which emitted
the first t words that has k as the final state. The Viterbi path (most likely state
sequence) can be remembered by storing back pointers which contain the state
sx which was chosen in the second equation. The complexity of the algorithm is
O(|T ||S 2 |) where T is the set of words, the input sequence and S is the set of POS
tags.
3.1.8
Pseudocode
Pseudocode for the Viterbi algorithm is given below:
#
#
#
#
#
#
#
#
Given
Set of states: Array S
Start state: s0
End state: se
Symbol sequence: Array w
State transition probabilities: Matrix a
Symbol emission probabilities: Matrix b
alpha: Matrix alpha
# All indices in arrays start on 1 in this pseudocode
23
# Returns
# Total probability: p
# Initialisation F1
foreach s in S do
alpha [1][s] := a[s0][s]*b[s][w[1]]
done
# Induction F2
for i := 1 to length(w)-1 do
foreach s in S do
foreach s’ in S do
alpha[i+1][s] += alpha[i][s’]*a[s’][s]
done
alpha[i+1][s] *= b[s][w[i+1]]
done
done
# Termination F3
foreach s in S do
p += alpha[length(w)][s]*a[s][se]
done
return p
In the next section, we present the concept of Maximum Entropy and see how
it is applied to NLP tasks via an example for Statistical Machine Learning.
3.2
Maximum Entropy Approach
”Gain in entropy always means loss of information, and nothing more”.
- G.N. Lewis (1930)
3.2.1
Entropy - Thermodynamic and Information
In statistical mechanics, entropy is of the form:
P
S = −k pi log pi ,
i
where pi is the probability of the microstate i taken from an equilibrium ensemble.
The defining expression for entropy in Shannon’s theory of information is of the
form:
24
H=−
P
pi log pi ,
i
where pi is the probability of the message mi taken from the message space M .
Mathematically H may also be seen as an average information, taken over the
message space, because when a certain message occurs with probability pi , the information − log pi will be obtained.
A connection can be made between the two. If the probabilities in question
are the thermodynamic probabilities pi , the (reduced) Gibbs entropy σ can then be
seen as simply the amount of Shannon information needed to define the detailed
microscopic state of the system, given its macroscopic description. To be more
concrete, in the discrete case using base two logarithms, the reduced Gibbs entropy
is equal to the minimum number of yes/no questions needed to be answered in
order to fully specify the microstate, given that we know the macrostate.
3.2.2
The Maximum Entropy Model
Language modelling is the attempt to characterize, capture and exploit regularities
in natural language. In statistical language modelling, large amounts of text are
used to automatically determine the models parameters, in a process known as
training. While building models, we may use each knowledge source separately
and then combine. Under the Maximum Entropy approach, one does not construct
separate models. Instead, we build a single, combined model, which attempts to
capture all the information provided by the various knowledge sources. Each such
knowledge source gives rise to a set of constraints, to be imposed on the combined
model. The intersection of all the constraints, if not empty, contains a (possibly
infinite) set of probability functions, which are all consistent with the knowledge
sources. Once the desired knowledge sources have been incorporated, no other
features of the data are assumed about the source. Instead, the worst (flattest)
of the remaining possibilities is chosen. Let us illustrate these ideas with a simple
example.
3.2.3
Application to Statistical Machine Learning
Suppose we wish to predict the next word in a document[11], given the history,
i.e., what has been read so far. Assume we wish to estimate P (BANK|h), namely
the probability of the word BANK given the documents history. One estimate
may be provided by a conventional bigram. The bigram would partition the event
space (h, w) based on the last word of the history. Consider one such equivalence
class, say, the one where the history ends in THE. The bigram assigns the same
probability estimate to all events in that class:
PBIGRAM (BANK|THE) = K{THE,BANK}
That estimate is derived from the distribution of the training data in that class.
Specifically, it is derived as:
25
K{THE,BANK} =
C(THE,BANK)
C(THE)
Another estimate may be provided by a particular trigger pair, say (LOAN7→ BANK).
Assume we want to capture the dependency of BANK on whether or not LOAN occurred before it in the same document. Thus a different partition of the event space
will be added. Similarly to the bigram case, consider now one such equivalence
class, say, the one where LOAN did occur in the history. The trigger component
assigns the same probability estimate to all events in that class:
PLOAN7→BANK (BANK|LOAN∈ h) = K{BANK|LOAN∈h}
That estimate is derived from the distribution of the training data in that class.
Specifically, it is derived as:
K{BANK|LOAN∈h} =
C(BANK,LOAN∈h)
C(LOAN∈h)
These estimates are clearly mutually inconsistent. How can they be reconciled?
Linear interpolation solves this problem by averaging the two answers. The backoff method solves it by choosing one of them. The Maximum Entropy approach,
on the other hand, does away with the inconsistency by relaxing the conditions imposed by the component sources.
Consider the bigram. Under Maximum Entropy, we no longer insist that P (BANK|h)
always have the same value K{THE,BANK} whenever the history ends in THE. Instead, we acknowledge that the history may have other features that affect the
probability of BANK. Rather, we only require that, in the combined estimate,
P (BANK|h) be equal to K{THE,BANK} on average in the training data.
E
h ends in THE
[PCOMBINED (BANK|h)] = K{THE,BANK}
where E stands for an expectation, or average. The constraint expressed by this
equation is much weaker. There are many different functions PCOMBINED that
would satisfy it. Similarly,
E
[PCOMBINED (BANK|h)] = K{BANK|LOAN∈h}
LOAN∈h
In general, we can define any subset S of the event space, and any desired expectation K, and impose the constraint:
P
[P (h, w)] = K
(h,w)∈S
The subset S can be specified by an index function, also called selector function,
fS , an indicator for the belongingness of the pair (h, w) in S. So, we have
P
[P (h, w)fS (h, w)] = K
(h,w)
We need not restrict ourselves to index functions. Any real-valued function f (h, w)
can be used. We call f (h, w) a constraint function, and the associated K the desired
expectation. So, we have
hf, P i = K
26
3.3
The ME Principle and a Solution
Now, we give a general description of the Maximum Entropy model and its solution. The Maximum Entropy (ME) Principle can be stated as follows[6]
1. Reformulate the different information sources as constraints to be satisfied
by the target (combined) estimate.
2. Among all probability distributions that satisfy these constraints, choose the
one that has the highest entropy.
Given a general event space {x}, to derive a combined probability function P (x),
each constraint j is associated with a constraint function fj (x) and a desired expectation Kj . The constraint is then written as:
P
EP fj = P (x)fj (x) = Kj
x
Given consistent constraints, a unique ME solution is guaranteed to exist, and to
be of the form:
f (x)
P (x) = Π µj j
j
where the µj s are some unknown constants, to be found.
3.3.1
Proof for the ME Formulation
Here, we give a proof for the unique ME solution that we proposed in the previous
subsection. Suppose there are N different points in the event space, and we assign
a probability pi to each. Then, the objective to be maximised is the entropy, given
N
P
by H = −
pi ln pi . The constraints are:
i=1
X
pi = 1
i
X
pi fj (xi ) = Kj ∀j ∈ {1, 2, ..., m}
i
27
So, we introduce Lagrange multipliers and now maximise
F
= −
N
X
pi ln pi + λ(
i=1
∂F
∂pi
= − ln pi − 1 + λ +
N
X
i=1
m
X
pi − 1) +
m
X
N
X
λj (
pi fj (xi ) − Kj )
j=1
i=1
λj fj (xi ) = 0
j=1
ln pi = λ − 1 +
m
X
λj fj (xi )
j=1
m
P
λj fj (xi )
pi = eλ−1 ej=1
m
Y
pi = eλ−1
eλj fj (xi )
j=1
pi = a
m
Y
f (xi )
µj j
j=1
where a = eλ−1 is a normalization constant and eλj = µj
3.3.2
Generalized Iterative Scaling
Q
fj (xi )
for the µi s that will
To search the exponential family defined by pi = m
j=1 µj
make P (x) satisfy all the constraints, an iterative algorithm exists, which is guar(0)
anteed to converge to the solution. GIS[5] starts with some arbitrary µi values,
which define the initial probability estimate:
P 0 (x) =
Q
j
(0) fj (x)
µj
Each iteration creates a new estimate, which is improved in the sense that it matches
the constraints better than its predecessor. Each iteration (say k) consists of the
following steps:
1. Compute the expectations of all
Pthe fj ’s under the current estimate function.
Namely, compute EP (k) fj = P (k) (x)fj (x)
x
2. Compare the actual values EP (k) fj ’s to the desired values Kj s, and update
the µj ’s according to the following formula:
(k+1)
µj
(k)
= µj
Kj
EP (k) fj
3. Define the next estimate function based on the new µj s:
28
P (k+1) (x) =
Q
j
(k+1) fj (x)
µj
Iterating is continued until convergence or near-convergence.
3.4
Improved Iterative Scaling
Iterative Scaling and its variants are all based on the central idea of the Gradient
Descent algorithm for optimizing convex training objectives. It is presented here
using a model which occurs at many places in a maximum entropy approach to
natural language processing.
3.4.1
The Model in parametric form
The problem we consider is a language modelling problem[9], which is to define
the distribution P (y|x), where y and x are sequences. For eg, y can be the POS tag
sequence and x the input sequence. Henceforth the boldface indicating that x is a
sequence will be dropped unless the context demands further elucidation.
Given just the above information, the maximum entropy approach maximises
the entropy of the model giving us a model of the following form.
!
n
X
1
PΛ (y|x) =
exp
(3.3)
λi fi (x, y) .
ZΛ (x)
i=1
where
• fi (x, y) is a binary-valued function, called a feature of (x,y), associated with
the model. The model given above has n features.
• λi is a real-valued weight attached with fi whose absolute value measures
the ’importance’ of the feature fi . Λ is the vector of the weights: Λ =
{λ1 , λ2 , . . . , λn }.
• ZΛ (x) is the normalizing factor which ensures that PΛ is a probability distribution.
!
n
X
X
ZΛ (x) =
exp
λi fi (x, y)
y
3.4.2
i=1
Maximum Likelihood
The next thing to do would be to train the model, i.e find the parameters λi so as to
maximize some objective over the training data. Here, we choose to maximize the
likelihood of the training data. The likelihood is computed by assuming that the
29
model is the correct underlying distribution and hence is a function of the parameters of the model. The likelihood of the training data is expressed as follows (N is
the number of training instances):
M (Λ) =
N
Y
P (xi , yi )
i=1
=
N
Y
PΛ (yi |xi )P (xi )
i=1
Now, we note that log(x) is a one-to-one map for x > 0. Therefore the value of
x which maximizes f (x) is the same as that which maximizes log(f (x)). Henceforth we work with the logarithm of the likelihood expression as it is mathematically easier to work with. The log-likelihood expression denoted by L(Λ) is given
below:
L(Λ) = log(M (Λ))
=
N
X
log (PΛ (yi |xi )) + C
i=1
where C is independent of Λ and is hence treated as a constant. It is dropped from
the expression henceforth as it does not affect the maximization problem.
Now, we express the log-likelihood expression in terms of the empirical probability
distribution p̃(x, y) obtained from the training data as follows:
c(x, y)
x,y c(x, y)
p̃(x, y) = P
where c(x, y) is the number of times the instance (x, y) occurs in the training data.
The log-likelihood expression becomes the following:
X
Lp̃ (Λ) =
log PΛ (y|x)c(x,y)
x,y
=
X
p̃(x, y)log (PΛ (y|x))
x,y
We ignore
P
x,y
c(x, y) as it is constant for a given training set (= N ).
30
3.4.3
The objective to optimize
Hence we arrive the objective to be maximized. The maximum likelihood problem
is to discover Λ∗ ≡ argmaxΛ Lp̃ (Λ) where
X
Lp̃ (Λ) =
p̃(x, y)log (PΛ (y|x))
x,y
=
=
X
p̃(x, y)
X
x,y
i
x,y
X
X
X
p̃(x, y)
x,y
3.4.4
λi fi (x, y) −
X
λi fi (x, y) −
p̃(x, y)log
n
X
exp
y
p̃(x)log(
X
exp
y
x
i
X
!!
λi fi (x, y)
i=1
n
X
!
λi fi (x, y) )
i=1
Deriving the iterative step
Suppose we have a model with some arbitrary set of parameters Λ = {λ1 , λ2 , . . . , λn }.
We would like to find a new set of parameters Λ+∆ = {λ1 +δ1 , λ2 +δ2 , . . . , λn +
δn } which yield a model of higher log-likelihood. The change in log-likelihood is
X
X
Lp̃ (Λ + ∆) − Lp̃ (Λ) =
p̃(x, y)logP(Λ+∆) (y|x) −
p̃(x, y)logPΛ (y|x)
x,y
=
x,y
X
p̃(x, y)
x,y
X
δi fi (x, y) −
X
p̃(x)log
x
i
Z(Λ+∆) (x)
Z(Λ) (x)
Now, we make use of the inequality −log(α) ≥ 1 − α to establish a lower
bound on the above change in likelihood expression.
Lp̃ (Λ + ∆) − Lp̃ (Λ) ≥
X
p̃(x, y)
X
x,y
δi fi (x, y) + 1 −
X
p̃(x)
x
i
Z(Λ+∆) (x)
Z(Λ) (x)
P
y exp ( i (λi + δi )fi (x, y))
P
=
p̃(x, y)
δi fi (x, y) + 1 −
p̃(x) P
exp
(
i λi fi (x, y))
y
x,y
x
i
!!
X
X
X
X exp(P λi fi (x, y) X
i
=
p̃(x, y)
δi fi (x, y) + 1 −
p̃(x)
exp
δi fi (x, y)
Z
Λ (x)
x,y
x
y
i
i
!
X
X
X
X
X
=
p̃(x, y)
δi fi (x, y) + 1 −
p̃(x)
PΛ (y|x)exp
δi fi (x, y)
X
X
X
x,y
i
x
P
y
i
= A(∆|Λ)
Now we know that is we can find a ∆ such that A(∆|Λ) > 0 then we have a
improvement in the likelihood. Hence, we try to maximize A(∆|Λ) with respect
to each δi . Unfortunately the derivative of A(∆|Λ) with respect to δi yields an
equation containing all of {δ1 , δ2 . . . . , δn } and hence the constraint equations for
δi are coupled.
31
To get around this, we first observe that the coupling is due to the summation
of the δi s present inside the exponentiation function. We consider a counterpart
expression with the summation placed outside the exponentiation and compare the
two expressions. We find that we can indeed establish an inequality using an important property called the Jensen’s inequality. First, we define the quantity,
X
f # (x, y) =
fi (x, y)
i
If fi are binary-valued then f # (x, y) just gives the total number of features which
are non-zero (applicable) at the point (x,y). We rewrite A(∆|Λ) in terms of f # (x, y)
as follows:
A(∆|Λ) =
X
p̃(x, y)
X
x,y
i
Now, we note that
p(x),
X
X
X δi fi (x, y)
δi fi (x, y)+1−
p̃(x)
PΛ (y|x)exp f # (x, y)
f # (x, y)
x
y
!
i
fi (x,y)
f # (x,y)
is a p.d.f. Jensen’s inequality states that for a p.d.f,
!
exp
X
p(x)q(x)
≤
x
X
exp(p(x)q(x))
x
Now, using Jensen’s inequality, we get,
A(∆|Λ) ≥
X
p̃(x, y)
X
x,y
δi fi (x, y) + 1 −
X
x
i
p̃(x)
X
y
X fi (x, y) PΛ (y|x)
exp(δi f# (x, y))
f # (x, y)
i
= B(∆|Λ)
where B(∆|Λ) is a new lower-bound on the change in likelihood. B(∆|Λ) can be
maximized easily because there is no coupling of variables in its derivative. The
derivative of B(∆|Λ) with respect to δi is,
X
X
∂B(∆) X
=
p̃(x, y)fi (x, y) −
p̃(x)
PΛ (y|x)fi (x, y)exp(δi f # (x, y))
∂δi
x,y
x
y
Notice that in the expression for ∂B(∆)
∂δi δi appears alone without the other parameters. Therefore, we can solve for each δi individually. The final IIS algorithm is as
follows,
• Start with some arbitrary values for λi s.
• Repeat until convergence
– Solve for
∂B(∆)
∂δi
= 0 for δi .
– Set λi = λi + δi
for each i.
32
3.5
Swarm Intelligence
Swarm Intelligence (SI)[10] is a relatively new paradigm being applied in a host of
research settings to improve the management and control of large numbers of interacting entities such as communication, computer and sensor networks, satellite
constellations and more. Attempts to take advantage of this paradigm and mimic
the behaviour of insect swarms however often lead to many different implementations of SI. Here, we provide a set of general principles for SI research and development. A precise definition of self-organized behaviour is described and provides
the basis for a more axiomatic and logical approach to research and development as
opposed to the more prevalent ad hoc approach in using SI concepts. The concept
of Pareto optimality is utilized to capture the notions of efficiency and adaptability.
3.5.1
Foundations
The use of swarm intelligence principles makes it possible to control and manage
complex systems of interacting entities even though the interactions between and
among the entities is minimal.
As an example, consider how ants actually solve shortest path problems. Their
motivation for solving these problems stems from their need to find sources of
food. Many ants set out in search of a food source by apparently randomly choosing several different paths. Along the way they leave traces of pheromone. Once
ants find a food source, they retrace their path back to their colony by following
their scent back to their point of origin. Since many ants go out from their colony
in search of food, the ants that return first are presumably those that have found
the food source closest to the colony or at least have found a source that is in some
sense more accessible. In this way, an ant colony can identify the shortest or best
path to the food source.
The cleverness and simplicity of this scheme is highlighted when this process
is examined from what one could conceive of as the ants perspective - they simply
follow the path with the strongest scent (or so it seems). The shortest path will
have the strongest scent because less time has elapsed between when the ants set
out in search of food and when they arrive back at the colony, hence there is less
time for the pheromone to evaporate. This leads more ants to go along this path
further strengthening the pheromone trail and thereby reinforcing the shortest path
to the food source and so exhibits a form of reinforcement learning.
But this simple method of reinforcement or positive feedback also exhibits important characteristics of efficient group behaviour. If, for instance, the shortest
path is somehow obstructed, then the second best shortest path will, at some later
point in time, have the strongest pheromone, hence will induce ants to traverse it
thereby strengthening this alternate path. Thus, the decay in the pheromone level
33
leads to redundancy, robustness and adaptivity, i.e., what some describe as emergent behaviour.
Efficiency via Pareto Optimality
Optimization problems are ubiquitous and even social insects must face them. Certainly, the efficient allocation of resources present problems where some goal or
objective must be maintained or achieved. Such goals or objectives are often mathematically modelled using objective functions, functions of decision variables or
parameters that produce a scalar value that must be either minimized or maximized.
The challenge presented in these often difficult problems is to find the values of
those parameters that either minimize or maximize, i.e., optimize, the objective
function value subject to some constraints on the decision variables.
In multi-objective optimization problems (MOPs) system efficiency in a mathematical sense is often based on the definition of Pareto optimality a well established way of characterizing a set of optimal solutions when several objective
functions are involved. Each operating point or vector of decision variables (operational parameters) produces several objective function values corresponding to a
single point in objective function space (this implies a vector of objective function
values). A Pareto optimum corresponds to a point in objective function space with
the property that when it is compared to any other feasible point in objective function space, at least one objective function value (vector component) is superior to
the corresponding objective function value (vector component) of this other point.
Pareto optima therefore constitute a special subset of points in objective function
space that lie along what is referred to as the Pareto optimal frontier the set of
points that together dominate (are superior to) all other points in objective function
space.
Figure 3.3: The Pareto Optimal frontier is the set of hollow points. Operational decisions must be restricted along this set if operational efficiency is to be maintained
Determining several Pareto optima can be quite valuable for enhancing the
survival value of a species (or managing a complex system) because it enables
adaptive behaviour. Thus, if in an ant colony a path to a food source becomes congested, then other routes must be utilized. Although the distances to food sources
are generally minimized as is the level of congestion, these often conflicting objec34
tives can be efficiently traded off when the shortest distance is sacrificed to lessen
the level of congestion.
The Measure of Pareto Optima: A rather intuitive yet surprisingly little known
aspect of Pareto optima is its measure. This measure is based on the size of the
set of points in objective function space that are dominated by the Pareto optimal
frontier - in essence a Lebesgue measure or hypervolume.
Figure 3.4: The Pareto hypervolume
3.5.2
Example Algorithms and Applications
• Ant colony optimization
A class of optimization algorithms modelled on the actions of an ant colony,
ACO is a probabilistic technique useful in problems that deal with finding
better paths through graphs. Artificial ’ants’ -simulation agents, locate optimal solutions by moving through a parameter space representing all possible
solutions. Natural ants lay down pheromones directing each other to resources while exploring their environment. The simulated ’ants’ similarly
record their positions and the quality of their solutions, so that in later simulation iterations more ants locate better solutions.
• Artificial bee colony algorithm
Artificial bee colony algorithm (ABC) is a meta-heuristic algorithm that
simulates the foraging behaviour of honey bees. The algorithm has three
phases: employed bee, onlooker bee and scout bee. In the employed bee and
the onlooker bee phases, bees exploit the sources by local searches in the
neighbourhood of the solutions selected based on deterministic selection in
the employed bee phase and the probabilistic selection in the onlooker bee
phase. In the scout bee phase which is an analogy of abandoning exhausted
food sources in the foraging process, solutions that are not beneficial any
more for search progress are abandoned, and new solutions are inserted instead of them to explore new regions in the search space. The algorithm has
a well-balanced exploration and exploitation ability.
35
• Particle swarm optimization
PSO is a global optimization algorithm for dealing with problems in which
a best solution can be represented as a point or surface in an n-dimensional
space. Hypotheses are plotted in this space and seeded with an initial velocity, as well as a communication channel between the particles. Particles then
move through the solution space, and are evaluated according to some fitness
criterion after each time-step. Over time, particles are accelerated towards
those particles within their communication grouping which have better fitness values. The main advantage of such an approach over other global minimization strategies such as simulated annealing is that the large number of
members that make up the particle swarm make the technique impressively
resilient to the problem of local minima.
3.5.3
Case Study: Ant Colony Optimization applied to the NP-hard
Travelling Salesman Problem
Travelling salesman problem (TSP) consists of finding the shortest route in complete weighted graph G with n nodes and n(n-1) edges, so that the start node and
the end node are identical and all other nodes in this tour are visited exactly once.
We apply the Ant Colony[12] heuristic to obtain an approximate solution to the
problem. We use virtual ants to traverse the graph and discover paths for us. Their
movement depends on the amount of pheromone on the graph edges. We assume
the existence of ant’s internal memory. In symbols, what we have is:
• Complete weighted graph G = (N, A)
• N = set of nodes representing the cities
• A = set of arcs
• Each arc (i, j) in A is assigned a value (length) dij , which is the distance
between cities i and j.
Tour Construction
τij refers to the desirability of visiting city j directly after city i. Heuristic information is chosen as ηij = d1ij .
We apply the following constructive procedure to each ant:
1. Choose, according to some criterion, a start city at which the ant is positioned;
2. Use pheromone and heuristic values to probabilistically construct a tour by
iteratively adding cities that the ant has not visited yet, until all cities have
been visited;
3. Go back to the initial city;
36
4. After all ants have completed their tour, they may deposit pheromone on the
tours they have followed.
Continue for a fixed number of iterations or till the pheromone distribution becomes almost constant.
Ant System
The Ant System (proposed in 1991) uses the following heuristics and formulae for
probability propagation
• Initialize the pheromone trails with a value slightly higher than the expected
amount of pheromone deposited by the ants in one iteration; a rough estimate
of this value can be obtained by setting
τij = τ0 =
m
C nn
where m is the number of ants, and C nn is the length of a tour generated by
the nearest-neighbour heuristic.
• In AS, these m artificial ants concurrently build a tour of the TSP.
• Initially, put ants on randomly chosen cities. At each construction step, ant
k applies a probabilistic action choice rule, called random proportional rule,
to decide which city to visit next.
β
pkij = τijα ηij
/
P
l∈Nik
τilα ηilβ , if j ∈ Nik
• Each ant k maintains a memory Mk which contains the cities already visited,
in the order they were visited. This memory is used to define the feasible
neighbourhood Nik in the construction rule.
• We can adopt any of the following two: Parallel implementation: at each
construction step all ants move from current city to next one; Sequential implementation: ant builds complete tour before next one starts to build another
37
Update of Pheromone Trails
• Forget bad decisions:
τij ← (1 − ρ)τij ∀i, j, where ρ ∈ {0, 1}
• So,if an arc is not chosen by the ants, its pheromone value decreases exponentially
• ∆τijk is the amount of pheromone ant k deposits on the arcs it has visited and
C k is the length of tour T k built by the k th ant. Then, they are related as
follows:
∆τijk = 1/C k , if arc (i, j) belongs to tour T k ; 0 otherwise
• The update then happens as follows:
τij ← τij +
m
P
k=1
∆τijk , ∀(i, j)
Computational Experiments For experiment, the problem of 32 cities in Slovakia has been solved using the ACO. The optimal solution to that problem has a
length of route 1453km. Parameters are α = 1, β = 5. The number of iterations
was set to 1000.
With m = 1000, the result was the tour with length 1621 km in 34th iteration
(difference 11.56% from optimal route).
Figure 3.5: Search process for m=1000 ants
With m = 5000, algorithm ACO finds the tour with length 1532km in 21st
iteration (difference 5.44% from optimal route).
38
Figure 3.6: Search process for m=5000 ants
3.6
Boltzmann Machines
One of the first examples of a neural network capable of learning internal representations, Boltzmann machines3 are able to represent and (given sufficient time) solve
difficult combinatoric problems. They are named after the Boltzmann distribution
in statistical mechanics, which is used in their sampling function.
Figure 3.7: Graphical representation for a Boltzmann machine with a few labelled
weights
3.6.1
Structure
A Boltzmann machine, is a network of stochastic units with an energy defined for
the network. The global energy E, in a Boltzmann machine is:
P
P
E = −( i<j wij si sj + i θi si )
3
Content and figure from http://en.wikipedia.org/wiki/Boltzmann_machine
39
where wij is the connection strength between unit j and unit i; si ∈ {0, 1} is the
state of unit i; θi is the bias of unit i in the global energy function.
The connections in a Boltzmann machine have two restrictions:
• wii = 0
• wij = wji
3.6.2
∀i. (No unit has a connection with itself.)
∀i, j. (All connections are symmetric.)
Probability of a state
The difference in the global energy that results from a single unit i being 0(off)
versus 1(on), written ∆Ei , is given by:
P
∆Ei = j wij sj + θi
This can be expressed as the difference of energies of two states:
∆Ei = Ei=off − Ei=on
We then substitute the energy of each state with its relative probability according
to the Boltzmann Factor (the property of a Boltzmann distribution that the energy
of a state is proportional to the negative log probability of that state):
∆Ei = −kB T ln(pi=off ) − (−kB T ln(pi=on ))
where kB is Boltzmann’s constant and is absorbed into the artificial notion of temperature T . We then rearrange terms and consider that the probabilities of the unit
being on and off must sum to one:
∆Ei
T
∆Ei
T
∆Ei
T
∆Ei
−
T
∆Ei
−
T
∆Ei
exp −
T
= ln(pi=on ) − ln(pi=off )
= ln(pi=on ) − ln(1 − pi=on )
pi=on
= ln
1 − pi=on
1 − pi=on
= ln
pi=on
1
= ln
−1
pi=on
1
=
−1
pi=on
We can now solve for pi=on , the probability that the ith unit is on.
pi=on =
1
∆E
1+exp(− T i )
where the scalar T is referred to as the temperature of the system. This relation is
the source of the logistic function found in probability expressions in variants of
the Boltzmann machine.
40
3.6.3
Equilibrium State
The network is run by repeatedly choosing a unit and setting its state according
to the above formula. After running for long enough at a certain temperature, the
probability of a global state of the network will depend only upon that global state’s
energy, according to a Boltzmann distribution. This means that log-probabilities of
global states become linear in their energies. This relationship is true when the machine is at thermal equilibrium, meaning that the probability distribution of global
states has converged. If we start running the network from a high temperature, and
gradually decrease it until we reach a thermal equilibrium at a low temperature, we
may converge to a distribution where the energy level fluctuates around the global
minimum. This process is called simulated annealing.
41
Chapter 4
Some popular Quantum
Computing Ideas
A quantum algorithm is a step-by-step procedure such that each of the steps can
be performed on a classical computer. Quantum computers can execute algorithms
that sometimes dramatically outperform classical computation. The best-known
example of this is Shor’s discovery of an efficient quantum algorithm for factoring
integers, whereas the same problem appears to be intractable on classical computers. Understanding what other computational problems can be solved significantly
faster using quantum algorithms is one of the major challenges in the theory of
quantum computation. In an attempt to gain an insight in the same, we study a few
of the existing quantum algorithms.
The first among them is the Deutsch-Jozsa algorithm used to determine the
nature of a function, followed by Shor’s algorithm for factoring integers and then
Grover’s algorithm which efficiently searches for an element in an unsorted database.
4.1
4.1.1
Deutsch-Jozsa Algorithm
Problem Statement
In the Deutsch-Jozsa problem, we are given a black box quantum computer known
as an oracle that implements the function f : {0, 1}n → {0, 1}. In layman’s terms,
it takes n-digit binary values as input and produces either a 0 or a 1 as output for
each such value. We are promised that the function is either constant (0 on all
inputs or 1 on all inputs) or balanced (returns 1 for half of the input domain and 0
for the other half); the task then is to determine if f is constant or balanced by using
the oracle.
4.1.2
Motivation and a Classical Approach
The DeutschJozsa problem[1] is specifically designed to be easy for a quantum
algorithm and hard for any deterministic classical algorithm. The motivation is to
42
show a black box problem that can be solved efficiently by a quantum computer
with no error, whereas a deterministic classical computer would need exponentially
many queries to the black box to solve the problem.
For a conventional deterministic algorithm where n is number of bits/qubits,
+ 1 evaluations of f will be required in the worst case. To prove that f is
constant, just over half the set of inputs must be evaluated and their outputs found
to be identical.
2n−1
4.1.3
The Deutsch Quantum Algorithm
1. The algorithm begins with the n + 1 qubit state |0i⊗n |1i. That is, the first n
qubits are each in the state |0i and the final one is |1i.
P n −1
2. Apply a Hadamard transformation to each bit to obtain the state √ 1n+1 2x=0
|xi(|0i−
2
|1i).
3. We have the function f implemented as quantum oracle. The oracle maps
the state |xi|yi to |xi|y ⊕ f (x)i, where ⊕ is addition modulo 2.
P2n −1
4. Applying the quantum oracle gives √ 1n+1 x=0
|xi(|f (x)i − |1 ⊕ f (x)i).
2
5. For each x, fP
(x) is either 0 or 1. A quick check of these two possibilities
n −1
yields √ 1n+1 2x=0
(−1)f (x) |xi(|0i − |1i).
2
6. At this point, ignore the last qubit. Apply a Hadamard transformation to each
qubit to obtain
i
P2n −1
P2n −1 hP2n −1
1 P2n −1
x·y |yi = 1
f (x)
f (x) (−1)x·y |yi
(−1)
(−1)
(−1)
n
n
y=0
x=0
y=0
x=0
2
2
where x·y = x0 y0 ⊕x1 y1 ⊕· · ·⊕xn−1 yn−1 is the sum of the bitwise product.
2
1 P2n −1
⊗n
f
(x)
7. Finally we examine the probability of measuring |0i , 2n x=0 (−1)
which evaluates to 1 if f (x) is constant (constructive interference) and 0 if
f (x) is balanced (destructive interference).
The DeutschJozsa algorithm provided inspiration for Shor’s algorithm and Grover’s
algorithm, two of the most revolutionary quantum algorithms, which are described
now.
4.2
Shor’s Algorithm
Shor’s algorithm, given in 1994 by mathematician Peter Shor, is an algorithm for
integer factorization. On a quantum computer, Shor’s algorithm runs in polynomial time. First, we describe the problem of factorization more formally followed
43
by an overview of some mathematical concepts required to understand the algorithm. The familiar reader can skip these subsections and continue reading from
the subsection ’Reduction of the Factorization problem’.
4.2.1
The factorization problem
The factorization problem definition is given below.
Problem Definition: Given an integer n, factorize n as a product of primes.
Typically the integer n is very large (a few hundred digits long). Hence the
brute force approach of checking whether each number between 2 and n − 1 is a
factor of n which takes exponential time to complete, is not efficient and it can take
many years for the computation to finish. In fact, there is no deterministic algorithm known that can factorize n in polynomial-time. This limitation is exploited
by the famous Rivest-Shamir-Adleman encryption scheme (RSA).
We will assume (both for simplicity and with a view to RSA cryptanalysis) that
n = pq where p and q are large unknown primes. We must determine p and q.
4.2.2
The integers mod n
Let R = 0, 1, 2, . . . , n − 1 with addition and multiplication modulo n. For a, b ∈
R we compute a + b mod n and ab mod n by first computing the sum or product
as an ordinary integer, then taking the remainder upon division by n.
These operations are easily performed in polynomial time in the input size
l = log(n) using a classical logical circuit of size polynomial in l. For x ∈ R and
a ≥ 0, the value of xa mod n can also be determined in polynomial time and space
via the square-and-multiply algorithm which is described in brief below.
4.2.3
A fast classical algorithm for modular exponentiation
The method is based on the following observation:
(
a−1
x (x2 ) 2 , if a is odd
a
x =
a
(x2 ) 2 ,
if a is even.
(4.1)
Now, due to the modular nature of squaring, the number of digits of x2 are
limited by the length of n. We computed xa by repeated squaring taking the result
modulo n each time before proceeding to the next iteration, which gives rise to the
following recursive algorithm for exponentiation.
Function exp-by-squaring(x,n)
if n<0 then return exp-by-squaring(1/x, -n);
else if n=0 then return 1;
else if n=1 then return x;
else if n is even then return exp-by-squaring(x*x, n/2);
else if n is odd then return x*exp-by-squaring(x*x, (n-1)/2).
44
4.2.4
Reduction of the Factorization problem
Using randomization, factorization can be reduced to finding the order of an element in the multiplicative group (mod n), where order or r is the smallest r ≥ 1
such that xr mod n is 1.
Suppose we choose x randomly from {2, . . . , n − 1} and find the order r of x
with respect to n. Then if r is not odd
r
r
(x 2 − 1)(x 2 + 1) ≡ 1 (mod n)
r
Now consider the gcd(x 2 − 1, n). This fails to be a non-trivial divisor of n only if
r
x 2 ≡ −1 (mod n) or r is odd. This procedure, when applied to a random x (mod
1
n), yields a factor of n with probability at least 1 − 2k−1
, where k is the number of
distinct odd prime factors of n. We will accept this statement without proof.
It can be seen that the above probability is at least 12 if k ≥ 1. If k = 1 implying n
had only one odd prime factor, it can be easily factored in polynomial time using
classical algorithms. (Reference here)
Shor’s algorithm finds the factors of n indirectly by first choosing a random
r
x and then finding the order of x with respect to n. Then it finds gcd(x 2 − 1, n)
which will be a factor of n with high probability. It continues doing the same until
n has been completely factorized. The algorithm requires a quantum computer
only for finding the period of x in polynomial time. This part of the algorithm is
presented next.
4.2.5
The Algorithm
We present only the quantum part of the algorithm in this section. The complete
algorithm is presented at the end of the section. The algorithm uses two quantum
registers which hold integers represented in binary and some additional workspace.
1. Find q, such that q = 2l for some integer l and n2 ≤ q < 2n2 . In a quantum
gate array we need not even keep the values of n, x and q in memory, as they
can be built into the structure of the gate array.
2. Next, the first register is put in the uniform superposition of states representing numbers a (mod q). This leaves the registers in the following state.
q−1
1 X
1
q2
|ai|0i.
(4.2)
a=0
3. Next xa mod n is computed using the square-and-multiply algorithm. This
can be done reversibly. This leaves our registers in the following state.
q−1
1 X
q
1
2
|ai|xa (mod n)i.
a=0
45
(4.3)
4. Then the Fourier transform is performed on the first register, as described in
Chapter 2 which maps |ai to
q−1
1 X
1
q2
exp(2πiac/q)|ci.
(4.4)
c=0
This leaves the registers in the following state
q−1 q−1
1 XX
exp(2πiac/q)|ci|xa (mod n)i.
q
(4.5)
a=0 c=0
5. Finally we observe the system. We now compute the probability that our machine ends in a particular state |c, xk mod ni, where 0 ≤ k < r. Summing
up over all possible ways to reach this state, this probability is
2
1 X
exp(2πiac/q)
q a
a:x ≡xk
Since the order is r, this sum is over all a such that a ≡ k (mod r). Therefore, tha above sum can be written as,
2
b(q−k−1)/rc
X
1
exp(2πi(br
+
k)c/q)
q
b=0
Since, exp(2πikc/q) factors out of the sum and has magnitude 1, we drop
it. Now, on the remaining part of the expression Shor’s algorithm performs
an estimation analysis of the above probability expression and derives the
following lemma which we present without proof.
Lemma 1. The probability of seeing a given state |c, xk (mod n)i is at least
1
if there is a d such that,
3r2
−r
r
≤ rc − dq ≤ .
2
2
(4.6)
Next, Shor proceeds to prove that the probability of obtaining r via the above
δ
algorithm is at least loglogr
. We will accept the above statement without
proof. Hence by repeating the experiment O(loglogr) times, we are assured
of a high probability of success.
4.2.6
An example factorization
We show the running of Shor over the factorization of n = 55. Since n2 ≤ q < 2n2
and q = 2l , q = 213 = 8192. Suppose we choose x = 13. The running of the
algorithm on this input is described below.
46
1. We initialize the initial state to be a superposition of states representing a
(mod 8192).
|ψi = √
1
(|0, 0i + |1, 0i + . . . + |8191, 0i)
8192
2. Next the modular exponentiation gate is applied.
|ψi =
=
1
(|0, 1i + |1, 13i + |2, 132 mod55i . . . + |8191, 138191 mod55i)
8192
1
√
(|0, 1i + |1, 13i + |2, 4i . . . + |8191, 2i)
8192
√
3. Next we perform the Fourier transform on the first register.
|ψi =
8191 8191
1 XX
exp(2πiac/8192)|ci|13a (mod 55)i.
8192
a=0 c=0
4. Now we observe the registers. Register 2 can be in any of the states with
equal probability. Hence all power of x mod 55 are almost equally likely to
be observed if r << q. Suppose we observe 28 as a power of x mod 55.
This occurs 410 times in the series as a varies from 0 to 8191. Then the
probability of observing register 1 to be in state c is
409
1 1 X
P r(c) =
exp(2πirdc/8192)
8192 410 d=0
Here r = 20. Among the states which can be observed with reasonably high
probability is |4915i which is observed with probability of 4.4%.
5. Now qc = 4915
8192 . Shor’s algorithm uses the method of continued fractions
to find d/r from c/q. Applying it here would give us r to be a multiple of
r1 = 5 and that on trying r1 , 2r1 , . . . blog(n)1+ cr1 as values for r we are
guaranteed to find r with a very high probability. Here, we find that r = 20.
6. Now, the algorithm uses the Euclidean algorithm to find the factors of 55.
m = 13(20/2) mod 55 = 1310 mod 55 = 34
and the factors of n = 55 are,
gcd(m + 1, 55) = gcd(35, 55) = 5
gcd(m − 1, 55) = gcd(33, 55) = 11
47
4.3
Grover’s Algorithm
The Grover algorithm, given by Lov Grover in 1996, solves the problem of search1
ing for an element in an unsorted database with N entries in O(N 2 ) time. Note
that with classical computation models this problem cannot be solved in less than
linear time (O(N )).
4.3.1
The search problem
Assume N = 2n . Suppose that we have a function f (x) from {0, 1}n to 0, 1 which
is zero on all inputs except for a single (marked) item x0 : f (x) = δx,x0 . By querying this function we wish to find the marked item x0 . If we have no information
about the particular x0 , then finding this marked item is very difficult. In the worst
case it will take 2n − 1 queries to find x0 for a deterministic algorithm. In general,
if the search problem has M solutions, then the classical algorithm might take as
many as 2n − M steps.
For large N, the Grover algorithm could yield very large performance increases.
The key idea is that although finding a solution to the search problem is hard,
recognising a solution is easy. We wish to search through a list of N elements, lets
index the elements by x ∈ 0, N − 1 and call them yx .
4.3.2
The Oracle
Rather than dealing with the list itself, we focus on the index of the list, x. Given
some value of x, we can tell whether yx solves the search problem. We assume
that we can construct some device to tell us if yx solves the search problem. This
device is called an Oracle.
• The Oracle takes as input an index value in a qubit register |xi. It also takes
a single Oracle qubit, |qi. The state given to the Oracle is thus |ψi = |xi|qi.
• The Oracle is represented by a unitary operator, O. If x indexes a solution
to the search problem, O sets f (x) = 1, and f (x) = 0 if it doesnt index a
solution.
• If f (x) = 1, the Oracle flips the state of |qi. We write this as O|xi|qi =
|xiX f (x) |qi. X is just our quantum NOT operator.
• So if f (x) = 1, |qi 7→ X|qi, else |qi 7→ |qi.
• We choose to initially program |qi =
−|qi. And O|xi|qi =
√1
2
(|0i − |1i). Then X|qi =
√1
2
(−|0i + |1i) =
(−1)f (x) |xi|qi.
• The Oracle therefore takes |xi 7→ (−1)f (x) |xi. So the term indexing the
solution is marked with a − sign.
48
The Oracle does not find the solution to the problem, it simply recognises the
answer when presented with one. The key to quantum search is that we can look
at all solutions simultaneously: the Oracle just manipulates the state coefficients
using a unitary operator!.
4.3.3
The Grover Iteration
1. Begin with |xi =
√1
N
PN −1
j=0
|ji.
2. Apply the Oracle to |xi:
|xi 7→
√1
N
PN −1
j=0
(−1)f (x) |ji
3. Apply the QFT to |xi.
4. Reverse the sign of all terms in |xi except for the term |0i.
5. Apply the Inverse QFT.
6. Return to step 2 and repeat.
4.3.4
Performance of the algorithm
The point at which we terminate Grovers algorithm and measure the result is critical. This is because the probability associated with the correct state rises to 1
after a certain number of iterations and then oscillates periodically between the
two extremes,
0 and 1. It has been shown that the optimum number of iterations is
q
N
, where M is the number of solutions. It has also been shown that this is
∼ π4 M
the best that any quantum search algorithm can do.
4.3.5
An example
Apply Grovers algorithm to N = 4 with solution x = 2.
• We start with |xi = 12 (|0i + |1i + |2i + |3i).
• Apply the Oracle: |xi 7→ 12 (|0i + |1i − |2i + |3i).
• Apply the QFT: F |xi = 12 (|0i + |1i − |2i + |3i).
• Flips signs of all terms except |0i : F |xi 7→ 12 (|0i − |1i + |2i − |3i).
• Inverse QFT: |xi = |2i.
• So when we measure |xi, we are guaranteed the right answer.
49
4.4
The Quantum Minimum Algorithm
We now present a quantum algorithm for finding the minimum value among a given
set of numbers. This algorithm is faster than the fastest possible classical algorithm
and, as usual, is probabilistic. This algorithm uses the Grover search algorithm
repeatedly to find the minimum with a high probability. First, we formally present
the problem with notation and then we present the algorithm.
4.4.1
The Problem
Let T [0..N − 1] be an unsorted table of N items, each holding a value from an
ordered set. The minimum searching problem is to find the index y such that T [y]
is minimum. This clearly requires a linear number of probes on a classical probabilistic Turing machine. [16] gave a simple quantum algorithm which solves the
problem using O(N 1/2 ) probes. The algorithm makes repeated calls to Grover’s
search algorithm to find the index of a smaller item than the value determined by
a particular threshold index. If there are t ≥ 1 marked entries, Grover’s algorithm
p will return one of them with equal probability after an expected number of
O( N/t) iterations. If no entry is marked, it will run forever.
4.4.2
The Algorithm
The algorithm is as follows:
1. Choose threshold index 0 ≤ y ≤ N − 1 uniformly at random.
2. Repeat the
interrupt it when the total running time is more
√ following and
2
than 22.5 N + 1.4 log N . Then go to 2c.
(a) Initialize the register as a uniform superposition over the N states, i.e.,
give each state a coefficient of √1N . Mark every item j for which
T [j] < T [y]. This would be an O(N ) operation on a classical computer but here, the entire state which is a superposition of the N basis
states, is acted upon at once by a quantum operator.
(b) Apply the quantum exponential searching algorithm of [15].
(c) Observe the register: let y 0 be the outcome. If T [y 0 ] < T [y], then set
threshold index y to y 0 .
3. Return y.
4.4.3
Running Time and Precision
By convention, we assume that stage 2a takes log(N ) time steps and that one iteration in Grover’s algorithm takes one time step. The expected number of iterations
used by Grover to find the index of a marked item among N items where t items
50
p
are marked is at most 92 N/t. The expected total time before the y holds the index
√
7
of the minimum is at most m0 = 45
N + 10
log 2 N .
4
The algorithm given above finds the minimum with probability at least 12 . This
probability can be improved to 1 − 21c by running the algorithm c times.
4.5
Quantum Walks
A generalization of Grover’s search technique, quantum walks[17] have lead to a
number of quantum algorithms for problems such as element distinctness (which
will be described later). In this section, we present the basics of quantum walks and
also an application of quantum walks due to Ambainis , namely, element distinctness. First, we describe random walks which are the classical analogue of quantum
walks. Random walks provided the inspiration for quantum walks.
4.5.1
Random Walks
A random walk is a mathematical formulation of a path that consists of a succession
of random steps. For example, the path traced by a gas molecule, the path traced
by an animal foraging for food are all random walks. Often, random walks are
assumed to be Markov chains or Markov processes in discrete time although there
can be other types of random walks too. A classical Markov chain is said to be
a random walk on an underlying graph if the nodes of the graph are the states in
S, and a state s has a non-zero probability to go to state t if and only if the egde
(s, t) exists in the graph. A simple random walk on a graph G(V, E) is described
by repeated applications of a stochastic matrix P , where P (u, v) = (d1u ) if (u, v)
is an edge in G and du is the degree of the vertex u. If G is connected and nonbipartite, the distribution of the random walk Dt = P t D0 converges to a stationary
distribution π which is independent of the initial distribution D0 .
An example: A one-dimensional random walk
The elementary one-dimensional random walk is a walk on the integer line Z which
starts at 0 and at each time step moves +1 or -1 with equal probability.
4.5.2
Terminology used with Random Walks
There are many definitions which capture the rate of convergence to the limiting
distribution in a random walk. Some important terms are defined here.
Definition 3. Mixing Time:
M = min{T |∀t ≥ T, D0 : ||Dt − π|| ≤ }
(4.7)
where the distance between two distributions d1 and d2 is given by ||d1 − d2 || =
P
i |d1 (i) − d2 (i)|.
51
Definition 4. Filling Time:
τ = min{T |∀t ≥ T, D0 , X ⊆ V : Dt (X) ≥ (1 − )π(X)}
(4.8)
Definition 5. Dispersion Time:
ξ = min{T |∀t ≥ T, D0 , X ⊆ V : Dt (X) ≤ (1 + )π(X)}
4.5.3
(4.9)
Quantum Analogue: Quantum Markov Chains or Quantum Walks
Let G(V, E) be a graph, and let HV be the Hilbert space spanned by the states
|vi where v ∈ V . We denote by n, or |V | the number of vertices in G. Assume
that G is d-regular. Let HA be the auxiliary Hilbert space of dimension d spanned
by the states |1i through |di. Let C be a unitary transformation on HA . Label
each directed edge with a number between 1 and d, such that the edges labelled a
form a permutation. Now we can define a shift operator S on HA ⊗ HV such that
S|a, vi = |a, ui where u is the ath neighbour of v. Hence, one step of the quantum
walk is given by the operator U = S.(C ⊗ I). This is called a coined quantum
walk.
Example: Consider the cycle graph with n nodes. This graph is 2-regular. The
Hilbert space for the walk would be C 2 ⊗ C n . We choose C to be the Hadamard
transform
1
1
1
H = √2
1 −1
and the shift S is defined as
S|R, ii = |R, i + 1mod ni
S|L, ii = |L, i − 1mod ni
where R denotes a move to the node on the right of the node indexed i and L
denotes a move to the left. The quantum walk is then defined to be the repeated application of the Hadamard transform on the first register followed by an application
of the shift operator S.
Having this general idea of quantum walks in mind, we now proceed to examine their applications to algorithmic problems. We first make a remark that the
Grover’s search algorithm is a special case of a quantum walk. Next we describe
the application of quantum walks to the element distinctness problem.
4.5.4
Application to Element-Distinctness Problem
The element distinctness problem is as follows. Given numbers x1 , x2 · · · xN ∈
[M ], are there i, j ∈ [N ], such that i 6= j and xi = xj ? Any classical solution to
this problem will need Ω(N ) queries. Ambainis gave a quantum walk algorithm
for this problem that gives the answer in O(N 2/3 ) queries. The main idea is as follows. We have vertices vS corresponding to sets S ⊆ {1, 2, · · · , N }. Two vertices
52
vS and vT are connected by an edge is S and T differ in one variable. A vertex is
marked if S contains i, j such that xi = xj . At each moment of time, we know
xi for all i ∈ S. This enables us to check if the vertex is marked with no queries.
Also, it enables us to move to an adjacent vertex vT by querying just one variable
xi for i ∈
/ S and i ∈ T .
Then we define a quantum walk on subsets of the type S. Ambainis shows that
if x1 , x2 · · · xN are not distinct, this walk finds a set S containing i, j such that
xi = xj within O(N 2/3 ) steps.
With that, we conclude the current chapter on quantum computing literature.
Next, we move on to see the applications of quantum computing to more complex
algorithmic applications involving intelligence tasks.
53
Chapter 5
Quantum Computing and
Intelligence Tasks
Given, the quantum computing techniques we have seen on the previous chapter,
we would like to see if any of them can be applied to natural language processing tasks. There exist approaches in literature for applying quantum principles to
machine learning tasks such as classification [?]. Since NLP relies heavily on machine learning, we would like to study quantum machine learning algorithms too.
In this chapter, we first study a quantum approach to classification proposed by
Sébastian Gambs in 2008. Then we present our approach to a quantum Viterbi
algorithm using a modified version of Grover’s search algorithm. Later, we present
other possible approaches currently being investigated by us for the same problem.
5.1
Quantum Classification
Quantum classification is defined as the task of predicting the associated class of
an unknown quantum state drawn from an ensemble of pure states given a finite
number of copies of this state.
5.1.1
Learning in a Quantum World
Definition 6. (Quantum training dataset). A quantum training dataset containing
n pure quantum states can be described as Dn = {(|ψ1 i, y1 ), ...,
(|ψn i, yn )}, where |ψi i is the ith quantum state of the training dataset and yi is the
class associated with this state.
Example: (Quantum training dataset composed of pure states defined on d
qubits). In the context where all the pure states in the training dataset live in a
Hilbert space formed by d qubits and we are interested in the task of binary classid
fication; |ψi i ∈ C2 and yi ∈ {−1, +1}.
54
Definition 7. (Training error). The training error (or error rate) of a classifier
f is defined as the probability that this classifier predicts the wrong class yi on a
quantum state |ψi i drawn randomly from the states of the quantum training dataset
Dn . Formally:
P
f = n1
nP rob(f (|ψi i) 6= yi )
i=1
In the context of quantum classification, the notion of regret also takes a particular importance.
Definition 8. (Regret). The regret r of a classifier f is defined as the difference
between its error rate f and the smallest achievable error opt that can be achieved
on the same problem. Formally:
rf = f − opt
The regret of a classifier, as well as its error, can potentially take any value in
the range between zero and one. The concept of regret is particularly meaningful
in the context of hard learning problems, where the raw error rate alone is not an
appropriate measure to characterize the inherent difficulty of the learning.
Definition 9. (Classification cost). The classification cost corresponds to the number of copies of the unknown quantum state |ψ? i that will be used by the classifier
to predict the class y? of this state.
5.1.2
The Helstrom Oracle
For the purpose of quantum classification, we will be using an abstraction called
the Helstrom Oracle.
Definition 10. The Helstrom oracle is an abstract construction that takes as input:
Version 1: a classical description of the density matrices ρ− and ρ+ and corresponding to the -1 and +1 tagged states and their a priori probabilities p− and p+
or
Version 2: a finite number of copies of each state of the quantum training dataset
Dn
From this input, the oracle can be trained to produce an efficient implementation (exact or approximative) of the POVM (Positive-Operator Valued Measurement) of the Helstrom measurement fhel , in the form of a quantum circuit that can
distinguish between ρ− and ρ+. In the second version of the oracle, its training
cost corresponds to the minimum amount of copies of each state of the training
dataset that the oracle has to sacrifice in order to construct fhel .
55
5.1.3
Binary Classification
Let m− be the number of quantum states in Dn for which yi = −1 (negative class),
and its complement m+ be the number of states for which yi = +1 (positive class),
such that m− + m+ = n, the total number of data points in Dn . Moreover, p− is
the a priori probability of observing the negative class and is equal to p− f racm− n
, and p+ its complementary probability for the positive class such that p− +p+ = 1.
Definition 11. (Statistical mixture of the negative class).
The statistical mixture
1 P
representing the negative class ρ− , is defined as m−
nI{yi = −1}|ψi ihψi |,
i=1
where I{.} is the indicator function which equals 1 if its premise is true and 0
otherwise.
Definition 12. (Statistical mixture of the positive class). In the same P
manner, the
1
statistical mixture representing the positive class ρ+ is defined as m+
nI{yi =
i=1
+1}|ψi ihψi |.
Theorem 1. (Helstrom measurement). The error rate of distinguishing between the
two classes rho− and ρ+ is bounded from below by epsilonhel = 12 − D(ρ−2 ,ρ+ ) ,
where D(ρ− , ρ+ ) = T r|p− ρ− − p+ ρ+ | is a distance measure between ρ− and
ρ+ called the trace distance (here, p− and p+ represent the a priori probabilities
of classes ρ− and ρ+ , respectively). Moreover, this bound can be achieved exactly
by the optimal POVM called the Helstrom measurement.
The Helstroms measurement is a binary classifier that has a null regret, which
means rhel = 0.
Remark Error rate of the Helstrom measurement for extreme cases:
• Consider the case where both the negative class and the positive class are
equiprobable. If ρ− and ρ+ are two density matrices which correspond to
the same state, their trace distance D(ρ− , ρ+ ) is equal to zero, which means
that the error hel of the Helstrom measurement is 1.
• On the other hand, if ρ− and ρ+ are orthogonal, this means that D(ρ− , ρ+ ) =
1 and that the Helstrom measurement has an error hel = 0.
5.1.4
Weighted Binary Classification
(Reduction from weighted binary classification to standard binary classification via
Helstrom oracle). Given the access to an Helstrom oracle that takes as inputs the
description of the density matrices ρ− and ρ+ (and their a priori probabilities p−
and p+ ), it is possible to reduce the task of weighted binary classification to the
task of standard binary classification.
Training cost: null
Classification cost: Θ(1).
56
Proof : The weight wi of a particular state can be converted to a probability pi
wi
reflecting its importance by setting pi = P
.
n
wj
j=1
Let p− be the new a priori probability of the negative class, which is equal to p̂− =
n
P
pi I{yi = −1} and p̂+ , its complementary probability such that p̂− + p̂+ = 1.
i=1
The Helstrom measurement which discriminates between the density matrices in
which the weights are incorporated is precisely the POVM which minimizes the
weighted error. Therefore, it suffices to call the Helstrom oracle with inputs
ρ̂− =
ρ̂+ =
n
P
i=1
n
P
pi I{yi = −1}|ψi ihψi |
pi I{yi = +1}|ψi ihψi |
i=1
This reduction makes only one call to the Helstrom oracle and requires only one
copy of the unknown quantum state at classification.
5.2
Quantum Walk for A-star search
The A∗ search algorithm is a heuristic based searching/graph traversal algorithm
which has vast applications in the field of artificial intelligence. In this section, we
first describe the A∗ algorithm in detail and then present our ideas of a quantum A∗
using the idea of quantum walks.
5.2.1
The A∗ Algorithm
A∗ is a graph search algorithm which uses a best-first search to find the least cost
path from a given initial node to a goal node. As A∗ traverses the graph, it follows
a path of lowest expected cost and keeps a sorted priority queue of alternate path
segments along the way. It uses a knowledge-plus-heuristic cost function to determine the order in which the search visits nodes in the tree. A∗ is primarily a search
algorithm and the basic building blocks of any search algorithm are the following:
1. State Space: This is the space of states of the graph among which we are
searching for a solution.
2. Start State: The start state is the state from which our search starts. The start
state is denoted by S0 .
3. Goal State: The goal state is the state we intend to find via the search. The
goal state is denoted by G.
4. Operator: This constitutes of a transformation between states. It is a function
which takes a state as input and gives as output another state. This is used to
move from state to state thereby traversing the graph. Also each use of the
operator adds to the cost of taking that particular path.
57
5. Cost: The amount of effort involved in using the operator. This can be a
different function for different search problems.
6. Optimal Path: The path with the least cost to move from the start state to the
goal state.
Now we give an example to illustrate the above building blocks in a concrete
setting. We look at the 8-puzzle problem. We have a 3x3 square with the numbers
1 through 8 randomly arranged in 8 of the spaces of the square. This is our initial
configuration S0 . The empty square is regarded as free space and we can slide the
other 8 blocks upward, downward, left or right into the empty space. The aim is
to arrive at the goal configuration by sliding the blocks and making the minimum
number of moves (each movement of any block in any direction is regarded as one
move) This can be modelled in our search paradigm as follows. The state space the
set of all possible configurations of the 8 numbered blocks. The start state is the
initial configuration given to us and the goal state is the one shown in Figure 5.2.
The available operators are move left, move right, move up and move down if the
move is permitted. The cost is simply the total number of times the operators are
used in arriving at the goal state.
Figure 5.1: Start State for the 8-puzzle problem
We now present the A∗ search algorithm.
1. Create a search graph G consisting solely of the start node S. Put S on a list
called OPEN.
2. Create a list called CLOSED that is initially empty.
3. Loop: if OPEN is empty, exit with failure.
4. Select first node on OPEN, remove form OPEN and put on closed. Call this
node n.
5. If n is the goal node, exit with the solution obtained by tracing a path along
the pointers from S to n. (Pointers are established in step 7).
58
Figure 5.2: Goal State for the 8-puzzle problem
6. Expand node n, generating the set M of its successors that are not ancestors
of n. Install these members of n as successors of n in G.
7. Establish a pointer from n to those members of M not already in G. (i.e not
already on either OPEN or CLOSED. Add these members of M to OPEN.
For each member of n, that was already on OPEN or CLOSED, decide
whether or not to redirect its pointer to n. For each member of M already on
CLOSED, decide for each of its descendants in G whether or not to redirect
its pointer.
8. Reorder OPEN using the cost function f which is a mix of knowledge function g and heuristic function h. Reorder in the ascending order of cost.
9. Go to step 3.
The Heart of A∗ : The Heuristic Function
The cost function f is maintained at every node in the search graph. For a node n,
f (n) = g(n) + h(n), where g(n) is the least cost path to n from S0 found so far
and h(n) is a function which satisfies the property that h(n) ≤ h∗ (n) where h∗ (n)
is the actual cost of the optimal path to G from n which is to be found. If g ∗ (n) is
the least cost path from S0 to n, then g and h satisfy the following relations.
g(n) ≥ g ∗ (n)
∗
h(n) ≤ h (n)
(5.1)
(5.2)
For example, in the 8-puzzle problem a possible heuristic function is to let h(n)
be the total number of tiles displaced. Since we know that any path to the goal
configuration has to make a number of moves at least the total number of displaced
tiles, this is a valid heuristic function. A property of A∗ is that if we choose an
admissible heuristic such as the one above the algorithm always terminates finding
the optimal path. Of course, better the heuristic, faster the algorithm. Now, we give
59
our insights on how quantum computing ideas can possibly be applied to the A∗
algorithm.
5.2.2
A Quantum Approach?
We notice that A∗ at its heart, is a graph traversal scheme. We know from the previous chapter that quantum walks can also be used as a graph traversal scheme. It is
worthwhile to look at whether a quantum counterpart to the classical A∗ algorithm
can be designed. There already exist quantum walk algorithms for performing a
multi-dimensional grid search which offer speed-up over their classical counterparts. We believe we can apply these algorithms in the A∗ setting to get a quantum
A∗ which is faster than the classical one.
60
Chapter 6
The Quantum Viterbi
Now we present the quantum Viterbi algorithm developed by us and analyse the
precision and accuracy results on the BNC corpus.
6.1
The Approach
If we look at the trellis of the states on which Viterbi runs, we can say that the
Viterbi algorithm is searching for the ’best’ possible path from the first level of the
trellis to the last. Hence, we could call it a search problem. Classically, we would
have to search each path one after the other and compare their fitness values. Due
to the Markov assumption in the problem, our search is now split into stages. To
make the next step in the path, we still need to search among all possible transitions
from our current position and then choose the best one. The breakthrough quantum
computing brings is the ability to perform computation on many variables simultaneously. For example, in Shor’s algorithm, the routine for finding the period,
applies the Fourier transform on all the states simultaneously. In contrast, classical
period finding algorithms would have to search among the different xa mod n to
find the period.
Hence, a natural approach to a quantum Viterbi would be to try and model
it as a result of an observation over a quantum superposition of the various
paths in the trellis. However, Viterbi is not a random search among all the paths.
There is greater structure to the problem lent by the Markov assumption which
reduces the search over all possible paths to a search over all the possible next
states into which we can transition. So instead of a quantum superposition over all
possible trellis paths, we generate a new quantum superposition of states (possible
transitions) at each level of the trellis. Now, our task is find an operator or a sequence of operators such that their application on a uniform quantum superposition
of the possible transitions from a state will lead to a quantum superposition which
when observed has a very high probability of yielding the desired state.
61
6.1.1
Can Grover be used?
In the Grover algorithm, we see that the Oracle is a fixed operator which has knowledge of the
√ solution state and does not vary across iterations. The Grover search
is an O( N ) time algorithm when we have N states to search among. At each
tag, we spend O(T) time to select one out of the T possible paths ending there
from, the previous stage. So, at each stage of the classical Viterbi trellis,√we spend
O(T2 ). Via the Grover search, we wish to bring this time down to O(T T). But,
in our case, we do not know beforehand which is the element we are searching for
as we want to find the maximum among n numbers. To do this, we need to modify
Grover as used in the quantum Minimum finding algorithm. Using this insight, we
next present the quantum Viterbi algorithm.
6.2
The Algorithm
The classical Viterbi algorithm is first presented again for reference.
6.2.1
The Classical Version
Firstly, given a tagged corpus and a tag-set t1 , t2 , ..., tT of size T , we learn tagwise probabilities of starting the sentence πi ; tag-to-tag transition probabilities
P (tj |ti ); and tag-to-word generation probabilities P (wk |ti ). Then, given a sentence of length w1 , w2 , ..., wL of length L,
1. Initialize a 2-dimensional vector V of size T × (L + 1) to all zeroes
2. Initialize a 2-dimensional vector B of size T × L to null
3. Set V [i][0] to πi ∀ i ∈ {1, ..., T }
4. For k in 1 to L,
For j in 1 to T ,
B[j][k] = argmax P (tj |ti ) ×
i
P (wk |ti ) × V [i][k − 1]
m = B[j][k]
V [j][k] = P (tj |tm )P (wk |tm )V [m][k − 1]
5. In the BNC corpus used by us, all sentences end in punctuations. So, we
assign the corresponding tag to the last word in the sentence. Say, the index
for this tag is p. Then, tagL = p.
6. Now, we use the back-pointers stored in B to find the path that gave us
maximum score and ended in p.
For k in L − 1 to 1, tagk = B[tagk+1 ][k + 1]
62
6.2.2
Quantum exponential searching
This search algorithm[15] receives as input a superposition of N states, of which
say t are marked. It gives as output one of those t states. The steps are:
1. Initialize m = 1 and λ = 56 .
2. Choose j uniformly at random from the whole numbers smaller than m.
3. P
Apply j iterations of Grover’s algorithm starting from initial state |ψ0 i =
√1
i N |ii.
4. Observe the register; Let i be the outcome. If T [i] = x, then the problem is
solved; exit and return i.
√
5. Otherwise, set m to min(λm, N ) and go back to step 2.
6.2.3
The Grover Iteration
[3] shows that the unitary transform G, defined below, efficiently implements what
we called an iteration above. S0 is an operator that inverts the sign of the coefficient
of the |0i state. Similarly, St inverts the sign of coefficients of all marked states. T
is defined by its actions on the states |0i, |1i, ..., |N − 1i as
P −1
(i.j) |ii
T |ji = √1N N
i=0 (−1)
where i.j denotes the bitwise dot product of the two binary strings denoting i and
j. Then the transform G is given by
G = −T S0 T St
Grover considers only the case when N is a power of 2 since the transform T is
well-defined only in this case.
6.2.4
The Quantum Approach to Viterbi
Note that step 4 in 6.2.1 has an inner loop that runs T times and needs to find the
maximum among T quantities each time, leading to the O(T 2 ) component in the
running time on a classical computer. On a quantum computer, the T quantities
to be compared can be prepared together into a superposition of states in log T
time (because we need log T qubits to represent T states) and then, we modify the
Quantum Minimum Algorithm to get a Quantum Maximum Algorithm by changing the < comparison operator to ≥ in 2a. Since the T quantities for a fixed j, k are
in a superposed state, we use an operator that fetches the required values from the
V table and the probability values learnt during the training phase, simultaneously
for all possible i in constant time. Then, the triplets are multiplied together to give
the T quantities, again in constant time. Now, the Quantum Maximum Algorithm
takes O(T 1/2 ) time to find the maximum among the T number of values, hence
giving the reduction in overall running time from O(T 2 L) to O(T 3/2 L)
63
6.3
6.3.1
Experimental Results
Implementation
We implemented a simulation of the quantum algorithm on a classical computer
and assigned part-of-speech tags to the British National Corpus, which has a 61strong tagset available at http://www.natcorp.ox.ac.uk/docs/c5spec.
html. We padded this with 3 dummy tags in order to work with 64 states, i.e., 6qubit quantum registers.
Additionally,
instead of running the Quantum Minimum Algorithm for a time
√
of 22.5 N + 1.4 log2 N , we restricted the number of iteration of step 2 to 10,
thus giving a total running time of the same
√ order (because the Quantum Exponential Searching over N states is an O( N ) operation). The rationale behind this
was that the probability of getting the correct result from Grover search (of which
the Quantum Exponential Search is a generalization) algorithm oscillates with the
number of iterations, rising quickly and peaking periodically. Thus, by performing
a slightly lesser number of iterations, we do not lose out much on precision but save
on execution time. (Note that the simulation of a quantum algorithm on a classical
Turing machine incurs an exponential blow-up in execution time.)
We used smoothing in our implementation where the tag-to-word probability
is boosted by 10−8 for all words. This ensures that even for words not present in
the training corpus, a positive probability value is assigned. Pushing up the probabilities of solely the words absent from the training corpus can end up changing
the tags of other words. To avoid this, the tag-to-word probabilities of all the words
are increased by the same amount regardless of their original value. If there were
no smoothing, the algorithm would end up leaving some words untagged. Due
to smoothing no word is left untagged and hence the overall precision and recall
values become the same.
6.3.2
Results
Although the quantum Viterbi algorithm is probabilistic, the probability of success can be made high by setting the running time of the algorithm appropriately.
Hence, the precision of the algorithm can be brought as close to that of the classical version as needed. For our implementation the classical version yielded a
precision and recall of 0.9289 whereas the quantum counterpart yielded 0.9067 as
the precision and recall.
Tag
nn0
unc
ajc
pun
np0
Classical
0.867388
0.224541
0.892725
1
0.808852
Quantum
0.909962
0.258896
0.897757
1
0.807513
64
Difference
-0.0425745
-0.0343547
-0.00503225
0
0.001339
dt0
vbd
at0
pnp
nn2
dtq
vhz
aj0
vbb
xx0
av0
dps
vm0
vvn
avq
cjc
ord
vhg
cjs
nn1
vbz
vhb
crd
vbn
vvz
pnq
vdd
prp
prf
vvi
vdg
avp
vvg
vvd
pnx
cjt
vhd
pni
vdn
to0
vbg
vbi
ex0
vvb
0.889688
0.999746
0.997748
0.9847
0.910331
0.988764
0.999684
0.902669
0.992438
0.994644
0.859771
0.995861
0.974927
0.877453
0.641234
0.999177
0.984701
0.978126
0.83856
0.930382
0.99572
0.955453
0.937628
1
0.839655
0.983776
1
0.965542
0.997954
0.895492
0.918877
0.736335
0.863298
0.859617
0.945212
0.964409
0.98311
0.881366
0.996154
0.966268
0.982608
0.997867
0.968648
0.590293
0.887663
0.993134
0.989039
0.975611
0.897774
0.975926
0.98396
0.884891
0.973092
0.975089
0.840036
0.974889
0.953286
0.855244
0.61802
0.975847
0.961115
0.953638
0.813751
0.904409
0.96692
0.926358
0.907157
0.967927
0.807522
0.948785
0.96372
0.928909
0.959898
0.854277
0.877193
0.692918
0.816867
0.813175
0.895753
0.914131
0.928726
0.826813
0.941558
0.909248
0.921845
0.928526
0.89689
0.512389
65
0.00202525
0.00661225
0.0087095
0.009089
0.0125575
0.0128385
0.0157243
0.0177777
0.0193463
0.0195548
0.0197355
0.020972
0.0216412
0.0222092
0.0232142
0.0233302
0.0235863
0.0244878
0.0248092
0.0259725
0.0288
0.0290948
0.0304707
0.0320727
0.0321325
0.03499
0.03628
0.0366335
0.0380552
0.0412152
0.0416842
0.0434172
0.046431
0.0464418
0.049459
0.050278
0.054384
0.0545532
0.0545952
0.0570202
0.0607627
0.0693417
0.0717575
0.077904
vhi 0.916013 0.836266
0.0797468
zz0 0.532633 0.444551
0.088082
vdz 0.997071
0.89678
0.100291
ajs 0.878247 0.773294
0.104954
vdb 0.97538
0.85216
0.12322
vhn 0.817375 0.616667
0.200709
itj
0.615165 0.372024
0.243142
vdi 0.926695 0.628504
0.298191
Table 6.1: Tag-wise comparison of precision values obtained by
both the classical and quantum Viterbi algorithms.
Tag
vdg
ajs
xx0
vdz
cjt
dps
pun
vdn
vbd
dtq
crd
at0
vbi
cjc
vbn
prf
nn0
vhg
ajc
vdd
vvi
vbz
ord
vm0
prp
vbg
nn1
vvd
vhd
Classical
0.964209
0.955312
0.958409
0.970937
0.887344
0.984227
1
1
0.989565
0.991827
0.976715
0.961239
0.989855
0.991317
0.990903
0.979257
0.967757
1
0.899988
0.997205
0.923541
0.995352
0.982694
0.985806
0.906497
0.986774
0.906681
0.871606
0.98567
Quantum
0.985294
0.969154
0.96795
0.98
0.894839
0.986229
1
1
0.988928
0.989443
0.972657
0.956836
0.984467
0.985674
0.983238
0.970943
0.959004
0.991071
0.890287
0.9875
0.911903
0.982435
0.969212
0.971792
0.891614
0.971668
0.890365
0.854049
0.967624
66
Difference
-0.0210847
-0.0138413
-0.009541
-0.00906325
-0.00749475
-0.00200175
0
0
0.0006365
0.002385
0.00405825
0.0044035
0.005388
0.00564275
0.00766475
0.00831375
0.008752
0.0089285
0.0097015
0.0097055
0.011638
0.0129175
0.0134827
0.0140138
0.0148828
0.0151065
0.0163158
0.0175568
0.018046
nn2 0.972807 0.954353
0.0184537
pnq 0.991778 0.969166
0.0226122
np0
0.8974
0.874631
0.022769
vhz 0.991969 0.968629
0.0233397
to0 0.962689 0.939272
0.023417
aj0
0.89451
0.870274
0.0242367
dt0 0.941652 0.909601
0.0320515
pnx 0.964999 0.931361
0.0336375
avq 0.865876 0.832125
0.0337505
vvn 0.827202 0.788522
0.0386807
pni 0.887096 0.847107
0.039989
pnp 0.953722 0.912573
0.0411495
zz0 0.308415 0.266807
0.041608
vhi 0.937314 0.889001
0.0483137
ex0 0.971073 0.921361
0.0497125
vvg 0.88181
0.826416
0.0553935
cjs 0.803035
0.74663
0.0564043
vdi 0.943283 0.866551
0.0767322
vvz 0.936522 0.856728
0.0797933
vhb 0.938971 0.855368
0.0836035
av0 0.901366 0.792911
0.108455
vbb 0.995897 0.869538
0.126359
itj
0.780903
0.65
0.130903
avp 0.809851 0.656315
0.153536
unc 0.730885 0.569445
0.161441
vvb 0.736583 0.535853
0.20073
vdb 0.965909
0.6866
0.279309
vhn 0.767851 0.440196
0.327655
Table 6.2: Tag-wise comparison of recall values obtained by both
the classical and quantum Viterbi algorithms.
Tag1
vdi
itj
itj
zz0
ex0
vdb
itj
vhn
Tag2
vdb
av0
vvn
pnp
av0
vdi
nn2
vhd
Classical
0.0696285
0.0927187
0.0349679
0.0383333
0.0264066
0.0246204
0.0149679
0.182625
67
Quantum
0.323111
0.217262
0.125
0.115625
0.0976915
0.0939719
0.0833332
0.25
Difference
-0.253482
-0.124543
-0.0900321
-0.0772917
-0.0712849
-0.0693515
-0.0683653
-0.0673752
vbi
vbb 0.00183934 0.0647824 -0.062943
vvb
nn1
0.261705
0.32404
-0.0623347
vhi
vhb
0.0825578
0.143788 -0.0612307
vbg
nn1
0.0173924
0.078155 -0.0607626
to0
prp
0.0329976 0.0854874 -0.0524897
vdn
vvn
0
0.0508658 -0.0508658
vhn
av0
0
0.05
-0.05
cjt
dt0
0.0344765 0.0779157 -0.0434392
nn0
nn0
0.867388
0.909962 -0.0425745
vdg
nn1
0.0811227
0.122807 -0.0416843
itj
vdd
0
0.0416668 -0.0416668
vhn
at0
0
0.0416668 -0.0416668
vhn
vvn
0
0.0416668 -0.0416668
vvd
vvn
0.0992355
0.137942 -0.0387068
ajs
nn1
0.0174963 0.0561497 -0.0386534
prf
av0 0.00163094 0.0387577 -0.0371267
pnx
dt0
0.0209213 0.0558179 -0.0348965
zz0
at0
0.256993
0.291853 -0.0348605
unc
unc
0.224541
0.258896 -0.0343547
vhd
vhn
0.0159773 0.0497156 -0.0337384
ajs
av0
0.0565788 0.0900432 -0.0334643
itj
at0
0.0687203
0.10119
-0.0324701
pnq
np0 0.00742375 0.0397013 -0.0322776
vbz
pnp 0.00165125 0.0315426 -0.0298913
zz0
np0
0.0877019
0.11672
-0.0290181
avp
prp
0.232996
0.260796 -0.0278002
ajs
aj0
0.0349714 0.0626435 -0.0276721
vdb
vbz
0
0.0263157 -0.0263157
vhb
vhi
0.0433937 0.0693684 -0.0259747
vdd
vbz
0
0.0252422 -0.0252422
vdz
vm0
0
0.0240275 -0.0240275
vdi
vvi 0.00183823 0.0241926 -0.0223543
vdz
vdb
0
0.0217391 -0.0217391
Table 6.3: Tag-to-tag pairwise comparison of confusion values obtained by both the classical and quantum Viterbi algorithms. Listed
above are those pairs for which the quantum algorithm increased
the confusion by atleast 2% w.r.t. to the classical one.
68
6.3.3
Tag-wise Precision and Recall Analysis
For the tags NN0, UNC and AT0 the quantum Viterbi algorithm yielded a higher
precision value. And for the tags VDG, AJS, XX0 and VDZ the quantum algorithm
gives a higher recall. On the other hand, there are also a number of tags for which
the quantum algorithm performs particularly bad. For the tags VDI, ITJ, VHN,
VDB, AJS and VDZ the quantum algorithm gave a significantly lower precision
value. And for the tags VHN, VDB, VVB, AVP, ITJ, AV0, CJS and PNP the recall
is significantly lowered by the quantum algorithm.
The quantum Viterbi is a probabilistic algorithm and does badly in the case of
tags with specific word forms. For example, VDI is the infinitive form of the verb
DO, i.e. ’do’ whereas VDZ is the -s form of verb DO, i.e. does. These are word
forms with specific tags and the classical algorithm yields high accuracy in these
cases. Here, the introduction of randomization by the quantum algorithm ends up
lowering the accuracy significantly. For the tags VDI, VHN, VDB, VDZ, VHI,
VVB, VVI and EX0, the quantum implementation yielded a precision which was
atleast 6% lesser than that given by the classical algorithm.
It is worth noticing that AV0 and AVP, both adverb forms suffered losses of
10 − 15% in recall on using the quantum Viterbi, which suggests that for these,
the probabilistic quantum algorithm for maximum-finding doesn’t do as well as
the deterministic classical version on scores for words tagged AV0 or AVP. One
possible reason is that the values among which maximum is to be determined are
close to each other, in which case a deterministic algorithm will go through but a
probabilistic one has higher chances of failing. This is confirmed by the precision
values being on the lower side (0.73 and 0.85) for these tags in the classical algorithm.
6.3.4
Concluding Remarks
The current implementation of the quantum Viterbi algorithm does just 2% worse
on overall precision than its classical counterpart. We know that the Grover
search
q
N
algorithm gives maximum accuracy when the running time is ∼ π4 M
, where
M is the number of solutions. Here, M = 1 and N = 64, the closest power of
2 to the number of tags (61). The accuracy will increase marginally as we inch
closer to this value of the running time. The larger gain is in the reduction in time
complexity of the algorithm. Of course, this is small when N is of the order of
just 26 . To see an application where such a decrease can have a significant impact
on the overall algorithm, we delve into the problem of Machine Translation among
close languages in the next chapter.
69
Chapter 7
Machine Translation among
Close Languages
7.1
7.1.1
Machine Translation
What is machine translation?
Machine translation (MT) is the translation of text by a computer, with no human
involvement. Pioneered in the 1950s, machine translation can also be referred to
as automated translation, automatic or instant translation. On a basic level, MT
performs simple substitution of words in one natural language for words in another, but that alone usually cannot produce a good translation of a text because
recognition of whole phrases and their closest counterparts in the target language
is needed. Solving this problem with corpus and statistical techniques is a rapidly
growing field that is leading to better translations, handling differences in linguistic
typology, translation of idioms, and the isolation of anomalies
7.1.2
How does machine translation work?
There are two types of machine translation system:
• Rule-based systems use a combination of language and grammar rules plus
dictionaries for common words. Specialist dictionaries are created to focus on certain industries or disciplines. Rule-based systems typically deliver
consistent translations with accurate terminology when trained with specialist dictionaries. The basic approach involves linking the structure of the
input sentence with the structure of the output sentence using a parser and
an analyser for the source language, a generator for the target language, and
a transfer lexicon for the actual translation. Its biggest downfall is that everything must be done explicit: orthographical variation and erroneous input
must be made part of the source language analyser in order to cope with
70
it, and lexical selection rules must be written for all instances of ambiguity.
Adapting to new domains in itself is not that hard, as the core grammar is
the same across domains, and the domain-specific adjustment is limited to
lexical selection adjustment.
• Statistical systems have no knowledge of language rules. Instead they ”learn”
to translate by analysing large amounts of data for each language pair. They
can be trained for specific industries or disciplines using additional data
relevant to the sector needed. Typically statistical systems deliver more
fluent-sounding but less consistent translations. Google Translate and similar statistical translation programs work by detecting patterns in hundreds
of millions of documents that have previously been translated by humans
and making intelligent guesses based on the findings. Generally, the more
human-translated documents available in a given language, the more likely
it is that the translation will be of good quality. SMT’s biggest downfall
includes it being dependent upon huge amounts of parallel texts, its problems with morphology-rich languages (especially with translating into such
languages), and its inability to correct singleton errors.
7.1.3
Advantages of machine translation
Some advantages, owing to which research in this field should be pursued, are:
• When time is a crucial factor, machine translation can save the day. The
software can translate content quickly and provide a quality output to the
user in no time at all.
• The next benefit is that it is comparatively cheap. It might look like an unnecessary investment but in the long run it is a very small cost considering
the return it provides.
• Confidentiality: Giving sensitive data to a translator might be risky while
with machine translation your information is protected.
7.2
Similarity to POS tagging for close languages
Among pairs of close languages such as Hindi and Urdu which almost follow wordfor-word translation, we can treat the words of one language as part-of-speech tags
for the corresponding words of the other and then, simply employ the Viterbi POStagging algorithm to obtain machine translation. Note that here, the number of
states involved in the Viterbi trellis, i.e., T would be of the order of 104 , and hence
an O(T 3/2 L) algorithm will be much more efficient than an O(T 2 L) one. This is
a problem where a quantum version of the POS-tagger can drastically bring down
computing time, if deployed on a quantum computer.
71
7.2.1
The izafat phenomenon
Talking of Hindi-Urdu translation, it is imperative to discuss the phenomenon of
izafat, a feature of Urdu orthography derived from Persian. Most of the time, it
indicates either description or possession, which are explained via examples below:
1. To indicate that the word following the izafat describes the word preceding
it. That is, the second word is being used as an adjective. So, whereas
the normal adjective noun pair is simply adjective + noun, the structure of
the descriptive izafat construct is the other way around: noun + izafat +
adjective. Example: mughal means ’Moghul’, and azam means ’greatest’.
To say ’greatest Moghul’ using the izafat, we would say: mughal-i-azam
2. To express the idea, the word preceding the izafat is possessed by the word
following it. In other words, it does the same thing as ka, ke and ki from
Hindi but in the reverse order. For instance, the word gham (noun) means
’sadness’, and the word dil (noun) means ’heart’. If we wanted to say ’the
hearts sadness’ in regular Urdu, we would put the correct form of ka, ke or
ki in, and say dil ka gham. But if we wanted to say the same thing using an
izafat construction, we would reverse the order of the two nouns, and stick
the izafat between them: gham-i-dil.
Hindi-Urdu translation is then like a POS tagging problem, modulo izafat,
which can be dealt with to some extent post processing.
7.3
Phrase-Book Translation
A phrase book is a collection of ready-made phrases, usually for a foreign language
along with a translation, indexed and often in the form of questions and answers.
To test our modelling of statistical machine translation among close languages as
a part-of-speech problem, we use urdu-english and hindi-english phrasebooks and
build a corpus from the translations for English sentences that are common in both
phrasebooks.
7.4
7.4.1
Experiments and Results
Training corpus
We built a parallel corpus of 54 sentences, containing 237 distinct urdu words,
which act as tags for our Viterbi algorithm. A few examples:
1. urdu: mukhtalif ravaayat aur akaaeed ke log ek saath aate hai
hindi: vibhinn paramparaaon aur dharmon ke log ek saath aate hai
72
2. urdu: bambaari se unka achha hona imkaan nahi hai
hindi: bambaari se unka bhala hona sambhaavit nahin hai
3. urdu: Sadar Bush ne shayad wohi hasil kiya jisme woh sabase zyada dilchaspi
rakhte the
hindi: Rashtrapati Bush ne sambhavtah wahi paaya jisme woh sabse adhik
dilchaspi rakhte the
4. urdu: jab tak hum kaarrawaahi shuru nahi karenge , hamaare shahari jamhooriyat
ki taakat me ummeed khote rahenge
hindi: jab tak hum kaarrawaahi aarambh nahi karenge , hamaare naagrik
loktantra ki shakti me aashaa khote rahenge
5. urdu: hume ilaakaai ta-aavun aur yakajahatee ko mazboot banaane ke amal
ki raaftaar badaana chaahiye
hindi: hume kshetriya sahayog aur ekeekaran ko mazboot banaane ki prakriyaa
ki gati badaana chaahiye
6. urdu: hukoomaton aur bain-ul-akvaamee tanzeemon ko un taaleem policiyon
ki himaayat karna chahiye jinka moassar hona saabit kiya gaya hai
hindi: sarkaaron aur antarraashtriya sangathanon ko un shiksha neetiyon ka
samarthan karna chahiye jinka prabhavi hona siddh kiya gaya hai
7. urdu: ek hoshiyaar nayi takneek chand hafte kabal ek akhbaar me bayaan ki
gayi thi
hindi: ek nayi chatur takneek kuchh saptaah poorv ek samaachaar-patr me
varnit ki gayi thi
8. urdu: isme koi tajzub nahi ki bahut log khud ko anmol samajhte hai
hindi: isme koi aashchary nahi ki kai log khud ko amoolya samajhte hai
9. urdu: khush-amdid
hindi: svagat
10. urdu: ap ka taluq kahan se hai
hindi: ap ka vaasta kaha se hai
11. urdu: subha bakhair
hindi: subh prabhat
12. urdu: kya zara ahistah kehenge
hindi: kripaya thode dhire boliye
73
7.4.2
Issues
Since the corpus is not too dense, we face the following issues:
• There are 54 sentences and 237 tags. Hence, for most tags, the probability
that they can start a sentence, obtained via the usual learning routine is 0.
When we construct a test corpus sentence starting with a word that does not
occur at the beginning of any sentence in the training corpus, the score for
the correct path of tags stays at 0 in the Viterbi algorithm. Hence, we add a
small constant, 0.005 to the start-probability for each tag.
• Since most tags have only one hindi word corresponding to them in training corpus, the tag→word probabilities in many cases turn out to be zeroes,
often leaving just a single non-zero value in the vector where maximum is
to be found. This takes away the possibility for a hindi word to be translated as different urdu ones that might not have all occurred as instances in
the training corpus. Since most words do have a one-to-one mapping and if
found in corpus, mostly come with their correct counterparts, we use a small
smoothing factor of 0.0001 only.
• Since urdu words from training corpus are the tags themselves, the tag→tag
transition probabilities are zero in most pairs. So, if we build sentences
using hindi words from different sentences and ask for a translation, it is
highly likely that their corresponding urdu words would never have been
adjacent to each other in the training corpus. This issue is tackled by adding
a smoothing factor of 0.1 to all the transition probabilities. This factor is
quite large because we wish to account for the less amount of information
contained in the corpus regarding what words can follow a particular word
in the urdu language.
Since the tagset has increased in size from 57 to 237 as compared to the experiments on BNC corpus, we increase the number of iterations in step 2 of the
Quantum Minimum Algorithm from 10 to 20.
7.4.3
Results
For testing purpose, we use a manually constructed corpus of 11 hindi sentences,
containing words occurring in the training data and run the quantum Viterbi. The
inputs and their corresponidng outputs are as follows:
1. hindi: saubhagya
urdu: allah-ka-fazal-ho
2. hindi: ratri me milenge
urdu: bakhair me ek
74
3. hindi: namaste ap ka svagat hai
urdu: mein ap se khush-amdid hay
4. hindi: sone ka moolya seemit nahin hai
urdu: sone ki tanzeemon tajzub nahi hai
5. hindi: hamaare naagrik loktantra ki shakti me aashaa rakhte the
urdu: hamaare bain-ul-akvaamee jamhooriyat ki mukaable me Sweden rakhte
the
6. hindi: sriman ko mera dhanyavad
urdu: sahib ko anmol shukriya
7. hindi: ap ne acchi sehat ke lie ek takneek varnit ki thi
urdu: ap ho ache sehat safr leyae amreeki takneek amal ki thi
8. hindi: ruko mai samajha nahin thode dhire boliye
urdu: roko hain samjha nahi zara the kehenge
9. hindi: pradhanmantri ko prabhavi sangathanon ka samarthan karna chahiye
urdu: vazeer-e-aazam ko sath kitne ka himaayat isme chahiye
10. hindi: mai tum se kaphi ummeed karta hu
urdu: main ap shab kafi tavakko hum hoon
11. hindi: amreeki kaanoon saral nahin hai
urdu: amreeki kaanoon aasaan nahi hai
7.4.4
Analysis
We observe that many words are translated correctly individually but the sentence
as a whole goes wrong because of some incorrect tags. This is a disadvantage of
modelling Machine Translation as POS-tagging using the 1-level Markov Model
as we lose context beyond the next word in the sentence.
Looking at specific examples, we see that ratri me milenge gets translated as
bakhair me ek. The translation goes wrong in the first word itself. This is because
the only instance of ratri (meaning: night) in the training corpus is when subh ratri
(meaning: good night) is translated as shab bakhair where subh corresponds to
bakhair and ratri to shab, i.e., the order gets interchanged. Our algorithm learns
bakhair as the translation for ratri and uses it when run on the test data, yielding
wrong output.
sriman ko mera dhanyavad is translated as sahib ko anmol shukriya instead of
sahib ko mera shukriya because the words ko and mera do not occur adjacent to
each other in training data while ko and anmol do, hence the tag→tag transition
75
probability is higher in the latter case.
ke to safr, aashaa to Sweden, ek to amreeki, etc. are some examples of absurd
translations that we come across. This is due to the fact that the tag→tag transition
probabilities have been shifted by quite a bit due to the smoothing factor of 0.1
and hence, there are paths in the Viterbi trellis that shouldn’t get such high scores
but are being assigned them now. This hints towards the use of another different
learning algorithm for the smoothing factor itself, which is something to be looked
into, in future.
76
Chapter 8
Conclusions
Over the course of the past two semesters, we have come across various intriguing
and novel aspects of both quantum computing and natural language processing. We
have studied why a quantum computer allows us to gain an exponential speed-up
over the classical one in many cases, the reason being its ability to operate over all
states in one quantum step, using the concept of superposition of qubit states. Also,
a recurring feature in our study of various quantum algorithms has been the Fourier
Transform, which basically amounts to finding sub-groups within Abelian groups.
The Grover algorithm is a quantum approach to the search problem which uses
an Oracle that can recognize a solution and uses the Fourier and inverse Fourier
Transforms iteratively to propel the probability of the desired state upwards.
We have also investigated a whole array of classical optimization techniques
and gained an in-depth understanding of their working. The quantum Viterbi algorithm has been thoroughly studied by us and we have implemented the same on
the BNC corpus, achieving satisfactory results, just 2% shy of the classical precision. To show applicability to real-life tasks where a time reduction from O(T 2 )
to O(T 3/2 ) would be significant, we have dealt with the problem of statistical machine translation among close languages, which can be modelled as a POS-tagging
problem with the words of one language behaving as tagset. |T | here would therefore be of the order of thousands.
There are other classical search techniques too wherein ideas can be sought
from the quantum realm. For example, quantum random walks can be used for
the A-star algorithm. These and quantum versions for other classical algorithms
presented in this report, could be investigated further.
77
Chapter 9
Future Work
We have done an extensive literature survey on various classical optimization techniques in this thesis. We have also looked at various quantum computing techniques. We further developed a quantum version of the Viterbi algorithm. There
are numerous other problems to open for study as to how they can be implemented
on a quantum computer more efficiently. If the advent of commercial quantum
computers does occur, then quantum algorithms being developed as such will find
great applicability in all areas where computation is done. Among the other problems we are looking at for quantum counterparts are the A∗ search algorithm and
quantum gradient descent. For the problem A∗ search we are investigating the
work done on quantum walks and their algorithmic applications [14].
Another line of future work is to run simulations of the quantum Viterbi algorithm for machine translation of close languages and analyse the quality of the
results. We have made forays in this area and have presented our results and analysis in this report, but we believe a more extensive analysis of the same can be made
to obtain greater insights into the performance of the algorithm.
78
References
[1] D. Deutsch and R. Jozsa. Rapid solutions of problems by quantum computation. Proceedings of the Royal Society of London A, 1992.
[2] P. Shor. Polynomial-Time Algorithms for Prime Factorization and Discrete
Logarithms on a Quantum Computer. In Proc. 35th Annual ACM Symposium
on Foundations of Computer Science, 1994.
[3] Grover L.K. A fast quantum mechanical algorithm for database search, In
Proc. 28th Annual ACM Symposium on Theory of Computing, STOC-96, page
212, 1996.
[4] J.C.H. Chen. Quantum Computing and Natural Language Processing. Master’s Thesis, Universitát Hamburg, 2002.
[5] J.N. Darroch and D. Ratcliff. Generalized Iterative Scaling for Log-Linear
Models, textitThe Annals of Mathematical Statistics, Volume 43, pages
14701480. 1972.
[6] E.T. Jaynes. Information Theory and Statistical Mechanics. Physics Reviews
106, pages 620630. 1957.
[7] S. Gambs. Quantum Classification. 2008.
[8] S. Clark, B. Coecke, E. Grefenstette, S. Pulman and M. Sadrzadeh. A quantum teleportation inspired algorithm produces sentence meaning from word
meaning and grammatical structure. October, 2013.
[9] A. Berger. The Improved Iterative Scaling Algorithm: A gentle introduction.
December, 1997.
[10] M. Fleischer. Foundations of Swarm Intelligence - From Principles to Practice. 2003.
[11] R. Rosenfeld. A Maximum Entropy Approach to Adaptive Statistical Language Modeling. 1996.
[12] I. Brezina Jr. and Z. Cickova. Solving the Travelling Salesman Problem Using the Ant Colony Optimization, Management Information Systems, Vol.6,
2011.
79
[13] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory 13, pages
260-269, 1967.
[14] S. Aaronson and A. Ambanis. Quantum search of spatial regions. Theory of
Computing, pages 200-209, 2005.
[15] M. Boyer, G. Brassard, P. Hoyer and A. Tapp. Tight bounds on quantum
searching. Fortschritte Der Physik, 1998.
[16] C. Durr and P. Hoyer. A quantum algorithm for finding the minimum. http:
//arxiv.org/abs/quant-ph/9607014, 1996.
[17] A. Ambainis. Quantum walks and their algorithmic applications. http://
arxiv.org/abs/quantph/0403120, 2008.
80
Appendix
The BNC Basic (C5) Tagset used for POS tagging
1. AJ0 Adjective (general or positive) (e.g. good, old, beautiful)
2. AJC Comparative adjective (e.g. better, older)
3. AJS Superlative adjective (e.g. best, oldest)
4. AT0 Article (e.g. the, a, an, no) [N.B. no is included among articles, which
are defined here as determiner words which typically begin a noun phrase,
but which cannot occur as the head of a noun phrase.]
5. AV0 General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest. [Note that adverbs, unlike
adjectives, are not tagged as positive, comparative, or superlative.This is because of the relative rarity of comparative and superlative adverbs.]
6. AVP Adverb particle (e.g. up, off, out) [N.B. AVP is used for such ”prepositional adverbs”, whether or not they are used idiomatically in a phrasal verb:
e.g. in ’Come out here’ and ’I can’t hold out any longer’, the same AVP tag
is used for out.
7. AVQ Wh-adverb (e.g. when, where, how, why, wherever) [The same tag is
used, whether the word occurs in interrogative or relative use.]
8. CJC Coordinating conjunction (e.g. and, or, but)
9. CJS Subordinating conjunction (e.g. although, when)
10. CJT The subordinating conjunction that [N.B. that is tagged CJT when it
introduces not only a nominal clause, but also a relative clause, as in ’the day
that follows Christmas’. Some theories treat that here as a relative pronoun,
whereas others treat it as a conjunction.We have adopted the latter analysis.]
11. CRD Cardinal number (e.g. one, 3, fifty-five, 3609)
12. DPS Possessive determiner (e.g. your, their, his)
81
13. DT0 General determiner: i.e. a determiner which is not a DTQ. [Here a
determiner is defined as a word which typically occurs either as the first
word in a noun phrase, or as the head of a noun phrase. E.g. This is tagged
DT0 both in ’This is my house’ and in ’This house is mine’.]
14. DTQ Wh-determiner (e.g. which, what, whose, whichever) [The category
of determiner here is defined as for DT0 above. These words are tagged as
wh-determiners whether they occur in interrogative use or in relative use.]
15. EX0 Existential there, i.e. there occurring in the there is ... or there are ...
construction
16. ITJ Interjection or other isolate (e.g. oh, yes, mhm, wow)
17. NN0 Common noun, neutral for number (e.g. aircraft, data, committee)
[N.B. Singular collective nouns such as committee and team are tagged NN0,
on the grounds that they are capable of taking singular or plural agreement
with the following verb: e.g. ’The committee disagrees/disagree’.]
18. NN1 Singular common noun (e.g. pencil, goose, time, revelation)
19. NN2 Plural common noun (e.g. pencils, geese, times, revelations)
20. NP0 Proper noun (e.g. London, Michael, Mars, IBM) [N.B. the distinction
between singular and plural proper nouns is not indicated in the tagset, plural
proper nouns being a comparative rarity.]
21. ORD Ordinal numeral (e.g. first, sixth, 77th, last) . [N.B. The ORD tag is
used whether these words are used in a nominal or in an adverbial role. Next
and last, as ”general ordinals”, are also assigned to this category.]
22. PNI Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody)
[N.B. This tag applies to words which always function as [heads of] noun
phrases. Words like some and these, which can also occur before a noun
head in an article-like function, are tagged as determiners (see DT0 and AT0
above).]
23. PNP Personal pronoun (e.g. I, you, them, ours) [Note that possessive pronouns like ours and theirs are tagged as personal pronouns.]
24. PNQ Wh-pronoun (e.g. who, whoever, whom) [N.B. These words are tagged
as wh-pronouns whether they occur in interrogative or in relative use.]
25. PNX Reflexive pronoun (e.g. myself, yourself, itself, ourselves)
26. POS The possessive or genitive marker ’s or ’ (e.g. for ’Peter’s or somebody
else’s’, the sequence of tags is: NP0 POS CJC PNI AV0 POS)
82
27. PRF The preposition of. Because of its frequency and its almost exclusively
postnominal function, of is assigned a special tag of its own.
28. PRP Preposition (except for of) (e.g. about, at, in, on, on behalf of, with)
29. PUL Punctuation: left bracket - i.e. ( or [
30. PUN Punctuation: general separating mark - i.e. . , ! , : ; - or ?
31. PUQ Punctuation: quotation mark - i.e. ’ or ”
32. PUR Punctuation: right bracket - i.e. ) or ]
33. TO0 Infinitive marker to
34. UNC Unclassified items which are not appropriately classified as items of the
English lexicon. [Items tagged UNC include foreign (non-English) words,
special typographical symbols, formulae, and (in spoken language) hesitation fillers such as er and erm.]
35. VBB The present tense forms of the verb BE, except for is, ’s: i.e. am, are,
’m, ’re and be [subjunctive or imperative]
36. VBD The past tense forms of the verb BE: was and were
37. VBG The -ing form of the verb BE: being
38. VBI The infinitive form of the verb BE: be
39. VBN The past participle form of the verb BE: been
40. VBZ The -s form of the verb BE: is, ’s
41. VDB The finite base form of the verb BE: do
42. VDD The past tense form of the verb DO: did
43. VDG The -ing form of the verb DO: doing
44. VDI The infinitive form of the verb DO: do
45. VDN The past participle form of the verb DO: done
46. VDZ The -s form of the verb DO: does, ’s
47. VHB The finite base form of the verb HAVE: have, ’ve
48. VHD The past tense form of the verb HAVE: had, ’d
49. VHG The -ing form of the verb HAVE: having
50. VHI The infinitive form of the verb HAVE: have
83
51. VHN The past participle form of the verb HAVE: had
52. VHZ The -s form of the verb HAVE: has, ’s
53. VM0 Modal auxiliary verb (e.g. will, would, can, could, ’ll, ’d)
54. VVB The finite base form of lexical verbs (e.g. forget, send, live, return)
[Including the imperative and present subjunctive]
55. VVD The past tense form of lexical verbs (e.g. forgot, sent, lived, returned)
56. VVG The -ing form of lexical verbs (e.g. forgetting, sending, living, returning)
57. VVI The infinitive form of lexical verbs (e.g. forget, send, live, return)
58. VVN The past participle form of lexical verbs (e.g. forgotten, sent, lived,
returned)
59. VVZ The -s form of lexical verbs (e.g. forgets, sends, lives, returns)
60. XX0 The negative particle not or n’t
61. ZZ0 Alphabetical symbols (e.g. A, a, B, b, c, d)
84
Fly UP