...

Experimenting with the Path Ranking Algorithm Matt Gardner Abhishek Bhowmick Karishma Agrawal

by user

on
Category: Documents
4

views

Report

Comments

Transcript

Experimenting with the Path Ranking Algorithm Matt Gardner Abhishek Bhowmick Karishma Agrawal
Experimenting with the Path Ranking Algorithm
Matt Gardner
1
Abhishek Bhowmick
Introduction
The Path Ranking Algorithm (Lao and Cohen, 2010)
is a general technique for performing link prediction
in a graph. PRA has mainly been used for knowledge
base completion (Lao et al., 2011; Gardner et al., 2013;
Gardner et al., 2014), though the technique is applicable to any kind of link prediction task. To learn a
prediction model for a particular edge type in a graph,
PRA finds sequences of edge types (or paths) that frequently connect nodes that are instances of the edge
type being predicted. PRA then uses those path types
as features in a logistic regression model to infer missing edges in the graph.
In this class project, we performed three separate
experiments relating to different aspects of PRA: improving the efficiency of the algorithm, exploring the
use of multiple languages in a knowledge base completion task, and using PRA-style features in sentencelevel prediction models.
The first experiment looks at improving the efficiency and performance of link prediction in graphs
by removing unnecessary steps from PRA. We introduce a simple technique that extracts features from the
subgraph centered around a pair of nodes in the graph,
and show that this method is an order of magnitude
faster than PRA while giving significantly better performance. Additionally, this new model is more expressive than PRA, as it can handle arbitrary features
extracted from the subgraphs, instead of only the relation sequences connecting the node pair. The new feature types we experimented with did not generally lead
to better predictions, though further feature engineering may yield additional performance improvements.
The second experiment we did with PRA extends
recent work that performs knowledge base completion
using a large parsed English corpus in conjunction with
random walks over a knowledge base (Gardner et al.,
2013; Gardner et al., 2014). This prior work showed
significant performance gains when using the corpus
along with the knowledge base, and even further gains
by using abstract representations of the textual relations
extracted from the corpus. In this experiment, we attempt to extend these results to a multilingual setting,
with textual relations extracted from 10 different languages. We discuss the challenges that arise when dealing with data in languages for which parsers and entity
Karishma Agrawal
Dheeru Dua
linkers are not readily available, and show that previous techniques for obtaining abstract relation representations do not work in this setting.
The final experiment takes a step towards a longstanding goal in artificial intelligence research: using a large collection of background knowledge to improve natural language understanding. We present a
new technique for incorporating information from a
knowledge base into sentence-level prediction tasks,
and demonstrate its usefulness in one task in particular: relation extraction. We show that adding PRAstyle features generated from Freebase to an off-theshelf relation extraction model significantly improves
its performance. This simple and general technique
also outperforms prior work that learns knowledge base
embeddings to improve prediction performance on the
same task.
In the remainder of this paper, we first give a brief
introduction to the path ranking algorithm. Then we
discuss each experiment in turn, with each section introducing the new methods, describing related work,
and presenting experimental results.
2
The Path Ranking Algorithm
PRA is a two-step process for generating a feature matrix over node pairs in a graph. The first step finds a set
of potentially useful path types that connect the node
pairs, which become the columns of the feature matrix.
The second step then computes the values in the feature matrix by finding random walk probabilities as described below. Once the feature matrix has been computed, it can be used with whatever classification model
is desired (or even incorporated as one of many factors
in a structured prediction model, as we discuss in Section 5), though almost all prior work with PRA simply
uses logistic regression.
More formally, consider a graph G with nodes N ,
edges E, and edge labels R, and a set of node pairs (si ,
ti ) ∈ D that are instances of some relationship of interest. PRA will generate a feature vector for each (si ,
ti ) pair, where each feature is some sequence of edges
<e1 , e2 , . . ., el >. If the edge sequence corresponding to the feature exists between the source and target
nodes in the graph, the value of that feature in the feature vector will be non-zero.
Because the feature space considered by PRA is so
large (the set of all possible edge label sequences, with
Pl
cardinality i=1 |R|i , assuming a bound l on the maximum path length), and computing the feature values
so computationally intensive, the first step PRA must
perform is feature selection, which is done using random walks over the graph. In this step of PRA, we find
path types π that are likely to be useful in predicting
new instances of the relation represented by the input
node pairs. These path types are found by performing random walks on the graph G starting at the source
and target nodes in D and recording which paths connect some source node with its target.1 Note that these
are two-sided, unconstrained random walks: the walks
from sources and targets can be joined on intermediate
nodes to get a larger set of potential paths that connect
the source and target nodes. Once connectivity statistics have been computed in this way, k path types are
selected as features. Lao et al. (2011) use measures of
the precision and recall of each feature in this selection,
while Gardner et al. (2014) simply pick those most frequently seen.
Once a set of path features has been selected, the second step of PRA is to compute values for each cell in
the feature matrix. Recall that rows in this matrix correspond to node pairs, and the columns correspond to
the path types found in the first step. The value computed by PRA is the probability of arriving at the target
node of a node pair, given that a random walk began at
the source node and was constrained to follow the path
type: p(t|s, π). Computing this probability takes time
proportional to the average out degree of each node to
the power of the path length, and can be very expensive. Lao (2012) showed that dynamic programming
can be used to make this computation somewhat faster
when the target nodes of a query are known, but it still
can take a significant amount of time.
As mentioned above, once the feature matrix has
been computed in the second step of PRA, one can use
any kind of classifier desired to learn a model and make
predictions on test data.
One important practical issue for most uses of PRA
is the selection of negative examples for training a
model. Typically a knowledge base only contains positive examples of a relation, and it is not clear a priori what the best method is for obtaining negative evidence. Prior work with PRA makes a closed world
assumption, treating any (s, t) pair not seen in the
knowledge base as a negative example. Negative instances are selected when performing the second step
of PRA—if a random walk from a source ends at a target that is not a known correct target for that source,
that source-target pair is used as a negative example.
1
A determinstic algorithm, such as a breadth-first search,
could obviously be used here instead of random walks. Random walks have been preferred over a BFS in prior work because they allow for a bound on the computation required
that is independent of the size of the graph, which may be too
large or too densely connected to allow for an efficient BFS.
Dataset
Freebase
NELL
Method
Probabilities
Binarized
Probabilities
Binarized
MAP
.337
.344
.303
.319
Table 1: Using binary feature values instead of random
walk probabilities gives statistically indistinguishable
performance. The p-value on the Freebase data is .55,
while it is .25 for NELL.
3
More Efficient and Expressive Link
Prediction
Our first experiment with PRA attempts to mitigate
some of the computational issues discussed in Section 2. The second step of PRA, where feature values
are calculated using path-constrained random walks,
requires a lot of computation. In this project we consider whether this computational effort is well-spent, or
whether we might more profitably spend computation
in other ways. We propose a new way of generating
feature matrices over node pairs in a graph that uses an
order of magnitude less time than PRA while achieving
significantly better performance.
We first motivate this technique with three observations. First, it appears that binarizing the feature matrix
produced by PRA, removing most of the information
gained in PRA’s second step, has no significant impact
on prediction performance in knowledge base completion tasks. We show this on the NELL KB and the Freebase KB in Table 1. The fact that random walk probabilities carry no additional information for this task
over binary features is a surprising result, and it shows
that the second step of PRA spends a lot of its computation for no discernable gain in performance.
Second, Neelakantan (2015) presented experiments
showing a substantial increase in performance from using a much larger set of features in a PRA model. All
of their experiments used binary features, so this is not
a direct comparison of random walk probabilities versus binarized features, but it shows that increasing the
feature size beyond the point that is computationally
feasible with random walk probabilities seems useful.
Additionally, they showed that using path bigram features, where each sequential pair of edges types in each
path was added as an additional feature to the model,
gave a significant increase in performance. These kind
of features are not representable in the traditional formulation of PRA.
Lastly, the method used to compute the random
walk probabilities—rejection sampling—makes the inclusion of more expressive features problematic. Consider the path bigrams mentioned above; one could
conceivable compute a probability for a path type that
only specifies that the last edge type in the path must be
r, but it would be incredibly inefficient with rejection
sampling, as most of the samples would end up rejected
(leaving aside the additional issues of an unspecified
path length). In contrast, if the features simply signify
whether a particular path type exists in the graph, without any associated probability, these kinds of features
are very easy to compute.
Given this motivation, our project attempts to improve both the efficiency and the expressivity of PRA
by removing the second step of the algorithm. Efficiency is improved because the second step is the
most computationally expensive, and expressivity is
improved by allowing features that cannot be reasonably computed with rejection sampling. We show experimentally that the techniques we introduce do indeed improve performance quite substantially.
3.1
Removing the first step of PRA
In this section we discuss how we construct feature matrices using only the first step of PRA. As outlined in
Section 2, the first step does a series of random walks
from each source and target node (si , ti ) in a dataset D.
In the original PRA algorithm these random walks are
used to find a relatively small set of potentially useful
path types for which more specific random walk probabilities are then computed. In our modification to the
algorithm we stop after this first set of random walks
and instead construct a binary feature matrix.
More formally, for each node s in the data (where
s could be either a source node or a target node), we
construct a subgraph centered around that node. Each
random walk that leaves s follows some path type π
and ends at some intermediate node i. We keep all of
these (π, i) pairs as the characterization of the subgraph
around s, and we will refer to this subgraph as Gs . To
construct a feature vector for a source-target pair (si ,
ti ), we take the subgraphs Gsi and Gti and merge them
on the intermediate nodes i. That is, if an intermediate
node i is present in both Gsi and Gti , we take the path
types π correponding to i and combine them (reversing
the path type coming from the target node ti ). If some
intermediate node for the source si happens to be ti ,
no combination of path types is necessary (and similarly if an intermediate node for the target ti is si —the
path only needs to be reversed in this case). This creates a feature space that is exactly the same as that constructed by PRA: sequences of edge types that connect
a source node to a target node. To construct the feature
vector we just take all of these combined path types
as binary features for (si , ti ). Note, however, that we
need not restrict ourselves to only using the same feature space as PRA; Section 3.2 will examine extracting
more expressive features from these subgraphs.
One other important issue is that of generating negative examples. PRA uses the second step of the algorithm to find negative examples for use during training.
We can no longer do this, as our method only generates features for given node pairs, and does not do a
search from source nodes to target nodes. Instead, we
devised a technique to find negative examples from a
Method
PRA
SFE
MAP
.3704
.4007
Ave. Features
835
8275
Table 2: Comparison of PRA and SFE on 10 NELL
relations. The difference shown is not statistically significant.
graph given a set of positive examples, and we used
this to obtain the training and testing data used in the
experiments below. Our technique takes each source
and target node in the given positive examples and finds
other nodes in the same category that are close in terms
of personalized page rank (PPR). We then sample new
(source, target) pairs from these lists of similar nodes,
weighted by their PPR score (while also allowing the
original source and target to be sampled). These become our negative examples, both at training and at
testing time.
This method for generating a feature matrix over
node pairs in a graph is much simpler and less computationally expensive than PRA, and, if the results in
Table 1 are to be believed, should perform on par with
PRA with drastically reduced computation costs. Or,
we could increase the computation a little and get a
much larger feature space, presumably outperforming
PRA.
Some experimentation shows that it is not that simple. Table 2 shows a comparison between PRA and this
subgraph feature extraction technique (labeled SFE) on
10 NELL relations. SFE has a higher mean average
precision, but the difference is not statistically significant. There is a large variance in the performance of
SFE, and on some relations PRA performs better.
We examined the feature matrices computed by these
methods and discovered that the reason for the inconsistency of SFE’s improvement is because its random
walks are all unconstrained. Consider the case of a
node with a very high degree, say 1000. If we only
do 200 random walks from this node, we cannot possibly get a complete characterization of the graph even
one step away from the node. If a particularly informative path is <C ITY I N S TATE, S TATE I N C OUNTRY>,
and both the city from which a random walk starts and
the intermediate state node have very high degree, the
probability of actually finding this path type using unconstrained random walks is quite low. This is the benefit gained by the path-constrained random walks performed by PRA; PRA leverages training instances with
relatively low degree and aggregation across a large
number of instances to find path types that are potentially useful. Once they are found, significant computational effort goes into discovering whether each
path type exists for all (s, t) pairs. It is this computational effort that allows the path type <C ITY I N S TATE,
S TATE I N C OUNTRY> to have a non-zero value even for
very highly connected nodes.
How do we mitigate this issue, so that SFE can con-
Method
PRA
SFE-RW
SFE-BFS
MAP
.3704
.4007
.480
Ave. Features
835
8275
237853
Time
44 min.
6 min.
5 min.
Table 3: Comparison of PRA and SFE on 10 NELL
relations. SFE-RW is not statistically better than PRA,
but SFE-BFS is (p < 0.05).
the existence of the relation that we are trying to predict. We also consider paths that are similar to PRA
generated path types and may potentially exist between
the node pair. In this section, we discuss the new feature types that were added to the PRA implementation,
and the effect they had on performance on top of subgraph feature extraction.
3.2.1
sistently find these path types? It seems the only option without resorting to a similar two-step process to
what PRA uses is to do a more exhaustive search. PRA
uses random walks to improve scalability on very large
graphs, particularly because the second step of the algorithm is so expensive. However, if we are only doing
a single search, and the graph fits in memory, a few
steps of a breadth-first search (BFS) per node is not
infeasible. We can make the BFS more tractable by
excluding edge types whose fan out is too high. For
example, at a category node in Freebase, there could
be thousands of edges of type / TYPE / OBJECT / TYPE;
if there are a large number of edges of the same type
leaving a node, we do not include those edges in the
BFS. Note that because the category node will still be
counted as an intermediate node in the subgraph, we
can still find paths that go through that node; we just
do not continue searching if the out-degree of a particular edge type is too high.
When using a BFS instead of random walks to obtain
the subgraphs Gsi and Gti for each node pair, we saw a
dramatic increase in the number of features found and a
substantial increase in performance, while also achieving a slight decrease in running time.2 These results
are shown in Table 3; SFE-RW is our SFE implementation using random walks, and SFE-BFS uses a breadthfirst search. The running time decrease for SFE-BFS
was due to the fact that we were using GraphChi for
SFE-RW, which does a number of scans through the
whole graph, which is stored on disk. SFE-BFS loads
the graph into memory, instead of loading it from disk
several times, and then only processes nodes as necessary in the BFS.
Figure 1: One-sided paths: ‘cityInState’ and ‘flows in’
are one-sided paths starting at the source and target respectively, detect the relation overlooks
3.2.2
3.2
Adding more expressive features
Having shown that doing subgraph feature extraction
with binary features significantly outperforms computing random walk probabilities with PRA, we now look
into using a richer set of features than those allowed
by the rejection sampling procedure used by PRA. In
addition to using path types connecting node pairs as
features, we investigated edge sequences that do not
necessarily start and end at nodes of interest. Such path
types may capture useful information about either the
start or target node individually, implicitly pointing to
2
These results were using L1 and L2 parameters tuned for
PRA on a separate but related dataset. When the parameters
are tuned for SFE, performance increases further, as we show
below.
One-sided paths
A one-sided path is a sequence of edges that starts at
a source or target node, but terminates at neither of the
two. Following the notation introduced in Section 3.1,
we use as features each (π, i) pair in the subgraph
characterizations Gs and Gt , along with whether the
feature came from the source node or the target node.
The motivation for these one-sided path types is to better model which sources and targets are good candidates for participating in a particular relation (because,
e.g., not all cities participate in the relation C ITY C AP ITAL O F C OUNTRY, even though the domain of the relation is all cities).
Comparisons of one-sided paths
Comparisons of the intermediate nodes at which onesided paths terminate may yield useful insights about
the existence of a relation between the node pair.
Meaningful comparisons can be made when the intermediate nodes are categorical or numerical in nature.
Drawing again on the notation from Section 3.1, we
will describe these features as analogous to the pairwise PRA features. To get the PRA features, we intersect the intermediate nodes i from the subgraphs Gs
and Gt , and combine the path types π when we find
common intermediate nodes. To get these comparison features, we instead intersect the subgraphs on the
path types, and combine the intermediate nodes when
there are common path types. That is, if we see a common path type, such as / PEOPLE / PERSON / RELIGION,
we will construct a feature representing a comparison
between the intermediate node for the source and the
target. If the values are the same, this information can
be captured with a PRA feature, but it cannot be easily
captured by PRA when the values are different.
Feature Types
PRA
PRA + One-sided
PRA + One-sided + Comparisons
MAP
.375
.425
.4002
Table 4: Feature ablation study using a random walk
implementation of SFE.
Figure 2: Comparisons of paths: Nodes 10 and 73 at
the end of the one-sided paths ‘age’ point negatively to
the relation sibling of. Nodes RUSSIA at the end of path
‘nationality’ count favourably towards the existence of
the relation ‘speaks same language as’.
3.2.3 Vector Similarity paths
Gardner et al. (2014) introduced a modification of
PRA’s random walks to incorporate vector space similarity between the relations in the graph. On the data
they were using, a graph that combined a formal knowledge base with textual relations extracted from text,
they found that this technique gave a substantial performance improvement. The vector space random walks
only affected the second step of PRA, however, and we
have removed that step in SFE. While it is not as conceptually clean as the vector space random walks, we
can obtain a similar effect with a simple feature transformation using the vectors for each relation. We obtain vector representations of relations through factorization of the knowledge base tensor as did Gardner et
al., and exploit the vector space similarities to obtain
paths that are close to the original PRA or SFE generated paths. We also introduce a special “any edge”
symbol, and say that all other edge types are similar to
this edge type.
Figure 3: Vector similarity paths: The path ‘- rival of builds product -’ is similar in vector space to the original path type ‘- competes with - economic sector -’.
3.2.4 Experiments
Here we present the results of some experiments using these new features. For these experiments, we used
10 relations from NELL (the same 10 used by Gardner
et al. (2014)). We first experimented with adding the
features to our random walk implementation of SFE,
and found that the features seemed to improve performance. However, because there is so much variance in
the random walk SFE algorithm, the differences shown
Feature Types
PRA
+ One-sided
+ Comparisons
+ One-sided + Comps.
+ Vector similarity
MAP
.5000
.4908
.4675
.4901
.5942
Ave. Features
238k
1,261k
554k
1,578k
3,954k
Table 5: Feature ablation study using a BFS implementation of SFE. All rows use PRA features. PRA + vector similarity is statistically better than all other methods (highest p-value is 0.033), and none of the other
differences are significant.
in Table 4 are not statistically significant. Seeing this
result is what initially drove us to implement SFE with
a breadth-first search.
We also present in Table 5 the result of a similar ablation study with the BFS implementation of
SFE. While the one-sided and categorical comparison
features seemed like they helped a little in the random walk implementation, they do not improve performance with a BFS implementation. The vector similarity features do, however, significantly improve performance, giving a final MAP score of .594.
Finally, we present a comparison between PRA,
PRA with vector space random walks, and our BFS
SFE implementation. This is shown in Table 6. SFE
significantly outperforms PRA, and does so in less
time.3
3
In the final line, SFE with both PRA and vector similarity
features, almost all of the time was spent in MALLET’s logistic regression optimizer, which is single-threaded and did
not scale well to the greatly expanded feature size. A more efficient optimizer would drastically cut down the running time
of this method, while not affecting the running time of the
other methods much, as little of that time was spent in the
optimizer.
Method
PRA
Vector space PRA
SFE (PRA features)
SFE (PRA and VS features)
MAP
.3704
.4601
.5000
.5942
Time
44 min
46 min
7 min
54 min
Table 6: Results of final comparison between PRA and
SFE, with and without vector space similarity features.
The final line is statistically better than all other methods.
Method
Logistic regression
SVM with quadratic kernel
SVM with linear kernel
SVM with RBF kernel
MAP
.4799
.3895
.3726
.2875
Language
Arabic
Chinese
French
Georgian
Hindi
Indonesian
Latvian
Russian
Swahili
Tagalog
Table 7: Experimenting with the learning algorithm.
All methods use a breadth-first search SFE to generate
the same feature matrix. Logistic regression is statistically better than all other methods, with p < 0.005.
3.3
Beyond logistic regression
One kind of feature we wanted to add was a conjunction between some of the feature types introduced in
the previous section. Lao has seen some improvement
from a limited application of these kinds of conjuctions4 , but finding useful conjunctions can be problematic, especially in our SFE setting where we do not have
an explicit feature selection step. However, one should
be able to capture something similar to these conjunction features using a kernelized classifier. We used the
kernel SVM from the popular libSVM library (Chang
and Lin, 2011) for the task of predicting relations. We
evaluated kernels such as linear, quadratic and Gaussian RBF.
Table 7 shows the results of these experiments. The
quadratic kernel did improve over the linear kernel,
slightly, though the difference was not quite statistically significant (perhaps testing with a few more relations would have pushed the result to statistical significance). Contrary to our expectation, however, we
did not see any improvement from using any kind of
SVM over logistic regression. It could be that using
some kind of ranking loss function is more appropriate for our task, which is evaluated using mean average
precision, a metric over ranked lists. It could also be
that the ratio of number of features to number of training examples was too high for the SVM (there were
240k features per relation on average, and 5k training instances). Further experimentation and analysis
is necessary to understand why SVMs did not perform
well here.
4
Multilingual PRA
Our second project looks at performing PRA over
a knowledge base graph augmented by relations extracted from text in multiple languages. In the prior
work by (Lao et al., 2012) and (Gardner et al., 2014)
it was established that incorporation of dependency
parsed sentences with knowledge base (KB) results in
a stronger connectivity of the graph. In particular, performing random walks over this augmented KB has
shown significant improvement in learning relationships and semantic inference rules. In this experiment,
the focus has been on incorporation of lexicalized syntactic labels from multiple languages. The underlying
4
From personal correspondence.
# Triples
637k
2,722k
6.214k
464k
358k
1623k
467k
6893k
107k
97k
Table 8: Number of SVO triples per language in the
dataset
idea is to identify the merits of KB reinforced with
Subject-Verb-Object (SVO) triples from multilingual
text corpus and test different methods of establishing
connection between different languages.
4.1
Data
For the purpose of our experiment we have used the dependency parsed multilingual Wikipedia corpus generated by (Faruqui and Kumar, 2015). The corpus consists of SVO triples from 10 languages – French, Russian, Chinese, Arabic, Hindi, Indonesian, Tagalog, Latvian, Swahili and Georgian. Table 8 lists the number of
these triples in thousands. The knowledge base used is
a subset of Freebase (Bollacker et al., 2008), extracted
with focus on relations that seemed applicable to the
text corpus.
4.2
Experiment Setup
The text corpus, though dependency parsed, was not
processed in an optimal format to be fed as input to
the PRA. The major challenges outlined here, deal
with transformation of this data and generation of suitable mappings to the KB. The major tasks of the datapreprocessing pipeline consist of alias mapping and
generation of translation tables.
4.2.1
Alias Mapping
The first step in the data processing pipeline was to extract the SVO triples from the text corpus. Identifying
ALIAS relations with the KB succeeded this. The initial method for generating these relations was to look
for mappings of KB entities to any unigram obtained
from subject or object in each triple. This quickly escalated into identification of a good amount of noisy
mappings. Additionally, unigrams such as “Ben” and
“Affleck” would be mapped separately. While, in
this case, we were referring to the actor, “Ben” could
be mapped to other unrelated entities. To refine the
ALIAS relation we looked into bigrams and trigrams
generated from the SVO and their mappings with the
KB. Amongst others, one of the challenges faced here
was the computational power required. This part of the
/location/location/contains
/people/person/nationality
/music/album/artist
/film/film/country
/location/administrative division/country
/book/written work/subjects
/tv/tv program/country of origin
/biology/organism classification/higher classification
/language/human language/main country
/language/human language/region
/people/ethnicity/languages spoken
Table 9: Relations used in our experiments
pipeline took around 9 hours on a 24-core machine with
60 GB of RAM.
Additionally, the aliases in Freebase are heavily
skewed towards Latin scripts. For 5 of the languages
under consideration—Chinese, Georgian, Hindi, Arabic and Russian—there were few direct links to KB entities. This increased the difficulty faced in the actual
experiments.
4.2.2 Translation Tables
The problems mentioned in the previous section
prompted us to look at smarter methods for enhancing the connections between the multilingual corpora.
For this end the method we explored in this experiment, was to incorporate translation relations. One of
the methods to deal with this is implicitly taken care of
by Freebase. Entities in different languages, referring
to same object get mapped to single unique key. However, this is mostly limited to the languages with Latin
scripts, where the spelling of the proper noun does not
change. The coverage for the other languages is much
smaller. The alternate method is to introduce external
translation tables. For this purpose, we obtained parallel text from (Tiedemann, 2012) and used the fast align
software provided by (Dyer et al., 2013) to generate
word alignments. Due to the diversity of the languages,
their scripts and structures, the generated alignments
had some noise. For each word, we took the top three
aligned translations.
4.2.3 Relationship Selection
The alias mapping had identified multiple freebase relationships present in the text corpora. For the purpose
of this experiment we selected 11 relationships listed in
Table 9. Relations that had more than 2000 instances
associated with it were under sampled. On an average
each relation had 1240 instances, which was split 80%20% into training and test sets.
4.3
Experiments
To conduct experiments, we used graphs constructed
with a combination of edges from the Freebase KB
and edges from the extracted SVO triples. In all cases
where we used SVO triples, all of the triples from all
Experiment
KB
SVOs
SVOs (vector space)
SVOs (single rel.)
KB + SVOs
KB + SVOs + translations
KB + SVOs (vector space)
KB + SVOs (single relation)
MAP
0.4665
0.2562
0.3453
0.3747
0.4987
0.4922
0.5470
0.5603
MRR
0.6664
0.7424
0.7652
0.8667
0.7094
0.6024
0.7094
0.7538
Table 10: Results for experiment 1, mixed language
training data. Each experiment uses the same training
and test data on graphs constructed with the relation
sets indicated. No method is statistically better than all
other methods.
languages were added to the graph. We used three different ways to represent these triples, however. The
first was to use the original surface relation as present in
the data we obtained from (Faruqui and Kumar, 2015).
The second was to use a factorization of the graph to
obtain vector representations of these relations and use
the vector space random walks presented by Gardner
et al. (2014). Lastly, we tried replacing all of the surface relations with a single placeholder relation indicating that there was some connection between the noun
phrases in the SVO data.
There were two sets of experiments that we conducted. In the first case, we took all of the relation
instances found from all languages and randomly split
them into training and test sets. In the second case,
we used the relation instances found from 9 of the language pairs as training data and tried to predict the instances in the 10th language (French). To evaluate our
results, we have considered two metrics: mean average precision (MAP) and mean reciprocal rank (MRR).
MAP is equivalent to the area under the precision-recall
curve. MRR is defined as the precision of top prediction for each relation.
4.3.1
Mixed Language Training Data
In this set of experiments we aggregated all the instances from all the 10 languages and randomly split
them into training and testing set. We then trained models and scored the test data on several different graphs
with various combinations of KB and SVO edges.
The results are shown in Table 10. While there are
differences in the scores on these graphs, performance
varied enough between relations that almost none of the
differences are statistically significant, and it is hard to
draw many conclusions just from this table (thus nothing is bolded in the table, as there is no method that is
statistically better than all of the others). However, with
the caveat that these results do not achieve statistical
significance, it appears that just using the KB to infer
new edges is better than using the extracted SVO data.
This seems to indicate that there is a great deal of noise
in our method for obtaining relation instances from the
Experiment
KB
SVOs
SVOs (vector space)
SVOs (single rel.)
KB + SVOs
KB + SVOs (vector space)
KB + SVOs (single rel.)
MAP
0.2555
0.0543
0.1835
0.1877
0.2688
0.3168
0.2918
MRR
0.6535
0.4727
0.9099
0.7508
0.6455
0.7744
0.6883
Table 11: Results from experiment 2, cross lingual corpus
SVO data—had our method for alignment been cleaner,
we should have been able to construct better models
from the data. The table also shows that using vector space representations of SVO data is better than not
using them, and that just using a single placeholder relation instead of the actual surface text does better still
(and for the case where there is no KB, those two differences are just about the only statistically significant
differences in the table). From this we can see that the
mere fact that two entities of particular types are mentioned together in text is a pretty strong indication of
those entities participating in a particular relationship,
at least for the relations we tested. We can also conclude from this that a tensor factorization for learning
vector space representations does not work very well
in this setting, as we could do better simply by just replacing all of the relation labels with a single dummy
relation.
We should also mention that adding the translation
edges did not improve performance. In fact, there were
no translation edges found in any of the random walks
for any of the relations. We believe that this is due to
a mis-alignment of the corpora used for obtaining the
translation edges and the multilingual SVO corpora we
used to obtain training data. Most of the aliases we
found were proper nouns, while the translation edges
consisted mostly of common nouns.
4.3.2
Cross Language Training Data
For our second set of experiments, we held out the relation instances obtained from the French SVO from
the training set (Table 11), and then tried to predict
these instances from the models learned on the other
languages. We used French as our held-out language
because it had the highest number of relation instances,
giving us a reasonably-sized test set.
Here we see results that are broadly similar to what
we saw in the mixed language experiment. The difference in MAP in the KB setting should allow for some
calibration of the difficulty of these two different test
sets, in general—the fact that MAP is lower even when
only using the KB means that this is a harder test set.
Additionally, we see that using SVO triples themselves resulted in very poor performance, which should
be expected due to the language difference at training
and at test time. However, using vector space or single
relation representations allowed us to make much better predictions (though still not as good as in the mixed
language setting). From this we can see that we do get
some limited amount of cross-lingual transfer in this
setting, but it is incredibly shallow and leaves a lot to
be desired, and a lot of room for improvement.
4.4
Future Work
We faced several challenges in performing knowledge
base completion in multiple languages. One of the primary challenges is to obtain better translation relations.
We saw that the potential benefit to be gained using the
translation tables was not apparent in our experiment
due to mismatch in the data. Going forward, we intend
to improve this relation. One method we are looking at
is to use common source for both the text corpora and
the parallel texts.
We, also, intend to carry out more experiments to
see if the vector embedding in cross lingual experiment
learned anything or if it meant we saw some potential
SVO connection between two instances.
Finally, we plan to study how well can the cross lingual lift be. For this purpose we will conduct another
set of experiment where we will have additional baselines where only the SVO triples from the testing language would be used to form the knowledge graph and
KB augmented with testing language would be another
guideline.
5
Adding KB features to sentence-level
models
A long-standing focus of Artificial Intelligence has
been to make machines understand and comprehend
text like humans do. This has induced active research
in the field of Information Extraction for better document understanding. The availability of datasets like
Freebase and Wikipedia has facilitated this research
on a large scale. Because there are so many different individual prediction tasks (e.g., one per relation
in Freebase), the annotation effort required for completely supervised relation extraction techniques is very
expensive. This is why “weak” or “distant” supervision
seems to be a very promising technique to learn relation
extractors from text.
In this paper we propose an approach for predicting overlapping relations between two entities by biasing the prediction decision on inferences drawn from
a large scale knowledge repository, Freebase. The
method we present is a very generic way of injecting features into any graphical model representing pairwise factors between any two entities. In this project,
we experiment with relation extraction using the MultiR algorithm (Hoffmann et al., 2011) at the sentence
level. As per (Hoffmann et al., 2011), it is possible
for a pair of entities to have multiple relations between
them; for instance, CAPTIAL O F(Delhi, India) and locatedIn(Delhi, India). As per (Lao et al., 2011) the
entities are also indirectly connected via other entity
PRA Per Mention
PRA Per Entity
PRA Only
1
Precision
0.8
0.6
0.4
0.2
0
0.1
0.2
0.3
0.4
Recall
Figure 5: Comparison between methods we introduced
Figure 4: The MultiR model
nodes in a graphical representation of the knowledge
base, which can used to factor in weak supervision between an entity pair. We employ the above two facts to
learn a joint model with the MultiR algorithm that follows an online perceptron style approach to update the
parameter vector per sentence mention for each pair of
entities. The dataset that we use was prepared by Sebastian Reidel for his work (Riedel et al., 2010) which
contains 52 relations with 67946 entity pair-relation tuples in training set and 96678 entity pair-relation tuples
in the testing set.
5.1
Methodology
The graphical notation of both the models in reference
to the baseline is shown in Figure 4. The MultiR model
is the baseline model proposed by (Hoffmann et al.,
2011). Y corresponds to the relation between an entity pair for a particular sentence mention S. The latent
variable zi are predicted by the model, if the prediction is incorrect updates are made to the parameter vector. P represents the PRA features that were injected
into the MultiR model in two different manner. Since
two entities can be related to each other via multiple
relations, Y is a vector of length equal to the number
of possible relations in the dataset. If a relation exists
between the two entities under consideration the corresponding bit is activated in the relation vector Y .
5.1.1 Per Entity Pair
In this model, we add features extracted by the Path
Ranking Algorithm (Lao et al., 2011), which are direct or indirect paths between any two entities. These
are obtained by performing random walks on the Free-
base graph. “Per Entity pair” means we add PRA features once for an entity pair. Each entity pair contains
a set of sentence mentions with the syntactic features
(obtained from a dependency parse) per sentence defining the sentential construction. This approach performs
better over using only PRA features to train the model
(Figure 5) as it incorporates the syntactic patterns entailing how language-specific constructs can define relation between two entities for instance “Italy, Europe”
describes that Italy is LOCATED I N Europe.
5.2
Per Mention
In this model, we add PRA features per sentence mention in contrast to previous case where we added the
features per entity pair. This model performs better
over the previous one for the following two reasons.
First, it makes the feature space less sparse. Secondly,
it trains a better representation of the model by bringing
the semantic and syntactic components of the language
together in each parameter update. The precision for
PRA per entity goes down quickly at higher values of
recall which is clearly because we dont encounter PRA
features with every parameter update. This makes the
denominator grow faster than numerator in recall calculation which is (True Positive)/(True Positive+False
Negative).
5.3
Related Work
The first attempt toward supervised Information Extraction was made by (Soderland, 1997). However,
with the growth in amount of un-annotated data it became very expensive to adopt new Supervised Learning techniques. (Craven and Kumlien, 20) introduced a
weak or distant supervision technique to match Yeast
Protein Database with abstracts in PubMed using a
nave-bayes exractor.
The KYLIN system (Wu and Weld, 2007) employed
weak supervision on Wikipedia infoboxes to learn relationships between entities by considering Infoboxes as
the database. (Mintz et al., 2009) learnt 100 relational
Hoffmann
Weston
Ours
1
Precision
0.8
0.6
0.4
0
0.1
0.2
0.3
0.4
Recall
Figure 6: Comparison with prior work
extractors on Wikipedia data with the help of Freebase
facts.
(Riedel et al., 2010) used multi-instance learning to
learn a graphical model assuming that there is at least
one sentence mention in a corpus that expresses the arguments of a Freebase fact. This help eliminate the
effects of noisily labeled training data.
(Hoffmann et al., 2011) proposed a weakly supervised method for relation extraction, where they used
syntactic features like dependency parses between the
entity pairs which vary as per their usage in a sentence.
(Surdeanu et al., 2012) also proposed a similar model
which used textual information for relation extraction.
The state of the art algorithm till date was proposed
by (Weston et al., 2013). They first learn a low dimensional vector space representations or embedding
for entities and relationships. Then, a function of sentence mentions and relation between entities is learnt
based on the features in the embedding space. These
functions are then projected in the embedding space to
predict relationship between two entities present in a
sentence. This model leverages weak supervision from
text as well as Knowledge base triples.
In our approach not only do we consider the sentential features with knowledge base, we also bias our
relation predictions to the global knowledge extracted
from knowledge base.
5.4
Results and Conclusion
Figure 6 displays the aggregate precision/recall curves
of our PRA augmented with MultiR features approach
and the existing state-of-the-art approaches HOFFMANN (Hoffmann et al., 2011) and WESTON (Weston et al., 2013).
It also shows how both the ways of PRA feature
augmentation perform in comparison to the baseline.
We used the dataset developed by Sebastian Riedel
for (Riedel et al., 2010). In the dataset we removed
the direct and inverse relation edges from the freebase
graph mentioned in the dataset.
There were some problems reproducing the results
using the algorithm proposed by Weston et al. So,
we used a plot digitizer to obtain the results presented
in (Weston et al., 2013).
Firstly we trained a model directly with source code
used in (Hoffmann et al., 2011) but we found that due
to increase in feature space the model parameters were
not generalizing well. Due to the noise in the parameter learning because of fewer training instances we resorted to regularization. We did 5 fold cross validation with 80%-20% training and testing dataset split to
choose the parameter for L2 regularization. After regularization we observed an overall performance gain in
the aggregate precision-recall curve.
6
Conclusion
We have done three separate experiments improving
the path ranking algorithm in various ways. We showed
that subgraph feature extraction gave better performance than PRA in terms of both running time and
mean average precision, and allows for the addition of
much more expressive features than can be computed
by PRA. We gave an exploration of the challenges involved with using multilingual data to perform knowledge base completion. And we showed that a simple
and general technique for incorporating features from
knowledge bases into sentence-level prediction models
can greatly improve performance over previous relation
extraction techniques.
References
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim
Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD.
Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology
(TIST), 2(3):27.
Mark Craven and Johan Kumlien. 20. Constructing
biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh
International Conference on Intelligent Systems for
Molecular Biology, 3:993–1022, March.
Chris Dyer, Victor Chahuneau, and Noah A Smith.
2013. A simple, fast, and effective reparameterization of ibm model 2. In HLT-NAACL, pages 644–
648. Citeseer.
Manaal Faruqui and Shankar Kumar. 2015. Multilingual open relation extraction using cross-lingual projection. In Proceedings of ACL.
Matt Gardner, Partha Talukdar, Bryan Kisiel, and Tom
Mitchell. 2013. Improving learning and inference in
a large knowledge-base using latent syntactic cues.
In Proceedings of EMNLP. Association for Computational Linguistics.
Matt Gardner, Partha Talukdar, Jayant Krishnamurthy,
and Tom Mitchell. 2014. Incorporating vector space
similarity in random walk inference over knowledge
bases. In Proceedings of EMNLP. Association for
Computational Linguistics.
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke
Zettlemoyer, and Daniel S Weld. 2011. Knowledgebased weak supervision for information extraction of
overlapping relations. In Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language TechnologiesVolume 1, pages 541–550. Association for Computational Linguistics.
Ni Lao and William W Cohen. 2010. Relational retrieval using a combination of path-constrained random walks. Machine learning, 81(1):53–67.
Ni Lao, Tom Mitchell, and William W Cohen. 2011.
Random walk inference and learning in a large scale
knowledge base. In Proceedings of EMNLP. Association for Computational Linguistics.
Ni Lao, Amarnag Subramanya, Fernando Pereira, and
William W Cohen. 2012. Reading the web with
learned syntactic-semantic inference rules. In Proceedings of EMNLP-CoNLL.
Ni Lao. 2012. Efficient Random Walk Inference with
Knowledge Bases. Ph.D. thesis, Carnegie Mellon
University.
Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP, pages
1003–1011. Association for Computational Linguistics.
Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. 2015. Compositional vector space models
for knowledge base completion. In ACL 2015. Association for Computational Linguistics.
Sebastian Riedel, Limin Yao, and Andrew McCallum.
2010. Modeling relations and their mentions without
labeled text. In Machine Learning and Knowledge
Discovery in Databases, pages 148–163. Springer.
Stephen Soderland.
1997.
Learning to extract
text-based information from the world wide web.
97:251–254.
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,
and Christopher D Manning. 2012. Multi-instance
multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language Learning, pages 455–
465. Association for Computational Linguistics.
Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In LREC, pages 2214–2218.
Jason Weston, Antoine Bordes, Oksana Yakhnenko,
and Nicolas Usunier. 2013. Connecting language
and knowledge bases with embedding models for relation extraction. In Proceedings of EMNLP.
Fei Wu and Daniel S Weld. 2007. Autonomously
semantifying wikipedia. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 41–50.
ACM.
Fly UP