Experimenting with the Path Ranking Algorithm Matt Gardner Abhishek Bhowmick Karishma Agrawal
Experimenting with the Path Ranking Algorithm Matt Gardner 1 Abhishek Bhowmick Introduction The Path Ranking Algorithm (Lao and Cohen, 2010) is a general technique for performing link prediction in a graph. PRA has mainly been used for knowledge base completion (Lao et al., 2011; Gardner et al., 2013; Gardner et al., 2014), though the technique is applicable to any kind of link prediction task. To learn a prediction model for a particular edge type in a graph, PRA finds sequences of edge types (or paths) that frequently connect nodes that are instances of the edge type being predicted. PRA then uses those path types as features in a logistic regression model to infer missing edges in the graph. In this class project, we performed three separate experiments relating to different aspects of PRA: improving the efficiency of the algorithm, exploring the use of multiple languages in a knowledge base completion task, and using PRA-style features in sentencelevel prediction models. The first experiment looks at improving the efficiency and performance of link prediction in graphs by removing unnecessary steps from PRA. We introduce a simple technique that extracts features from the subgraph centered around a pair of nodes in the graph, and show that this method is an order of magnitude faster than PRA while giving significantly better performance. Additionally, this new model is more expressive than PRA, as it can handle arbitrary features extracted from the subgraphs, instead of only the relation sequences connecting the node pair. The new feature types we experimented with did not generally lead to better predictions, though further feature engineering may yield additional performance improvements. The second experiment we did with PRA extends recent work that performs knowledge base completion using a large parsed English corpus in conjunction with random walks over a knowledge base (Gardner et al., 2013; Gardner et al., 2014). This prior work showed significant performance gains when using the corpus along with the knowledge base, and even further gains by using abstract representations of the textual relations extracted from the corpus. In this experiment, we attempt to extend these results to a multilingual setting, with textual relations extracted from 10 different languages. We discuss the challenges that arise when dealing with data in languages for which parsers and entity Karishma Agrawal Dheeru Dua linkers are not readily available, and show that previous techniques for obtaining abstract relation representations do not work in this setting. The final experiment takes a step towards a longstanding goal in artificial intelligence research: using a large collection of background knowledge to improve natural language understanding. We present a new technique for incorporating information from a knowledge base into sentence-level prediction tasks, and demonstrate its usefulness in one task in particular: relation extraction. We show that adding PRAstyle features generated from Freebase to an off-theshelf relation extraction model significantly improves its performance. This simple and general technique also outperforms prior work that learns knowledge base embeddings to improve prediction performance on the same task. In the remainder of this paper, we first give a brief introduction to the path ranking algorithm. Then we discuss each experiment in turn, with each section introducing the new methods, describing related work, and presenting experimental results. 2 The Path Ranking Algorithm PRA is a two-step process for generating a feature matrix over node pairs in a graph. The first step finds a set of potentially useful path types that connect the node pairs, which become the columns of the feature matrix. The second step then computes the values in the feature matrix by finding random walk probabilities as described below. Once the feature matrix has been computed, it can be used with whatever classification model is desired (or even incorporated as one of many factors in a structured prediction model, as we discuss in Section 5), though almost all prior work with PRA simply uses logistic regression. More formally, consider a graph G with nodes N , edges E, and edge labels R, and a set of node pairs (si , ti ) ∈ D that are instances of some relationship of interest. PRA will generate a feature vector for each (si , ti ) pair, where each feature is some sequence of edges <e1 , e2 , . . ., el >. If the edge sequence corresponding to the feature exists between the source and target nodes in the graph, the value of that feature in the feature vector will be non-zero. Because the feature space considered by PRA is so large (the set of all possible edge label sequences, with Pl cardinality i=1 |R|i , assuming a bound l on the maximum path length), and computing the feature values so computationally intensive, the first step PRA must perform is feature selection, which is done using random walks over the graph. In this step of PRA, we find path types π that are likely to be useful in predicting new instances of the relation represented by the input node pairs. These path types are found by performing random walks on the graph G starting at the source and target nodes in D and recording which paths connect some source node with its target.1 Note that these are two-sided, unconstrained random walks: the walks from sources and targets can be joined on intermediate nodes to get a larger set of potential paths that connect the source and target nodes. Once connectivity statistics have been computed in this way, k path types are selected as features. Lao et al. (2011) use measures of the precision and recall of each feature in this selection, while Gardner et al. (2014) simply pick those most frequently seen. Once a set of path features has been selected, the second step of PRA is to compute values for each cell in the feature matrix. Recall that rows in this matrix correspond to node pairs, and the columns correspond to the path types found in the first step. The value computed by PRA is the probability of arriving at the target node of a node pair, given that a random walk began at the source node and was constrained to follow the path type: p(t|s, π). Computing this probability takes time proportional to the average out degree of each node to the power of the path length, and can be very expensive. Lao (2012) showed that dynamic programming can be used to make this computation somewhat faster when the target nodes of a query are known, but it still can take a significant amount of time. As mentioned above, once the feature matrix has been computed in the second step of PRA, one can use any kind of classifier desired to learn a model and make predictions on test data. One important practical issue for most uses of PRA is the selection of negative examples for training a model. Typically a knowledge base only contains positive examples of a relation, and it is not clear a priori what the best method is for obtaining negative evidence. Prior work with PRA makes a closed world assumption, treating any (s, t) pair not seen in the knowledge base as a negative example. Negative instances are selected when performing the second step of PRA—if a random walk from a source ends at a target that is not a known correct target for that source, that source-target pair is used as a negative example. 1 A determinstic algorithm, such as a breadth-first search, could obviously be used here instead of random walks. Random walks have been preferred over a BFS in prior work because they allow for a bound on the computation required that is independent of the size of the graph, which may be too large or too densely connected to allow for an efficient BFS. Dataset Freebase NELL Method Probabilities Binarized Probabilities Binarized MAP .337 .344 .303 .319 Table 1: Using binary feature values instead of random walk probabilities gives statistically indistinguishable performance. The p-value on the Freebase data is .55, while it is .25 for NELL. 3 More Efficient and Expressive Link Prediction Our first experiment with PRA attempts to mitigate some of the computational issues discussed in Section 2. The second step of PRA, where feature values are calculated using path-constrained random walks, requires a lot of computation. In this project we consider whether this computational effort is well-spent, or whether we might more profitably spend computation in other ways. We propose a new way of generating feature matrices over node pairs in a graph that uses an order of magnitude less time than PRA while achieving significantly better performance. We first motivate this technique with three observations. First, it appears that binarizing the feature matrix produced by PRA, removing most of the information gained in PRA’s second step, has no significant impact on prediction performance in knowledge base completion tasks. We show this on the NELL KB and the Freebase KB in Table 1. The fact that random walk probabilities carry no additional information for this task over binary features is a surprising result, and it shows that the second step of PRA spends a lot of its computation for no discernable gain in performance. Second, Neelakantan (2015) presented experiments showing a substantial increase in performance from using a much larger set of features in a PRA model. All of their experiments used binary features, so this is not a direct comparison of random walk probabilities versus binarized features, but it shows that increasing the feature size beyond the point that is computationally feasible with random walk probabilities seems useful. Additionally, they showed that using path bigram features, where each sequential pair of edges types in each path was added as an additional feature to the model, gave a significant increase in performance. These kind of features are not representable in the traditional formulation of PRA. Lastly, the method used to compute the random walk probabilities—rejection sampling—makes the inclusion of more expressive features problematic. Consider the path bigrams mentioned above; one could conceivable compute a probability for a path type that only specifies that the last edge type in the path must be r, but it would be incredibly inefficient with rejection sampling, as most of the samples would end up rejected (leaving aside the additional issues of an unspecified path length). In contrast, if the features simply signify whether a particular path type exists in the graph, without any associated probability, these kinds of features are very easy to compute. Given this motivation, our project attempts to improve both the efficiency and the expressivity of PRA by removing the second step of the algorithm. Efficiency is improved because the second step is the most computationally expensive, and expressivity is improved by allowing features that cannot be reasonably computed with rejection sampling. We show experimentally that the techniques we introduce do indeed improve performance quite substantially. 3.1 Removing the first step of PRA In this section we discuss how we construct feature matrices using only the first step of PRA. As outlined in Section 2, the first step does a series of random walks from each source and target node (si , ti ) in a dataset D. In the original PRA algorithm these random walks are used to find a relatively small set of potentially useful path types for which more specific random walk probabilities are then computed. In our modification to the algorithm we stop after this first set of random walks and instead construct a binary feature matrix. More formally, for each node s in the data (where s could be either a source node or a target node), we construct a subgraph centered around that node. Each random walk that leaves s follows some path type π and ends at some intermediate node i. We keep all of these (π, i) pairs as the characterization of the subgraph around s, and we will refer to this subgraph as Gs . To construct a feature vector for a source-target pair (si , ti ), we take the subgraphs Gsi and Gti and merge them on the intermediate nodes i. That is, if an intermediate node i is present in both Gsi and Gti , we take the path types π correponding to i and combine them (reversing the path type coming from the target node ti ). If some intermediate node for the source si happens to be ti , no combination of path types is necessary (and similarly if an intermediate node for the target ti is si —the path only needs to be reversed in this case). This creates a feature space that is exactly the same as that constructed by PRA: sequences of edge types that connect a source node to a target node. To construct the feature vector we just take all of these combined path types as binary features for (si , ti ). Note, however, that we need not restrict ourselves to only using the same feature space as PRA; Section 3.2 will examine extracting more expressive features from these subgraphs. One other important issue is that of generating negative examples. PRA uses the second step of the algorithm to find negative examples for use during training. We can no longer do this, as our method only generates features for given node pairs, and does not do a search from source nodes to target nodes. Instead, we devised a technique to find negative examples from a Method PRA SFE MAP .3704 .4007 Ave. Features 835 8275 Table 2: Comparison of PRA and SFE on 10 NELL relations. The difference shown is not statistically significant. graph given a set of positive examples, and we used this to obtain the training and testing data used in the experiments below. Our technique takes each source and target node in the given positive examples and finds other nodes in the same category that are close in terms of personalized page rank (PPR). We then sample new (source, target) pairs from these lists of similar nodes, weighted by their PPR score (while also allowing the original source and target to be sampled). These become our negative examples, both at training and at testing time. This method for generating a feature matrix over node pairs in a graph is much simpler and less computationally expensive than PRA, and, if the results in Table 1 are to be believed, should perform on par with PRA with drastically reduced computation costs. Or, we could increase the computation a little and get a much larger feature space, presumably outperforming PRA. Some experimentation shows that it is not that simple. Table 2 shows a comparison between PRA and this subgraph feature extraction technique (labeled SFE) on 10 NELL relations. SFE has a higher mean average precision, but the difference is not statistically significant. There is a large variance in the performance of SFE, and on some relations PRA performs better. We examined the feature matrices computed by these methods and discovered that the reason for the inconsistency of SFE’s improvement is because its random walks are all unconstrained. Consider the case of a node with a very high degree, say 1000. If we only do 200 random walks from this node, we cannot possibly get a complete characterization of the graph even one step away from the node. If a particularly informative path is <C ITY I N S TATE, S TATE I N C OUNTRY>, and both the city from which a random walk starts and the intermediate state node have very high degree, the probability of actually finding this path type using unconstrained random walks is quite low. This is the benefit gained by the path-constrained random walks performed by PRA; PRA leverages training instances with relatively low degree and aggregation across a large number of instances to find path types that are potentially useful. Once they are found, significant computational effort goes into discovering whether each path type exists for all (s, t) pairs. It is this computational effort that allows the path type <C ITY I N S TATE, S TATE I N C OUNTRY> to have a non-zero value even for very highly connected nodes. How do we mitigate this issue, so that SFE can con- Method PRA SFE-RW SFE-BFS MAP .3704 .4007 .480 Ave. Features 835 8275 237853 Time 44 min. 6 min. 5 min. Table 3: Comparison of PRA and SFE on 10 NELL relations. SFE-RW is not statistically better than PRA, but SFE-BFS is (p < 0.05). the existence of the relation that we are trying to predict. We also consider paths that are similar to PRA generated path types and may potentially exist between the node pair. In this section, we discuss the new feature types that were added to the PRA implementation, and the effect they had on performance on top of subgraph feature extraction. 3.2.1 sistently find these path types? It seems the only option without resorting to a similar two-step process to what PRA uses is to do a more exhaustive search. PRA uses random walks to improve scalability on very large graphs, particularly because the second step of the algorithm is so expensive. However, if we are only doing a single search, and the graph fits in memory, a few steps of a breadth-first search (BFS) per node is not infeasible. We can make the BFS more tractable by excluding edge types whose fan out is too high. For example, at a category node in Freebase, there could be thousands of edges of type / TYPE / OBJECT / TYPE; if there are a large number of edges of the same type leaving a node, we do not include those edges in the BFS. Note that because the category node will still be counted as an intermediate node in the subgraph, we can still find paths that go through that node; we just do not continue searching if the out-degree of a particular edge type is too high. When using a BFS instead of random walks to obtain the subgraphs Gsi and Gti for each node pair, we saw a dramatic increase in the number of features found and a substantial increase in performance, while also achieving a slight decrease in running time.2 These results are shown in Table 3; SFE-RW is our SFE implementation using random walks, and SFE-BFS uses a breadthfirst search. The running time decrease for SFE-BFS was due to the fact that we were using GraphChi for SFE-RW, which does a number of scans through the whole graph, which is stored on disk. SFE-BFS loads the graph into memory, instead of loading it from disk several times, and then only processes nodes as necessary in the BFS. Figure 1: One-sided paths: ‘cityInState’ and ‘flows in’ are one-sided paths starting at the source and target respectively, detect the relation overlooks 3.2.2 3.2 Adding more expressive features Having shown that doing subgraph feature extraction with binary features significantly outperforms computing random walk probabilities with PRA, we now look into using a richer set of features than those allowed by the rejection sampling procedure used by PRA. In addition to using path types connecting node pairs as features, we investigated edge sequences that do not necessarily start and end at nodes of interest. Such path types may capture useful information about either the start or target node individually, implicitly pointing to 2 These results were using L1 and L2 parameters tuned for PRA on a separate but related dataset. When the parameters are tuned for SFE, performance increases further, as we show below. One-sided paths A one-sided path is a sequence of edges that starts at a source or target node, but terminates at neither of the two. Following the notation introduced in Section 3.1, we use as features each (π, i) pair in the subgraph characterizations Gs and Gt , along with whether the feature came from the source node or the target node. The motivation for these one-sided path types is to better model which sources and targets are good candidates for participating in a particular relation (because, e.g., not all cities participate in the relation C ITY C AP ITAL O F C OUNTRY, even though the domain of the relation is all cities). Comparisons of one-sided paths Comparisons of the intermediate nodes at which onesided paths terminate may yield useful insights about the existence of a relation between the node pair. Meaningful comparisons can be made when the intermediate nodes are categorical or numerical in nature. Drawing again on the notation from Section 3.1, we will describe these features as analogous to the pairwise PRA features. To get the PRA features, we intersect the intermediate nodes i from the subgraphs Gs and Gt , and combine the path types π when we find common intermediate nodes. To get these comparison features, we instead intersect the subgraphs on the path types, and combine the intermediate nodes when there are common path types. That is, if we see a common path type, such as / PEOPLE / PERSON / RELIGION, we will construct a feature representing a comparison between the intermediate node for the source and the target. If the values are the same, this information can be captured with a PRA feature, but it cannot be easily captured by PRA when the values are different. Feature Types PRA PRA + One-sided PRA + One-sided + Comparisons MAP .375 .425 .4002 Table 4: Feature ablation study using a random walk implementation of SFE. Figure 2: Comparisons of paths: Nodes 10 and 73 at the end of the one-sided paths ‘age’ point negatively to the relation sibling of. Nodes RUSSIA at the end of path ‘nationality’ count favourably towards the existence of the relation ‘speaks same language as’. 3.2.3 Vector Similarity paths Gardner et al. (2014) introduced a modification of PRA’s random walks to incorporate vector space similarity between the relations in the graph. On the data they were using, a graph that combined a formal knowledge base with textual relations extracted from text, they found that this technique gave a substantial performance improvement. The vector space random walks only affected the second step of PRA, however, and we have removed that step in SFE. While it is not as conceptually clean as the vector space random walks, we can obtain a similar effect with a simple feature transformation using the vectors for each relation. We obtain vector representations of relations through factorization of the knowledge base tensor as did Gardner et al., and exploit the vector space similarities to obtain paths that are close to the original PRA or SFE generated paths. We also introduce a special “any edge” symbol, and say that all other edge types are similar to this edge type. Figure 3: Vector similarity paths: The path ‘- rival of builds product -’ is similar in vector space to the original path type ‘- competes with - economic sector -’. 3.2.4 Experiments Here we present the results of some experiments using these new features. For these experiments, we used 10 relations from NELL (the same 10 used by Gardner et al. (2014)). We first experimented with adding the features to our random walk implementation of SFE, and found that the features seemed to improve performance. However, because there is so much variance in the random walk SFE algorithm, the differences shown Feature Types PRA + One-sided + Comparisons + One-sided + Comps. + Vector similarity MAP .5000 .4908 .4675 .4901 .5942 Ave. Features 238k 1,261k 554k 1,578k 3,954k Table 5: Feature ablation study using a BFS implementation of SFE. All rows use PRA features. PRA + vector similarity is statistically better than all other methods (highest p-value is 0.033), and none of the other differences are significant. in Table 4 are not statistically significant. Seeing this result is what initially drove us to implement SFE with a breadth-first search. We also present in Table 5 the result of a similar ablation study with the BFS implementation of SFE. While the one-sided and categorical comparison features seemed like they helped a little in the random walk implementation, they do not improve performance with a BFS implementation. The vector similarity features do, however, significantly improve performance, giving a final MAP score of .594. Finally, we present a comparison between PRA, PRA with vector space random walks, and our BFS SFE implementation. This is shown in Table 6. SFE significantly outperforms PRA, and does so in less time.3 3 In the final line, SFE with both PRA and vector similarity features, almost all of the time was spent in MALLET’s logistic regression optimizer, which is single-threaded and did not scale well to the greatly expanded feature size. A more efficient optimizer would drastically cut down the running time of this method, while not affecting the running time of the other methods much, as little of that time was spent in the optimizer. Method PRA Vector space PRA SFE (PRA features) SFE (PRA and VS features) MAP .3704 .4601 .5000 .5942 Time 44 min 46 min 7 min 54 min Table 6: Results of final comparison between PRA and SFE, with and without vector space similarity features. The final line is statistically better than all other methods. Method Logistic regression SVM with quadratic kernel SVM with linear kernel SVM with RBF kernel MAP .4799 .3895 .3726 .2875 Language Arabic Chinese French Georgian Hindi Indonesian Latvian Russian Swahili Tagalog Table 7: Experimenting with the learning algorithm. All methods use a breadth-first search SFE to generate the same feature matrix. Logistic regression is statistically better than all other methods, with p < 0.005. 3.3 Beyond logistic regression One kind of feature we wanted to add was a conjunction between some of the feature types introduced in the previous section. Lao has seen some improvement from a limited application of these kinds of conjuctions4 , but finding useful conjunctions can be problematic, especially in our SFE setting where we do not have an explicit feature selection step. However, one should be able to capture something similar to these conjunction features using a kernelized classifier. We used the kernel SVM from the popular libSVM library (Chang and Lin, 2011) for the task of predicting relations. We evaluated kernels such as linear, quadratic and Gaussian RBF. Table 7 shows the results of these experiments. The quadratic kernel did improve over the linear kernel, slightly, though the difference was not quite statistically significant (perhaps testing with a few more relations would have pushed the result to statistical significance). Contrary to our expectation, however, we did not see any improvement from using any kind of SVM over logistic regression. It could be that using some kind of ranking loss function is more appropriate for our task, which is evaluated using mean average precision, a metric over ranked lists. It could also be that the ratio of number of features to number of training examples was too high for the SVM (there were 240k features per relation on average, and 5k training instances). Further experimentation and analysis is necessary to understand why SVMs did not perform well here. 4 Multilingual PRA Our second project looks at performing PRA over a knowledge base graph augmented by relations extracted from text in multiple languages. In the prior work by (Lao et al., 2012) and (Gardner et al., 2014) it was established that incorporation of dependency parsed sentences with knowledge base (KB) results in a stronger connectivity of the graph. In particular, performing random walks over this augmented KB has shown significant improvement in learning relationships and semantic inference rules. In this experiment, the focus has been on incorporation of lexicalized syntactic labels from multiple languages. The underlying 4 From personal correspondence. # Triples 637k 2,722k 6.214k 464k 358k 1623k 467k 6893k 107k 97k Table 8: Number of SVO triples per language in the dataset idea is to identify the merits of KB reinforced with Subject-Verb-Object (SVO) triples from multilingual text corpus and test different methods of establishing connection between different languages. 4.1 Data For the purpose of our experiment we have used the dependency parsed multilingual Wikipedia corpus generated by (Faruqui and Kumar, 2015). The corpus consists of SVO triples from 10 languages – French, Russian, Chinese, Arabic, Hindi, Indonesian, Tagalog, Latvian, Swahili and Georgian. Table 8 lists the number of these triples in thousands. The knowledge base used is a subset of Freebase (Bollacker et al., 2008), extracted with focus on relations that seemed applicable to the text corpus. 4.2 Experiment Setup The text corpus, though dependency parsed, was not processed in an optimal format to be fed as input to the PRA. The major challenges outlined here, deal with transformation of this data and generation of suitable mappings to the KB. The major tasks of the datapreprocessing pipeline consist of alias mapping and generation of translation tables. 4.2.1 Alias Mapping The first step in the data processing pipeline was to extract the SVO triples from the text corpus. Identifying ALIAS relations with the KB succeeded this. The initial method for generating these relations was to look for mappings of KB entities to any unigram obtained from subject or object in each triple. This quickly escalated into identification of a good amount of noisy mappings. Additionally, unigrams such as “Ben” and “Affleck” would be mapped separately. While, in this case, we were referring to the actor, “Ben” could be mapped to other unrelated entities. To refine the ALIAS relation we looked into bigrams and trigrams generated from the SVO and their mappings with the KB. Amongst others, one of the challenges faced here was the computational power required. This part of the /location/location/contains /people/person/nationality /music/album/artist /film/film/country /location/administrative division/country /book/written work/subjects /tv/tv program/country of origin /biology/organism classification/higher classification /language/human language/main country /language/human language/region /people/ethnicity/languages spoken Table 9: Relations used in our experiments pipeline took around 9 hours on a 24-core machine with 60 GB of RAM. Additionally, the aliases in Freebase are heavily skewed towards Latin scripts. For 5 of the languages under consideration—Chinese, Georgian, Hindi, Arabic and Russian—there were few direct links to KB entities. This increased the difficulty faced in the actual experiments. 4.2.2 Translation Tables The problems mentioned in the previous section prompted us to look at smarter methods for enhancing the connections between the multilingual corpora. For this end the method we explored in this experiment, was to incorporate translation relations. One of the methods to deal with this is implicitly taken care of by Freebase. Entities in different languages, referring to same object get mapped to single unique key. However, this is mostly limited to the languages with Latin scripts, where the spelling of the proper noun does not change. The coverage for the other languages is much smaller. The alternate method is to introduce external translation tables. For this purpose, we obtained parallel text from (Tiedemann, 2012) and used the fast align software provided by (Dyer et al., 2013) to generate word alignments. Due to the diversity of the languages, their scripts and structures, the generated alignments had some noise. For each word, we took the top three aligned translations. 4.2.3 Relationship Selection The alias mapping had identified multiple freebase relationships present in the text corpora. For the purpose of this experiment we selected 11 relationships listed in Table 9. Relations that had more than 2000 instances associated with it were under sampled. On an average each relation had 1240 instances, which was split 80%20% into training and test sets. 4.3 Experiments To conduct experiments, we used graphs constructed with a combination of edges from the Freebase KB and edges from the extracted SVO triples. In all cases where we used SVO triples, all of the triples from all Experiment KB SVOs SVOs (vector space) SVOs (single rel.) KB + SVOs KB + SVOs + translations KB + SVOs (vector space) KB + SVOs (single relation) MAP 0.4665 0.2562 0.3453 0.3747 0.4987 0.4922 0.5470 0.5603 MRR 0.6664 0.7424 0.7652 0.8667 0.7094 0.6024 0.7094 0.7538 Table 10: Results for experiment 1, mixed language training data. Each experiment uses the same training and test data on graphs constructed with the relation sets indicated. No method is statistically better than all other methods. languages were added to the graph. We used three different ways to represent these triples, however. The first was to use the original surface relation as present in the data we obtained from (Faruqui and Kumar, 2015). The second was to use a factorization of the graph to obtain vector representations of these relations and use the vector space random walks presented by Gardner et al. (2014). Lastly, we tried replacing all of the surface relations with a single placeholder relation indicating that there was some connection between the noun phrases in the SVO data. There were two sets of experiments that we conducted. In the first case, we took all of the relation instances found from all languages and randomly split them into training and test sets. In the second case, we used the relation instances found from 9 of the language pairs as training data and tried to predict the instances in the 10th language (French). To evaluate our results, we have considered two metrics: mean average precision (MAP) and mean reciprocal rank (MRR). MAP is equivalent to the area under the precision-recall curve. MRR is defined as the precision of top prediction for each relation. 4.3.1 Mixed Language Training Data In this set of experiments we aggregated all the instances from all the 10 languages and randomly split them into training and testing set. We then trained models and scored the test data on several different graphs with various combinations of KB and SVO edges. The results are shown in Table 10. While there are differences in the scores on these graphs, performance varied enough between relations that almost none of the differences are statistically significant, and it is hard to draw many conclusions just from this table (thus nothing is bolded in the table, as there is no method that is statistically better than all of the others). However, with the caveat that these results do not achieve statistical significance, it appears that just using the KB to infer new edges is better than using the extracted SVO data. This seems to indicate that there is a great deal of noise in our method for obtaining relation instances from the Experiment KB SVOs SVOs (vector space) SVOs (single rel.) KB + SVOs KB + SVOs (vector space) KB + SVOs (single rel.) MAP 0.2555 0.0543 0.1835 0.1877 0.2688 0.3168 0.2918 MRR 0.6535 0.4727 0.9099 0.7508 0.6455 0.7744 0.6883 Table 11: Results from experiment 2, cross lingual corpus SVO data—had our method for alignment been cleaner, we should have been able to construct better models from the data. The table also shows that using vector space representations of SVO data is better than not using them, and that just using a single placeholder relation instead of the actual surface text does better still (and for the case where there is no KB, those two differences are just about the only statistically significant differences in the table). From this we can see that the mere fact that two entities of particular types are mentioned together in text is a pretty strong indication of those entities participating in a particular relationship, at least for the relations we tested. We can also conclude from this that a tensor factorization for learning vector space representations does not work very well in this setting, as we could do better simply by just replacing all of the relation labels with a single dummy relation. We should also mention that adding the translation edges did not improve performance. In fact, there were no translation edges found in any of the random walks for any of the relations. We believe that this is due to a mis-alignment of the corpora used for obtaining the translation edges and the multilingual SVO corpora we used to obtain training data. Most of the aliases we found were proper nouns, while the translation edges consisted mostly of common nouns. 4.3.2 Cross Language Training Data For our second set of experiments, we held out the relation instances obtained from the French SVO from the training set (Table 11), and then tried to predict these instances from the models learned on the other languages. We used French as our held-out language because it had the highest number of relation instances, giving us a reasonably-sized test set. Here we see results that are broadly similar to what we saw in the mixed language experiment. The difference in MAP in the KB setting should allow for some calibration of the difficulty of these two different test sets, in general—the fact that MAP is lower even when only using the KB means that this is a harder test set. Additionally, we see that using SVO triples themselves resulted in very poor performance, which should be expected due to the language difference at training and at test time. However, using vector space or single relation representations allowed us to make much better predictions (though still not as good as in the mixed language setting). From this we can see that we do get some limited amount of cross-lingual transfer in this setting, but it is incredibly shallow and leaves a lot to be desired, and a lot of room for improvement. 4.4 Future Work We faced several challenges in performing knowledge base completion in multiple languages. One of the primary challenges is to obtain better translation relations. We saw that the potential benefit to be gained using the translation tables was not apparent in our experiment due to mismatch in the data. Going forward, we intend to improve this relation. One method we are looking at is to use common source for both the text corpora and the parallel texts. We, also, intend to carry out more experiments to see if the vector embedding in cross lingual experiment learned anything or if it meant we saw some potential SVO connection between two instances. Finally, we plan to study how well can the cross lingual lift be. For this purpose we will conduct another set of experiment where we will have additional baselines where only the SVO triples from the testing language would be used to form the knowledge graph and KB augmented with testing language would be another guideline. 5 Adding KB features to sentence-level models A long-standing focus of Artificial Intelligence has been to make machines understand and comprehend text like humans do. This has induced active research in the field of Information Extraction for better document understanding. The availability of datasets like Freebase and Wikipedia has facilitated this research on a large scale. Because there are so many different individual prediction tasks (e.g., one per relation in Freebase), the annotation effort required for completely supervised relation extraction techniques is very expensive. This is why “weak” or “distant” supervision seems to be a very promising technique to learn relation extractors from text. In this paper we propose an approach for predicting overlapping relations between two entities by biasing the prediction decision on inferences drawn from a large scale knowledge repository, Freebase. The method we present is a very generic way of injecting features into any graphical model representing pairwise factors between any two entities. In this project, we experiment with relation extraction using the MultiR algorithm (Hoffmann et al., 2011) at the sentence level. As per (Hoffmann et al., 2011), it is possible for a pair of entities to have multiple relations between them; for instance, CAPTIAL O F(Delhi, India) and locatedIn(Delhi, India). As per (Lao et al., 2011) the entities are also indirectly connected via other entity PRA Per Mention PRA Per Entity PRA Only 1 Precision 0.8 0.6 0.4 0.2 0 0.1 0.2 0.3 0.4 Recall Figure 5: Comparison between methods we introduced Figure 4: The MultiR model nodes in a graphical representation of the knowledge base, which can used to factor in weak supervision between an entity pair. We employ the above two facts to learn a joint model with the MultiR algorithm that follows an online perceptron style approach to update the parameter vector per sentence mention for each pair of entities. The dataset that we use was prepared by Sebastian Reidel for his work (Riedel et al., 2010) which contains 52 relations with 67946 entity pair-relation tuples in training set and 96678 entity pair-relation tuples in the testing set. 5.1 Methodology The graphical notation of both the models in reference to the baseline is shown in Figure 4. The MultiR model is the baseline model proposed by (Hoffmann et al., 2011). Y corresponds to the relation between an entity pair for a particular sentence mention S. The latent variable zi are predicted by the model, if the prediction is incorrect updates are made to the parameter vector. P represents the PRA features that were injected into the MultiR model in two different manner. Since two entities can be related to each other via multiple relations, Y is a vector of length equal to the number of possible relations in the dataset. If a relation exists between the two entities under consideration the corresponding bit is activated in the relation vector Y . 5.1.1 Per Entity Pair In this model, we add features extracted by the Path Ranking Algorithm (Lao et al., 2011), which are direct or indirect paths between any two entities. These are obtained by performing random walks on the Free- base graph. “Per Entity pair” means we add PRA features once for an entity pair. Each entity pair contains a set of sentence mentions with the syntactic features (obtained from a dependency parse) per sentence defining the sentential construction. This approach performs better over using only PRA features to train the model (Figure 5) as it incorporates the syntactic patterns entailing how language-specific constructs can define relation between two entities for instance “Italy, Europe” describes that Italy is LOCATED I N Europe. 5.2 Per Mention In this model, we add PRA features per sentence mention in contrast to previous case where we added the features per entity pair. This model performs better over the previous one for the following two reasons. First, it makes the feature space less sparse. Secondly, it trains a better representation of the model by bringing the semantic and syntactic components of the language together in each parameter update. The precision for PRA per entity goes down quickly at higher values of recall which is clearly because we dont encounter PRA features with every parameter update. This makes the denominator grow faster than numerator in recall calculation which is (True Positive)/(True Positive+False Negative). 5.3 Related Work The first attempt toward supervised Information Extraction was made by (Soderland, 1997). However, with the growth in amount of un-annotated data it became very expensive to adopt new Supervised Learning techniques. (Craven and Kumlien, 20) introduced a weak or distant supervision technique to match Yeast Protein Database with abstracts in PubMed using a nave-bayes exractor. The KYLIN system (Wu and Weld, 2007) employed weak supervision on Wikipedia infoboxes to learn relationships between entities by considering Infoboxes as the database. (Mintz et al., 2009) learnt 100 relational Hoffmann Weston Ours 1 Precision 0.8 0.6 0.4 0 0.1 0.2 0.3 0.4 Recall Figure 6: Comparison with prior work extractors on Wikipedia data with the help of Freebase facts. (Riedel et al., 2010) used multi-instance learning to learn a graphical model assuming that there is at least one sentence mention in a corpus that expresses the arguments of a Freebase fact. This help eliminate the effects of noisily labeled training data. (Hoffmann et al., 2011) proposed a weakly supervised method for relation extraction, where they used syntactic features like dependency parses between the entity pairs which vary as per their usage in a sentence. (Surdeanu et al., 2012) also proposed a similar model which used textual information for relation extraction. The state of the art algorithm till date was proposed by (Weston et al., 2013). They first learn a low dimensional vector space representations or embedding for entities and relationships. Then, a function of sentence mentions and relation between entities is learnt based on the features in the embedding space. These functions are then projected in the embedding space to predict relationship between two entities present in a sentence. This model leverages weak supervision from text as well as Knowledge base triples. In our approach not only do we consider the sentential features with knowledge base, we also bias our relation predictions to the global knowledge extracted from knowledge base. 5.4 Results and Conclusion Figure 6 displays the aggregate precision/recall curves of our PRA augmented with MultiR features approach and the existing state-of-the-art approaches HOFFMANN (Hoffmann et al., 2011) and WESTON (Weston et al., 2013). It also shows how both the ways of PRA feature augmentation perform in comparison to the baseline. We used the dataset developed by Sebastian Riedel for (Riedel et al., 2010). In the dataset we removed the direct and inverse relation edges from the freebase graph mentioned in the dataset. There were some problems reproducing the results using the algorithm proposed by Weston et al. So, we used a plot digitizer to obtain the results presented in (Weston et al., 2013). Firstly we trained a model directly with source code used in (Hoffmann et al., 2011) but we found that due to increase in feature space the model parameters were not generalizing well. Due to the noise in the parameter learning because of fewer training instances we resorted to regularization. We did 5 fold cross validation with 80%-20% training and testing dataset split to choose the parameter for L2 regularization. After regularization we observed an overall performance gain in the aggregate precision-recall curve. 6 Conclusion We have done three separate experiments improving the path ranking algorithm in various ways. We showed that subgraph feature extraction gave better performance than PRA in terms of both running time and mean average precision, and allows for the addition of much more expressive features than can be computed by PRA. We gave an exploration of the challenges involved with using multilingual data to perform knowledge base completion. And we showed that a simple and general technique for incorporating features from knowledge bases into sentence-level prediction models can greatly improve performance over previous relation extraction techniques. References Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of SIGMOD. Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27. Mark Craven and Johan Kumlien. 20. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, 3:993–1022, March. Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. In HLT-NAACL, pages 644– 648. Citeseer. Manaal Faruqui and Shankar Kumar. 2015. Multilingual open relation extraction using cross-lingual projection. In Proceedings of ACL. Matt Gardner, Partha Talukdar, Bryan Kisiel, and Tom Mitchell. 2013. Improving learning and inference in a large knowledge-base using latent syntactic cues. In Proceedings of EMNLP. Association for Computational Linguistics. Matt Gardner, Partha Talukdar, Jayant Krishnamurthy, and Tom Mitchell. 2014. Incorporating vector space similarity in random walk inference over knowledge bases. In Proceedings of EMNLP. Association for Computational Linguistics. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledgebased weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language TechnologiesVolume 1, pages 541–550. Association for Computational Linguistics. Ni Lao and William W Cohen. 2010. Relational retrieval using a combination of path-constrained random walks. Machine learning, 81(1):53–67. Ni Lao, Tom Mitchell, and William W Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In Proceedings of EMNLP. Association for Computational Linguistics. Ni Lao, Amarnag Subramanya, Fernando Pereira, and William W Cohen. 2012. Reading the web with learned syntactic-semantic inference rules. In Proceedings of EMNLP-CoNLL. Ni Lao. 2012. Efficient Random Walk Inference with Knowledge Bases. Ph.D. thesis, Carnegie Mellon University. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011. Association for Computational Linguistics. Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. 2015. Compositional vector space models for knowledge base completion. In ACL 2015. Association for Computational Linguistics. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, pages 148–163. Springer. Stephen Soderland. 1997. Learning to extract text-based information from the world wide web. 97:251–254. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 455– 465. Association for Computational Linguistics. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In LREC, pages 2214–2218. Jason Weston, Antoine Bordes, Oksana Yakhnenko, and Nicolas Usunier. 2013. Connecting language and knowledge bases with embedding models for relation extraction. In Proceedings of EMNLP. Fei Wu and Daniel S Weld. 2007. Autonomously semantifying wikipedia. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 41–50. ACM.