Distant Supervision: Mike Mintz, Steven Bills, Rion Snow and Dan Jurafsky

935 Words2 Pages

In the research paper Distant supervision for relation extraction without labeled data",

the authors Mike Mintz, Steven Bills, Rion Snow and Dan Jurafsky investigate an alternate paradigm

[called distant supervision] for relation extraction. This algorithm combines the advantages of Super-

vised Information Extraction and Unsupervised Information Extraction to achieve greater precision.

Apart from this, they also analyze feature performance for better understanding of the roles of lexical

and syntactic features. Some of the key observations from this research are :

1) A combination of syntactic and lexical features o ffers a substantial improvement in relation

extraction precision over either of these feature sets on its own.

2) Syntactical features may help tease apart difficult relations. They are more useful in cases where

the individual patterns are particularly ambiguous, and where they are nearby in the dependency

structure but distant in lexical terms.

The intuition of distant supervision is that any sentence that contains a pair of entities that participate in a known freebase relation is likely to express that relation in some way. This intuition

ignores the fact that two entities can have multiple relations between them. This can result in poor

relation extraction precision.

For example, consider the two statements - Rafael Nadal lives in Spain" and Rafael Nadal likes

playing tennis in Spain". The first sentence expresses the country of residence relation between

Rafael Nadal and Spain. Although the second sentence doesn't express the country of residence

relation, distant supervision wrongly labels it that way nevertheless.

This can be avoided by incorporating multiple relations between two ent...

... middle of paper ...

...er ever to play tennis". (Assume that these are the only answer sentences.)

When the model developed in the paper tries to extract an answer from these answer sentences, it

comes across three potential answers - football, golf and tennis.

To resolve such conflicts the paper resorts to the overproduce and vote mechanism. Rather than

this, I suggest we use the already existing freebase database and look for [Rafael Nadal, Football] ,

[Rafael Nadal, Golf] and [Rafael Nadal, Tennis] relations in it. This should give us a confirmation

that Rafael Nadal is related to tennis. Hence, tennis can be considered as the answer to our original

question.

Answer Extraction as a sequence tagging problem is a relatively new area of study in Natural

Language Processing and has much scope for improvement. Subtle changes in the approach can

produce state-of-the-art models.

Open Document