TextJoiner Improvements


Project By: Tejaswinee Sohoni
tejaswineesohoni2015@u.northwestern.edu
EECS 349 - Northwestern University
Professor Doug Downey



Overview

Automatically extracting information from the web is called Web Information Extraction. On demand Web Information Extraction systems allow users to search the web for textual queries, for instance, “Nobel Laureates from Austria”. This is done by specifying the query as a relation.However, such systems have to make a trade-off between precision and recall. Therefore, a new approach has been proposed by researchers in Prof. Downey’s group, in which queries are considered as conjunctions and disjunctions of multiple contexts instead of a single context. This offers high precision as well as high recall. This approach has been implemented in a system called TextJoiner. However, the existing TextJoiner system does not take into account every mention of a particular entity in a piece of text. It misses out on the mentions of an entity where it is referred to with pronouns or equivalent words. This makes the task of coreference resolution interesting to the TextJoiner system. Coreference resolution means finding all the expressions that refer to the same entity in a text. If the TextJoiner system can understand this difference, it can discover more sentences about the entity being queried, and hence can extract more information out of the text.
In this project, I performed coreference resolution on text with the help of Stanford's NLP library. This library has a specific module for coref resolution. I processed the output I obtained to form a dataset, which I then used against 5 different algorithms. I compared the accuracy of these and found one which worked best in predicting whether the output obtained using the Stanford coref library is accurate ir not.

Coreferencing

For coreferencing, I used the Stanford NLP library. This library provides a module for coreferencing. While it is not very easy to figure out its usage, it turns out that this module works quite well once one can figure it out. I used some sentences from a parsed Wikipedia page that is used by the actual TextJoiner system. The article I used was about Chicago. I used this module to clean the text, tokenize it, and annotate it. This is then used to build a parse tree of the current sentence and its corresponding dependency graph. Using this, I built a coreferenced graph of chains, where each chain stores the mentions of a particular entity.

Dataset

After further processing the chains, I got the individual words or phrases from each chain, its corresponding sentence number and chain number. From this data, I arranged the phrases into every possible pair they could form, their corresponding sentence and chain numbers, whether they belonged to the same sentence or chain, and if they belonged to the same chain, the size of the chain they belonged to. So, essentially, the features of the dataset I formed were:

  1. Phrase A
  2. Phrase B
  3. Sentence Number of Phrase A
  4. Sentence Number of Phrase B
  5. Chain Number of Phrase A
  6. Chain Number of Phrase B
  7. Whether A and B belong to the same chain
  8. If A and B belong to the same chain, size of the chain
  9. Result (Whether the Stanford library coreferences the pairs correctly or not)


I hand labeled the Result field. Initially for a paragraph of about 6 sentences, I got about 20000 pairs, which was too big a dataset for me to hand label, hence I cut down the text, and shortened some sentences. The text I finally used was:

"Chicago is the third most populous city in the United States . Its metropolitan area is sometimes called Chicagoland. Chicago is the seat of Cook County , a small part of the city extends into DuPage County. It was incorporated as a city in 1837. Chicago is listed as an alpha+ global city. The best-known nicknames for the city include Windy City and Second City. The City of Chicago was the fastest growing city in the world for several decades."

With this text I got about 300 pairs, for which I hand labeled the data with Y if the coreferencing is correct and N if it is not.

Training/Testing

On the labeled dataset, I ran 5 algorithms, all with 10-fold cross validation. They gave different levels of accuracy, as shown below.

  1. J48 (Decision Tree)- 90.4762% accuracy
  2. Random Forest - 89.1156% accuracy
  3. IBk (k Nearest Neighbors) - 85.034% accuracy
  4. Naive Bayes - 79.932% accuracy
  5. Multilayer Perceptron - 95.2531% accuracy

From the above data, I think Multilayer Perceptron, followed by Decision Tree, is the best algorithm to see whether the predictions made by the Stanford library are accurate.

Conclusion

Overall, I think that the Stanford coref library works quite well for finding coreferences for a text, and I believe the module I have created is ready to be used with TextJoiner so that it can coreference text to improve its performance.