From: Phil Burns
Subject: Re: Named Entity Extraction
Date: February 1, 2008
At 09:38 PM 1/23/2008, Martin Mueller wrote:
Loretta and Phil,
Putting together strands of conversation from here and there, it seems to me that you should be in a conversation about the details of named entity extraction. The simplest statement of the relevant facts and questions may go something like this:
1. MorphAdorner does a pretty good job of identifying single word names in texts from various genres and places
2. Modern named entity extraction procedures do not work very well on earlier texts because the assumptions about what multi-word expressions are name phrases are drawn from the contemporary world and don't translate easily backwards.
Modern named entity extraction procedures operate most successfully in limited subject domains, usually non-fiction, such as medicine, business, or newswire texts. A named entity extractor trained in one subject area generally doesn't work well for other subject areas. However, it is usually possible to retrain an extractor with data specific to a new subject area or genre and improve its recognition performance for that genre.
Recognizing names in literature is more difficult than in non-fiction. For example, the name (gazetteer) lists which prove so helpful for matching names in non-fiction are much less helpful for fiction. Place names in fiction may be, well, fictional, and embody structural forms differing considerably from "real-world" names. Assuming one can recognize that Humpty-Dumpty is a name, is it a person name or a place name? What about Numenor? Nyarlathotep? You will be unlikely to get any guidance from your friendly neighborhood gazetteer.
3. Other projects have had limited success in this area. We may want to talk with the Perseus folks at Tufts. They have done a fair amount of fairly sophisticated work with 19th century texts and worry about whether they can resolve the string 'Washington' to the state, the city, or the president.
Steve mentioned that while Tuft's recognizer did a great job on a collection of Civil War texts (Steve can correct me as needed), the recognizer's performance was not so good on other collections. This is typical for named entity extractors.
4. Other things being equal, named entity extraction may be easier if you have a morphosyntactically annotated text in which proper names have been identified with some certainty. (is that a fact?)
Some named entity extractors work with part of speech tagged texts. Others do not. Some use rules (e.g. Gate) in combination with part of speech tagging. Some use statistical models: BBN's commercial Identifinder uses a hidden Markov model similar to the one MorphAdorner uses for part of speech tagging. Some systems use machine learning techniques to build classifiers of the "is a name, is not a name" form by looking for surface features.
Given a part-of-speech tagged text in which potential names are marked with reasonable accuracy, you can perform a baseline name extraction by pulling out noun phrases containing at least one potential proper noun. A noun phrase can be defined in a number of different ways. The simplest is to consider the longest series of nouns bracketed by non-nouns as a noun phrase.
As an example, I have attached below the potential names extracted from Austen's "The Watsons" using noun phrases as just defined. There are some obvious errors here. A good named entity extractor should be able improve on this simple procedure. This procedure works equally well or poorly on early texts. I have also attached the names extracted from one of Kristin's witchcraft texts published in 1673.
One possible approach to developing an improved named entity extractor is to transform MorphAdorned files into training files for rule-based or statistical-based named entity extractors. We can look at name extraction as a sequential tagging process, just like part of speech tagging. Thus we should be able to train any part of speech tagger to perform name extraction if we define a proper tag set.
A named entity tag set that has been used extensively employs three types of tags.
B-X: First word of an entity of type X. I-X: Non-first word of an entity of type X. O: Word is outside any type of entity.
"X" takes on values such as "PER" for person, "LOC" for location, "ORG" for organization, and so on.
To generate training data, we start by taking an adorned file and adding these entity tags to each word automatically using the simple noun phrase extraction algorithm described above. Then a human performs the arduous task of iteratively correcting this raw training data. Then the usual training procedures for the selected part of speech tagger are used to generate the probability matrices, rules, or classifiers required by the tagger. Running the tagger on a new text produces a tagging for entities which can be back transformed into whatever representation is convenient, e.g., an XML "<rs>" tag.
There are numerous extant named-entity systems systems which can be retrained. Examples include OpenNLP and LingPipe. A slightly out-of-date list appears at
An important advantage of a hidden Markov model approach (e.g., as used by MorphAdorner or Tnt) is that the training time is very short. A minute or two suffices for generating the needed tagging data from a couple of million words of training data. The training time is measured in hours for conditional random fields or maximum entropy models given the same amount of training data. Fast rule-based training methods such as Florian and Ngai's FastTBL are another possibility.
5. There is a question in my mind where named entities are stored, once identified. Should they be tagged in the TEI-A text? P5 has a quite intelligent system for person and place names. Or are there simpler and equally effective ways of keeping the information.
The usual choices are to mark up the entities in the text using standard tags like <rs>, or to create standoff markup. Either provides an adequate interchange format for creating databases or feeding search engines. The single file approach keeps the entity information is in the same place as the text containing the entities. The multiple file approach allows for different possibly overlapping entity definitions. We could combine both approaches.
Given either type of external representation, the question arises as to the most effective way of storing these for later retrieval and querying. That is probably best addressed at the database level.
6. There is a whole set of questions about referring different spellings of a name to a standardized version of it. Should we try to tackle this at all? And where does it end? Catherine, Katharine, Catharine, Katherine.....
The issue is not so much standardizing the name as recognizing when two or more names refer to the same person or place within a work. This process is called "co-reference resolution." There has been a lot of work done on this. Standardizing the names may help. A related problem is resolving pronoun references to nouns. Doing this well requires recognizing the gender of person names and nouns such as "father" and "mother." Determining the gender of names automatically can be quite difficult.
The baseline algorithm for pronomial coreference resolution is probably that suggested by Shalom Lappin and Herbert J. Leass in 1994.
Another difficult problem is relating names across works. Is the Zeus of the Iliad the same as the Zeus in the Odyssey? Is the Falstaff of Henry IV the same as the Falstaff of The Merry Wives of Windsor? These questions go beyond simple matters of information extraction.
Yet another difficult problem is automatically relating words spoken by a named character to that character.
7. Can you consistently distinguish between places and people? Leicester, Worcester, Northumberland, Kent, and all the other earls, thanes, and dukes of English history.
Human readers have trouble determining whether a name refers to a place or a person. In the sentence "Chester provided arms for the mercenaries", does this refer to the Earl, the county, an organization, or another person named Chester? Even in context it might be impossible to be sure which is the referent. When a person can't decide, it's unlikely a computer can either. Still, certain patterns as encoded in rules or in statistical discriminators can distinguish some names from places rather reliably. A form like "Mr. Pib" or "Ranulf, Earl of Chester" is almost certainly indicates a person. A form like "Cook County, Illinois" almost certainly indicates a location.
In one way or another, we'll need to put names on the agenda and figure out what is desirable and possible within the time frame of a year.
In addition to names and places, it may be useful to pull out other types of entities such as time, date, money, and organization.
I modified the default entity extractor in Gate to use different gazeteers and rules to perform name, place, organization, time, date, and money extraction. I wrapped Gate in a program called AdornWithNamedEntities which knows how to combine entities across splits induced by soft tags (which Gate does not know how to do). The "adornwithne" script and batch file in the MorphAdorner snapshot demonstrate the use of AdornWithNamedEntities. The extraction performance is not very good.
The results would be better if the extractor could be modified to use MorphAdorner generated parts of speech rather than the built-in Hepple tagger generated parts of speech. It is likely that multiple different rule sets would be required to get adequate performance for different periods and genres. I don't know that reworking Gate would provide any improvement over training another extractor as I discussed above.
Whatever approach the Monk project decides to pursue for named entity extraction, we are looking at many months of work. This is not an easy task to accomplish.
– Phil "Pib" Burns
Northwestern University, Evanston, IL. USA
There are 162 proper names.
Country Miss Emma
Country Miss Margt
Lady Osborne's Cassino Table
Lane Miss Watson
Ld Osborne's Hounds
LdOsborne Miss Watson
Master Blake Sir
Miss Emma Watson
Mr Sam Watson
Mr Tom Musgrave
Mrs R. W.
Mrs Robert Watson
There are 13 place names.
Witchcraft text from 1673.
There are 47 proper names.
Mr. RÂ•uland Baugh
Sir Samuel Baldwyn
There are 7 place names.