Welcome to the Home Page of the Automatic Temporal Expression Labeler; ATEL.



This page is maintained by
Kadri Hacioglu (hacioglu@cslr.colorado.edu)


Click here to play with ATEL






Overview of ATEL

Extraction of temporal expressions from an input text is considered a very important step in several natural language processing tasks; namely, information extraction, question answering (QA), summarization etc. For example, in the summarization task, temporal expressions can be used to establish a time line for all events mentioned in multiple documents for a coherent summarization. Recently, there has been growing interest in addressing temporal questions in QA systems. In those systems, a highly accurate temporal expression recognizer or tagger (statistical or rule-based) is required for effective treatment of temporal questions yielding high-quality end-to-end system performance.

The ATEL labels broad range of temporal mentions in the text. It marks information in the source text that mentions when something happened, or how long something lasted, or how often something occurs. Those temporal expressions in text vary from explicit references, e.g. June 1, 1995 , to implicit references, e.g. last summer , to durations, e.g. four years , to sets, e.g. every month, and to event-anchored expressions, e.g. a year after the earthquake.

For example, given a sentence " That's 30 percent more than the same period a year ago", the ATEL yileds the following labeled output:

That 's 30 percent more than [TIMEX the same period [TIMEX a year ago ]]

Here TIMEX is used to denote "time expression" and it does not refer to any specific tagging convention. That is, it is not the TIMEX tag which was used in the past MUC-6/MUC-7 tasks. Actually, here, the tagging scheme is performed without any normalization according to the TIMEX2 guidelines (which can be viewed as the second generation of TIMEX ) as described in ( Ferro et. al., 2004). For more information regarding TIMEX2 click here .

We cast the chunking of text into time expressions as a tagging problem using a bracketed representation at token level, which takes into account embedded constructs. We adopted a left-to-right, token-by-token, discriminative, and deterministic classification scheme to determine the tags for each token. A number of features are created from a predefined context centered at each token and augmented with decisions from a rule-based time expression tagger and/or a statistical time expression tagger trained on different type of text data, assuming they provide complementary information. We have trained one-versus-all multi-class classifiers using support vector machines.

We define a number of features for each token. Features can be grouped into two broad classes as lexical and syntactic features. The lexical features are the token itself, its lower-case version, its part of speech tag and a set of features that indicates a specific token pattern (e.g. is hyphenated or not, is\_XX/XX/XX etc., where X is a number) and its frequency (e.g. Rare/Frequent/Unknown) with respect to a lexicon (with counts) created from the training data. The syntactic features that we have extracted are base phrase chunks represented using IOB2 tags, the head words, and dependency relations between the tokens and their respective heads. We have used a part-of-speech (POS) tagger, trained in-house, to determine the POS tag for each word. This tagger is based on the Yamcha SVM toolkit and trained on a relatively large portion of the Penn TreeBank. Similarly, the base phrase chunks are obtained using an in-house SVM-based chunker. The dependency features are assembled from the output of Minipar , a rule-based dependency parser. In addition to those features, we have used the decisions from a rule-based time expression tagger and BBN IdentiFinder.

The general architecture of the ATEL is shown below:



After the SVM classification we have employed a simple post-processing algorithm to maintain the consistency of bracketing that might be violated due to tagging errors. In the following we summarize the steps taken in the system for the extraction of time expressions:

Step 1. Sentence segmentation
Step 2. Tokenization
Step 3. Pattern and frequency checking
Step 4. Dependency Parsing
Step 5. Third-party time tagging (statistical, HMM)
Step 6. POS tagging
Step 7. Base phrase chunking
Step 8. Third-party time tagging (rule-based)
Step 9. Feature composition
Step 10. Multi-class SVM classification
Step 11. Post-processing

For more information see
Hacioglu, Kadri, Chen, Ying and Douglas , Benjamin " Automatic Time Expression Labeling for English and Chinese Text ", to appear in Proceedings of CICLing-2005, Mexico City-Mexico, Feb. 13-19, 2005.



Click here to play with ATEL



Return to HOME