A Pattern Matching Approach to Find the IUPAC Names in Chemical Documents

686 Words2 Pages

Chemical substances or entities are important terms in chemistry publications and patents. Various representations are available to represent chemical entities like IUPAC, trivial names, SMILES, InChI and CAS Registry numbers. Chemical names pose a special challenge in information retrieval since they typically are long and complex expressions and prone to variation, which in turn may cause a decrease in retrieval performance.

The difficulty in obtaining manually annotated data for training NER systems has motivated researchers to look for alternative ways of generating annotated data, or for making the best possible use of unlabeled data. Several systems address the problem regarding chemical entities with a variety of approaches. In this paper, we present a Pattern Matching approach to find the IUPAC names in chemical documents.

Alexander Vasserman [11] (2004) identify chemical names in Biomedical Text using substring co-occurrence based approaches. In this work, models were built based on the difference between strings occurring in chemical names and strings that occur in other words. The models are trained from a dictionary of chemical names and general biomedical text. A new way of interpolating N-grams was introduced that does not require tuning any parameters.

Zornitsa Kozareva [5] (2006) proposed and implemented a pattern validation search in an unlabeled corpus through which gazetteer lists were automatically generated. The gazetteers were used as features by a Named Entity Recognition system. A comparative study of information contributed by the gazetteers in the entity classification process was shown. Andreas Vlachos et al. [3] (2006) demonstrated empirically the efficiency of using automatically created tra...

... middle of paper ...

...

Tim Rocktäschel, Michael Weidlich and Ulf Leser, 2012, [2] presented a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. They used a hybrid approach combining a Conditional Random Field with a dictionary. It achieves an F1 measure of 68.1% on the SCAI corpus, outperforming the OSCAR 4 chemical NER tool.

A common problem in chemical NER is the sparsity of annotated corpora for training. In this work, we use the chemical research articles of Indian Journal of Chemistry (Section B) for the extraction of chemical terms using pattern matching and the extracted entities are evaluated using ChEBI dictionary of molecular entities which uses the nomenclature of International Union of Pure and Applied Chemistry for chemical entities.

More about A Pattern Matching Approach to Find the IUPAC Names in Chemical Documents

Open Document