Human gene name normalization using text matching with automatically extracted synonym dictionaries

Haw Ren Fang, Kevin Murphy, Yang Jin, Jessica Signoff, Peter S. White

Research output: Contribution to conferencePaper

25 Citations (Scopus)

Abstract

The identification of genes in biomedical text typically consists of two stages: identifying gene mentions and normalization of gene names. We have created an automated process that takes the output of named entity recognition (NER) systems designed to identify genes and normalizes them to standard referents. The system identifies human gene synonyms from online databases to generate an extensive synonym lexicon. The lexicon is then compared to a list of candidate gene mentions using various string transformations that can be applied and chained in a flexible order, followed by exact string matching or approximate string matching. Using a gold standard of MEDLINE abstracts manually tagged and normalized for mentions of human genes, a combined tagging and normalization system achieved 0.669 F-measure (0.718 precision and 0.626 recall) at the mention level, and 0.901 F-measure (0.957 precision and 0.857 recall) at the document level for documents used for tagger training.

Original languageEnglish (US)
Pages41-48
Number of pages8
StatePublished - Jan 1 2006
Externally publishedYes
EventHLT-NAACL 2006 Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis, BioNLP 2006 - New York, United States
Duration: Jun 8 2006 → …

Conference

ConferenceHLT-NAACL 2006 Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis, BioNLP 2006
CountryUnited States
CityNew York
Period6/8/06 → …

Fingerprint

Glossaries
Names
Genes
Dictionary
Normalization
Gene
Synonyms
MEDLINE
Databases
Strings

ASJC Scopus subject areas

  • Language and Linguistics
  • Information Systems
  • Software
  • Health Informatics
  • Computer Science Applications
  • Biomedical Engineering

Cite this

Fang, H. R., Murphy, K., Jin, Y., Signoff, J., & White, P. S. (2006). Human gene name normalization using text matching with automatically extracted synonym dictionaries. 41-48. Paper presented at HLT-NAACL 2006 Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis, BioNLP 2006, New York, United States.

Human gene name normalization using text matching with automatically extracted synonym dictionaries. / Fang, Haw Ren; Murphy, Kevin; Jin, Yang; Signoff, Jessica; White, Peter S.

2006. 41-48 Paper presented at HLT-NAACL 2006 Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis, BioNLP 2006, New York, United States.

Research output: Contribution to conferencePaper

Fang, HR, Murphy, K, Jin, Y, Signoff, J & White, PS 2006, 'Human gene name normalization using text matching with automatically extracted synonym dictionaries', Paper presented at HLT-NAACL 2006 Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis, BioNLP 2006, New York, United States, 6/8/06 pp. 41-48.
Fang HR, Murphy K, Jin Y, Signoff J, White PS. Human gene name normalization using text matching with automatically extracted synonym dictionaries. 2006. Paper presented at HLT-NAACL 2006 Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis, BioNLP 2006, New York, United States.
Fang, Haw Ren ; Murphy, Kevin ; Jin, Yang ; Signoff, Jessica ; White, Peter S. / Human gene name normalization using text matching with automatically extracted synonym dictionaries. Paper presented at HLT-NAACL 2006 Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis, BioNLP 2006, New York, United States.8 p.
@conference{f52792b3666541d09d43214acb419670,
title = "Human gene name normalization using text matching with automatically extracted synonym dictionaries",
abstract = "The identification of genes in biomedical text typically consists of two stages: identifying gene mentions and normalization of gene names. We have created an automated process that takes the output of named entity recognition (NER) systems designed to identify genes and normalizes them to standard referents. The system identifies human gene synonyms from online databases to generate an extensive synonym lexicon. The lexicon is then compared to a list of candidate gene mentions using various string transformations that can be applied and chained in a flexible order, followed by exact string matching or approximate string matching. Using a gold standard of MEDLINE abstracts manually tagged and normalized for mentions of human genes, a combined tagging and normalization system achieved 0.669 F-measure (0.718 precision and 0.626 recall) at the mention level, and 0.901 F-measure (0.957 precision and 0.857 recall) at the document level for documents used for tagger training.",
author = "Fang, {Haw Ren} and Kevin Murphy and Yang Jin and Jessica Signoff and White, {Peter S.}",
year = "2006",
month = "1",
day = "1",
language = "English (US)",
pages = "41--48",
note = "HLT-NAACL 2006 Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis, BioNLP 2006 ; Conference date: 08-06-2006",

}

TY - CONF

T1 - Human gene name normalization using text matching with automatically extracted synonym dictionaries

AU - Fang, Haw Ren

AU - Murphy, Kevin

AU - Jin, Yang

AU - Signoff, Jessica

AU - White, Peter S.

PY - 2006/1/1

Y1 - 2006/1/1

N2 - The identification of genes in biomedical text typically consists of two stages: identifying gene mentions and normalization of gene names. We have created an automated process that takes the output of named entity recognition (NER) systems designed to identify genes and normalizes them to standard referents. The system identifies human gene synonyms from online databases to generate an extensive synonym lexicon. The lexicon is then compared to a list of candidate gene mentions using various string transformations that can be applied and chained in a flexible order, followed by exact string matching or approximate string matching. Using a gold standard of MEDLINE abstracts manually tagged and normalized for mentions of human genes, a combined tagging and normalization system achieved 0.669 F-measure (0.718 precision and 0.626 recall) at the mention level, and 0.901 F-measure (0.957 precision and 0.857 recall) at the document level for documents used for tagger training.

AB - The identification of genes in biomedical text typically consists of two stages: identifying gene mentions and normalization of gene names. We have created an automated process that takes the output of named entity recognition (NER) systems designed to identify genes and normalizes them to standard referents. The system identifies human gene synonyms from online databases to generate an extensive synonym lexicon. The lexicon is then compared to a list of candidate gene mentions using various string transformations that can be applied and chained in a flexible order, followed by exact string matching or approximate string matching. Using a gold standard of MEDLINE abstracts manually tagged and normalized for mentions of human genes, a combined tagging and normalization system achieved 0.669 F-measure (0.718 precision and 0.626 recall) at the mention level, and 0.901 F-measure (0.957 precision and 0.857 recall) at the document level for documents used for tagger training.

UR - http://www.scopus.com/inward/record.url?scp=84888285017&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84888285017&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:84888285017

SP - 41

EP - 48

ER -