Doi:10.1016/j.compbiomed.2007.01.013

Computers in Biology and Medicine 37 (2007) 1511 – 1521 A knowledge based method for the medical question answering problem Rafael M. Terol∗, Patricio Martínez-Barco, Manuel Palomar Department of Software and Computing Systems, The University of Alicante, San Vicente del Raspeig Road, Alicante, Spain Received 16 March 2006; received in revised form 22 January 2007; accepted 24 January 2007 Abstract
In this paper, a restricted domain question answering (QA) system is described. The design architecture of this QA system and the features that allow the adaptation of the QA system to the medical domain are also presented. The advantages of this QA system include the simple processof defining the question taxonomy answered by the system as well as the possibility of locally or remotely managed document collections. Themain computing methods of the QA system are based on the application of natural language processing (NLP) techniques to infer the logicforms and on the treatment of the logic forms. The knowledge of the system is acquired through the use of two different resources: UnifiedMedical Language System (UMLS) to handle the medical terminology and WordNet to manage the open-domain terminology.
᭧ 2007 Elsevier Ltd. All rights reserved.
Keywords: Bioinformatics; Biomedical; Text mining; Medicine; Medinformatics; Question answering framework; Medical question taxonomies 1. Introduction
set of questions that users need. These FAQ systems handle adatabase where the list of questions and their related answers Open-domain textual question answering (QA), as defined are stored. Thus, the FAQ system allows users to choose one of by the TREC competitions,1 is the task of extracting the right the possible questions that the system is able to answer by way answer from text snippets identified in large collections of doc- of searching in the database for the answers related with that uments where the answer to a natural language question lies.
question. The natural language questions do not consider by Open-domain textual QA systems are defined as capable these FAQ systems and, the increment in the questions treated tools to extract concrete answers to very precise needs of infor- by the system require the user to compare if the question is mation in document collections. For instance, in open domains, matched with the large number of the answered questions. For a system can respond to society questions such as where was these reasons, these FAQ systems are replaced by QA systems Marilyn Monroe born?, what is the name of Elizabeth Taylor’s over restricted domains. Nowadays, textual QA is also exhib- fourth husband?; geography questions such as where is Halifax ited in restricted domains such as clinical tourism med- located? and so on. Examples of these kinds of QA systems in ical and so on. These system are described in the next back- open domains can be located in authors such as Moldovan Sasaki Vicedo Zukerman and so on. These types According to official results of the QA track at the last TREC of QA systems locally process document collections discarding conference, QA systems in open domains are between 30% the access to internet information sources.
and 40% of precision.In a restricted domain such as medical In restricted domains, frequently asked question (FAQ) sys- domain, it is necessary to highly improve this score due to the tems are often used to obtain common answers to a restricted critical information that is handled in these medical areas whereerroneous information can originate serious risks to people’shealth (no answer is better than incorrect answers).
∗ Corresponding author. Tel.: +34 965903772; fax: +34 965909326.
This is the reason why our research effort is directed towards E-mail address: (R.M. Terol).
the textual QA on medical domain retrieving the information 1 The Text REtrieval Conference (TREC) is a series of workshops organized by the National Institute of Standards and Technology (NIST),designed to advance the background in information retrieval (IR) and QA.
2 This evaluation measure gives the accuracy of the QA system.
0010-4825/$ - see front matter ᭧ 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.compbiomed.2007.01.013 R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 from internet websites. There exists a lot of feasible medi- ognizing weather events, and the domain independent ontology cal information towards internet, the largest network in the for place names. Rinaldi et al. shows the adaptation to the world. This fact increases the importance of evaluating the genomics domain of an existing QA system. The knowledge quality of information on medical websites because anyone was extracted from several resources such as Unified Medical can create a website and can put any medical information on Language System (UMLS) SWISS-PROT, OMIM, Ge- this website. This medical information would not be accurate neOntology, GenBank and LocusLink. As an adaptation of the ExtrAns system to the new genomics domain, this system In this paper, a QA system is presented. This QA system uses the minimal logical forms to perform the semantic repre- is capable of working over any restricted domain. The adap- sentation of documents and questions. Niu and Hirst previous tation to the system medical domain (medical QA system) is work showed that current technologies for factoid QAin also exhibited. The medical QA system is able to answer med- open domains were not adequate for clinical questions, whose ical questions according to a generic question taxonomy. In the answers must often be obtained by synthesizing relevant con- following sections, the main features of the QA system are de- text. To adapt to this new characteristic of QA in the medi- scribed focusing in detail the question analysis performance.
cal domain, they exploited the relations between the semantic Section 2 introduces the state of the art of QA systems. In Sec- tion 3, we show the motivation of working in QA over medical As shown in the present section, different ways of process- domain. Section 4 details the modulate architecture of the re- ing logic forms are applied in the open-domain QA perfor- stricted domain QA system and its adaptation to the medical mance. Also, open-domain QA systems can be adapted to re- domain. In Section 5, we describe the evaluation task and show stricted domains. In the following sections, our QA system the obtained results by our medical QA system. Section 6 dis- based on the processing of logic forms is presented. The fea- cusses the contribution of our research work. The last section tures that allow the portability of the QA system to a new domain (the medical domain) are also presented. These porta-bility features imply that our QA system runs as a medical 2. Background
QA performance requires complex natural language process- 3. Motivation
ing (NLP) techniques. The core of our QA system is the textprocessing by way of logic forms. In the following sections, There exists several agents that can interact in the clinical this complex NLP technique is defined. A logic form is a way domains such as doctors, patients, laboratories and so on. All of representing natural language sentences. Other authors em- of them need quick and easy ways to access electronic infor- ploy logic forms in their QA systems. Concretely, Moldovan mation. Access to the latest medical information helps doctors developed an open domain QA system, and Mollá de- to select better diagnoses, helps patients to know about their signed an open domain QA system capable of answering nat- conditions, and allows to establish the most effective treatment.
ural language questions in the frame of the commands of the These facts produce a lot of information and different types of UNIX operating system. In Moldovan’s QA system, the identi- information between these agents that must be electronically fication of the predicates is based on the format of Logic Form processed. For example, people want to find competent med- Transformation of eXtended WordNet while Mollá identi- ical answers to medical questions: when they have some un- fies the predicates using a more complex terminology based on known symptoms and want to know what they could be related logic treatment. In order to focus their QA systems on open do- to, or when they want to know another medical opinion about mains, Moldovan and Mollá employ complex inference rules the best way to treat their disease, or when they can ask expe- in the logic forms treatment performance.
rienced doctors any medical questions related to any unknown Moreover, the use of these open-domain textual QA systems symptoms or their state. All these features conclude that the in restricted domains such as medical domain do not produce number and the type of medical questions that a medical QA good results because these systems use NLP generic resources system can respond to is very great.
such as WordNetwhich is not specialized in medical These reasons motivated us to adapt the QA system to the terminology. When QA systems are directed to restricted do- medical domain. This medical QA system is capable of answer- mains, it is necessary to acquire rich knowledge resources of ing medical questions according to a medical question taxon- the domain that allows the system to understand the meaning of omy. This question taxonomy is based on the study developed the treated information in the user’s question and documents.
by Ely et al. whose main objective is to develop a taxonomy Chung et al. presented a practical QA system in the me- of doctor’s questions about patient care that could be used to teorology domain that extracts information about the weather help answer such questions. In this study, the participants were every hour from the website of the Korea Meteorological Ad- 103 Iowa family doctors and 49 Oregon primary care doctors.
ministration. This information is structured and locally stored The authors concluded that clinical questions in primary care in a database management system (DBMS). The knowledge is can be categorized into a limited number of generic types. A obtained by consulting a domain-dependent ontology for rec- 4 A factoid question is a fact-based, short answer question such as When 3 WordNet is a large lexical database of the English language.
R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 moderate degree of interrater reliability was achieved with the 4. QA system architecture
taxonomy developed in this study. The taxonomy may enhancethe understanding of doctors’ information needs and improve The main components (modules) of our QA system could be the ability to meet those needs. According to this question tax- onomy, the 10 most frequent questions formulated by doctorsare ranked in the following enumeration: (1) Question analysis.
(2) Document retrieval.
(1) What is the drug of choice for condition x? (3) What test is indicated in situation x?(4) What is the dose of drug x? These components are related to each other and process the (5) How should I treat condition x (not limited to drug treat- textual information available on different levels until the QA (6) How should I manage condition x (not specifying diag- The natural language questions formulated to the system are processed initially by the question analysis component. This (7) What is the cause of physical finding x? process is very important since the quantity and quality of the (8) What is the cause of test finding x? information extracted in this analysis will condition the per- (9) Can drug x cause (adverse) finding y? formance of the remaining components and therefore, the final (10) Could this patient have condition x? A part of the information obtained from this question analysis Thus, our medical QA system must be able to answer natural process is used by the document retrieval module to perform language questions according to this set of 10 generic medical a first selection of documents from websites. In a restricted questions, discarding other questions (medical and from other domain the document collections are frequently updated and domains). The fact that our QA system is only able to answer this fact derives high maintenance costs of updated document questions in this question taxonomy produces on one hand a collections locally stored. This is the main reason why this task lower recall but on the other hand a higher precision with the is remotely performed using the Google search service. The aim that our system will be very useful in the medical domain obtained result is a very reduced subset of the documentary according to this question taxonomy.
This adapted domain QA system (in this case, medical do- Subsequently, the relevant passages selection module per- main) uses complex NLP techniques as logic forms treatment.
forms a more detailed analysis of the relevant documents sub- The main differences in the logic forms of our QA system and set with the objective of detecting those reduced text fragments those of Moldovan and Mollá are based on the method of deriva- that are susceptible of containing the search answer.
tion of the logic forms, the method of identifying the predicates Finally, the answer extraction module processes the small in the logic forms and the complexity of the inference rules text fragments set obtained from the previous process with the in the treatment of the logic forms. On the one hand, the QA purpose of locating and extracting the search answer. systems of Moldovan and Mollá derive the logic forms through graphically shows the execution sequence of these processes the syntactic analysis of the sentence while, on the other hand, and the relationships to each other modules.
our QA system derives the logic form through the dependencyrelationships between the words. As Courtin and Genthial said, the processing based on syntactic analysis allows to add some semantic information on words. In open-domains, thismethod of derivation of the logic forms improves the knowl-edge of the system. On the other hand, in restricted domains where there exists other knowledge resources, the derivation of the logic form through the dependency relationships be-tween the words is more concise. Also, in our QA system asin Moldovan’s QA system, the identification of the predicates is based on the format of Logic Form Transformation of eX- tended WordNet. In order to focus our QA system in restricteddomains, in the logic forms treatment task, our inference rules are deeper than the inference rules applied by Moldovan The next section details the modulate architecture of our QA system capable of answering the questions formulated accord-ing to a question taxonomy. Concretely, we show the adapta- tion to the specific medical domain taxonomy, implemented bymeans of the medical QA system.
Fig. 1. Medical QA system modulate architecture.
R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 The computational cost of this complex process is primar- ily dependent on two main factors: the speed of the internet connection in the tasks of document retrieval and named enti-ties recognition, and the logic form derivation task. The tem- poral costs derived from the speed of the internet connectionwould be lower if the document collection and the knowledge resources (presented in the following subsections) were locally stored because our system is also able to locally work with these resources. We prefer to remotely work with these re-sources because they are frequently updated (new drugs, new releases of knowledge resources, and so on). Moreover, with the aim of running this medical QA system in the most com- mon operating systems, the JAVAீ platform has been usedin the development phase. The needs of persistent informa- tion are stored in the file system of the operating system.
Thus, the dependencies between the DBMSs and the operat-ing systems are avoided. Considering these development fea- Fig. 2. Dependency tree of the sentence.
tures of accessing the resources via internet, the medium tem-poral cost of answering a question using the QA system isaround 8 s.
root of the dependency tree does not modify any word. It is The Section 3.1 presents how the QA system performs a also called the head of the sentence.
previous preprocessing of the sentences (questions and pos- For example, shows the dependence tree of the sen- sible answers). The Section 3.2 shows the portability fea- tence “Patient assistance programs help millions get the med- tures that allow the QA system to run as a medical QA ications that they need”. The lexical category of each word is system. Then, the rest of the Sections (from 3.3 to 3.6) shown inside the brackets behind the word. These lexical cate- describe the main components of the medical QA system gories can be noun (N), verb (V), adjective (A), and so on. Each one of the arrows label the dependency relationship betweenthe modifier and the head. These dependency relationships canbe s (subject), mod (modifier), obj (object), and so on. In this 4.1. Preprocessing of the sentences example, the verb “to help” is the head of the sentence (the rootof the dependency relationship).
This previous preprocessing of the sentences allows the main modules to infer logic forms of sentences and obtain similar-ity relationships between verbs in the WordNet lexical 4.1.1.2. Logic form derivation. Once the dependency relation- ships have been acquired, the next step to automatically inferthe logic form of the sentence is the analysis of these depen-dency relationships between the words of the sentence. Then, 4.1.1. Inferring logic forms of sentences the logic form derivation is a compositional process that starts Our medical QA system makes use of the logic forms of the in the leaves of the dependency tree, continues through the sentences with the aim of simplifying the sentence treatment ramifications of the dependency tree and ends in the root of process. The logic form of a sentence is derived through ap- the derivation tree. Thus, the logic form is inferred on the one plying NLP rules to the dependency relationship of the words hand by the application of simple NLP rules to the leaves of the dependency tree and, on the other hand, by the applicationof complex NLP rules to all the pairs (modifier, head) in the 4.1.1.1. Getting dependency relationships. The first step nec- dependency tree. This distinction between single and complex essary to infer the logic form of a sentence is to obtain the NLP rules is produced because in the leaves of the dependency dependency relationships between the words of the sentence.
tree there does not exist any dependency relationship in which The NLP resource used to obtain the dependency relationships the word is the head of the dependency relationship while in between the words of the sentence is MINIPAR a broad- the ramifications and in the root of the dependency tree depen- According to the definition proposed by Lin a depen- To design the single NLP rules only the lexical category of the dency relationship is an asymmetric binary relationship be- word has been contemplated while in the design of the complex tween a word called head (or governor, parent), and another NLP rules the lexical category of the head, the lexical category word called modifier (or dependent, daughter). Normally the of the modifier, the type of dependency relationship and the rel- dependency relationships constitute a tree that links all the ative position of the modifier (before the head or after the head) words in the sentence. This dependency tree has different lev- have been considered. shows some simple NLP rules els of words because a word in the sentence may have different and describes some complex NLP rules. In these tables, modifiers, but each word may modify at most one word. The the Leaf column expresses the lexical category of the leaf, the R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 LCH column describes the lexical category of the head in the • Preposition: A combination between the x-type and e-type dependency relationship, the LCM column expresses the lexical arguments can be assigned as the two arguments of this category of the modifier in the dependency relationship, the DR predicate that only link the dependency relationship be- column shows the type of dependency relationship, the MP col- tween two other predicates. For instance, the expression umn expresses the relative position of the modifier with respect “south of America” could be codified as “south:NN(x1) to the head, and the LF column shows the inferred logic form.
of:IN(x1, x2) America:NN(x2)” while the expression “go The assignation of predicates and arguments to the lemma of to the airport” could be codified as “go:VB(e1, x1, x2) the words is based on the codification applied by Logic Form Transformation of eXtended WordNet a lexical resourcebased on logic forms. This codification depends on the part-of- We summarize this complex process of inferring the logic form of a sentence through the following example in the sentence“The aspirin is effective”. The first step is to find the dependency • Noun: An x-type argument is assigned to the predicate of relationships between the words in the sentence. shows this word. This argument uniquely identifies this predicate the dependency tree. The second step consists of applying the in the logic form. For instance, the noun “house” could be simple NLP rules to the leaves of this dependency relationship codified by the predicate “house:NN(x1)”.
and obtaining the predicates of the logic form derived in these • Verb: An e-type and two x-type arguments are assigned leaves (see The next step is based on applying the to the predicate of this word. The first one uniquely complex NLP rules to the ramifications and the root of the identifies this predicate (the action of the verb) in the dependency tree deriving the logic form (see logic form and the other ones denote the subject and Once all these rules have been applied to the dependency the object of the word. If the verb is intransitive then tree of the sentence “The aspirin is effective”, the logic form is the object argument must be dummy. As an example,the noun “take” could be codified by the predicate • Adjective: An x-type argument is assigned to the predicate of this word. This argument uniquely recognizes this pred-icate (the property denoted by the adjective) in the logic form. For instance, the adjective “young” could be cod-ified by the predicate “young:NN(x1)”. When the adjec- tive modifies a noun (there exists a dependency relation- ship from the adjective to the noun) then both predicatesin the logic form are instantiated by the same x-type ar- Fig. 3. Dependency tree of the sentence.
gument. For instance “young man” could be codified as“young:JJ(x1) man:NN(x1)”.
• Adverb: An e-type argument is assigned to the predicate of this word. This argument uniquely identifies this predicate Simple NLP rules applied to the leafs in the dependency tree (the action expressed by the adverb) in the logic form. As an example the adverb “clearly” could be codified by thepredicate “clearly:RB(e1)”.
Table 1Subset of simple NLP rules applied to the leafs in the dependency tree Table 4Complex NLP rules applied to dependency relationships subj Before aspirin: NN (x2) be: VB (e1,x2,x3) Table 2Subset of complex NLP rules applied to dependency relationships modifier LF + lemma of head:JJ(modifier x var) modifier LF + lemma of head:VB(new e var, modifier x var, new x var) head LF+ Atributo:IN(head e var, modifier x var) + modifier LF R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 inferred as “aspirin:NN(x2) be:VB(e1, x2, x3) Atributo:IN(e1, x1) effective:JJ(x1)”. Note that the verb “to be” is intransitive.
This fact produces in the logic form that on the one hand the ar- gument of its predicate that represents the object (x3) is dummyand, on the other hand, the predicate “Atributo” links the de- pendency relationship between the verb and the adjective.
Our NLP technique used to infer the logic is different to other techniques that accomplish the same goal such as Moldovan’s C0020538 Hypertension T047 Disease or syndrome that takes as input the parse-tree of a sentence, or Mollá’s that introduces the flat form as an intermediate step betweenthe sentence and the logic form.
This generic NLP resource based on inferring the logic forms of the sentences is used by our medical QA system in the biomedical and health terminologies by way of concepts and performance of question analysis (deriving the logic forms of semantic types in the UMLS Metathesaurus. On the one hand, the questions) and answer extraction (deriving the logic forms the CUI column uniquely identifies the concept while the CN of the sentences that would contain the answer).
column shows the name of the concept, and on the other hand,the TUI column uniquely identifies the semantic type while the 4.1.2. Similarity relationships between verbs STY column describes the name of the semantic type associated In spite of the fact that UMLS is a rich resource in medical to the concept. Thus, our medical named entities recognition expressions, it does not contain much information related to module is based on dictionary. This module retrieves from the verbs because the verbs should not be domain independent. For UMLS Metathesaurus all the information relative to the concept this reason our system uses WordNet to extract the similar- and the semantic types of the free-text received as argument.
ity relationships of one verb to another. WordNet is a database The retrieval of this information from the UMLS Metathesaurus of word meanings and lexical relationships that records the se- is performed by consuming the UMLS Metathesaurus webser- mantic relations between the synonym sets, also called synsets.
vice through Simple Object Access Protocol (SOAP), an XML- A synset can be defined as a group of synonym words. These based messaging protocol. The processing of this retrieved in- synsets are related to each other according to different rela- tions: synonymy, hyponymy, hyperonymy, coordinate terms, Even though our QA system is able to locally work with holonymy, meronymy, antonymy, and so on.
the UMLS Metathesaurus, this feature is actually discardedbecause this resource is frequently updated with new releases.
4.2. Portability of the system to the medical domain The fact that the execution time decreases in a few seconds bylocally working with this resource would suppose the following To adapt the QA system to the medical domain it is necessary disadvantages: to detect when a new release has been published, to obtain medical knowledge by way of medical named entities to download this new release, to replace the previous installation recognition, and develop the patterns associated to each one of with the new release, and to make possible changes in the the treated generic medical questions.
software that interacts with the new release.
4.2.1. Medical named entities recognition Our medical QA system needs to recognize the medical enti- This off-line task consists of the definition of the patterns that ties in the sentences focusing on the processing in the different identify each generic question. These patterns are composed by phases of the QA process. The medical named entities recog- a combination of types of medical entities and verbs. These pat- nition performance is developed by using the UMLS a terns can be generated according to two different methods: the resource of the language of biomedicine and health. A great first one consists of the easy process of definition of patterns by number of concepts, relationships and definitions contained in an advanced user of the system, and the second one consists of UMLS have been derived from the Medical Subject Headings the automatic generation of the patterns through the processing (MeSH) vocabulary.Concretely, our system uses the UMLS of questions according to the question taxonomy. We are going knowledge source called Metathesaurus to accomplish this to describe these two different ways of generating patterns: goal. The UMLS Metathesaurus contains information about Manual pattern generation: The manual definition of these biomedical and health related concepts (meanings) facilitating patterns is presented in The advanced user of the system mapping free-text entries to biomedical and health terminolo- has to identify the types of medical entities and verbs that must gies. The UMLS Metathesaurus is organized by concept. These match in the generic question. The automatic expansion of these concepts have assigned, at least, one semantic type (category).
verbs according to their similarity relationships with other ones shows an example of the mapping free-text entries to in the WordNet lexical database is also performed. The follow-ing step consists of setting the medical entities lower threshold (MELT) and the medical entities upper threshold (MEUT) of MeSH is a huge controlled vocabulary created by the United States National Library for the purpose of indexing journal articles and books in each pattern. On the one hand, MELT can be defined as the minimum number of medical entities that must match between R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 the pattern and the question formulated by the user and, on the in the logic form through the similarity relations with other other hand, MEUT can be defined as the maximum number verbs in the WordNet lexical database. The next step consists of medical entities that can match between the pattern and the of the automatic setting of the MELT whose score is set to the question formulated by the user. Finally, the last step consists number of medical entities in the logic form minus one, and of the manual setting of the possible expected answer types.
the automatic setting of the MEUT of which the score is set to Supervised automatic pattern generation: The automatic the number of medical entities in the logic form. Finally, the generation of these patterns by the system is performed through last step consists of the manual setting of the possible expected the processing of questions matched to the question taxon- answer types. This task is supervised by an advanced user of omy as shown in Thus, the first step consists of the the system that can modify the results obtained by the system derivation of the logic form associated to each question. The next step is the medical named entities recognition in the logicform of those predicates whose type is noun (NN) or complex nominal (NNC) including their possible adjective modifiers(JJ). The third step is the recognition of the main verb in the The question analysis performance consists of classifying logic form and the automatic expansion of this main verb and analyzing the natural language questions that users can ask.
This computational process is based on two different tasks: • Question classification: assigning one of the generic pat- terns to each one of the questions that the user asks oursystem.
Question analysis: performing a complex process on the question according to the matched pattern and its respectivematched generic question.
This question classification task starts after the user enters the question into the system. In this implementation of the QAadapted to the medical domain, 10 classes of user questions aremanaged according to the 10 generic questions treated by the system. Then, this task has to decide if the user question belongs to one class (matches with one of the generic questions) or not.
To accomplish this goal, this task focuses on the treatment of question forms derived from the user questions according to the Fig. 4. Manual pattern generation task.
steps shown in Thus, the first step consists of inferringthe logic form of the question entered to the system. The second Fig. 5. Supervised automatic pattern generation task.
Fig. 6. Question classification task.
R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 step is the extraction of the main verb in the logic form. The medical website classes where our system can retrieve the med- next step is the recognition of the medical entities of those ical documents. Once these medical website classes have been predicates whose type is noun (NN) or complex nominal (NNC) defined, an additional task that consists of relating the generic including their possible adjective modifiers (JJ). The fourth questions and these medical website classes can be defined but step is the analysis of the question form setting the medical it is not necessary. Note that a medical website class can be entities score in question (MESQ). MESQ can be defined as the related to more than one generic question, and a generic ques- number of medical entities in the logic form of the question.
tion can be associated to more than one medical website class.
The next step consists of finding those patterns of questions of Thus, this association relates each one of the generic questions which the list of verbs contains the main verb of the logic form and the medical websites that can answer them.
Then, this document retrieval engine can start retrieving those the entities matching measure (EMM) which is defined as the relevant documents from medical websites whether there exists number of medical entities that match between the question and or not the association between the searched generic question the pattern. Finally, the last step is the selection of the pattern whose difference between EMM and MELT is the lowest one.
4.4.1. Document retrieval by way of medical websites classes When the treated generic question has been related to at least Once the user question is matched to a generic question one medical websites class then the Google search engine re- pattern from one of the 10 generic questions treated by the trieves the relevant documents according to the question key- system, this question analysis task firstly captures the seman- tics of the user question. As mentioned before, WordNet andUMLS Metathesaurus are used in this performance. The fol- 4.4.2. Document retrieval by way of MFC algorithm lowing step consists of the recognition of the expected answer When the treated generic question has not been related to any type. These medical answer types can be diseases, symptoms, medical website class then we apply our most frequent classes dose of drugs, and so on, according to the possible answers to (MFC) algorithm. This algorithm calculates the most frequent the 10 generic questions treated by the system. After that, the medical website classes that rightly answer the treated generic keywords are identified. These question keywords are directly question in the latest searches. Thus, the Google search engine recognized by applying a set of heuristics to the predicates and retrieves the relevant documents according to the question key- the relationships between predicates in the logic form. Like words in the medical websites that belong to these most fre- question keywords our QA system identifies complex nominals quent medical website classes. The update of the MFC for the and nouns recognized as medical expressions (using medical treated generic question is produced using an adaptation of the named entities recognition) including their possible adjective LRU algorithm for database disk buffering This task con- modifiers, the rest of the complex nominals and nouns includ- sist of updating the MFC for the treated question with the actual ing their possible adjective modifiers and the main verb in the medical website classes where the right answer can be found.
logic form. For instance, in the part of the logic form “. . .
high:JJ(x3) blood:NN(x1) NNC(x3, x1, x2) pressure:NN(x2) . . .”, the predicate x3 is recognized as a Disease or Syndromeand then “high blood pressure” is treated as a keyword. These Once the medical documents are retrieved, this relevant pas- question keywords can be expanded by applying a set of heuris- sage selection process consists of extracting the sentences in tics. For example, medical expressions can be expanded using these medical documents that could answer the user question.
similarity relations given by UMLS Metathesaurus. Thus, ac- These sentences are extracted by applying a technique based on cording to UMLS Metathesaurus, “high blood pressure” can be comparing the question keywords in the documents and, those sentences that at least contain a question keyword are extracted This set of question keywords is sorted by priority, so if from the document and are evaluated by the next answer ex- too many keywords are extracted from the question, only a traction module that decides if the sentence rightly answers the maximum number of keywords are searched in the information This module extracts the answer by analyzing the sentences Even though the document retrieval module can retrieve lo- extracted by the previous relevant passage selection module.
cally stored documents, its remote facility retrieves the relevant This process is performed by applying the following steps to documents from medical websites using the Google search ser- each one of the retrieved sentences: the first one consists of vice. These medical websites can be sorted from the previously inferring the logic form of the sentence and identifying the defined medical website classification. This medical website main verb in this logic form; the following step is to verify if classification is performed before the real-time execution of this main verb belongs to the set of verbs that can answer the the Google search engine and consists of defining the different generic question; the third step is the recognition of the medical R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 entities in the logic form; the next step consists of comparing if the medical entities searched as the answer is found in the logicform; and finally, the last step is the analysis of the predicatesthat relate the candidate answer, the main verb and the restof the medical entities in the logic form (answer form). This Question
process produces an answer ranking. In a valid answer, the Classification
verb can uniquely relate two medical entities considering this feature as a direct link. Also, IN-type predicates can take part inthe relation between the two medical entities considering thisfeature as a connect link. Our system differently scores thesetwo links: 1 for the direct link, and 0.8 for the connect link. Torank the answer, our system applies the link measure defined as Fig. 7. Question classification task.
For example, if a user formulates the system with the ques-tion “Which drugs are associated with the high blood pressure when a QA system is directed to any restricted domain do not problem?”, this question is classified according to the first exist these kinds of evaluation tracks. This is the main moti- generic question “What is the drug of choice for condition x?”.
vation why the evaluation of the question classification task is Continuing with the processing, the answer extraction mod- based on the evaluation presented by Chung et al. in their pre- ule receives as input the following sentences: “Cozaar treats vious research work Thus, a pilot evaluation task applied hypertension” and “Hyzaar is indicated in the management of to the evaluation of the question classification performance has hypertension”. The logic form associated to the first sentence been developed involving a group of people that did not work is “cozaar:NN(x1) treat:VB(e1, x1, x2) hypertension:NN(x2)” on the design and development phases of the QA system. These while the logic form associated to the second sentence is de- people received several instructions about the manual construc- fined as “hyzaar:NN(x1) indicate:VB(e1, x1, x4) in:IN(e1, x3) tion of these types of questions to manually create 50 questions management:NN(x3) of:IN(x3, x2) hypertension:NN(x2)”. The according to the 10 generic questions answered by the system answer form associated to the first logic form is instantiated (GQ1: five questions that are matched to the first generic ques- as “Pharmacologic_Substance:NN(x1) treat:VB(e1, x1, x2) tion; . . .; GQ10: five questions that are adjusted to the tenth Disease_or_Syndrome:NN(x2)”. Only a direct link (the treat generic question.). Also, the OQ question set that is composed verb) relates both medical entities (Pharmacologic_Substance of 200 questions of the last QA English Track at CLEF 2005 and Disease_or_Syndrome). In this case LM = 1. The an- conference is also included to evaluate the robustness of the swer form associated to the second logic form is instanti- question classification task in a noisy environment.
ated as “Pharmacologic_Substance:NN(x1) indicate:VB(e1, shows how the question classifier task is able to classify x1, x4) in:IN(e1, x3) management:NN(x3) of:IN(x3, x2) each one of the given questions in one of the following classes Disease_or_Syndrome:NN(x2)”. A direct link (the indicate verb) and two connect links (in and of) relate both medical en-tities (Pharmacologic_Substance and Disease_or_Syndrome).
• GE: This class of questions include each one of the 10 In this case LM = 0.8. Then, the answer ranking according to generic questions. Thus, GE1 corresponds with the generic the LM scores would be: Cozaar and Hyzaar. These two an- question “What is the drug of choice for condition x?”, swers would be the results returned by the system. LM ranks GE2 is matched with the generic question “What is the the answers according to the length of the paths between the cause of symptom x?”, . . . , and GE10 is arranged with the treated medical entities. Thus, short paths would be in header generic question “Could this patient have condition x?”.
• OE: The rest of the questions from other domains.
5. Results
Then the evaluation task consist of checking if each oneof the 250 evaluation questions (GQ1, . . . , GQ10 and OQ) The evaluation of the medical QA system is based on the have been correctly classified in the appropriate class of question analysis module, the core of the system, because the questions (GE1, . . . , GE10 or OE). As an evaluation mea- good performance of its question classification task (rightly sure, we apply the precision measure (P ) defined as classifying the formulated question into one of the generic ques- P = # correctly classified questions/# classified questions.
tions) finally derives in the increasing of the precision of the In order to show the results obtained in this question classifi- system. Despite the fact that open-domain QA systems can be cation task, shows the obtained results in the evaluation evaluated according to TREC and CLEFevaluation tracks, of each subset of evaluation questions while presentsthese summarized results according to the generic set of evalu- 6 Similar to TREC, CLEF is other system evaluation campaign where ation questions. The class column expresses the class of ques- QA systems can be tested, tuned and evaluated.
tions that we are evaluating. The related class column shows R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 web-browser. Thus, the use of the medical QA system would Detailed evaluation of the question classification task The main novelty of the medical QA system is that the infor- mation can be retrieved from internet websites in comparison to most QA systems (in open and restricted domains) that only retrieve locally stored information in a known host. In spite of the medical question taxonomy presented in this article, the extension to other medical questions can be easily performed.
Due to the efficient resources and techniques used by the med- ical QA system, the average temporal costs are round about 8 s Also, with the aim to improve the temporal costs in answer- ing the medical questions, each treated medical question canbe searched in the medical websites considered by the systemadministrator. If this feature is not considered then the system automatically applies an adaptation to our task of the LRU al- Summarized evaluation of the question classification task gorithm used by the operating systems and the DBMSs in thememory management performance. This algorithm considers the medical websites where the system retrieved the documents that rightly responded to this class of question in previous exe- cutions of the system, and orders them according to the number of right responses retrieved in each medical website.
The software engineering rules that treat the module coupling and module cohesion properties in an object-oriented context the correct related class associated to each classified class. The have been applied in the design of the medical QA system questions column presents the number of classified questions.
architecture. For this reason the medical QA system can also The number of five questions per classand 200 noisy ques- be easily extended to other domains. This fact only implies the tions has been empirically established in the pilot evaluation adaptation to the new domain of the system’s submodule that task but this fact does not mean that the classifier is only able to performs the entities recognition task, and the indications of classify this number of questions. The classifier, as the rest of which are the right websites dependent on the new domain that components of the QA system, does not consider this number contain the information in which the answers can be extracted.
of questions to perform their functions. So, the QA system isable to sequentially manage an unlimited number of questions.
7. Summary
The correct column indicates the number of questions that havebeen correctly classified according to the related class. The pre- QA is applied to medical disciplines in modern QA over re- cision column shows the precision of this classification task stricted domains. It allows users to efficiently obtain a list of an- that agrees with the presented evaluation measure.
swers to medical questions. The medical QA system presented According to the overall row in the precision score in the present article is able to answer these questions according of the question classifier task is 94.4%. This good score will to a medical question taxonomy. Thus, the medical QA system positively condition the right performance of the following parts offers tools to automatically define the functional patterns of of this QA process in the medical domain.
a new medical question towards a set of matched questions tothis new medical question. Once these functional patterns have 6. Discussion
been automatically created, the new medical question is ableto be answered by the medical QA system, in conjunction with It is well known that there exists a lot of information needs the rest of these generic medical questions. Also, the medical related to the different medical areas and specialities. Most of websites where the system can find the right answers to the new the on-line information in the health and medical areas are un- question can be given easily as an input of the system. This known to people outside of these areas including health care guide to medical websites will improve the temporal costs of professionals. These information needs can be solved by ap- the system in answering this class of medical questions. The plying the medical QA system capable of answering medical core of the medical QA system is the logic form treatment.
questions by retrieving the information from medical websites This complex process is produced by applying advanced NLP discarding any other wrong medical information that anybody techniques. The logic form of a sentence is derived through ap- can put on different websites. According to the proposed ar- plying NLP rules to the dependency relationship of the words chitecture the medical QA system can be easily transformed in the sentence. The NLP resource used to obtain these depen- to a client–server application on the web accessed through a dency relationships is MINIPAR a broad coverage parser.
Other NLP resources are used in this complex process: on the 7 Five questions per class according to the question taxonomy.
one hand the WordNet lexical database is used to extract R.M. Terol et al. / Computers in Biology and Medicine 37 (2007) 1511 – 1521 the similarity relationships between the verbs and, on the other [12] F. Rinaldi, J. Dowdall, G. Schneider, A. Persidis, Answering questions hand, the UMLS Metathesaurus is used to recognize the in the genomics domain, in: Proceedings of 42nd Annual Meeting of medical named entities in the text. In spite of the fact that this the Association for Computational Linguistics, Workshop on Question Answering in Restricted Domains, Barcelona, Spain, July 2004.
QA system has been adapted to the medical domain, it also can [13] D.A.B. Lindberg, B.L. Humphreys, A.T. McCray, The Unified Medical be adapted to other restricted domains.
Language System, in: Methods of Information in Medicine, vol. 32(4), Acknowledgments
[14] Y. Niu, G. Hirst, G. McArthur, P. Rodriguez-Gianolli, Answering clincal questions with role identification, in: Proceedings of 41st Annual Meetingof the Association for Computational Linguistics, Workshop on Natural This work has been partially funded by the Spanish Gov- Language Processing in Biomedicine, Sapporo, Japan, July 2003.
ernment under project CICyT number TIN2006-15265-C06-0 [15] J.W. Ely, J.A. Osheroff, P.N. Gorman, M.H. Ebell, M.L. Chambliss, E.A.
and PROFIT number PI051438, by the European Union under Pifer, P.Z. Stavri, A taxonomy of generic clinical questions: classification project number FP6-IST-2005-33860 and by the Valencia Gov- study, Brit. Med. J. 321 (2000) 429–432.
ernment under project number GV06/028.
[16] J. Courtin, D. Genthial, Parsing with dependency relations and robust parsing, in: Proceedings of COLING-ACL ’98 Workshop on Processing References
of Dependency-based Grammars, Montreal, August 1998, pp. 88–94.
[17] D. Lin, Dependency-based evaluation of minipar, in: Workshop on the Evaluation of Parsing Systems, Granada, Spain, 1998.
[1] D. Moldovan, C. Clark, S. Harabagiu, S. Maiorano, COGEX: a logic [18] D. Moldovan, V. Rus, Logic form transformation of WordNet and its prover for question answering, in: Proceedings of HLT-NAACL 2003, applicability to question-answering, in: Proceedings of 39th Annual Human Language Technology Conference, Edmonton, Canada, 2003, Meeting of the Association for Computational Linguistics, Toulouse, [2] Y. Sasaki, Question answering as question-biased term extraction: a [19] B.L. Humphreys, D.A.B. Lindberg, The UMLS project: making the new approach toward multilingual QA, in: Proceedings of 43th Annual conceptual connection between users and the information they need, Meeting of the Association for Computational Linguistics, Michigan, Bull. Med. Libr. Assoc. 81 (1993) 170–177.
[20] E.J. O’Neil, P.E. O’Neil, G. Weikum, The LRU-K page replacement [3] J.L. Vicedo, M. Saiz, R. Izquierdo, F. Llopis, Does English help question algorithm for database disk buffering, in: ACM SIGMOD Record, vol.
answering in Spanish, in: Proceedings of the Fifth Workshop of the Cross-Language Evaluation Forum, CLEF 2004, Bath, UK, September2004.
[4] I. Zukerman, B. Raskutti, Lexical query paraphrasing for document Rafael M. Terol (1979) graduated from the University of Alicante, Spain with
retrieval, in: H.-H. Chen, C.-Y. Lin (Eds.), Proceedings of the 19th a Bachelor of Engineering Degree in Information technology. With a university International Conference on Computational Linguistics, COLING 2002, rank for his undergraduate degree, he joined the University of Alicante in the year 2002 for his Master of Computer Sciences degree in Natural Language [5] D. Demner-Fushman, J. Lin, Knowledge extraction for clinical question Processing Systems under the supervision of Dr. Patricio Martinez-Barco andDr. Manuel Palomar. Working in the area natural language processing, Rafael answering: preliminary results, in: Proceedings of the AAAI-05 performed his research work at GPLSI-UA Spain. He is currently working Workshop on Question Answering in Restricted Domains, Pittsburgh, in the area of question answering (QA) in restricted domains. His research interests include, medical QA, textual entailment and information retrieval.
[6] F. Benamara, Cooperative question answering in restricted domains: the WEBCOOP experiment, in: ACL 2004 Workshop on Question Answering Patricio Martinez-Barco (1968) Ph.D. in Computer Science by the Univer-
in Restricted Domains, Barcelona, Spain, July 2004.
sity of Alicante (2001). Master in Computer Science by the University of [7] Y. Niu, G. Hirst, Analysis of semantic classes in medical text for question Alicante (1994). He is working since 1995 in the Department of Software answering, in: Proceedings of 42nd Annual Meeting of the Association and Computing Science (GPLSI division) at this University as a lecturer.
for Computational Linguistics, Workshop on Question Answering in His research interests are focused on Computational Linguistics and Natural Restricted Domains, Barcelona, Spain, July 2004.
Language Processing. His last projects are related to temporal expression res- [8] D. Mollá, R. Schwitter, M. Hess, R. Fournier, ExtrAns, an answer olution, syntactic-semantic patterns and logical forms applied to Information exraction system, TAL Special Issue on Information Retrieval Oriented Extraction, Information Retrieval and Question Answering. He was General Natural Language Processing, 2002, pp. 495–522.
Chair of the ESTAL’04 (Alicante) and SEPLN’04 (Barcelona) conferences, as well as Local Chair of the SLPLT’01 (Jaén) workshop. He has editedseveral books, and contributed with more than 40 papers to several journals morphologically and semantically enhanced resource, in: Proceedings of ACL-SIGLEX99: Standardizing Lexical Resources, Maryland, June1999, pp. 1–8.
[10] G.A. Miller, WordNet: an on-line lexical database, Int. J. Lexicography Manuel Palomar (1964) is the pro-vice-chancellor for research at the Uni-
versity of Alicante, and he is the head of the Natural Language Processing
Group (GPLSI) of Language and Information Systems Department at the [11] H. Chung, Y.-I. Song, K.-S. Han, D.-S. Yoon, J.-Y. Lee, H.-C. Rim, S.- University of Alicante, Spain. Palomar received a Ph.D. in computer science H. Kim, A practical QA system in restricted domains, in: Proceedings of from the Technical University of Valencia, Spain. His research interests in- 42nd Annual Meeting of the Association for Computational Linguistics, clude information extraction, question answering, linguistic resources and in Workshop on Question Answering in Restricted Domains, Barcelona, general research on Human Language Technologies.

Source: http://cluster.cis.drexel.edu:8080/sofia/resources/QA.Data/PDF/M_2007_JCmopBioMed_Terol_A_Knowledge_based_Method_for_the_Medical_Question_Answering_Problem-4131599372/M_2007_JCmopBioMed_Terol_A_Knowledge_based_Method_for_the_Medical_Question_Answering_Problem.pdf

Microsoft word - manualmain.doc

VLS-AIM INTERFACE MODULE MANUAL INTRODUCTION The information is this manual is intended as an installation guide for the Videx intercom to SmartDisc/SmartTel digital video transmission system. This manual should be read carefully before the installation commences. Any damage caused to the equipment due to faulty installations where the information in this manual h

Hamudsocoro.doc

Cantaurus , Vol. 12, 7-9, May 2004 © McPherson College Division of Science and Technology Pseudomonas aeruginosa resistance to tetracycline and triclosan Abida A. Hamud-Socoro ABSTRACT P. aeruginosa is a gram-negative rod bacterium which is widespread in nature and causes dangerous infections in humans. Tetracycline is a common antibiotic which is sometimes used to combat

Copyright 2014 Pdf Medic Finder