13 articles

Articles prepublished February 07, 2012

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Journal:Methods of Information in Medicine
ISSN:0026-1270
DOI:http://dx.doi.org/10.3414/ME09-01-0009
Issue:2010 (Vol. 49): Issue 4 2010
Pages:371-378

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Original Article

A. Kastrin (1), B. Peterlin (1), D. Hristovski (2)
(1) Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia; (2) Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Summary

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE® citations containing genetic relevant topic.Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH® descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

Keywords

Text Mining, natural language processing, Applied statistics, document categorization

DOI

http://dx.doi.org/10.3414/ME09-01-0009

You may also be interested in...

1.

S. M. Meystre1,G. K. Savova2, K. C. Kipper-Schuler2, J. F. Hurdle1

IMIA Yearbook 2008 2008 3 1: 128-144

2.
A Pilot Experiment Using a Semi-automated Method for Logical Schema Acquisition

Original Article

M. García-Remesal (1), V. Maojo (1), H. Billhardt (2), J. Crespo (1)

Methods of Information in Medicine 2010 49 4: 337-348

http://dx.doi.org/10.3414/ME0614

3.
Text Mining Trial of Discharge Summary

Section 5: Decision Support

Best paper selection

T. Suzuki, H. Yokoi, S. Fujita, K. Takabayashi

IMIA Yearbook 2009 2009 4 1: 98-98

http://dx.doi.org/10.3414/ME9128


Preprint Online November 21, 2011

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Journal:Methods of Information in Medicine
ISSN:0026-1270
DOI:http://dx.doi.org/10.3414/ME09-01-0009
Issue:2010 (Vol. 49): Issue 4 2010
Pages:371-378

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Original Article

A. Kastrin (1), B. Peterlin (1), D. Hristovski (2)
(1) Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia; (2) Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Summary

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE® citations containing genetic relevant topic.Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH® descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

Keywords

Text Mining, natural language processing, Applied statistics, document categorization

DOI

http://dx.doi.org/10.3414/ME09-01-0009

You may also be interested in...

1.

S. M. Meystre1,G. K. Savova2, K. C. Kipper-Schuler2, J. F. Hurdle1

IMIA Yearbook 2008 2008 3 1: 128-144

2.
A Pilot Experiment Using a Semi-automated Method for Logical Schema Acquisition

Original Article

M. García-Remesal (1), V. Maojo (1), H. Billhardt (2), J. Crespo (1)

Methods of Information in Medicine 2010 49 4: 337-348

http://dx.doi.org/10.3414/ME0614

3.
Text Mining Trial of Discharge Summary

Section 5: Decision Support

Best paper selection

T. Suzuki, H. Yokoi, S. Fujita, K. Takabayashi

IMIA Yearbook 2009 2009 4 1: 98-98

http://dx.doi.org/10.3414/ME9128


Articles prepublished September 14, 2010

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Journal:Methods of Information in Medicine
ISSN:0026-1270
DOI:http://dx.doi.org/10.3414/ME09-01-0009
Issue:2010 (Vol. 49): Issue 4 2010
Pages:371-378

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Original Article

A. Kastrin (1), B. Peterlin (1), D. Hristovski (2)
(1) Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia; (2) Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Summary

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE® citations containing genetic relevant topic.Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH® descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

Keywords

Text Mining, natural language processing, Applied statistics, document categorization

DOI

http://dx.doi.org/10.3414/ME09-01-0009

You may also be interested in...

1.

S. M. Meystre1,G. K. Savova2, K. C. Kipper-Schuler2, J. F. Hurdle1

IMIA Yearbook 2008 2008 3 1: 128-144

2.
A Pilot Experiment Using a Semi-automated Method for Logical Schema Acquisition

Original Article

M. García-Remesal (1), V. Maojo (1), H. Billhardt (2), J. Crespo (1)

Methods of Information in Medicine 2010 49 4: 337-348

http://dx.doi.org/10.3414/ME0614

3.
Text Mining Trial of Discharge Summary

Section 5: Decision Support

Best paper selection

T. Suzuki, H. Yokoi, S. Fujita, K. Takabayashi

IMIA Yearbook 2009 2009 4 1: 98-98

http://dx.doi.org/10.3414/ME9128


Preprint Online August 05, 2011

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Journal:Methods of Information in Medicine
ISSN:0026-1270
DOI:http://dx.doi.org/10.3414/ME09-01-0009
Issue:2010 (Vol. 49): Issue 4 2010
Pages:371-378

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Original Article

A. Kastrin (1), B. Peterlin (1), D. Hristovski (2)
(1) Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia; (2) Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Summary

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE® citations containing genetic relevant topic.Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH® descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

Keywords

Text Mining, natural language processing, Applied statistics, document categorization

DOI

http://dx.doi.org/10.3414/ME09-01-0009

You may also be interested in...

1.

S. M. Meystre1,G. K. Savova2, K. C. Kipper-Schuler2, J. F. Hurdle1

IMIA Yearbook 2008 2008 3 1: 128-144

2.
A Pilot Experiment Using a Semi-automated Method for Logical Schema Acquisition

Original Article

M. García-Remesal (1), V. Maojo (1), H. Billhardt (2), J. Crespo (1)

Methods of Information in Medicine 2010 49 4: 337-348

http://dx.doi.org/10.3414/ME0614

3.
Text Mining Trial of Discharge Summary

Section 5: Decision Support

Best paper selection

T. Suzuki, H. Yokoi, S. Fujita, K. Takabayashi

IMIA Yearbook 2009 2009 4 1: 98-98

http://dx.doi.org/10.3414/ME9128


Preprint Online July 26, 2011

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Journal:Methods of Information in Medicine
ISSN:0026-1270
DOI:http://dx.doi.org/10.3414/ME09-01-0009
Issue:2010 (Vol. 49): Issue 4 2010
Pages:371-378

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Original Article

A. Kastrin (1), B. Peterlin (1), D. Hristovski (2)
(1) Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia; (2) Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Summary

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE® citations containing genetic relevant topic.Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH® descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

Keywords

Text Mining, natural language processing, Applied statistics, document categorization

DOI

http://dx.doi.org/10.3414/ME09-01-0009

You may also be interested in...

1.

S. M. Meystre1,G. K. Savova2, K. C. Kipper-Schuler2, J. F. Hurdle1

IMIA Yearbook 2008 2008 3 1: 128-144

2.
A Pilot Experiment Using a Semi-automated Method for Logical Schema Acquisition

Original Article

M. García-Remesal (1), V. Maojo (1), H. Billhardt (2), J. Crespo (1)

Methods of Information in Medicine 2010 49 4: 337-348

http://dx.doi.org/10.3414/ME0614

3.
Text Mining Trial of Discharge Summary

Section 5: Decision Support

Best paper selection

T. Suzuki, H. Yokoi, S. Fujita, K. Takabayashi

IMIA Yearbook 2009 2009 4 1: 98-98

http://dx.doi.org/10.3414/ME9128


Preprint Online March 21, 2011

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Journal:Methods of Information in Medicine
ISSN:0026-1270
DOI:http://dx.doi.org/10.3414/ME09-01-0009
Issue:2010 (Vol. 49): Issue 4 2010
Pages:371-378

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Original Article

A. Kastrin (1), B. Peterlin (1), D. Hristovski (2)
(1) Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia; (2) Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Summary

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE® citations containing genetic relevant topic.Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH® descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

Keywords

Text Mining, natural language processing, Applied statistics, document categorization

DOI

http://dx.doi.org/10.3414/ME09-01-0009

You may also be interested in...

1.

S. M. Meystre1,G. K. Savova2, K. C. Kipper-Schuler2, J. F. Hurdle1

IMIA Yearbook 2008 2008 3 1: 128-144

2.
A Pilot Experiment Using a Semi-automated Method for Logical Schema Acquisition

Original Article

M. García-Remesal (1), V. Maojo (1), H. Billhardt (2), J. Crespo (1)

Methods of Information in Medicine 2010 49 4: 337-348

http://dx.doi.org/10.3414/ME0614

3.
Text Mining Trial of Discharge Summary

Section 5: Decision Support

Best paper selection

T. Suzuki, H. Yokoi, S. Fujita, K. Takabayashi

IMIA Yearbook 2009 2009 4 1: 98-98

http://dx.doi.org/10.3414/ME9128


Preprint Online March 04, 2011

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Journal:Methods of Information in Medicine
ISSN:0026-1270
DOI:http://dx.doi.org/10.3414/ME09-01-0009
Issue:2010 (Vol. 49): Issue 4 2010
Pages:371-378

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Original Article

A. Kastrin (1), B. Peterlin (1), D. Hristovski (2)
(1) Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia; (2) Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Summary

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE® citations containing genetic relevant topic.Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH® descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

Keywords

Text Mining, natural language processing, Applied statistics, document categorization

DOI

http://dx.doi.org/10.3414/ME09-01-0009

You may also be interested in...

1.

S. M. Meystre1,G. K. Savova2, K. C. Kipper-Schuler2, J. F. Hurdle1

IMIA Yearbook 2008 2008 3 1: 128-144

2.
A Pilot Experiment Using a Semi-automated Method for Logical Schema Acquisition

Original Article

M. García-Remesal (1), V. Maojo (1), H. Billhardt (2), J. Crespo (1)

Methods of Information in Medicine 2010 49 4: 337-348

http://dx.doi.org/10.3414/ME0614

3.
Text Mining Trial of Discharge Summary

Section 5: Decision Support

Best paper selection

T. Suzuki, H. Yokoi, S. Fujita, K. Takabayashi

IMIA Yearbook 2009 2009 4 1: 98-98

http://dx.doi.org/10.3414/ME9128



Articles

You've 176 Article(s) in your Basket.