Assessment of Evaluation Metrics Used for Abstractive Text Summarization and a Proposal of Novel Metric
Abstract
Background
Automatic text summarization is one of the most efficient techniques to obtain a shorter and a more concise version of the unstructured text data. Text summarization algorithms have seen a huge recent increase in performance following investments in very advanced Machine Learning, Large Language Models and Generative AI. Even though these text summarization algorithms are getting improved, the way we measure the accuracy of text summary generated by these algorithms remains unchanged. We still use a traditional method that mostly uses the word overlaps in generated summary and original summary. There is a constraint in analyzing the summary produced. This paper examines these measures in the business arena which are unable to serve business needs like relevance of summary, coherence, and informativeness. As businesses leverage transformer-based models to generate customer feedback and insights, there arises a need for improved evaluation metrics. This research suggests a new metric that matches business objectives and enhances automated summarization to gain a competitive edge.
Methods
This research study uses quantitative methods to evaluate a metric and improve text summarization evaluation. The study was conducted based on the traditional metrics like ROUGE and BLEU and proposed a new metric called the Unified Summary Evaluation Score (USES). The study examined BART and T5 model who are capable of generating text summaries from large datasets. Statistical techniques (t-tests, ANOVA, etc) were used to assess the performance and consistency of USES against other conventional metrics, with respect to both surface-level accuracy and semantic-level accuracy.
Results
USES performance metrics yield better semantic accuracy and more reliable summaries than traditional metrics, according to the analysis of quantitative data. The combination of BERTScore, Wu-Palmer similarity and cosine similarity enabled USES to provide a more holistic evaluation along with better accuracy and consistency. According to statistical tests, USES has greater consistency in evaluating abstractive summaries and aligns better with humans.
Discussion and Conclusion
According to the results, USES provides a more complete evaluation of summaries which is better than the traditional metrics that only look at the overlapping words. Through integration of semantics, coherence as well as coverage, USES provides a more accurate and fair assessment of summaries generated by various large language models. This study helps develop better assessment frameworks in terms of text summarization which will be useful for various natural language processing and machine learning applications. The study establishes a groundwork for further investigations into evaluation strategies that are efficient and scalable to improve summarization.
Keywords
Automatic Text Summarization, Natural Language Processing (NLP), Unstructured Text Data, Text Summarization Algorithms, Machine Learning, Large Language Models, Generative AI, Evaluation Metrics, Word Overlap Methods, Transformer-based Models.