As the context windows of LLMs scale up to millions of tokens, book-length summarization becomes an increasingly relevant problem. Most prior work on summarization has focused on documents of a few hundred or thousand tokens, making book summarization a largely unexplored area of research. Our work at the UMass Amherst NLP group includes completed, ongoing, and future projects focused on the book summarization task.
Authors 👉 Yapei Chang, Kyle Lo, Tanya Goyal, Mohit Iyyer
In this work, we explore coherence evaluation in the book summarization setup. After collecting newly published books unlikely to have been seen by LLMs during training, we adopt a finegrained human annotation protocol to evaluate book summaries generated by models like GPT-4 and Mixtral. Annotators are asked to read through a summary, highlight confusing spans, and ask clarification questions. We subsequently derive an error taxonomy based on these annotations, and find that omission is the most common error type. Motivated by the high costs associated with human evaluation, we propose an automatic evaluation metric, BooookScore, where we use GPT-4 to obtain finegrained annotations and compute a coherence score as the percentage of sentences that are error-free in a summary. This metric is found to align well with human judgment.
Authors 👉 Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, Mohit Iyyer
A major limitation of BooookScore is that it only focuses on coherence. In FABLES, we aim to fill this gap by conducting a human evaluation of faithfulness and content selection on LLM-generated book summaries. After collecting a set of newly published books, we obtain LLM summaries, extract claims from these summaries with GPT-4, then hire annotators who have already read these books to evaluate the faithfulness of the claims while citing evidence from the books. Our error analysis suggests that most unfaithful claims are related to events or states of characters and relationships. We further experiment with automatically evaluating faithfulness using LLMs, but find that they struggle to identify unfaithful claims. As we observe from annotators’ high-level comments, verifying unfaithful claims is also a hard task for humans since it often is less direct and involves aggregating evidence from multiple parts of the book. Apart from faithfulness, we also analyze how well summaries cover the content of the books. Insights from annotators’ comments reveal that important events, details, and themes are frequently omitted by all LLMs.
Automatic claim verification with LLMs: We delve deeper into automatically evaluating the faithfulness of LLM-generated book summaries by asking LLMs to verify claims extracted from summaries.
Summarizing series of books: Instead of standalone books, we scale up to series of books to better utilize and test LLMs’ long context windows.