Build an AI agent that reads, summarizes, and answers questions about long documents using Gist Memory and the Cerebras Inference SDK.
arxiv_parser.py
script handles this by fetching an academic paper from ArXiv and converting it into a simple list of paragraphs. Since parsing PDFs is difficult, the script uses a clever workaround: it transforms the ArXiv link into its corresponding ar5iv HTML version, which is much easier to process with standard tools.
The parser’s logic is built around a few key functions:
get_ar5iv_link(url)
: This function takes a standard ArXiv URL for a PDF or abstract page and converts it into the equivalent ar5iv.labs.arxiv.org HTML link. It uses a regular expression to extract the paper’s unique ID to build the new URL.
get_html_page(url)
: To avoid re-downloading the same paper, this function fetches the HTML and saves it to a local html_cache directory. On subsequent runs, if the file exists in the cache, it’s read directly from the disk.
get_paragraphs_from_html(html)
: This function does the main work of text extraction. Using the BeautifulSoup library, it finds all paragraph elements in the HTML. It also includes a crucial preprocessing step for scientific content: it finds all mathematical formula tags (<math>
) and replaces them with their readable LaTeX alttext, wrapped in $ symbols so the LLM can understand them.
<57>
) between paragraphs . These labels correspond to the paragraph’s index in the full document.
<57>
). If the label is valid, that paragraph index is used as the end of the current page. If the LLM fails to provide a valid break, the agent defaults to ending the page at the end of the accumulated chunk.
PROMPT_SHORTEN_TEMPLATE
. This prompt instructs the LLM to “Please shorten the following passage. Just give me a shortened version. DO NOT explain your reason” ._post_process_summary
, which strips away any conversational filler (e.g., “Here is the shortened version:”) to ensure the gist is clean.self.pages
(a list of the original, full-text pages) and self.shortened_pages
(a list of the corresponding gists). This dual-memory system is the core of the Gist Memory technique and is essential for the final question-answering stage.
PROMPT_LOOKUP_TEMPLATE
, it presents this gist memory and the user’s question to the LLM. The prompt specifically instructs the model not to answer the question yet, but to instead identify which pages it needs to read in full to find the answer.PROMPT_FREE_ANSWER_TEMPLATE
, the agent sends this hybrid context and the user’s question to the LLM to generate the final, fully-informed answer.