Njdeh Satourian
July 15, 2025
Special thanks to Rohan Deshpande for the original implementation of this agent during his time at Cerebras!
Reading and reasoning over long documents like academic papers or legal texts presents a major challenge for AI due to the context window limitations of Large Language Models (LLMs). To solve this, we will build an agent that implements Gist Memory, a technique developed by Google DeepMind that mimics human reading patterns. Instead of ingesting a whole document at once, the agent intelligently breaks it into pages, creates high-level summaries (“gists”) of each one, and then selectively re-reads only the most relevant sections to answer questions.
The agent’s workflow is entirely self-contained. It begins with a single ArXiv URL and processes it in memory through a structured sequence of steps. It first parses the document into clean text, then paginates it into semantic episodes. From there, it generates a concise gist for each page, creating a dual-memory structure: the full-text original pages and a parallel list of their corresponding summaries. This allows the agent to hold a compressed version of the entire document in memory.
At the heart of this agent is the Gist Memory technique, which relies on two key LLM-driven stages: intelligent pagination to create coherent chunks and interactive lookup to retrieve relevant information on demand. This multi-step process requires dozens of sequential LLM calls, making fast inference essential for a responsive user experience. To handle document ingestion, we leverage a helper script that converts ArXiv papers into a clean HTML format using the ar5iv service. In this cookbook recipe, we’ll walk you through how to build this Gist Agent step by step.
Architecture Overview
The agent’s architecture can be understood as a sequence of four internal modules that work together to read, remember, and reason about a long document:
- Parser: Fetches an ArXiv paper, converts it to HTML, and extracts a clean list of paragraphs for processing.
- Paginator: Breaks the long list of paragraphs into semantically coherent “pages” by using an LLM to identify natural breakpoints in the text.
- Summarizer: Reads each page and generates a concise “gist” to be stored in the agent’s memory.
- Q&A Engine: When asked a question, it first consults the list of gists to decide which pages are relevant, retrieves the full text for only those pages, and then generates an answer based on the enriched context.
For the entire codebase of this project, please visit its directory in our cookbook repository.
Prerequisites
Before getting started, please ensure that:
- You have installed the Cerebras Inference SDK
- You have a Cerebras API key and have saved it as an environment variable as such:
export CEREBRAS_API_KEY="your-api-key-here"
Step 1: Parsing Arxiv Papers
Before the agent can read a document, it needs clean, machine-readable text. The arxiv_parser.py
script handles this by fetching an academic paper from ArXiv and converting it into a simple list of paragraphs. Since parsing PDFs is difficult, the script uses a clever workaround: it transforms the ArXiv link into its corresponding ar5iv HTML version, which is much easier to process with standard tools.
The parser’s logic is built around a few key functions:
-
get_ar5iv_link(url)
: This function takes a standard ArXiv URL for a PDF or abstract page and converts it into the equivalent ar5iv.labs.arxiv.org HTML link. It uses a regular expression to extract the paper’s unique ID to build the new URL.
-
get_html_page(url)
: To avoid re-downloading the same paper, this function fetches the HTML and saves it to a local html_cache directory. On subsequent runs, if the file exists in the cache, it’s read directly from the disk.
-
get_paragraphs_from_html(html)
: This function does the main work of text extraction. Using the BeautifulSoup library, it finds all paragraph elements in the HTML. It also includes a crucial preprocessing step for scientific content: it finds all mathematical formula tags (<math>
) and replaces them with their readable LaTeX alttext, wrapped in $ symbols so the LLM can understand them.
The final output of this script is a clean list of text paragraphs, ready to be passed to the main agent for the next stage: pagination.
def get_ar5iv_link(url: str) -> str:
"""
Turns an arxiv link into a ar5iv link for HTML processing.
Args:
url (str): The original arxiv URL (e.g., https://arxiv.org/pdf/...).
Returns:
str: The corresponding ar5iv URL.
"""
if url.startswith("https://ar5iv.labs.arxiv.org/html/"):
return url
# Updated regex to handle different arxiv URL formats (e.g. /abs/, /pdf/)
match = re.search(r"arxiv\.org\/(?:pdf|abs)\/([\w+.-]+)", url)
if not match:
raise ValueError(f"{url} is not a valid arxiv link!")
paper_id = match.group(1)
# Remove .pdf if it exists
if paper_id.endswith('.pdf'):
paper_id = paper_id[:-4]
return f"https://ar5iv.labs.arxiv.org/html/{paper_id}"
def get_html_page(url: str) -> str:
"""
Fetches HTML content from a URL, using a local cache to avoid repeated requests.
Args:
url (str): The URL to fetch.
Returns:
str: The HTML content of the page.
"""
cache_dir = "html_cache"
if not os.path.exists(cache_dir):
os.makedirs(cache_dir)
# Create a simple cache key from the URL
cache_key = "".join(c for c in url if c.isalnum()) + ".html"
file_path = os.path.join(cache_dir, cache_key)
if os.path.exists(file_path):
print(f"Cache hit for {url}. Reading from {file_path}")
with open(file_path, "r", encoding="utf-8") as f:
return f.read()
else:
print(f"Cache miss for {url}. Fetching from web...")
try:
response = requests.get(url)
response.raise_for_status() # Raises an HTTPError for bad responses (4xx or 5xx)
html_content = response.text
with open(file_path, "w", encoding="utf-8") as f:
f.write(html_content)
return html_content
except requests.exceptions.RequestException as e:
print(f"Failed to fetch {url}: {e}")
raise
def get_title_from_html(html: str) -> Optional[str]:
"""
Extracts the document title from the ar5iv HTML.
Args:
html (str): The HTML content of the page.
Returns:
Optional[str]: The extracted title, or None if not found.
"""
soup = BeautifulSoup(html, 'html.parser')
element = soup.find(class_="ltx_title_document")
if element:
# Join fragments and strip whitespace for a clean title
title = " ".join(element.get_text(strip=True).split())
return title
return None
def get_paragraphs_from_html(html: str) -> Tuple[List[str], List[str]]:
"""
Extracts paragraphs from the ar5iv HTML.
Returns both a clean text version for the LLM and the original HTML
for potential rendering (though rendering is removed in this version).
Args:
html (str): The HTML content of the page.
Returns:
Tuple[List[str], List[str]]: A tuple containing:
- A list of LLM-readable paragraphs (clean text).
- A list of the original HTML paragraphs.
"""
soup = BeautifulSoup(html, 'html.parser')
# Start searching for paragraphs after the main title
title_element = soup.find(class_="ltx_title_document")
search_area = title_element if title_element else soup
elements = search_area.find_all_next(class_="ltx_p")
if not elements: # Fallback if no paragraphs are found after the title
elements = soup.find_all(class_="ltx_p")
original_html = [str(e) for e in elements]
llm_readable = []
for e in elements:
# Create a copy to avoid modifying the original soup object
e_copy = BeautifulSoup(str(e), 'html.parser')
# Replace <math> tags with their 'alttext' for better LLM consumption
for math_tag in e_copy.find_all('math'):
alttext = math_tag.get("alttext")
if alttext:
# Wrap in $ to signify it's a formula
math_tag.replace_with(f"${alttext.strip()}$")
text = e_copy.get_text(separator=' ', strip=True)
if text: # Only add non-empty paragraphs
llm_readable.append(text)
return llm_readable, original_html
Once the document is parsed into paragraphs, the next step is to group them into coherent “pages.” Instead of creating naive, fixed-size chunks that might awkwardly split a sentence or idea, the agent uses an LLM to find logical breakpoints. This process, called Episode Pagination, is handled by the _get_next_page_break method.
The pagination logic works as follows:
-
Accumulate and Mark Text: The agent gathers paragraphs into a chunk of about 600 words. After a certain threshold, it begins inserting numbered labels (e.g., <57>
) between paragraphs . These labels correspond to the paragraph’s index in the full document.
-
Ask the LLM for a Breakpoint: This chunk, now containing embedded labels, is sent to the LLM. The prompt asks the model to choose the label that marks a “natural” place to break reading, such as a narrative transition or the end of an argument .
-
Set the Page Boundary: The agent parses the LLM’s response to extract the chosen label (e.g., <57>
). If the label is valid, that paragraph index is used as the end of the current page. If the LLM fails to provide a valid break, the agent defaults to ending the page at the end of the accumulated chunk.
This iterative process is repeated until the entire document is divided. The result is a list of pages stored in self.pages, where each page is a semantically coherent section of the paper, ready for the next step.
PROMPT_PAGINATION_TEMPLATE = """
You are given a passage that is taken from a larger text (article, book, ...) and some numbered labels between the paragraphs in the passage.
Numbered label are in angeled brackets. For example, if the label number is 19, it shows as <19> in text.
Please choose one label that it is natural to break reading.
Such point can be scene transition, end of a dialogue, end of an argument, narrative transition, etc.
Please answer the break point label and explain.
For example, if <57> is a good point to break, answer with \"Break point: <57>\n Because ...\"
Passage:
{0}
{1}
{2}
"""
class GistAgent:
...
def _get_next_page_break(self, paragraphs: List[str], start_paragraph: int) -> int:
"""
Determines the next natural break point in the document.
Args:
paragraphs (List[str]): The list of all paragraphs in the document.
start_paragraph (int): The index of the paragraph to start from.
Returns:
The index of the paragraph that marks the end of the new page.
"""
word_limit = 600
start_threshold = 280
i = start_paragraph
preceding = "" if i == 0 else "...\n" + '\n'.join(self.pages[-1])
passage = [paragraphs[i]]
wcount = len(paragraphs[i].split())
j = i + 1
while wcount < word_limit and j < len(paragraphs):
wcount += len(paragraphs[j].split())
if wcount >= start_threshold:
passage.append(f"<{j}>")
passage.append(paragraphs[j])
j += 1
passage.append(f"<{j}>")
end_tag = "" if j == len(paragraphs) else paragraphs[j] + "\n..."
if wcount < 350:
return len(paragraphs)
prompt = PROMPT_PAGINATION_TEMPLATE.format(preceding, '\n'.join(passage), end_tag)
response = self._run_llm([{"role": "user", "content": prompt}])
if response:
pause_point = self._parse_pause_point(response)
if pause_point and (pause_point > i and pause_point <= j):
return pause_point
# Fallback to the max paragraph count in this chunk
return j
Step 3: Summarization (Memory Gisting)
With the document now organized into coherent pages, the next step is to create a concise summary for each one. These summaries, or “gists,” form the agent’s Gist Memory—a condensed, high-level version of the entire document that can be quickly scanned later. This process is handled by the _create_summary method.
The summarization logic for each page is straightforward:
- The agent takes the full text of a page and inserts it into a simple, direct prompt defined by
PROMPT_SHORTEN_TEMPLATE
. This prompt instructs the LLM to “Please shorten the following passage. Just give me a shortened version. DO NOT explain your reason” .
- The LLM’s response is then cleaned by a helper function,
_post_process_summary
, which strips away any conversational filler (e.g., “Here is the shortened version:”) to ensure the gist is clean.
This process is repeated for every page in the document. At the end of this stage, the agent holds two parallel data structures in its memory: self.pages
(a list of the original, full-text pages) and self.shortened_pages
(a list of the corresponding gists). This dual-memory system is the core of the Gist Memory technique and is essential for the final question-answering stage.
PROMPT_SHORTEN_TEMPLATE = """
Please shorten the following passage.
Just give me a shortened version. DO NOT explain your reason.
Passage:
{}
"""
class GistAgent:
...
def _create_summary(self, page: List[str]) -> str:
"""
Creates a summary (gist) for a given page of text.
"""
prompt = PROMPT_SHORTEN_TEMPLATE.format('\n'.join(page))
response = self._run_llm([{"role": "user", "content": prompt}])
if response:
shortened_text = response.strip()
return self._post_process_summary(shortened_text)
return "Failed to generate summary."
def _post_process_summary(self, text: str) -> str:
"""Removes conversational prefixes from summaries."""
match = re.match(r"(here[a-z ]+ shortened.*?:)", text.lower())
if match:
text = text[len(match.group(1)) :].strip()
return text
Step 4: The Q&A Engine with Interactive Lookup
This final stage is where the agent uses its Gist Memory to answer questions about the document. Instead of overwhelming the LLM with the full text, the answer method orchestrates a two-step “lookup then answer” process that allows the agent to focus only on the most relevant information.
The Q&A logic works as follows:
The Lookup Stage: The agent first needs to decide which parts of the document to re-read.
- It compiles all the gists into a single “memory” text, where each gist is labeled with its page number.
- Using the
PROMPT_LOOKUP_TEMPLATE
, it presents this gist memory and the user’s question to the LLM. The prompt specifically instructs the model not to answer the question yet, but to instead identify which pages it needs to read in full to find the answer.
- The agent parses the page numbers from the model’s response (e.g., “I want to look up Page [2, 5]…”).
The Answering Stage: With the relevant page numbers identified, the agent constructs a final, enriched context to generate the answer.
- It starts with the list of all page gists.
- It then iterates through the page numbers chosen during the lookup stage and replaces their gists with the original, full-text versions from self.pages. The result is a hybrid context containing high-detail excerpts where needed and low-detail summaries everywhere else.
- Finally, using the
PROMPT_FREE_ANSWER_TEMPLATE
, the agent sends this hybrid context and the user’s question to the LLM to generate the final, fully-informed answer.
By following this process, the agent intelligently consults its memory to retrieve only the most pertinent details on demand, allowing it to provide accurate answers based on documents that are far too long to fit in a single context window.
PROMPT_LOOKUP_TEMPLATE = """
The following text is what you remembered from reading an article and a multiple choice question related to it.
You may read 1 to 6 page(s) of the article again to refresh your memory to prepare yourself for the question.
Please respond with which page(s) you would like to read.
For example, if your only need to read Page 8, respond with \"I want to look up Page [8] to ...\";
if your would like to read Page 7 and 12, respond with \"I want to look up Page [7, 12] to ...\";
if your would like to read Page 2, 3, 7, 15 and 18, respond with \"I want to look up Page [2, 3, 7, 15, 18] to ...\".
if your would like to read Page 3, 4, 5, 12, 13 and 16, respond with \"I want to look up Page [3, 3, 4, 12, 13, 16] to ...\".
DO NOT select more pages if you don't need to.
DO NOT answer the question yet.
Text:
{}
Question:
{}
Take a deep breath and tell me: Which page(s) would you like to read again?
"""
PROMPT_FREE_ANSWER_TEMPLATE = """
Read the following article and then answer the question.
Article:
{}
Question:
{}
"""
class GistAgent:
...
def answer(self, question: str):
"""
Answers a question based on the processed document's gist memory.
Args:
question (str): The user's question.
"""
if not self.pages:
print("Error: No document has been processed. Please call `process_document` first.")
return
print("\n" + "="*20)
print(f"Question: {question}")
print("="*20 + "\n")
shortened_pages_pidx = [f"\nPage {i}:\n{gist}" for i, gist in enumerate(self.shortened_pages)]
shortened_article = '\n'.join(shortened_pages_pidx)
# Step 1: Ask the model which pages to look up
prompt_lookup = PROMPT_LOOKUP_TEMPLATE.format(shortened_article, question)
print("Asking model for page lookup rationale...")
intermediate_response = self._run_llm([{"role": "user", "content": prompt_lookup}])
if not intermediate_response:
print("Failed to get lookup response from LLM.")
return
print("\n--- Model's Lookup Rationale ---")
print(intermediate_response.strip())
print("--------------------------------\n")
page_ids = []
try:
match = re.search(r'\[([\d,\s]+)\]', intermediate_response)
if match:
page_ids_str = match.group(1).split(',')
for p in page_ids_str:
if p.strip().isnumeric():
page_id = int(p.strip())
if 0 <= page_id < len(self.pages):
page_ids.append(page_id)
else:
print(f" - (Model requested invalid page index: {page_id})")
except Exception as e:
print(f"Could not parse page IDs from response: {e}")
chosen_pages = sorted(list(set(page_ids))) if page_ids else 'None'
print(f"Model chose to re-read page(s): {chosen_pages}\n")
# Step 2: Construct the final context with expanded pages
expanded_shortened_pages = self.shortened_pages[:]
if page_ids:
for page_id in page_ids:
expanded_shortened_pages[page_id] = '\n'.join(self.pages[page_id])
expanded_article = '\n'.join(expanded_shortened_pages)
# Step 3: Ask the final question
prompt_answer = PROMPT_FREE_ANSWER_TEMPLATE.format(expanded_article, question)
print("Generating final answer...")
final_answer = self._run_llm([{"role": "user", "content": prompt_answer}])
if final_answer:
print("\n--- Final Answer ---")
print(final_answer.strip())
print("--------------------\n")
else:
print("Failed to generate a final answer.")
self.print_metrics()
Conclusion
In this tutorial, we built a Gist Agent capable of reading and reasoning about academic papers far longer than a typical LLM context window can support. By mimicking a human reader’s strategy of semantic pagination, summarization, and targeted lookup, the agent intelligently overcomes the context limitations of standard models. This project serves as a powerful example of how to solve complex problems by breaking them into smaller, manageable parts and using an LLM as a component in a larger workflow.
The Gist Memory agent demonstrates a fundamental shift in designing AI systems. Instead of relying on a single, massive prompt, we created an algorithmic workflow where the LLM is called dozens of times sequentially to paginate, summarize, and retrieve information. The model’s output from one step, such as the list of gists, directly informs the input for subsequent steps, like the lookup stage. This iterative, memory-augmented architecture represents a more sophisticated and capable approach to building AI agents.
This new class of agent architecture is only practical with access to high-speed, low-latency inference. Processing a single document can require over 20 LLM calls, with additional calls needed for each question asked. If each call took several seconds, the user would be left waiting for minutes, rendering the agent useless for interactive Q&A. Fast inference is therefore not just a performance enhancement; it is the core enabling technology that makes complex, multi-step agentic workflows like this one viable.
Responses are generated using AI and may contain mistakes.