Njdeh Satourian
July 15, 2025
July 15, 2025
How Iteration Improves Summarization & Why Fast Inference Matters
Having established iterative summarization as the cornerstone of our report generation pipeline, let’s examine why this approach dramatically enhances quality and why fast inference is critical to making this feasible. The iterative summarization method—Reflect → Elaborate → Critique → Refine—is fundamentally designed to address common pitfalls in automated summarization, such as factual inaccuracies, omission of critical details, and surface-level analysis. Each iteration improves upon the previous one, guided by structured feedback and critique, ensuring that summaries are thorough, accurate, and nuanced. Research studies underscore the effectiveness of this approach. For instance, the Self-Refine method (Madaan et al., 2023) demonstrated that iterative self-feedback significantly boosts accuracy by as much as 20%. Similarly, SelfCheckGPT (Manakul et al., 2023) confirmed that incorporating a critique phase using a second model or alternative decoding strategies substantially reduces errors, omissions, and hallucinations common in single-pass summarization. However, such iterative processes are computationally intensive, requiring multiple sequential LLM calls for every article. In our pipeline, summarizing just 12 articles involves nearly 50 sequential model invocations. Traditional inference providers with slower response times would render such an iterative loop impractical due to prohibitively high latency. Our pipeline leverages fast inference—using models like Llama 3.3 70B and Qwen 3 32B, served at speeds exceeding 2100 tokens per second—to execute each summarization step in mere seconds. This rapid response makes the iterative approach not only feasible but highly practical, enabling near-instant feedback loops that significantly enhance final report quality. In this cookbook recipe, we’ll walk you through how to build this multi-agent system step by step.Architecture Overview
The system’s architecture is modeled after an assembly line for knowledge work, comprising five distinct, specialized agents that each perform a single task before passing their work to the next:- Interactor: Refines the user’s initial topic by asking clarifying questions and capturing the answers to create a detailed research_brief.json.
- Researcher: Takes the research brief, generates targeted queries, searches the web, and uses an advanced iterative loop (Reflect, Elaborate, Critique, Refine) to produce high-quality article summaries.
- Outliner: Synthesizes all the research summaries into a structured, logical blueprint for the final report, embedding citation placeholders at each step.
- Writer: Composes the full, human-readable narrative in Markdown, transforming the outline’s bullet points into flowing paragraphs and preserving the citation placeholders.
- Citation Manager: Performs the final post-processing step, converting the placeholders into numbered citations and building a professional reference list at the end of the document.
Prerequisites
In order to be more concise, we have not included every snippet of code found in the codebase for this agent. You may notice import statements and the main orchestrations missing. You can find the entirety of the code in its directory here.
Shared Client Architecture
To optimize performance and reduce API initialization overhead, the system uses a shared client architecture. Instead of creating a new Cerebras client instance for each function call, we create it once and reuse it throughout the entire pipeline. Thecerebras_client.py
module implements a singleton pattern:
- Performance: Eliminates repeated client initialization overhead
- Resource efficiency: One connection/session instead of many
- Flexibility: Both synchronous and asynchronous clients available
- Consistency: All modules use the same client instance
get_client()
or get_async_client()
as needed, with the client being created lazily on first use and reused for all subsequent API calls.
Configuration System
The system uses a centralized configuration file (config.yaml
) that allows easy customization of models, API limits, word counts, and other parameters without modifying code. This makes the system highly configurable and maintainable.
The config_loader.py
module provides access to these settings:
Step 1: Interaction
The first step in our pipeline focuses on clarifying and capturing the user’s precise intent, ensuring the accuracy and relevance of the final output. The script comprises two core functions:ask_follow_up_questions(topic)
- Takes the user’s provided topic and generates three structured follow-up questions.capture_user_answers(questions)
- Presents these generated questions one-by-one, interactively capturing the user’s answers.
research_brief.json
.
Step 2: Research and Advanced Summarization
Next, we build out the core research engine of the pipeline. It transforms the detailedresearch_brief.json
from Step 1 into a collection of high-quality, detailed summaries through a combination of web searches and advanced AI summarization.
This script operates around two main functions:
-
run_research_tasks(research_brief)
- Orchestrates a “General + Specific” query strategy. It first formulates one broad search query based on your initial topic, followed by three specific queries derived from the clarifying questions and answers in the brief. Using the Exa API, it retrieves the top three articles for each query, resulting in 12 selected source articles. -
summarize_single_article(article_text, research_brief)
- Conducts a sophisticated four-step “Iterative Refinement” summarization loop:- Reflect: Identifies key points to structure the summary.
- Elaborate: Generates an initial detailed summary draft.
- Critique: Employs a different model (Qwen-3-32B) to review and critique the summary.
- Refine: Produces a final, refined summary incorporating the critique.
search_queries.json
raw_exa_results.json
summarized_articles.json
(most important - sets the stage for subsequent outlining and drafting phases)
Step 3: Creating a Structured Outline
Phase 3 transforms the collection of detailed summaries into a single, coherent outline for the final report. This process is managed by thecreate_report_outline(summaries)
function within 3_outliner.py
.
The script operates as follows:
- It begins by loading the article summaries from
summarized_articles.json
. - The summaries are combined into a single context and sent to the LLM, which is prompted to synthesize the information and organize it into a logical narrative flow.
- To ensure a predictable and usable structure, the model’s output is constrained by a detailed JSON schema.
report_outline.json
, a structured file that contains the blueprint for the report, including a title, introduction, body sections, conclusion, and bullet points. Each bullet point is paired with a list of source indices, ensuring every claim is traceable back to its original source material. This file acts as the primary input for the next agent, the Writer.
Step 4: Drafting the Report
Phase 4 transforms the structured outline from the previous step into a complete, narrative-driven report. This task is managed by thewrite_report_from_outline(outline)
function.
The script operates by loading the report_outline.json
file and using its contents to construct a detailed prompt. The prompt directs the model to:
- Act as an expert journalist, weaving the outline’s bullet points into a flowing, paragraph-based article.
- Use Markdown for all formatting.
- Preserve the source citations using a critical format, such as
[Source 1]
for a single source and[Source 1, 3, 5]
for multiple sources.
draft_report.md
. This file serves as the near-final draft, containing all the generated text and correctly formatted citation placeholders, ready for the final processing step.
Step 5: Citation Management and Finalization
The final step, executed bycitation_manager.py
, polishes the report by formatting citations and appending a complete reference list. This entire process is handled by the create_final_report()
function.
The script operates as follows:
- It loads the two required inputs:
draft_report.md
, which contains the text with[Source X]
placeholders, andsummarized_articles.json
, which holds the metadata for each source. - It parses the draft to find every unique source that was cited, regardless of where it appears.
- It re-numbers the sources to ensure they appear sequentially (1, 2, 3…) in the final document. The original
[Source X]
placeholders are replaced with these new, ordered numbers. - A “References” section is generated in Markdown, listing each unique source with its title and URL.
Conclusion
The fully processed text and the new reference list are then saved together asfinal_report.md
, the completed output of the pipeline.
This tutorial demonstrated a five-phase pipeline that automates the creation of a comprehensive, cited news report from a single user topic. By breaking the process into distinct, manageable steps—from user interaction to final citation management—the system ensures a high-quality and coherent output.
A core takeaway is the power of iterative refinement in the research phase. The “Reflect, Elaborate, Critique, Refine” summarization loop significantly enhances the accuracy and depth of the generated summaries. This technique, which involves using a second model to provide critical feedback, improves the quality of the final content without requiring expensive model fine-tuning.
Such advanced, multi-step agentic workflows are only practical with access to high-speed inference. The research phase alone requires nearly 50 sequential LLM calls to process all the articles. Traditional inference speeds would introduce significant latency, making this iterative approach impractical. Fast, low-latency inference is the enabling technology that allows for the development of more sophisticated and reliable AI agents.
While this pipeline focused on generating a news report, the “generate, critique, refine” pattern is a versatile technique applicable to numerous AI agentic workflows:
- Code Generation: An agent could write a function, a second agent could critique it for bugs and style, and a third could implement the suggested fixes.
- Strategic Planning: An agent could draft a business plan, a critique agent could identify potential risks or logical gaps, and a refine agent could create a more robust final strategy.
- Creative Writing: An agent could write a chapter of a story, a critique agent could check for plot holes or inconsistent character voices, and a refine agent could rewrite the section to address the feedback.