這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法原創(chuàng) 精華

發(fā)布于 2025-12-16 08:23

瀏覽

0收藏

你或許已經(jīng)對(duì) RAG (Retrieval Augmented Generation，檢索增強(qiáng)生成) 這個(gè)詞耳熟能詳。它就像一個(gè)超級(jí)大腦，能讓大語(yǔ)言模型（LLM）通過(guò)檢索外部知識(shí)，回答那些訓(xùn)練時(shí)沒(méi)“見(jiàn)過(guò)”的問(wèn)題。從簡(jiǎn)單的問(wèn)答機(jī)器人到復(fù)雜的知識(shí)庫(kù)助手，RAG的應(yīng)用無(wú)處不在。

然而，RAG也并非完美無(wú)缺。隨著你提供的知識(shí)庫(kù)越來(lái)越龐大，問(wèn)題也隨之而來(lái)：

知識(shí)庫(kù)里信息太多，怎么精準(zhǔn)定位到最相關(guān)的幾條？
那些看似無(wú)關(guān)但實(shí)則緊密相連的“上下文”怎么辦？
當(dāng)一個(gè)概念被不同的人用不同的方式描述時(shí)，如何讓RAG把它們“認(rèn)出來(lái)”？

我們通常會(huì)用各種優(yōu)化技巧來(lái)解決這些問(wèn)題：查詢轉(zhuǎn)換、重排模型、多路檢索……但每增加一層，系統(tǒng)就變得更復(fù)雜，調(diào)用LLM的次數(shù)也越多，整個(gè)架構(gòu)像搭積木一樣，搖搖欲墜。

那么，有沒(méi)有一種方法，能從源頭解決問(wèn)題，不增加查詢時(shí)的復(fù)雜性，而是讓知識(shí)庫(kù)本身就變得更“聰明”？

這正是 RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval，遞歸抽象處理樹(shù)形組織檢索) 管道的核心思想。它不改變你查詢的方式，而是通過(guò)構(gòu)建一個(gè)層次化的索引，讓知識(shí)庫(kù)像人腦一樣，從細(xì)節(jié)到高層概念，逐層理解和組織信息。

這個(gè)方法到底有多神？它能在不增加查詢復(fù)雜度的前提下，大幅提升檢索性能，甚至可能讓你的RAG系統(tǒng)少用70%的Token，從而節(jié)省計(jì)算和金錢(qián)成本。

今天，我們就來(lái)深入剖析 RAPTOR，看看它是如何通過(guò)一個(gè)巧妙的“遞歸抽象”過(guò)程，構(gòu)建出“聰明”的知識(shí)索引，從而徹底顛覆傳統(tǒng)RAG。

一、RAPTOR：從“葉子”到“樹(shù)干”，構(gòu)建智慧知識(shí)庫(kù)

簡(jiǎn)單來(lái)說(shuō)，RAPTOR 就像一個(gè)圖書(shū)館管理員，但它不只是簡(jiǎn)單地把書(shū)本（文檔）堆在一起，而是先將書(shū)本拆解，然后根據(jù)主題歸類(lèi)，再為每類(lèi)主題寫(xiě)摘要，最后把這些摘要和原始書(shū)頁(yè)一起整理成一本“精要合集”。

這個(gè)過(guò)程可以拆解為幾個(gè)核心步驟：

創(chuàng)建“葉節(jié)點(diǎn)”：首先，將所有原始文檔切分成一個(gè)個(gè)小而詳細(xì)的文本塊。這些文本塊就像一棵大樹(shù)的“葉子”，是知識(shí)最基本的組成單元。傳統(tǒng)的RAG檢索通常只停留在這一步，直接對(duì)這些“葉子”進(jìn)行向量化并搜索。
聚類(lèi)抽象：接下來(lái)，利用機(jī)器學(xué)習(xí)聚類(lèi)算法，將語(yǔ)義相關(guān)的“葉子”自動(dòng)分組。比如，所有討論“模型訓(xùn)練參數(shù)”的葉子，都會(huì)被分到同一組。
LLM總結(jié)：使用LLM為每個(gè)聚類(lèi)生成一個(gè)簡(jiǎn)潔、高質(zhì)量的摘要。這些摘要成了這棵知識(shí)樹(shù)的下一個(gè)更高層級(jí)的“分支”。它概括了本組“葉子”的核心思想。
遞歸向上：重復(fù)步驟2和3。對(duì)新生成的摘要進(jìn)行聚類(lèi)和總結(jié)，不斷向上構(gòu)建，直到到達(dá)一個(gè)能代表整個(gè)文檔庫(kù)最高層概念的“根節(jié)點(diǎn)”。
索引全部：最后，將所有內(nèi)容——原始的“葉節(jié)點(diǎn)”和所有生成的摘要——全部索引到一個(gè)向量數(shù)據(jù)庫(kù)中。這樣，檢索時(shí)就能在一個(gè)“多分辨率”的知識(shí)庫(kù)中進(jìn)行搜索。

這個(gè)“從下到上”的構(gòu)建過(guò)程，讓 RAPTOR 管道能夠處理不同粒度的查詢。當(dāng)用戶問(wèn)一個(gè)具體問(wèn)題時(shí)，它可以檢索到精確的“葉子”；當(dāng)用戶問(wèn)一個(gè)宏觀概念時(shí)，它可以直接檢索到高層級(jí)的摘要，避免“只見(jiàn)樹(shù)葉不見(jiàn)森林”的問(wèn)題。

接下來(lái)，我們將通過(guò)一個(gè)真實(shí)的案例——使用Hugging Face的官方文檔來(lái)構(gòu)建知識(shí)庫(kù)——來(lái)深入探究 RAPTOR 的強(qiáng)大之處。

二、配置RAG環(huán)境：準(zhǔn)備工作

為了公平地評(píng)估 RAPTOR 的性能，我們選擇使用一年前發(fā)布的、經(jīng)過(guò)量化的舊模型，而不是最新的LLM。這樣做的目的是為了確保評(píng)估真正考驗(yàn)的是檢索質(zhì)量，而不是模型本身是否“知道”答案。

首先，讓我們導(dǎo)入所需的庫(kù)并配置好 LLM 和 Embedding 模型。

# Import the core PyTorch library for tensor operations
import torch

# Import LangChain's wrappers for Hugging Face models
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFacePipeline

# Import core components from the transformers library for model loading and configuration
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig

# Import LangChain's tools for prompt engineering and output handling
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

我們將使用 ??sentence-transformers/all-MiniLM-L6-v2?? 作為嵌入模型，因?yàn)樗p量且高效，非常適合大規(guī)模文檔索引。

# --- Configure Embedding Model ---
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Use GPU if available, otherwise fallback to CPU
model_kwargs = {"device": "cuda"}

# Initialize embeddings with LangChain's wrapper
embeddings = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=model_kwargs
)

對(duì)于文本生成，我們選擇 ??Mistral-7B-Instruct-v0.2??，這是一個(gè)功能強(qiáng)大但體積緊湊的指令微調(diào)模型。為了在顯存有限的設(shè)備上運(yùn)行，我們使用4-bit量化技術(shù)來(lái)加載它。

# --- Configure LLM for Summarization and Generation ---
llm_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Quantization: reduces memory footprint while preserving performance
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_id)

# Load LLM with quantization
model = AutoModelForCausalLM.from_pretrained(
    llm_id,
    torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=quantization_config
)

加載模型和分詞器后，我們將它們封裝進(jìn)一個(gè)Hugging Face管道，以便進(jìn)行文本生成。

# Create a text-generation pipeline using the loaded model and tokenizer.
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=512 # Controls the max length of the generated summaries and answers
)

# Wrap pipeline for LangChain compatibility
llm = HuggingFacePipeline(pipeline=pipe)

至此，我們已經(jīng)配置好了 RAG 管道所需的兩個(gè)核心組件，接下來(lái)我們將準(zhǔn)備用于測(cè)試的知識(shí)庫(kù)。

三、數(shù)據(jù)準(zhǔn)備：抓取Hugging Face文檔并分析

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

為了充分展示 RAPTOR 的優(yōu)勢(shì)，我們需要一個(gè)復(fù)雜且具有挑戰(zhàn)性的知識(shí)庫(kù)。我們選擇抓取 Hugging Face 的官方文檔，因?yàn)槠渲谐錆M了重疊信息和微妙的差異。

例如，Hugging Face對(duì)??ZeRO-3???檢查點(diǎn)保存有多種描述方式：??trainer.save_model()???、??unwrap_model().save_pretrained()??? 和 ??zero_to_fp32()??。這些都指向同一個(gè)底層概念，即將模型分片整合為一個(gè)完整的檢查點(diǎn)。一個(gè)簡(jiǎn)單的 RAG 管道可能只會(huì)檢索到其中一種變體，從而導(dǎo)致信息不完整。

我們將抓取以下五個(gè)核心指南的文檔內(nèi)容：

# Define the documentation sections to scrape, with varying crawl depths.
urls_to_load = [
    {"url": "https://huggingface.co/docs/transformers/index", "max_depth": 3},
    {"url": "https://huggingface.co/docs/datasets/index", "max_depth": 2},
    {"url": "https://huggingface.co/docs/tokenizers/index", "max_depth": 2},
    {"url": "https://huggingface.co/docs/peft/index", "max_depth": 1},
    {"url": "https://huggingface.co/docs/accelerate/index", "max_depth": 1}
]

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

使用 ??RecursiveUrlLoader??? 和 ??BeautifulSoup?? 來(lái)抓取內(nèi)容：

from langchain_community.document_loaders import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup

# Empty list to append components
docs = []

# Iterate through the list and crawl each documentation section.
for item in urls_to_load:
    # Initialize the loader with the specific URL and parameters.
    loader = RecursiveUrlLoader(
        url=item["url"],
        max_depth=item["max_depth"],
        extractor=lambda x: Soup(x, "html.parser").text, # Use BeautifulSoup to extract text
        prevent_outside=True, # Ensure we stay within the documentation pages
        use_async=True, # Use asynchronous requests for faster crawling
        timeout=600, # Set a generous timeout for slow pages
    )
    # Load the documents and add them to our master list.
    loaded_docs = loader.load()
    docs.extend(loaded_docs)
    print(f"Loaded {len(loaded_docs)} documents from {item['url']}")

運(yùn)行后，我們得到了145個(gè)文檔，總計(jì)??312,566??個(gè)Token。對(duì)文檔的Token分布進(jìn)行分析后，我們發(fā)現(xiàn)很多文檔都非常長(zhǎng)（最大達(dá)到12,453個(gè)Token），這表明需要進(jìn)行合理的分塊（chunking）。

import numpy as np

# We need a consistent way to count tokens, using the LLM's tokenizer is the most accurate method.
def count_tokens(text: str) -> int:
    """Counts the number of tokens in a text using the configured tokenizer."""
    # Ensure text is not None and is a string
    ifnot isinstance(text, str):
        return0
    return len(tokenizer.encode(text))

# Extract the text content from the loaded LangChain Document objects
docs_texts = [d.page_content for d in docs]

# Calculate token counts for each document
token_counts = [count_tokens(text) for text in docs_texts]

# Print statistics to understand the document size distribution
print(f"Total documents: {len(docs_texts)}")
print(f"Total tokens in corpus: {np.sum(token_counts)}")
print(f"Average tokens per document: {np.mean(token_counts):.2f}")
print(f"Min tokens in a document: {np.min(token_counts)}")
print(f"Max tokens in a document: {np.max(token_counts)}")

# Output
# Total documents: 145
# Total tokens in corpus: 312566
# Average tokens per document: 2155.59
# Min tokens in a document: 312
# Max tokens in a document: 12453

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

從Token分布直方圖來(lái)看，文檔的Token數(shù)量中位數(shù)大約在1000左右。因此，我們將??chunk_size??設(shè)置為1000。

四、傳統(tǒng)RAG的“硬傷”：為什么會(huì)“答非所問(wèn)”？

為了證明 RAPTOR 的優(yōu)越性，我們需要一個(gè)參照物：一個(gè)最基礎(chǔ)的、沒(méi)有優(yōu)化的 RAG 系統(tǒng)。這個(gè)系統(tǒng)將使用和 RAPTOR 相同的模型和知識(shí)庫(kù)，唯一的區(qū)別在于它只對(duì)原始文檔塊（即我們前面提到的“葉節(jié)點(diǎn)”）進(jìn)行索引。

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

首先，我們利用 ??RecursiveCharacterTextSplitter?? 將文檔切分成一個(gè)個(gè)葉節(jié)點(diǎn)：

from langchain.text_splitter import RecursiveCharacterTextSplitter

# We join all the documents into a single string for more efficient processing.
# The '---' separator helps maintain document boundaries if needed later.
concatenated_content = "\n\n --- \n\n".join(docs_texts)

# Create the text splitter using our LLM's tokenizer for accuracy.
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer,
    chunk_size=1000, # The max number of tokens in a chunk
    chunk_overlap=100# The number of tokens to overlap between chunks
)

# Split the text into chunks, which will be our leaf nodes.
leaf_texts = text_splitter.split_text(concatenated_content)

print(f"Created {len(leaf_texts)} leaf nodes (chunks) for the RAPTOR tree.")
# Output
# Created 412 leaf nodes (chunks) for the RAPTOR tree.

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

接下來(lái)，我們構(gòu)建一個(gè)簡(jiǎn)單的 RAG 管道，并使用 ??FAISS?? 進(jìn)行向量存儲(chǔ)。

from langchain_community.vectorstores import FAISS
from langchain_core.runnables import RunnablePassthrough

# In a simple RAG, the vector store is built only on the leaf-level chunks.
vectorstore_normal = FAISS.from_texts(
    texts=leaf_texts, 
    embedding=embeddings
)

# Create a retriever from this vector store that fetches the top 5 results.
retriever_normal = vectorstore_normal.as_retriever(
    search_kwargs={'k': 5}
)

print(f"Built Simple RAG vector store with {len(leaf_texts)} documents.")
# Output
# Built Simple RAG vector store with 412 documents.

現(xiàn)在，我們構(gòu)建完整的 RAG 鏈，并提出一個(gè)宏觀、概念性的問(wèn)題：

# This prompt template instructs the LLM to answer based ONLY on the provided context.
final_prompt_text = """You are an expert assistant for the Hugging Face ecosystem. 
Answer the user's question based ONLY on the following context. If the context does not contain the answer, state that you don't know.
CONTEXT:
{context}
QUESTION:
{question}
ANSWER:"""
final_prompt = ChatPromptTemplate.from_template(final_prompt_text)

# A helper function to format the retrieved documents.
def format_docs(docs):
    return"\n\n".join(doc.page_content for doc in docs)

# Construct the RAG chain for the simple approach.
rag_chain_normal = (
    {"context": retriever_normal | format_docs, "question": RunnablePassthrough()}
    | final_prompt
    | llm
    | StrOutputParser()
)

# Let's ask a broad, conceptual question.
question = "What is the core philosophy of the Hugging Face ecosystem?"
answer = rag_chain_normal.invoke(question)

print(f"Question: {question}\n")
print(f"Answer: {answer}")

我們來(lái)看看這個(gè)簡(jiǎn)單RAG的輸出：

Question: What is the core philosophy of the Hugging Face ecosystem?

Answer: The Hugging Face ecosystem is built around the `transformers` 
library, which provides APIs to easily download and use pretrained models.
The core idea is to make these models accessible. For example, the `pipeline`
function is a key part of this, offering a simple way to use models for 
inference. It also includes libraries like `datasets` for data loading and
`accelerate` for training.

這個(gè)回答沒(méi)錯(cuò)，但它給人的感覺(jué)是零散的、缺乏整體感的。它像是一堆“正確”的事實(shí)被硬生生地拼湊在一起，沒(méi)有真正理解背后的宏大敘事。這就是典型的“迷失在細(xì)節(jié)中”的問(wèn)題。檢索器只抓住了關(guān)鍵詞，卻錯(cuò)過(guò)了核心思想。而這，正是 RAPTOR 要解決的痛點(diǎn)。

五、揭秘RAPTOR的“智慧大腦”：層次化聚類(lèi)引擎

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

RAPTOR 的魔法，在于它能夠?qū)⒘闵⒌摹叭~子”組織成有意義的“分支”，這個(gè)過(guò)程離不開(kāi)一個(gè)精巧的“層次化聚類(lèi)引擎”。這個(gè)引擎由三個(gè)核心組件構(gòu)成：

1. UMAP：降維打擊，看清數(shù)據(jù)的“真實(shí)形狀”

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

我們的文本嵌入（??embeddings??）通常存在于一個(gè)高維空間中（比如384維）。在這個(gè)空間里，數(shù)據(jù)點(diǎn)之間很擁擠，很難看出它們的真實(shí)關(guān)系，這就是所謂的“維度詛咒”。

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

UMAP (Uniform Manifold Approximation and Projection，統(tǒng)一流形近似與投影) 就像一個(gè)強(qiáng)大的透視鏡，能將這些高維數(shù)據(jù)投影到更低的維度（比如10維），同時(shí)最大限度地保持?jǐn)?shù)據(jù)點(diǎn)之間的語(yǔ)義關(guān)系。這就像把一張復(fù)雜的立體地圖變成一張清晰的平面地圖，讓聚類(lèi)算法能夠更輕松地識(shí)別出數(shù)據(jù)的“形狀”和“分組”。

from typing import Dict, List, Optional, Tuple
import numpy as np
import pandas as pd
import umap
from sklearn.mixture import GaussianMixture

RANDOM_SEED = 42

def global_cluster_embeddings(embeddings: np.ndarray, dim: int, n_neighbors: Optional[int] = None, metric: str = "cosine") -> np.ndarray:
    """Perform global dimensionality reduction on the embeddings using UMAP."""
    # Heuristically set n_neighbors if not provided
    if n_neighbors isNone:
        n_neighbors = int((len(embeddings) - 1) ** 0.5)
    # Return the UMAP-transformed embeddings
    return umap.UMAP(
        n_neighbors=n_neighbors, 
        n_compnotallow=dim, 
        metric=metric, 
        random_state=RANDOM_SEED
    ).fit_transform(embeddings)

2. GMM + BIC：讓數(shù)據(jù)自己決定“分幾組”

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

我們?cè)摪盐臋n分為5個(gè)、10個(gè)還是50個(gè)主題？憑空猜測(cè)一個(gè)數(shù)字（K值）顯然是不合理的。

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

RAPTOR 采取了更科學(xué)的方法：使用 GMM (Gaussian Mixture Model，高斯混合模型) 和 BIC (Bayesian Information Criterion，貝葉斯信息準(zhǔn)則)。

這個(gè)過(guò)程就像一個(gè)實(shí)驗(yàn)：

第一步：我們嘗試將數(shù)據(jù)擬合到不同數(shù)量的簇中（比如從1個(gè)到50個(gè)）。
第二步：每次擬合后，計(jì)算出一個(gè)BIC分?jǐn)?shù)。BIC 既能衡量模型對(duì)數(shù)據(jù)的擬合程度，又會(huì)懲罰過(guò)于復(fù)雜的模型（即懲罰簇的數(shù)量）。
第三步：我們選擇那個(gè) BIC 分?jǐn)?shù)最低的簇?cái)?shù)量，因?yàn)樗砹四Ｐ驮凇皵M合”和“簡(jiǎn)潔”之間找到了最佳平衡。

def get_optimal_clusters(embeddings: np.ndarray, max_clusters: int = 50) -> int:
    """Determine the optimal number of clusters using the Bayesian Information Criterion (BIC)."""
    # Limit the max number of clusters to be less than the number of data points
    max_clusters = min(max_clusters, len(embeddings))
    # If there's only one point, there can only be one cluster
    if max_clusters <= 1: 
        return1
    
    # Test different numbers of clusters from 1 to max_clusters
    n_clusters_range = np.arange(1, max_clusters)
    bics = []
    for n in n_clusters_range:
        # Initialize and fit the GMM for the current number of clusters
        gmm = GaussianMixture(n_compnotallow=n, random_state=RANDOM_SEED)
        gmm.fit(embeddings)
        # Calculate and store the BIC for the current model
        bics.append(gmm.bic(embeddings))
        
    # Return the number of clusters that resulted in the lowest BIC score
    return n_clusters_range[np.argmin(bics)]

3. GMM：概率軟分配，允許“腳踏兩條船”

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

傳統(tǒng)的聚類(lèi)算法（如K-Means）是“硬聚類(lèi)”，它強(qiáng)制將每個(gè)數(shù)據(jù)點(diǎn)分到且只分到一個(gè)簇里。但現(xiàn)實(shí)情況是，一個(gè)文檔塊可能同時(shí)討論“模型訓(xùn)練”和“數(shù)據(jù)預(yù)處理”，它理應(yīng)同時(shí)屬于這兩個(gè)主題。

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

GMM 能夠?qū)崿F(xiàn)“軟聚類(lèi)”。它不直接分配，而是計(jì)算每個(gè)文檔塊屬于每個(gè)簇的概率。通過(guò)設(shè)置一個(gè)概率閾值，我們可以讓一個(gè)文檔塊同時(shí)被分配到多個(gè)相關(guān)的簇中。這完美地模擬了知識(shí)之間的交叉和關(guān)聯(lián)。

def GMM_cluster(embeddings: np.ndarray, threshold: float) -> Tuple[List[np.ndarray], int]:
    """Cluster embeddings using a GMM and a probability threshold."""
    # Find the optimal number of clusters for this set of embeddings
    n_clusters = get_optimal_clusters(embeddings)
    
    # Fit the GMM with the optimal number of clusters
    gmm = GaussianMixture(n_compnotallow=n_clusters, random_state=RANDOM_SEED)
    gmm.fit(embeddings)
    
    # Get the probability of each point belonging to each cluster
    probs = gmm.predict_proba(embeddings)
    
    # Assign a point to a cluster if its probability is above the threshold
    # A single point can be assigned to multiple clusters.
    labels = [np.where(prob > threshold)[0] for prob in probs]
    
    return labels, n_clusters

這個(gè)“層次化聚類(lèi)引擎”將 UMAP、BIC 和 GMM 精巧地組合在一起，確保了RAPTOR 能夠準(zhǔn)確、有深度地理解和組織知識(shí)。

六、構(gòu)建和運(yùn)行RAPTOR樹(shù)：從理論到實(shí)踐

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

有了這套強(qiáng)大的聚類(lèi)引擎，我們現(xiàn)在可以將它應(yīng)用到RAPTOR的構(gòu)建過(guò)程中。這個(gè)過(guò)程分為兩個(gè)階段：

全局聚類(lèi)：首先，對(duì)我們所有的??412??個(gè)葉節(jié)點(diǎn)進(jìn)行一次整體的降維和聚類(lèi)。這一步的目的是找出文檔庫(kù)中最高層級(jí)的主題，比如“??Transformers??庫(kù)”、“??Datasets??庫(kù)”和“訓(xùn)練與優(yōu)化”。
局部聚類(lèi)：接下來(lái)，我們“放大”每個(gè)全局簇。例如，進(jìn)入“訓(xùn)練與優(yōu)化”這個(gè)簇，對(duì)它內(nèi)部的文檔再次進(jìn)行降維和聚類(lèi)。這一次，我們會(huì)找到更細(xì)分的子主題，比如“??PEFT??”、“??Accelerate??”和“??Trainer??參數(shù)”。

這個(gè)“先概覽，后細(xì)看”的RAPTOR策略，完美地模擬了人類(lèi)的思考過(guò)程。它首先建立起知識(shí)的宏觀框架，然后再填充細(xì)節(jié)。

我們編寫(xiě)一個(gè)名為 ??perform_clustering?? 的函數(shù)，來(lái)整合上述所有步驟，實(shí)現(xiàn)這個(gè)分層的聚類(lèi)邏輯。

def perform_clustering(embeddings: np.ndarray, dim: int = 10, threshold: float = 0.1) -> List[np.ndarray]:
    """Perform hierarchical clustering (global and local) on the embeddings."""
    # Handle cases with very few documents to avoid errors during dimensionality reduction.
    if len(embeddings) <= dim + 1:
        return [np.array([0]) for _ in range(len(embeddings))]

    # --- Global Clustering Stage ---
    # First, reduce the dimensionality of all embeddings globally.
    reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)
    # Then, perform GMM clustering on the reduced-dimensional data.
    global_clusters, n_global_clusters = GMM_cluster(reduced_embeddings_global, threshold)

    # --- Local Clustering Stage ---
    # Initialize a list to hold all final local cluster assignments for each document.
    all_local_clusters = [np.array([]) for _ in range(len(embeddings))]
    # Keep track of the total number of clusters found so far to ensure unique IDs.
    total_clusters = 0

    # Iterate through each global cluster to find sub-clusters.
    for i in range(n_global_clusters):
        # Get all original indices for embeddings that are part of the current global cluster.
        global_cluster_indices = [idx for idx, gc in enumerate(global_clusters) if i in gc]
        ifnot global_cluster_indices:
            continue
        
        # Get the actual embeddings for this global cluster.
        global_cluster_embeddings_ = embeddings[global_cluster_indices]

        # Perform local clustering on this subset of embeddings.
        if len(global_cluster_embeddings_) <= dim + 1:
            local_clusters, n_local_clusters = ([np.array([0])] * len(global_cluster_embeddings_)), 1
        else:
            # We don't need a separate 'local_cluster_embeddings' function.
            # The global one works, as it adapts n_neighbors to the input
            reduced_embeddings_local = global_cluster_embeddings(global_cluster_embeddings_, dim)
            local_clusters, n_local_clusters = GMM_cluster(reduced_embeddings_local, threshold)

        # Map the local cluster IDs back to the original document indices.
        for original_idx, local_cluster_ids in zip(global_cluster_indices, local_clusters):
            # We add 'total_clusters' to ensure each cluster ID is globally unique.
            all_local_clusters[original_idx] = local_cluster_ids + total_clusters
            
        total_clusters += n_local_clusters
    
    # Return the final, globally unique cluster assignments for each document.
    return all_local_clusters

最終，我們將所有原始葉節(jié)點(diǎn)和所有層級(jí)的摘要全部索引到同一個(gè)向量數(shù)據(jù)庫(kù)中。當(dāng)用戶發(fā)起查詢時(shí)，例如“如何用 ??PEFT??? 和 ??Accelerate?? 庫(kù)進(jìn)行分布式訓(xùn)練？”，RAPTOR 的檢索器會(huì)：

檢索到包含“分布式訓(xùn)練”高層概念的摘要。
同時(shí)，也會(huì)檢索到與“??PEFT??”和“??Accelerate??”相關(guān)的具體代碼示例和API說(shuō)明（即葉節(jié)點(diǎn)）。

這玩意兒居然能讓RAG少用70%Token，真離譜！揭秘“遞歸抽象檢索（RAPTOR）”魔法-AI.x社區(qū)

這種“高層概括”與“底層細(xì)節(jié)”的完美結(jié)合，讓 RAPTOR 能夠給LLM提供一個(gè)既有全局觀又有具體細(xì)節(jié)的上下文。當(dāng)LLM看到這樣的上下文時(shí)，它能更容易地生成一個(gè)結(jié)構(gòu)清晰、內(nèi)容完整且連貫的答案，而不是東拼西湊的零散事實(shí)。

七、未來(lái)展望與總結(jié)：一場(chǎng)RAG的革命

RAPTOR 管道代表了RAG領(lǐng)域的一個(gè)重要趨勢(shì)：從簡(jiǎn)單的“檢索”走向更深度的“知識(shí)組織”。它告訴我們，僅僅將文檔切塊并向量化是不夠的，我們必須像人一樣，主動(dòng)地對(duì)知識(shí)進(jìn)行分類(lèi)、摘要和層次化組織，才能真正釋放LLM的潛力。

雖然 RAPTOR 需要在構(gòu)建索引時(shí)進(jìn)行額外的計(jì)算，比如多次調(diào)用LLM進(jìn)行摘要，以及執(zhí)行復(fù)雜的聚類(lèi)算法，但它帶來(lái)的回報(bào)是巨大的：在查詢時(shí)，可以顯著減少檢索到的文檔塊數(shù)量，從而節(jié)省大量的 Token 消耗，并獲得更優(yōu)越的回答質(zhì)量。 這意味著在長(zhǎng)遠(yuǎn)的生產(chǎn)環(huán)境中，它將帶來(lái)更高的效率和更低的成本。

RAPTOR 的成功，也為我們帶來(lái)了更多思考：

我們能否將這種“遞歸抽象”的思想應(yīng)用于更多領(lǐng)域？比如，能否用它來(lái)組織企業(yè)的內(nèi)部文檔庫(kù)，讓新員工能快速找到從宏觀戰(zhàn)略到具體操作的全部信息？
RAPTOR管道能否與其他RAG優(yōu)化技術(shù)（如查詢轉(zhuǎn)換、重排）結(jié)合，創(chuàng)造出更強(qiáng)大的“超級(jí)RAG”？

毫無(wú)疑問(wèn)，RAPTOR 不僅僅是一個(gè)技術(shù)，它是一種全新的思維方式，正在引領(lǐng)下一代 RAG 的發(fā)展。