LlamaIndex 學習筆記 - 餵資料之 IngestionPipeline

這筆記是建構在 LlamaIndex 官方文件的範例節錄/實作/追蹤

嘗試的原始碼在 https://github.com/unclefomotw/llamaindex-try/blob/main/src/rag_8.py

在線下用 LlamaIndex 把資料輸入資料庫，也就是 data ingestion，可以用 LlamaIndex 提供的 IngestionPipeline 並儲存至外部 vector DB

最簡單的程式碼片段如下

    # 範例：把資料文件切段，轉換成 chunks / nodes 的轉換器
    # 單純為了 retrieval 品質；可以是任意多個轉換器
    node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=128)

    # embed model 也是轉換器的一種 (把 node 加上 embedding)
    # 加在全域單純只是為後續 online 預測準備
    # 模型可用 OpenAI 也可用 HuggineFace 上的，沒差
    Settings.embed_model = HuggingFaceEmbedding(model_name="DMetaSoul/Dmeta-embedding-zh-small")

    # 照想要的執行順序，把轉換器放在 list 裡（包括 embedding!）
    transformations = [
        node_parser,
        Settings.embed_model
    ]

    # 載入文件，例如用 SimpleDirectoryReader 去 load_data
    documents = my_load_doc()

    # 設定好 vector database
    db_client = QdrantClient(host="localhost", port=6333)
    vector_store = QdrantVectorStore(
        client=db_client, collection_name="my_collection",
    )

    # 最後重頭戲：餵資料
    pipeline = IngestionPipeline(
        transformations=transformations,
        vector_store=vector_store
    )
    pipeline.run(documents=documents)

這個作用其實跟之前這篇的 Offline 效果一樣，但更具敘述性更好看

IngestionPipeline 是一個獨立的個體；所有跟 ingestion 有關的都在那邊設定
要哪些 transformations 集中在一起給定
清楚指明是 vector “store”, 而不依靠 LlamaIndex 的 “index” (那是 online 的事)

Online 查詢聊天則跟之前那篇一樣，用 VectorStoreIndex.from_vector_store(vector_store)

IngestionPipeline 還提供更多功能，可參見官方文件

Cache (local / remote)
Dedup / upsert 處理文件重複或更新
async
平行 multiprocessing

若您覺得有趣, 請追蹤我的Facebook 或 Linkedin, 讓你獲得更多資訊！