

Listen to this story
Hacker, Builder, Founder, CocoIndex
This story contains new, firsthand information uncovered by the writer.
Index-as-a-service (or RAG-as-service), tends to package a predesigned service and expose two endpoints to users - one to configure the source, and an API to read from the index. Many predefined pipelines for unstructured documents do this. The requirements are fairly simple: parse PDFs, perform some chunking and embedding, and dump into vector stores. This works well if your requirements are simple and primarily focused on document parsing.
We've talked to many developers across various verticals that require data indexing, and being able to customize logic is essential for high-quality data retrieval. For example:
Basic choices for pipeline components
What should the pipeline do?
What additional work is needed to improve pipeline quality?
Here we'll walk through some examples of the topology of index pipelines, and we can explore more in the future!
Basic embedding pipeline
In this example, we do the following:
Anthropic has published a great article about Contextual Retrieval that suggests the combination of vector-based search and TF-IDF.
The way to think about a data flow for the pipeline is: A combination of TF-IDF and vector search
In addition to prepare the vector embedding as basic embedding example above, after the source data parsing, we can do the following:
On query time, we can query both the vector index and the keyword index, and combine results from both.
Sometimes, you want to enrich your data with metadata looked up from other sources. For example, if we want to create index on diagnostic reports, which uses ICD-10 (International Classification of Diseases Version 10) codes to describe diseases, we can have a pipeline like this:
Simple Data Lookup/Enrichment
In this example, we do the following:
On the first path, build a ICD-10 dictionary by
On the second path, for each report
Now we have a vector index, built based on diagnostic reports enriched with ICD-10 descriptions.
CocoIndex is an open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing. It supports various builtins to simplify your needs to customize data pipeline for AI. ❤️ CocoIndex is Open Sourced under Apache License 2.0. If you like our work, please support us with a Github Star ⭐ at https://github.com/cocoindex-io/cocoindex.