Retrieval-Augmented Generation (RAG) systems are becoming the backbone of modern AI applications, enabling more accurate, context-aware responses by combining language models with your own custom data. If you’ve been exploring ways to set up a local, production-ready RAG stack, RagBits might be exactly what you’re looking for.
In this post, we’ll walk through how to install and use RagBits, an open-source Python toolbox designed for building RAG applications. We’ll pair it with Ollama, a popular tool for running local large language models. Whether you’re looking to run multi-agent workflows, do document search, or spin up AI-powered assistants, RagBits offers an end-to-end, developer-friendly pipeline.
What Is RagBits?
RagBits is a lightweight, open-source toolkit aimed at making RAG application development fast and flexible. It supports:
- Over 100+ LLMs through integration with LiteLLM
- Pydantic-based type-safe schema validation
- Built-in observability, testing, and monitoring
- Parsing of 20+ document formats
- Ray-based parallel processing for large-scale ingestion
- Compatibility with chat UI interfaces and custom deployment flows
You can seamlessly swap between embedding models, language models, and vector stores—all while using local resources.
Sure! Here’s a complete Quickstart Guide with installation and first use of Ragbits, combining installation, setup, and running your first GenAI + RAG app.
Requirements
Ragbits is compatible with:
- Python 3.9+
- Local or remote LLMs via LiteLLM
- Optional: GPU/CPU acceleration for better performance
- Optional: Ollama for running local models
Make sure to install and configure your embedding and language models (like
text-embedding-3-small
andgpt-4.1-nano
) using LiteLLM or Ollama before use.
1. Installation
Install the full Ragbits stack:
pip install ragbits
Alternatively, install just what you need:
pip install ragbits-core ragbits-document-search ragbits-chat
2. Define and Run an LLM Prompt
Create a simple script like qa_prompt.py
:
import asyncio
from pydantic import BaseModel
from ragbits.core.llms import LiteLLM
from ragbits.core.prompt import Prompt
class QuestionAnswerPromptInput(BaseModel):
question: str
class QuestionAnswerPromptOutput(BaseModel):
answer: str
class QuestionAnswerPrompt(Prompt[QuestionAnswerPromptInput, QuestionAnswerPromptOutput]):
system_prompt = "You are a question answering agent."
user_prompt = "Question: {{ question }}"
llm = LiteLLM(model_name="gpt-4.1-nano", use_structured_output=True)
async def main():
prompt = QuestionAnswerPrompt(QuestionAnswerPromptInput(question="What is RAG in AI?"))
response = await llm.generate(prompt)
print(response.answer)
if __name__ == "__main__":
asyncio.run(main())
Run it:
python qa_prompt.py
3. Ingest & Search a Document
import asyncio
from ragbits.core.embeddings import LiteLLMEmbedder
from ragbits.core.vector_stores import InMemoryVectorStore
from ragbits.document_search import DocumentSearch
embedder = LiteLLMEmbedder(model_name="text-embedding-3-small")
vector_store = InMemoryVectorStore(embedder=embedder)
document_search = DocumentSearch(vector_store=vector_store)
async def run():
await document_search.ingest("web://https://arxiv.org/pdf/1706.03762")
result = await document_search.search("What is the transformer model?")
print(result)
if __name__ == "__main__":
asyncio.run(run())
4. Create a RAG Pipeline (Prompt + Context)
from pydantic import BaseModel
class RAGInput(BaseModel):
question: str
context: list[str]
class RAGPrompt(Prompt[RAGInput, str]):
system_prompt = "You are a QA agent. Use the provided context to answer."
user_prompt = """
Question: {{ question }}
Context:
{% for item in context %}{{ item }}{% endfor %}
"""
async def run_rag():
question = "What are the key findings in the paper?"
await document_search.ingest("web://https://arxiv.org/pdf/1706.03762")
result = await document_search.search(question)
prompt = RAGPrompt(RAGInput(
question=question,
context=[r.text_representation for r in result],
))
response = await llm.generate(prompt)
print(response)
asyncio.run(run_rag())
5. Optional: Launch Chat UI
from ragbits.chat.api import RagbitsAPI
from ragbits.chat.interface import ChatInterface
class MyChat(ChatInterface):
async def setup(self):
self.embedder = LiteLLMEmbedder(model_name="text-embedding-3-small")
self.vector_store = InMemoryVectorStore(embedder=self.embedder)
self.document_search = DocumentSearch(vector_store=self.vector_store)
self.llm = LiteLLM(model_name="gpt-4.1-nano", use_structured_output=True)
await self.document_search.ingest("web://https://arxiv.org/pdf/1706.03762")
async def chat(self, message, history=None, context=None):
result = await self.document_search.search(message)
prompt = RAGPrompt(RAGInput(
question=message,
context=[r.text_representation for r in result]
))
async for chunk in self.llm.generate_streaming(prompt):
yield self.create_text_response(chunk)
if __name__ == "__main__":
RagbitsAPI(MyChat).run()
This starts a full-stack chatbot backed by your document.
6. Scaffold a New Project (Optional)
Use the official starter template:
uvx create-ragbits-app
Helpful Links
Why This Matters
The ability to run fast, modular, and observable RAG systems locally means:
- You own your data end to end
- You can build private chatbots and assistants
- You reduce latency, no API calls
- Great for prototyping enterprise apps (support, internal docs, agents)
And thanks to RagBits’ integration with Pydantic, Ray, and Ollama, it’s not just another hobby tool—it’s production-ready.
Final Thoughts
RagBits stands out in the RAG space because it makes developer experience a priority. The design is modular, observability is built-in, and it plays well with other local-first AI tools like Ollama and LightLLM.
If you’ve been looking for a way to run your own mini ChatGPT trained on your documents, this is one of the most approachable ways to get started.