ArXiv CS RAG on 🤗 Space
A resource-efficient retrieval system for searching computer science papers on ArXiv, powered by ColBERTv2 and large language models. Hosted on Huggingface Space.
Project Overview
The Challenge: Searching 4GB+ of Research on a Budget
The ArXiv dataset is a treasure trove of scientific knowledge, but its large size (4GB+) makes building a personal search engine computationally expensive. The challenge was to design and deploy a high-performance, accurate retrieval system without relying on GPUs or expensive cloud infrastructure.
The Solution: A Resource-Efficient RAG Pipeline
I engineered an end-to-end Retrieval-Augmented Generation (RAG) system with a focus on efficiency. The core idea was to intelligently pre-process the data and use a state-of-the-art “late-interaction” model, ColBERTv2, to create a highly accurate yet compact search index. And it runs on CPU!

Key Technical Decisions & My Role
As the sole developer, I owned the entire project lifecycle, making key technical decisions to ensure both performance and resource efficiency.

1. Efficient data filtering: Used Polars
pre-processing to shrink the 4GB dataset to a manageable size, focusing on relevant CS abstracts.

2. State-of-the-art indexing: Implemented ColBERTv2 via ragatouille
for highly accurate retrieval without needing complex query expansion.

3. Deployment: Built a custom Gradio UI and deployed the entire application on Hugging Face Spaces for public use.
The Impact: A Community-Recognized Tool
The project was not just a technical exercise; it delivered real-world value and was recognized by the community. The deployed application was featured as a top Hugging Face Space for a week, demonstrating significant public interest and providing a strong, verifiable signal of the project’s quality, usability, and the accuracy of its retrieval.
This project showcases my ability to independently design, build, and deploy a complete, efficient ML system from the ground up.
For more details, please refer to the project resources:
Huggingface Space | Github Code | Kaggle Notebook |