ArXiv CS RAG on 🤗 Space

A resource-efficient retrieval system for searching computer science papers on ArXiv, powered by ColBERTv2 and large language models. Hosted on Huggingface Space.

Project Overview

The final application: a live, deployed RAG system for searching computer science research on 🤗 Space

The Challenge: Searching 4GB+ of Research on a Budget

The ArXiv dataset is a treasure trove of scientific knowledge, but its large size (4GB+) makes building a personal search engine computationally expensive. The challenge was to design and deploy a high-performance, accurate retrieval system without relying on GPUs or expensive cloud infrastructure.

The Solution: A Resource-Efficient RAG Pipeline

I engineered an end-to-end Retrieval-Augmented Generation (RAG) system with a focus on efficiency. The core idea was to intelligently pre-process the data and use a state-of-the-art “late-interaction” model, ColBERTv2, to create a highly accurate yet compact search index. And it runs on CPU!

End-to-end RAG architecture

Key Technical Decisions & My Role

As the sole developer, I owned the entire project lifecycle, making key technical decisions to ensure both performance and resource efficiency.

Filtering with Polars reduces the ArXiv dataset to relevant CS abstracts.

1. Efficient data filtering: Used Polars pre-processing to shrink the 4GB dataset to a manageable size, focusing on relevant CS abstracts.

Late-interaction ColBERTv2 indexing produces compact, accurate search indices for efficient retrieval.

2. State-of-the-art indexing: Implemented ColBERTv2 via ragatouille for highly accurate retrieval without needing complex query expansion.

Gradio UI with a CPU-only backend deployed on Hugging Face Spaces for online inference.

3. Deployment: Built a custom Gradio UI and deployed the entire application on Hugging Face Spaces for public use.

The Impact: A Community-Recognized Tool

The project was not just a technical exercise; it delivered real-world value and was recognized by the community. The deployed application was featured as a top Hugging Face Space for a week, demonstrating significant public interest and providing a strong, verifiable signal of the project’s quality, usability, and the accuracy of its retrieval.

This project showcases my ability to independently design, build, and deploy a complete, efficient ML system from the ground up.

For more details, please refer to the project resources:

Huggingface Space Github Code Kaggle Notebook