ArXiv CS RAG on 🤗 Space

Project Overview

The final application: a live, deployed RAG system for searching computer science research on 🤗 Space

The Challenge: Searching 4GB+ of Research on a Budget

The ArXiv dataset is a treasure trove of scientific knowledge, but its large size (4GB+) makes building a personal search engine computationally expensive. The challenge was to design and deploy a high-performance, accurate retrieval system without relying on GPUs or expensive cloud infrastructure.

The Solution: A Resource-Efficient RAG Pipeline

I engineered an end-to-end Retrieval-Augmented Generation (RAG) system with a focus on efficiency. The core idea was to intelligently pre-process the data and use a state-of-the-art “late-interaction” model, ColBERTv2, to create a highly accurate yet compact search index. And it runs on CPU!

High-Level System Architecture — End-to-end RAG architecture

Key Technical Decisions & My Role

As the sole developer, I owned the entire project lifecycle, making key technical decisions to ensure both performance and resource efficiency.

Data Filtering Diagram — Filtering with Polars reduces the ArXiv dataset to relevant CS abstracts.

1. Efficient data filtering: Used Polars pre-processing to shrink the 4GB dataset to a manageable size, focusing on relevant CS abstracts.

ColBERTv2 Indexing Diagram — Late-interaction ColBERTv2 indexing produces compact, accurate search indices for efficient retrieval.

2. State-of-the-art indexing: Implemented ColBERTv2 via ragatouille for highly accurate retrieval without needing complex query expansion.

Deployment Diagram — Gradio UI with a CPU-only backend deployed on Hugging Face Spaces for online inference.

3. Deployment: Built a custom Gradio UI and deployed the entire application on Hugging Face Spaces for public use.

The Impact: A Community-Recognized Tool

The project was not just a technical exercise; it delivered real-world value and was recognized by the community. The deployed application was featured as a top Hugging Face Space for a week, demonstrating significant public interest and providing a strong, verifiable signal of the project’s quality, usability, and the accuracy of its retrieval.

This project showcases my ability to independently design, build, and deploy a complete, efficient ML system from the ground up.

For more details, please refer to the project resources:

Huggingface Space

Github Code

Kaggle Notebook