Feb 5, 2025

How to Set Up and Run DeepSeek R1 Locally

How to Set Up and Run DeepSeek R1 Locally

How to Set Up and Run DeepSeek R1 Locally

Read Time

In this tutorial, I’ll guide you through running DeepSeek-R1 locally, step by step, and setting it up with Ollama. We’ll also build a simple RAG application that runs on your laptop using the R1 model, LangChain, and Gradio.

If you're looking for an overview of the R1 model, check out this DeepSeek-R1 article. For fine-tuning instructions, refer to this tutorial on fine-tuning DeepSeek-R1.

Why Run DeepSeek-R1 Locally?

Running DeepSeek-R1 on your machine gives you full control over execution without relying on external servers. Key benefits include:

  • Privacy & Security: Keeps all data on your system.

  • Uninterrupted Access: Avoids rate limits, downtime, and service disruptions.

  • Performance: Delivers faster responses by eliminating API latency.

  • Customization: Allows parameter adjustments, prompt fine-tuning, and local application integration.

  • Cost Efficiency: Removes API costs by running the model locally.

  • Offline Availability: Enables use without an internet connection once downloaded.

Setting Up DeepSeek-R1 Locally with Ollama

Ollama simplifies local LLM execution by managing model downloads, quantization, and deployment.

Step 1: Install Ollama

Download and install Ollama from the official website.

Once the download is complete, install the Ollama application as you would any other software.

Step 2: Download and Run DeepSeek-R1

Now, let's test the setup and download the model. Open a terminal and run the following command:

Model Variants

Ollama provides multiple versions of DeepSeek-R1, ranging from 1.5B to 671B parameters. The 671B model is the original DeepSeek-R1, while the smaller models are distilled versions based on Qwen and Llama architectures.

If your hardware cannot support the 671B model, you can run a smaller version by replacing X in the command below with the desired parameter size (1.5b, 7b, 8b, 14b, 32b, 70b, or 671b):

This flexibility allows you to use DeepSeek-R1 even without high-end hardware.

Step 3: Running DeepSeek-R1 in the Background

To keep DeepSeek-R1 running continuously and make it available via an API, start the Ollama server:

This will enable integration with other applications.


Using DeepSeek-R1 Locally

Step 1: Running Inference via CLI

Once the model is downloaded, you can interact with DeepSeek-R1 directly from the terminal.

Step 2: Accessing DeepSeek-R1 via API

To integrate DeepSeek-R1 into applications, use the Ollama API with curl:

curl http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1",
  "messages": [{ "role": "user", "content": "Solve: 25 * 25" }]

Note: curl is a command-line tool available on Linux and macOS that allows users to make HTTP requests directly from the terminal, making it useful for interacting with APIs.

Step 3: Accessing DeepSeek-R1 via Python

You can run Ollama in any integrated development environment (IDE) of your choice. First, install the Ollama Python package:

Once installed, use the following script to interact with the model:

import ollama

response = ollama.chat(
    model="deepseek-r1",
    messages=[
        {"role": "user", "content": "Explain Newton's second law of motion"},
    ],
)

print(response["message"]["content"]

The ollama.chat() function processes the user’s input as a conversational exchange with the model. The script then extracts and prints the model’s response.

Running DeepSeek-R1 Locally in VSCode

Running a Local Gradio App for RAG With DeepSeek-R1

Now, let's build a simple demo app using Gradio to query and analyze documents with DeepSeek-R1.

Step 1: Prerequisites

Before implementation, ensure you have the following tools and libraries installed:

  • Python 3.8+

  • LangChain – A framework for building LLM-powered applications, facilitating easy retrieval, reasoning, and tool integration.

  • ChromaDB – A high-performance vector database for efficient similarity searches and embedding storage.

  • Gradio – For creating a user-friendly web interface.

Install the necessary dependencies using:


Once installed, import the required libraries:


Step 2: Processing the Uploaded PDF

Now, let's process the uploaded PDF:


How it Works:

The process_pdf() function:
✔ Loads and prepares PDF content for retrieval-based answering.
✔ Extracts text using PyMuPDFLoader.
✔ Splits text into chunks using RecursiveCharacterTextSplitter.
✔ Generates vector embeddings using OllamaEmbeddings.
✔ Stores embeddings in a Chroma vector store for efficient retrieval.

Step 3: Combining Retrieved Document Chunks

After retrieving document chunks, we need to merge them for better readability:


Since retrieval-based models pull relevant excerpts rather than entire documents, this function ensures extracted content is properly formatted before being passed to DeepSeek-R1.

Step 4: Querying DeepSeek-R1 Using Ollama

Now, let’s set up DeepSeek-R1 for processing queries:


How it Works:

✔ Formats the user’s question and retrieved document context into a structured prompt.
✔ Sends the input to DeepSeek-R1 via ollama.chat().
✔ Processes the question in context and returns a relevant answer.
✔ Strips unnecessary thinking output using re.sub().

Step 5: The RAG Pipeline

Now, let’s build the full RAG pipeline:


How it Works:

✔ Searches the vector store using retriever.invoke(question).
✔ Retrieves and formats the most relevant document excerpts.
✔ Passes structured content to ollama_llm() for context-aware responses.

Step 6: Creating the Gradio Interface

Now, let's build a Gradio web interface to allow users to upload PDFs and ask questions:

def ask_question(pdf_bytes, question):
    text_splitter, vectorstore, retriever = process_pdf(pdf_bytes)

    if text_splitter is None:
        return None  # No PDF uploaded

    result = rag_chain(question, text_splitter, vectorstore, retriever)
    return {result}


interface = gr.Interface(
    fn=ask_question,
    inputs=[
        gr.File(label="Upload PDF (optional)"),
        gr.Textbox(label="Ask a question"),
    ]

How it Works:

✔ Checks if a PDF is uploaded.
✔ Processes the PDF using process_pdf() to extract text and generate embeddings.
✔ Retrieves relevant information using rag_chain().
✔ Sets up a Gradio interface with gr.Interface().
✔ Enables document-based Q&A in a web browser with interface.launch().

Author: