No need for GPU, deploy llama2 large model locally on Windows, generate text using Python interface.

Hello everyone, today we are going to talk about deploying the llama2 large model on Windows locally.

In this lesson, we will deploy and run the llama2 large model based on the llama.cpp library in a Windows environment without using GPU and only using CPU.

After the deployment is completed, we will directly use the Python interface for text generation.

What is the llama2 large model?

On July 19, 2023, Meta released the free and commercially available large language model Llama 2.

This move is enough to cause a huge change in the field of large models.

The llama 2 series models come in three different scales: 7 billion, 13 billion, and 70 billion parameters.

After evaluating the actual effect of Llama2, it has surpassed GPT3 and is even close to GPT3.5.

There are many aspects of Llama 2 itself that are worth discussing, especially its profound impact on the domestic AI field. I will discuss it in a separate video in the future.

Now, let's get to the point and experience Llama2 on our personal computers.

llama-cpp and llama-cpp-python

We all know that large language models, with billions of parameters, are not intended to be used in a single-machine environment.

Even the smallest Llama model has 7 billion parameters and requires an Nvidia RTX4090 or A10 to run.

However, there is a super genius who has developed a module called llama.cpp for running llama models based on the LLaMA model released by Meta.

This project is built entirely in C++ without any third-party compilation dependencies, allowing us to perform large model inference based on CPU conditions.

In addition, llama.cpp not only supports llama2 models, but also supports other models such as Alpaca, chiness-llama, WizardLM, and provides interfaces for other languages such as Python, Go, and Node.js.

Next, let's build the python environment for llama-cpp step by step and achieve single-machine operation of large language models.

Building the environment based on Anaconda

Before building the environment, you need to install Visual Studio in advance. During the installation of llama-cpp-python, it is used to compile llama-cpp.

We can directly download Visual Studio 2022 Community Edition from the official website of VS for installation.

After completing the installation of VS, use Anaconda to create a new Python environment, and choose Python version 3.10.

Then open the environment and use the command "pip install llama-cpp-python" to install the python interface of llama-cpp.

Downloading the model from Hugging Face

We can download the quantized llama2 model from Hugging Face.

Specifically, after logging in to Hugging Face, find TheBloke project and then find the Llama-2-7B-Chat-GGML model.

Here, it should be noted that Llama-2-7B is the original 7B version, Chat represents the chat fine-tuned version, and GGML represents the quantized model.

Model quantization can be simply understood as a model compression technique.

In the download list, we choose to download the q4_0 version.

There are many other versions in the list.

Simply put, q4_0 means 4 bits per weight, and q5_0 means 5 bits per weight.

The larger the number, the higher the accuracy. We can choose a similar one to use.

Writing a Python program for text generation

After setting up the environment and downloading the model, we can write Python code.

The program will depend on the llama-cpp-python module, and the usage of its interface can be found in the project documentation.

Open the project documentation to find out how to use each parameter in the interface. I won't go into too much detail here. You need to debug the specific usage effect by yourself.

Next, let's write a simple sample program.

First, import llama_cpp.

In the main function, create a Llama model, and the model uses the q4_0 model we just downloaded.

Then input "hello", "who are you", "how old are you" to the model and see how the model reacts.

Here we implement a functional function generate_text. The function takes the model and input information as parameters and returns the output result of the model.

In the function, we first need to convert the message into a prompt, and then input the prompt to the model to get the output.

The format of the output result can be referred to in the documentation. We extract the text string from the choices and save it to answer, and then return answer.

Running the program, we can get three test results.

Sending "hello" will receive a friendly reply from llama2.

Sending "who are you" will introduce llama2.

Sending "how old are you" will also give an appropriate result.