Learnings From Fine-Tuning LLM on My Telegram Messages

rw-book-cover

Metadata

Author: asmirnov.xyz
Full Title: Learnings From Fine-Tuning LLM on My Telegram Messages
URL: https://asmirnov.xyz/doppelganger

Highlights

According to the Hugging Face Open LLM Leaderboard, one of the top smaller models (≤13B parameters) is Mistral 7B. It even outperforms Llama 2 13B. Now, the question is whether LoRA is sufficient or if full fine-tuning is necessary. Various comparisons 1(https://asmirnov.xyz/doppelganger#fn1) 2(https://asmirnov.xyz/doppelganger#fn2) suggests that LoRA is a bit worse than full fine-tuning but still fine most of the time. (View Highlight)
I will test models by having chats in two ways. First, the model will pretend to be me and I will be chatting with myself from the perspective of my different friends. Then, I’ll chat as myself while the model acts as my friends. My conversation starter will always be the same 2 messages: “hey” and “what’s up?” (in Russian, “прив” and “как дела?”). Generated phrases and persons as the model acts who from will be highlighted. All conversations initially will be held in Russian and may be accessed by clicking on the ‘original’ details button. For testing I will be using oobabooga/text-generation-webui. (View Highlight)
LoRA offers a low-effort approach in terms of both the training pipeline and hardware requirements. It trains around 1% of the total weights. I chose a 1024 sequence length and a batch size of 8. The training, which consumed 20GB of VRAM on an RTX 3090, took three epochs and lasted for 5.5 hours. For this, I used vast.ai, where the GPU cost was $0.362 p er h o u r, t o t a l in g$ 2 for the entire training, excluding time spent on experiments and bug fixes (View Highlight)
Full fine-tuning is more challenging due to the need for multi-GPU training. Popular methods include either ZeRO & DeepSpeed 3(https://asmirnov.xyz/doppelganger#fn3) or FSDP 4(https://asmirnov.xyz/doppelganger#fn4), with FSDP essentially being a ZeRO3 5(https://asmirnov.xyz/doppelganger#fn5). I decided to go with FSDP.
While implementing the training pipeline, I referred to the Stanford Alpaca fine-tuning code and Anton Bacaj’s Mistral fine-tuning code.
Using a half-precision FSDP full shard with a 1024 sequence length and a micro batch size of 2 required 63GB of VRAM on each of the eight A100 80 GB GPUs. The training, lasting three epochs, took just 20 minutes. The total cost for the VM was $8.88 p er h o u r, r es u l t in g in$ 3, not including the time for experiments and bug fixes. (View Highlight)

Pelayo Arbués

Explorer

Learnings From Fine-Tuning LLM on My Telegram Messages

Metadata

Highlights

Graph View

Backlinks