import transformers
import trl # trl = Transformers Reinforcement Learning -> https://github.com/huggingface/trl
import datasets
import accelerate
import gradio as grFully Fine-tune a Small Language Model with Hugging Face Tutorial
Note: If you’re running in Google Colab, make sure to enable GPU usage by going to Runtime -> Change runtime type -> select GPU.
1 Overview
Why fine-tune a Small Language Model (SLM)?
Because we want small model for a specific task (e.g. don’t to pay API credits or leak our data online).
If we have our own model… we can run it anywhere and everywhere we like.
2 How to fine-tune an LLM model
There are several ways to fine-tune an LLM including reinforcement learning (RL) and Supervised Fine-tuning (SFT).
We are going to do SFT because it’s the most straightforward.
Basically SFT = give samples of inputs and outputs and the model learns to map a given input to a given output.
For example if our goal was to extract names:
- Input: Hello my name is Daniel
- Output: Daniel
Note:
- Our inputs can be any kind of input string.
- Our outputs will be fine-tuned to conform to a structured data pattern.
In LLM world, data inputs are tokens and data outputs are tokens.
A token is a numerical representation of some kind of data.
Computers like numbers (not images, text, videos, etc).
Everything must be turned into numbers.
And data = a very broad term.
It could be text, images, video (series of images), audio, DNA sequences, Excel spreadsheets, you name it.
The goal of the LLM is to be given an input sequence of tokens and then predict the following tokens.
So with this mindset, you can think of any problem as tokens in, tokens out.
Ask yourself: What tokens do I want to put in and what tokens do I want my model to return?
In our case, we want to put in almost any string input. And we want to get back structured information specifically related to food and drinks.
This a very specific use case, however, the beauty of LLMs being so general is that you can apply this tokens in, tokens out mindset to almost anything.
If you’ve got an existing dataset (no problem if you don’t, you can create one, let me know if you’d like a guide on this), chances are, you can fine-tune an LLM to do pretty well on it.
3 What we’re cooking
We’re going to build a SLM (Small Language Model) to extract food and drink items from text.
Why?
If we needed to go over a large dataset of image captions and filter them for food items (we could then use these filtered captions for a food app).
TK image - example of what we’re doing.
4 Ingredients
- Model (Gemma-3-270M)
- Dataset (a pre-baked dataset to extract foods and drinks from text)
- Training code (Hugging Face Transformers + TRL)
- Eval code
- Demo
5 Method
- Download model - Hugging Face
transformers - Download dataset - Hugging Face
datasets - Inspect dataset - Hugging Face
datasets - Train model on dataset - Hugging Face
trl(TRL = Transformers Reinforcement Learning) - Eval model - basically just look at a bunch of samples
- Create an interactive demo - Hugging Face
gradio - Bonus: Make the demo public so other people can use it - Hugging Face Spaces
6 Fine-tuning LLM vs RAG
- Fine-tuning = To do a very specific task, e.g. structured data extraction.
- An example would be you’re an insurance company who gets 10,000 emails a day and you want to extract structured data directly from these emails to JSON.
- RAG = You want to inject custom knowledge into an LLM.
- An example would be you’re an insurance company wanting to send automatic responses to people but you want the responses to include information from your own docs.
7 Why fine-tune your own model?
- Own the model, can run on own hardware
- Our task is simple enough to just use a small language model
- No API calls needed
- Can run in batch mode to get much faster inference than API calls
- Model by default wasn’t very good at our task but now since fine-tuning, is very good
8 Definitions
Some quick definitions of what we’re doing.
- Full fine-tuning - All weights on the model are updated. Often takes longer and requires larger hardware capacity, however, if your model is small enough (e.g. 270M parameter or less), you can often do full fine-tuning.
- LORA fine-tuning (also known as partial fine-tuning) - Low Rank Adaptation or training a small adapter to attach to your original model. Requires significantly less resources but can perform on par with full fine-tuning.
- SLM (Small Language Model) - A subjective definition but to me a Small Language Model is a model with under 1B parameters, with added bonus for being under 500M parameters. Less parameters generally means less performance. However, when you have a specific task, SLMs often shine because they can be tailored for that specific task. If your task is “I want to create a chatbot capable of anything”, you’ll generally want the biggest model you can reasonably serve. If your task is “I want to extract some structured data from raw text inputs”, you’ll probably be surprised how well a SLM can perform.
9 Import dependencies
Note: If you’re in Google Colab, you may have to install
trl,accelerateandgradio.
For Google Colab:
!pip install trl accelerate gradio# Check the amount of GPU memory available (we need at least ~16GB)
import torch
if torch.cuda.is_available():
device = torch.cuda.current_device()
gpu_name = torch.cuda.get_device_name(device)
total_memory = torch.cuda.get_device_properties(device).total_memory
allocated_memory = torch.cuda.memory_allocated(device)
reserved_memory = torch.cuda.memory_reserved(device)
free_memory = total_memory - reserved_memory
print(f"GPU: {gpu_name}")
print(f"Total Memory: {total_memory / 1e6:.2f} MB | {total_memory / 1e9:.2f} GB")
print(f"Allocated Memory: {allocated_memory / 1e6:.2f} MB | {allocated_memory / 1e9:.2f} GB")
print(f"Reserved Memory: {reserved_memory / 1e6:.2f} MB | {reserved_memory / 1e9:.2f} GB")
print(f"Free Memory: {free_memory / 1e6:.2f} MB | {free_memory / 1e9:.2f} GB")
else:
print("No CUDA GPU available")GPU: NVIDIA GB10
Total Memory: 128524.03 MB | 128.52 GB
Allocated Memory: 0.00 MB | 0.00 GB
Reserved Memory: 0.00 MB | 0.00 GB
Free Memory: 128524.03 MB | 128.52 GB
/home/mrdbourke/miniforge3/envs/ai/lib/python3.12/site-packages/torch/cuda/__init__.py:283: UserWarning:
Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
Minimum and Maximum cuda capability supported by this version of PyTorch is
(8.0) - (12.0)
warnings.warn(
10 Setup Base Model
The base model we’ll be using is Gemma 3 270M from Google.
It’s the same architecture style as larger LLMs such as Gemini but at a much smaller scale.
This is why we refer to it as a “Small Language Model” or SLM.
We can load our model using transformers.
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_NAME = "google/gemma-3-270m-it" # note: "it" stands for "instruction tuned" which means the model has been tuned for following instructions
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
dtype="auto",
device_map="auto", # put the model on the GPU
attn_implementation="eager" # could use flash_attention_2 but ran into issues... so stick with Eager for now
)tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"[INFO] Model on device: {model.device}")
print(f"[INFO] Model using dtype: {model.dtype}")[INFO] Model on device: cuda:0
[INFO] Model using dtype: torch.bfloat16
Our model requires numbers (tokens) as input.
We can turn strings into tokens via a tokenizer!
tokenizer("Hello my name is Daniel"){'input_ids': [2, 9259, 1041, 1463, 563, 13108], 'attention_mask': [1, 1, 1, 1, 1, 1]}
import torch
outputs = model(torch.tensor(tokenizer("Hello my name is Daniel")["input_ids"]).unsqueeze(0).to("cuda"))
outputs.keys()odict_keys(['logits', 'past_key_values'])
11 Get dataset
Our dataset is located here: https://huggingface.co/datasets/mrdbourke/FoodExtract-1k.
It was created from image captions + random strings and then using gpt-oss-120b (a powerful open-source LLM) to do synthetic labelling.
For more on the dataset you can read the README.md file explaining it.
The main thing we are concerned about is that we want the input to our model to be the "sequence" column and the output to be the "gpt-oss-120b-label-condensed" column.
We’ll explore these below.
from datasets import load_dataset
dataset = load_dataset("mrdbourke/FoodExtract-1k")
print(f"[INFO] Number of samples in the dataset: {len(dataset['train'])}")[INFO] Number of samples in the dataset: 1420
import json
import random
def get_random_idx(dataset):
"""Returns a random integer index based on the number of samples in the dataset."""
random_idx = random.randint(0, len(dataset)-1)
return random_idx
random_idx = get_random_idx(dataset["train"])
random_sample = dataset["train"][random_idx]
example_input = random_sample["sequence"]
example_output = random_sample["gpt-oss-120b-label"]
example_output_condensed = random_sample["gpt-oss-120b-label-condensed"]
print(f"[INFO] Input:\n{example_input}")
print()
print(f"[INFO] Example structured output (what we want our model to learn to predict):")
print(eval(example_output))
print()
print(f"[INFO] Example output condensed (we'll train our model to predict the condensed output since it uses less tokens than JSON):")
print(example_output_condensed)[INFO] Input:
Nutrition information label for Sliced 5 Seeds Bread. Serving size: 78g (2 slices). Energy: 850 kJ (10% daily intake), Protein: 12.2g (22% daily intake), Fat: 8g (11% daily intake), Saturated Fat: 1.1g, Carbohydrate: 15.9g (5% daily intake), Sugars: 1.2g, Dietary Fibre: 1.6g (1% daily intake), Sodium: 300mg (13% daily intake). Ingredients include water, wheat flour, mixed seeds (18%), sunflower seeds, psyllium seeds, golden wheat seeds, and torch.
[INFO] Example structured output (what we want our model to learn to predict):
{'is_food_or_drink': True, 'tags': ['np', 'il', 'fi'], 'food_items': ['Sliced 5 Seeds Bread', 'water', 'wheat flour', 'mixed seeds', 'sunflower seeds', 'psyllium seeds', 'golden wheat seeds', 'torch'], 'drink_items': []}
[INFO] Example output condensed (we'll train our model to predict the condensed output since it uses less tokens than JSON):
food_or_drink: 1
tags: np, il, fi
foods: Sliced 5 Seeds Bread, water, wheat flour, mixed seeds, sunflower seeds, psyllium seeds, golden wheat seeds, torch
drinks:
Because we’d like to use our model to potentially filter a large corpus of data, we get it to assign various tags to the text as well.
These are as follows.
# Our fine-tuned model will assign tags to text so we can easily filter them by type in the future
tags_dict = {'np': 'nutrition_panel',
'il': 'ingredient list',
'me': 'menu',
're': 'recipe',
'fi': 'food_items',
'di': 'drink_items',
'fa': 'food_advertistment',
'fp': 'food_packaging'}11.1 Format the dataset into LLM-style inputs/outputs
Right now we have examples of string-based inputs and structured outputs.
However, our LLMs generally want things in the format of:
{"user": "Hello my name is Daniel",
"system": "Hi Daniel, I'm an LLM"}
In other words, they want structure around the intputs and outputs rather than just raw information.
Resource: See the dataset formats and types in the TRL docs: https://huggingface.co/docs/trl/en/dataset_formats
random_sample{'sequence': 'Nutrition information label for Sliced 5 Seeds Bread. Serving size: 78g (2 slices). Energy: 850 kJ (10% daily intake), Protein: 12.2g (22% daily intake), Fat: 8g (11% daily intake), Saturated Fat: 1.1g, Carbohydrate: 15.9g (5% daily intake), Sugars: 1.2g, Dietary Fibre: 1.6g (1% daily intake), Sodium: 300mg (13% daily intake). Ingredients include water, wheat flour, mixed seeds (18%), sunflower seeds, psyllium seeds, golden wheat seeds, and torch.',
'image_url': None,
'class_label': 'food',
'source': 'manual_taken_images',
'char_len': 451.0,
'word_count': 67.0,
'syn_or_real': 'syn',
'uuid': '3666577f-e1d1-40ed-9325-d1b3808fe15c',
'gpt-oss-120b-label': "{'is_food_or_drink': True, 'tags': ['np', 'il', 'fi'], 'food_items': ['Sliced 5 Seeds Bread', 'water', 'wheat flour', 'mixed seeds', 'sunflower seeds', 'psyllium seeds', 'golden wheat seeds', 'torch'], 'drink_items': []}",
'gpt-oss-120b-label-condensed': 'food_or_drink: 1\ntags: np, il, fi\nfoods: Sliced 5 Seeds Bread, water, wheat flour, mixed seeds, sunflower seeds, psyllium seeds, golden wheat seeds, torch\ndrinks:',
'target_food_names_to_use': None,
'caption_detail_level': None,
'num_foods': None,
'target_image_point_of_view': None}
def sample_to_conversation(sample):
"""Helper function to convert an input sample to conversation style."""
return {
"messages": [
{"role": "user", "content": sample["sequence"]}, # Load the sequence from the dataset
{"role": "system", "content": sample["gpt-oss-120b-label-condensed"]} # Load the gpt-oss-120b generated label
]
}
sample_to_conversation(random_sample){'messages': [{'role': 'user',
'content': 'Nutrition information label for Sliced 5 Seeds Bread. Serving size: 78g (2 slices). Energy: 850 kJ (10% daily intake), Protein: 12.2g (22% daily intake), Fat: 8g (11% daily intake), Saturated Fat: 1.1g, Carbohydrate: 15.9g (5% daily intake), Sugars: 1.2g, Dietary Fibre: 1.6g (1% daily intake), Sodium: 300mg (13% daily intake). Ingredients include water, wheat flour, mixed seeds (18%), sunflower seeds, psyllium seeds, golden wheat seeds, and torch.'},
{'role': 'system',
'content': 'food_or_drink: 1\ntags: np, il, fi\nfoods: Sliced 5 Seeds Bread, water, wheat flour, mixed seeds, sunflower seeds, psyllium seeds, golden wheat seeds, torch\ndrinks:'}]}
# Map our sample_to_conversation function to dataset
dataset = dataset.map(sample_to_conversation,
batched=False)
dataset["train"][42]{'sequence': 'another optional quest takes place on windfall island during the night time play the song of passing a number of times and each time, glance towards the sky',
'image_url': 'https://portforward.com/games/walkthroughs/The-Legend-of-Zelda-The-Wind-Waker/The-Legend-of-Zelda-The-Wind-Waker-large-430.jpg',
'class_label': 'not_food',
'source': 'qwen2vl_open_dataset',
'char_len': 156.0,
'word_count': 28.0,
'syn_or_real': 'real',
'uuid': 'bbac79ce-df1f-48b8-891c-752809be11c7',
'gpt-oss-120b-label': "{'is_food_or_drink': 'false', 'tags': [], 'food_items': [], 'drink_items': []}",
'gpt-oss-120b-label-condensed': 'food_or_drink: 0\ntags: \nfoods: \ndrinks:',
'target_food_names_to_use': None,
'caption_detail_level': None,
'num_foods': None,
'target_image_point_of_view': None,
'messages': [{'content': 'another optional quest takes place on windfall island during the night time play the song of passing a number of times and each time, glance towards the sky',
'role': 'user'},
{'content': 'food_or_drink: 0\ntags: \nfoods: \ndrinks:', 'role': 'system'}]}
Notice how we now have a "messages" key in our dataset samples.
# Create a train/test split
dataset = dataset["train"].train_test_split(test_size=0.2,
shuffle=False,
seed=42)
# Number #1 rule in machine learning
# Always train on the train set and test on the test set
# This gives us an indication of how our model will perform in the real world
datasetDatasetDict({
train: Dataset({
features: ['sequence', 'image_url', 'class_label', 'source', 'char_len', 'word_count', 'syn_or_real', 'uuid', 'gpt-oss-120b-label', 'gpt-oss-120b-label-condensed', 'target_food_names_to_use', 'caption_detail_level', 'num_foods', 'target_image_point_of_view', 'messages'],
num_rows: 1136
})
test: Dataset({
features: ['sequence', 'image_url', 'class_label', 'source', 'char_len', 'word_count', 'syn_or_real', 'uuid', 'gpt-oss-120b-label', 'gpt-oss-120b-label-condensed', 'target_food_names_to_use', 'caption_detail_level', 'num_foods', 'target_image_point_of_view', 'messages'],
num_rows: 284
})
})
11.2 Try the model with a pipeline
easy_sample = {"role": "user",
"content": "Hi my name is Daniel"}
def create_easy_sample(input):
template = {"role": "user", "content": input}
return templatefrom transformers import pipeline
# Load model and use it as a pipeline
pipe = pipeline("text-generation",
model=model,
tokenizer=tokenizer)
input_text = "Hi my name is Daniel. Please reply to me with a machine learning poem."
easy_sample = create_easy_sample(input=input_text)
input_prompt = pipe.tokenizer.apply_chat_template([easy_sample], # pipeline tokenizer wants a list of inputs
tokenize=False,
add_generation_prompt=True)
print(f"[INFO] This is the input prompt: {input_prompt}")
default_outputs = pipe(input_prompt,
max_new_tokens=512,
disable_compile=True)
print(f"[INFO] Input:\n{input_text}")
print()
print(f"[INFO] Output from {MODEL_NAME}:")
print()
print(default_outputs[0]["generated_text"][len(input_prompt):])Device set to use cuda:0
[INFO] This is the input prompt: <bos><start_of_turn>user
Hi my name is Daniel. Please reply to me with a machine learning poem.<end_of_turn>
<start_of_turn>model
[INFO] Input:
Hi my name is Daniel. Please reply to me with a machine learning poem.
[INFO] Output from google/gemma-3-270m-it:
Okay, Daniel! Here's a machine learning poem about the power of data:
A silent hum, a learning stream,
Data whispers, a vibrant dream.
From pixels bright to patterns deep,
Secrets that the data keeps to keep.
Each pattern a story untold,
Of connections, brave and bold.
Algorithms weave, a clever art,
To shape the world, a beating heart.
The world's data, a boundless sea,
Where insights bloom, and futures be.
With each data point, a new quest,
To understand, to be best blessed.
So let the data guide your way,
Through logic's logic, day and through day.
A powerful tool, a guiding hand,
To build a better, better land.
Example machine learning poem generated by Gemma 3 270M (not too bad):
Okay, Daniel, here's a machine learning poem. I've tried to capture a feeling of wonder and a bit of mystery.
The algorithm learns,
A silent, tireless quest.
Through data streams, it flows,
A symphony of thought.
Each point a new layer,
A learning bloom,
A future bright and clear.
It analyzes the data,
No single clue it knows.
It weaves a pattern true,
A story in the hue.
The world unfolds anew,
With subtle, complex view.
It's not just numbers,
But feeling, a soul.
A tapestry of grace,
A hopeful, vibrant space.
A learning, growing deep,
Secrets it will keep.
11.3 Try the model on one of our sequences
# Get a random sample
random_idx = get_random_idx(dataset["train"])
random_train_sample = dataset["train"][random_idx]
# Apply the chat template
input_prompt = pipe.tokenizer.apply_chat_template(conversation=random_train_sample["messages"][:1],
tokenize=False,
add_generation_prompt=True)
# Let's run the default model on our input
default_outputs = pipe(text_inputs=input_prompt, max_new_tokens=256)
# View and compare the outputs
print(f"[INFO] Input:\n{input_prompt}\n")
print(f"[INFO] Output:\n{default_outputs[0]['generated_text'][len(input_prompt):]}")[INFO] Input:
<bos><start_of_turn>user
A hand is holding a bottle of McCormick Foods Australia Pty Ltd's imported hot sauce. The label prominently displays the ingredients: reconstituted tomato puree (18%), distilled vinegar (6%), sugar, molasses (6%), maltodextrin, salt, spices (including black pepper 0.4%), corn starch, brown sugar, thickeners (guar gum, xanthan gum), onion, garlic, and smoke flavor. The nutrition information per 100g includes 464 kcal, 14g of fat, less than 1g of saturated fat, 233.3mg of sodium, and 15.4g of carbohydrates. The product is made in the USA and imported by McCormick Foods Australia Pty Ltd, located at 71 Fairbank Road, Clayton South, VIC, 3169, Australia. The best before date is March 11, 2026, and the bottle advises to shake it before pouring and store in the fridge after opening, not to use if the shrink wrap is broken or missing. The label also features a 'Born Here' badge from Texas.<end_of_turn>
<start_of_turn>model
[INFO] Output:
The product is a **hot sauce** made in the USA.
By default the model produces a fairly generic response.
This is expected and good.
It means the model has a good baseline understanding of language.
If it responded with pure garbage, we might have an uphill battle.
However, this response type is not what we want. We want our model to respond with structured data based on the input.
Good news is we can adjust the patterns in our model to do just that.
11.4 Let’s try to prompt the model
We want a model to extract food and drink items from text.
By default the model will just reply to any text input with a generic response.
However, we can try and get our ideal outputs via prompting.
prompt_instruction = """Given the following target input text from an image caption, please extract the food and drink items to a list.
If there are no food or drink items, return an empty list.
Return in the following format:
food_items: [item_1, item_2, item_3]
drink_items: [item_4, item_5]
For example:
Input text: Hello my name is Daniel.
Output:
food_items: []
drink_items: []
Example 2:
Input text: A plate of rice cakes, salmon, cottage cheese and small cherry tomatoes with a cup of tea.
Output:
food_items: ['rice cakes', 'salmon', 'cottage cheese', 'cherry tomatoes']
drink_items: ['cup of tea']
Return only the formatted output and nothing else.
Target input text: <targ_input_text>"""
def update_input_message_content(input):
original_content = input["messages"][:1][0]["content"]
new_content = prompt_instruction.replace("<targ_input_text>", original_content)
new_input = [{"content": new_content,
"role": "user"}]
return new_input
print(f'[INFO] Original content:\n{random_train_sample["messages"][:1][0]["content"]}')
print()
print(f'[INFO] New content with instructions in prompt:')
print(update_input_message_content(input=random_train_sample)[0]["content"])[INFO] Original content:
A hand is holding a bottle of McCormick Foods Australia Pty Ltd's imported hot sauce. The label prominently displays the ingredients: reconstituted tomato puree (18%), distilled vinegar (6%), sugar, molasses (6%), maltodextrin, salt, spices (including black pepper 0.4%), corn starch, brown sugar, thickeners (guar gum, xanthan gum), onion, garlic, and smoke flavor. The nutrition information per 100g includes 464 kcal, 14g of fat, less than 1g of saturated fat, 233.3mg of sodium, and 15.4g of carbohydrates. The product is made in the USA and imported by McCormick Foods Australia Pty Ltd, located at 71 Fairbank Road, Clayton South, VIC, 3169, Australia. The best before date is March 11, 2026, and the bottle advises to shake it before pouring and store in the fridge after opening, not to use if the shrink wrap is broken or missing. The label also features a 'Born Here' badge from Texas.
[INFO] New content with instructions in prompt:
Given the following target input text from an image caption, please extract the food and drink items to a list.
If there are no food or drink items, return an empty list.
Return in the following format:
food_items: [item_1, item_2, item_3]
drink_items: [item_4, item_5]
For example:
Input text: Hello my name is Daniel.
Output:
food_items: []
drink_items: []
Example 2:
Input text: A plate of rice cakes, salmon, cottage cheese and small cherry tomatoes with a cup of tea.
Output:
food_items: ['rice cakes', 'salmon', 'cottage cheese', 'cherry tomatoes']
drink_items: ['cup of tea']
Return only the formatted output and nothing else.
Target input text: A hand is holding a bottle of McCormick Foods Australia Pty Ltd's imported hot sauce. The label prominently displays the ingredients: reconstituted tomato puree (18%), distilled vinegar (6%), sugar, molasses (6%), maltodextrin, salt, spices (including black pepper 0.4%), corn starch, brown sugar, thickeners (guar gum, xanthan gum), onion, garlic, and smoke flavor. The nutrition information per 100g includes 464 kcal, 14g of fat, less than 1g of saturated fat, 233.3mg of sodium, and 15.4g of carbohydrates. The product is made in the USA and imported by McCormick Foods Australia Pty Ltd, located at 71 Fairbank Road, Clayton South, VIC, 3169, Australia. The best before date is March 11, 2026, and the bottle advises to shake it before pouring and store in the fridge after opening, not to use if the shrink wrap is broken or missing. The label also features a 'Born Here' badge from Texas.
# Apply the chat template
updated_input_prompt = update_input_message_content(input=random_train_sample)
input_prompt = pipe.tokenizer.apply_chat_template(conversation=updated_input_prompt,
tokenize=False,
add_generation_prompt=True)
# Let's run the default model on our input
default_outputs = pipe(text_inputs=input_prompt,
max_new_tokens=256)
# View and compare the outputs
print(f"[INFO] Input:\n{input_prompt}\n")
print(f"[INFO] Output:\n{default_outputs[0]['generated_text'][len(input_prompt):]}")[INFO] Input:
<bos><start_of_turn>user
Given the following target input text from an image caption, please extract the food and drink items to a list.
If there are no food or drink items, return an empty list.
Return in the following format:
food_items: [item_1, item_2, item_3]
drink_items: [item_4, item_5]
For example:
Input text: Hello my name is Daniel.
Output:
food_items: []
drink_items: []
Example 2:
Input text: A plate of rice cakes, salmon, cottage cheese and small cherry tomatoes with a cup of tea.
Output:
food_items: ['rice cakes', 'salmon', 'cottage cheese', 'cherry tomatoes']
drink_items: ['cup of tea']
Return only the formatted output and nothing else.
Target input text: A hand is holding a bottle of McCormick Foods Australia Pty Ltd's imported hot sauce. The label prominently displays the ingredients: reconstituted tomato puree (18%), distilled vinegar (6%), sugar, molasses (6%), maltodextrin, salt, spices (including black pepper 0.4%), corn starch, brown sugar, thickeners (guar gum, xanthan gum), onion, garlic, and smoke flavor. The nutrition information per 100g includes 464 kcal, 14g of fat, less than 1g of saturated fat, 233.3mg of sodium, and 15.4g of carbohydrates. The product is made in the USA and imported by McCormick Foods Australia Pty Ltd, located at 71 Fairbank Road, Clayton South, VIC, 3169, Australia. The best before date is March 11, 2026, and the bottle advises to shake it before pouring and store in the fridge after opening, not to use if the shrink wrap is broken or missing. The label also features a 'Born Here' badge from Texas.<end_of_turn>
<start_of_turn>model
[INFO] Output:
food_items: ['rice cakes','salmon', 'cottage cheese', 'cherry tomatoes']
drink_items: ['cup of tea']
# This is our input
print(random_train_sample["messages"][0]["content"])
print()
# This is our ideal output:
print(random_train_sample["messages"][1]["content"])A hand is holding a bottle of McCormick Foods Australia Pty Ltd's imported hot sauce. The label prominently displays the ingredients: reconstituted tomato puree (18%), distilled vinegar (6%), sugar, molasses (6%), maltodextrin, salt, spices (including black pepper 0.4%), corn starch, brown sugar, thickeners (guar gum, xanthan gum), onion, garlic, and smoke flavor. The nutrition information per 100g includes 464 kcal, 14g of fat, less than 1g of saturated fat, 233.3mg of sodium, and 15.4g of carbohydrates. The product is made in the USA and imported by McCormick Foods Australia Pty Ltd, located at 71 Fairbank Road, Clayton South, VIC, 3169, Australia. The best before date is March 11, 2026, and the bottle advises to shake it before pouring and store in the fridge after opening, not to use if the shrink wrap is broken or missing. The label also features a 'Born Here' badge from Texas.
food_or_drink: 1
tags: np, il, fi, fp
foods: hot sauce
drinks:
Okay looks like our small LLM doesn’t do what we want it to do… it starts to reply with Python text or unreliably extracts foods and drinks from text in a non-uniform format.
No matter, we can fine-tune it so it does our specific task!
12 Fine-tuning our model
Steps:
- Setup SFTConfig (Supervised Fine-tuning Config) - https://huggingface.co/docs/trl/v0.26.2/en/sft_trainer#trl.SFTConfig
- Use SFTTrainer to train our model on our supervised samples (from our dataset above) - https://huggingface.co/docs/trl/v0.26.2/en/sft_trainer#trl.SFTTrainer
# Setting up our SFTConfig
from trl import SFTConfig
torch_dtype = model.dtype
CHECKPOINT_DIR_NAME = "./checkpoint_models"
BASE_LEARNING_RATE = 5e-5
print(f"[INFO] Using dtype: {torch_dtype}")
print(f"[INFO] Using learning rate: {BASE_LEARNING_RATE}")
sft_config = SFTConfig(
output_dir=CHECKPOINT_DIR_NAME,
max_length=512,
packing=False,
num_train_epochs=3,
per_device_train_batch_size=16, # Note: you can change this depending on the amount of VRAM your GPU has
per_device_eval_batch_size=16,
gradient_checkpointing=False,
optim="adamw_torch_fused", # Note: if you try "adamw", you will get an error
logging_steps=1,
save_strategy="epoch",
eval_strategy="epoch",
learning_rate=BASE_LEARNING_RATE,
fp16=True if torch_dtype == torch.float16 else False,
bf16=True if torch_dtype == torch.float16 else False,
lr_scheduler_type="constant",
push_to_hub=False,
report_to=None
)
# There are a lot of settings in the sft_config, so feel free to uncomment this and inspect it if you want
#sft_config[INFO] Using dtype: torch.bfloat16
[INFO] Using learning rate: 5e-05
Config setup, now we can train our model with SFTTrainer!
# Supervised Fine-Tuning = provide input and desired output samples
from trl import SFTTrainer
# Create Trainer object
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
processing_class=tokenizer
)
trainer.train()The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 2, 'pad_token_id': 0}.
| Epoch | Training Loss | Validation Loss | Entropy | Num Tokens | Mean Token Accuracy |
|---|---|---|---|---|---|
| 1 | 1.761000 | 2.275403 | 2.180556 | 170969.000000 | 0.572724 |
| 2 | 1.500400 | 2.291959 | 1.943576 | 341938.000000 | 0.574516 |
| 3 | 1.542500 | 2.369414 | 1.751302 | 512907.000000 | 0.573615 |
TrainOutput(global_step=213, training_loss=1.8445255733991452, metrics={'train_runtime': 320.1481, 'train_samples_per_second': 10.645, 'train_steps_per_second': 0.665, 'total_flos': 873751312465920.0, 'train_loss': 1.8445255733991452, 'epoch': 3.0})
Woohoo! Looks like our training accuracy went up.
Let’s inspect the loss curves.
import matplotlib.pyplot as plt
# Access the log history
log_history = trainer.state.log_history
# Extract training / validation loss
train_losses = [log["loss"] for log in log_history if "loss" in log]
epoch_train = [log["epoch"] for log in log_history if "loss" in log]
eval_losses = [log["eval_loss"] for log in log_history if "eval_loss" in log]
epoch_eval = [log["epoch"] for log in log_history if "eval_loss" in log]
# Plot the training loss
plt.plot(epoch_train, train_losses, label="Training Loss")
plt.plot(epoch_eval, eval_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training and Validation Loss per Epoch")
plt.legend()
plt.grid(True)
plt.show()
# Save the model
trainer.save_model()# Remove all the checkpoint folders (since we've already saved the best model)
!rm -rf ./checkpoint_models/checkpoint-*/*
!rm -rf ./checkpoint_models/checkpoint-*huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
13 Load the trained model back in and see how it performs
We’ve now fine-tuned our own Gemma 3 270M to do a specific task, let’s load it back in and see how it performs.
CHECKPOINT_DIR_NAME'./checkpoint_models'
# Load the fine-tuned model and see how it goes
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load trained model
loaded_model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=CHECKPOINT_DIR_NAME,
dtype="auto",
device_map="auto",
attn_implementation="eager"
);loaded_model_pipeline = pipeline("text-generation",
model=loaded_model,
tokenizer=tokenizer)
loaded_model_pipelineDevice set to use cuda:0
<transformers.pipelines.text_generation.TextGenerationPipeline at 0xf3ee89376c60>
dataset["test"]Dataset({
features: ['sequence', 'image_url', 'class_label', 'source', 'char_len', 'word_count', 'syn_or_real', 'uuid', 'gpt-oss-120b-label', 'gpt-oss-120b-label-condensed', 'target_food_names_to_use', 'caption_detail_level', 'num_foods', 'target_image_point_of_view', 'messages'],
num_rows: 284
})
Let’s now perform inference with our fine-tuned model on a random sample from the test dataset (our model has never seen these samples).
# Get a random sample
random_test_idx = get_random_idx(dataset["test"])
random_test_sample = dataset["test"][random_test_idx]
# Apply the chat template
input_prompt = pipe.tokenizer.apply_chat_template(conversation=random_test_sample["messages"][:1],
tokenize=False,
add_generation_prompt=True)
# Let's run the default model on our input
default_outputs = loaded_model_pipeline(text_inputs=input_prompt,
max_new_tokens=256)
# View and compare the outputs
print(f"[INFO] Input:\n{input_prompt}\n")
print(f"[INFO] Output:\n{default_outputs[0]['generated_text'][len(input_prompt):]}")[INFO] Input:
<bos><start_of_turn>user
The image shows a close-up of a snack package with detailed nutrition information and ingredients. The package is orange with a black label that reads "SNACK GOOD™." The nutrition panel provides the following information:
- Servings per package: Approx. 12
- Serving size: 33g (3 Balls)
**Nutrition Information:**
- Energy: 620 kJ (148 Cal) per serving, 1,880 kJ (449 Cal) per 100g
- Protein: 4.4 g per serving, 13.4 g per 100g
- Gluten: Not detected
- Fat, total: 7.1 g per serving, 21.6 g per 100g
- Saturated: 1.6 g per serving, 4.8 g per 100g
- Carbohydrate: 14.9 g per serving, 45.1 g per 100g
- Sugars: 9.8 g per serving, 29.7 g per 100g
- Dietary fibre: 3.6 g per serving, 11.0 g per 100g
- Sodium: 90 mg per serving, 274 mg per 100g
**Ingredients:**
- Dates, Peanuts, Cacao Powder (7%), Peanut Butter (5%), Sea Salt, Coconut.
**Contains:**
- Peanuts
**May Contain:**
- Tree Nuts, Soy and Milk, Fruit Pit or Nut Shell Fragments.
**Storage Instructions:**
- Please store in a dark, cool, dry place.
**Additional Information:**
- Made in New Zealand from imported and local ingredients.
- Recycle at store.
The package is displayed on a shelf, with other similar packages visible in the background.<end_of_turn>
<start_of_turn>model
[INFO] Output:
food_or_drink: 1
tags: np, il, fi, fp
foods: Dates, Peanuts, Cacao Powder, Peanut Butter, Sea Salt, Coconut
drinks: Peanuts, Tree Nuts, Soy and Milk, Fruit Pit or Nut Shell Fragments
print(random_test_sample["gpt-oss-120b-label-condensed"])food_or_drink: 1
tags: np, il, fi, fp
foods: Dates, Peanuts, Cacao Powder, Peanut Butter, Sea Salt, Coconut, Tree Nuts, Soy, Milk, Fruit Pit, Nut Shell Fragments
drinks:
14 Counting the number of parameters in our model
def get_model_num_params(model):
"""
Returns the number of trainable, non-trainable and total parameters of a PyTorch model.
"""
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
non_trainable_params = sum(p.numel() for p in model.parameters() if not p.requires_grad)
total_params = trainable_params + non_trainable_params
return {"trainable_params": trainable_params,
"non_trainable_params": non_trainable_params,
"total_params": total_params}
# Get parameters of our fine-tuned model
model_params = get_model_num_params(loaded_model)
print(f"Trainable parameters: {model_params['trainable_params']:,}")
print(f"Non-trainable parameters: {model_params['non_trainable_params']:,}")
print(f"Total parameters: {model_params['total_params']:,}")Trainable parameters: 268,098,176
Non-trainable parameters: 0
Total parameters: 268,098,176
# Our model is 270M parameters, GPT-OSS-120B is 120B parameters
120_000_000_000 / 270_000_000444.44444444444446
By fine-tuning Gemma 3 270M we distill the capabilities of a 120B parameter model into a model 444x smaller.
15 Uploading our fine-tuned model to the Hugging Face Hub
We can upload our fine-tuned model to the Hugging Face Hub so other people can use it and test it out.
First, let’s load it in ourselves and confirm it works how we’d like it to.
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
MODEL_PATH = "./checkpoint_models/"
# Load the model into a pipeline
loaded_model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=MODEL_PATH,
dtype="auto",
device_map="auto",
attn_implementation="eager"
)
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH
)
# Create a pipeline from the loaded model
loaded_model_pipeline = pipeline("text-generation",
model=loaded_model,
tokenizer=tokenizer)
# Test the loaded model on raw text (this won't work as well as formatted text)
test_input_message = "Hello my name is Daniel!"
loaded_model_pipeline(test_input_message)The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The tokenizer you are loading from './checkpoint_models/' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
Device set to use cuda:0
[{'generated_text': 'Hello my name is Daniel! We are an American-owned, family-owned business that provides world-class cooking services and culinary arts education to students from all over the world. We focus on providing high-quality, hands-on culinary experiences that go beyond simple recipes. From cooking classes and culinary arts workshops to cooking competitions and culinary school programs, Daniel offers an unparalleled culinary journey for every student. We believe that food should be more than just a source of energy; it should be a catalyst for creativity and self-discovery. With Daniel, you can unlock your culinary potential and discover your passion. Visit our website at www.danespot.com for more information.'}]
Let’s create a helper function to format our input text into message format.
def format_message(input):
return [{"role": "user", "content": input}]
input_formatted = format_message(input=test_input_message)
input_formatted[{'role': 'user', 'content': 'Hello my name is Daniel!'}]
Now we can turn it into a prompt with our tokenizer and the apply_chat_template method.
input_prompt = loaded_model_pipeline.tokenizer.apply_chat_template(conversation=input_formatted,
tokenize=False,
add_generation_prompt=True)
input_prompt'<bos><start_of_turn>user\nHello my name is Daniel!<end_of_turn>\n<start_of_turn>model\n'
Beautiful! Time to run our fine-tuned model!
loaded_model_outputs = loaded_model_pipeline(text_inputs=input_prompt,
max_new_tokens=256)
# View and compare the outputs
print(f"[INFO] Input:\n{input_prompt}\n")
print(f"[INFO] Output:\n{loaded_model_outputs[0]['generated_text'][len(input_prompt):]}")[INFO] Input:
<bos><start_of_turn>user
Hello my name is Daniel!<end_of_turn>
<start_of_turn>model
[INFO] Output:
food_or_drink: 0
tags:
foods:
drinks:
Okay let’s make another helper function to predict on any given sample input.
We’ll also return the inference time of our model so we can see how long things take.
import time
def pred_on_text(input_text):
start_time = time.time()
raw_output = loaded_model_pipeline(text_inputs=[{"role": "user",
"content": input_text}],
max_new_tokens=256,
disable_compile=True)
end_time = time.time()
total_time = round(end_time - start_time, 4)
generated_text = raw_output[0]["generated_text"][1]["content"]
return generated_text, raw_output, total_time
pred_on_text(input_text="British Breakfast with baked beans, fried eggs, black pudding, sausages, bacon, mushrooms, a cup of tea and toast and fried tomatoes")('food_or_drink: 1\ntags: fi, di\nfoods: baked beans, fried eggs, black pudding, sausages, bacon, mushrooms, toast, fried tomatoes\ndrinks: tea',
[{'generated_text': [{'role': 'user',
'content': 'British Breakfast with baked beans, fried eggs, black pudding, sausages, bacon, mushrooms, a cup of tea and toast and fried tomatoes'},
{'role': 'assistant',
'content': 'food_or_drink: 1\ntags: fi, di\nfoods: baked beans, fried eggs, black pudding, sausages, bacon, mushrooms, toast, fried tomatoes\ndrinks: tea'}]}],
0.6088)
Nice! Looks like our model is working well enough (of course we could always improve it over time with more testing and different samples).
Let’s upload it to the Hugging Face Hub.
We can do so using the huggingface_hub library.
from huggingface_hub import HfApi, create_repo
api = HfApi()
# Give our model a name (this is in the format [Hugging Face Username]/[Target Model Name]
repo_id = "mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v1"
# Create the repo
create_repo(repo_id,
repo_type="model",
exist_ok=True)
# Upload the entire model folder containing our model files
api.upload_folder(
folder_path="./checkpoint_models/",
repo_id=repo_id,
repo_type="model"
)CommitInfo(commit_url='https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v1/commit/7707671b42ca5333905062e89b33020ab5812df0', commit_message='Upload folder using huggingface_hub', commit_description='', oid='7707671b42ca5333905062e89b33020ab5812df0', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v1', endpoint='https://huggingface.co', repo_type='model', repo_id='mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v1'), pr_revision=None, pr_num=None)
Woohoo! Our model is officially on the Hugging Face Hub.
You can download it here: https://huggingface.co/mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v1/tree/main
Now not only can we redownload it and use it again, others can download it and use it for themselves (of course you can make the model private if you like too).
16 Turn our model into a demo
Right now our model seems to be working quite well for our specific use case.
However, it takes some coding to be able to use it.
What if we wanted to allow someone who wasn’t familiar with programming to try it out?
To do so, we can turn our model into a Gradio demo and upload it to Hugging Face Spaces (a place to share all kinds of small applications).
Gradio allows us to turn our model into an easy to use and sharable demo anyone can try.
Gradio demos work on the premise of: input (text) -> function (our model) -> output (text)
We’ve already go a function ready with pred_on_text so we can wrap this with some Gradio code.
To create a sharable demo, we’ll need the following files:
app.py- Entry point for our app, all of our application code will go in here.README.md- Tells people what our app does.- Note: Hugging Face Spaces use a special “front matter” (text at the start of a
README.mdfile) to add various attributes to a Hugging Face Space, we’ll see this below.
- Note: Hugging Face Spaces use a special “front matter” (text at the start of a
requirements.txt- Tells Hugging Face Spaces what our app requires.torch,transformers,gradio,accelerate
Let’s make a folder to store our demo application.
!mkdir demos/
!mkdir demos/FoodExtracthuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
16.1 Creating the app.py file
When running our app on Hugging Face Spaces, we have the option to run our model on a GPU thanks to Hugging Face’s ZeroGPU feature.
This is optional, however, it’s highly recommend you run a model such as Gemma 3 270M on the GPU as we’ll see significant speedups.
You can run a function on a GPU by importing spaces and then using the @spaces.GPU decorator on your target function.
import spaces
@spaces.GPU
def function_to_run_on_the_gpu():
passTo ensure your model runs on the GPU, be sure to select a ZeroGPU instance in your Hugging Face Space settings.
%%writefile demos/FoodExtract/app.py
# Load dependencies
import time
import transformers
import torch
import spaces # Optional: run our model on the GPU (this will be much faster inference)
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
@spaces.GPU # Optional: run our model on the GPU (this will be much faster inference)
def pred_on_text(input_text):
start_time = time.time()
raw_output = loaded_model_pipeline(text_inputs=[{"role": "user",
"content": input_text}],
max_new_tokens=256,
disable_compile=True)
end_time = time.time()
total_time = round(end_time - start_time, 4)
generated_text = raw_output[0]["generated_text"][1]["content"]
return generated_text, raw_output, total_time
# Load the model (from our Hugging Face Repo)
# Note: You may have to replace my username `mrdbourke` for your own
MODEL_PATH = "mrdbourke/FoodExtract-gemma-3-270m-fine-tune-v1"
# Load the model into a pipeline
loaded_model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=MODEL_PATH,
dtype="auto",
device_map="auto",
attn_implementation="eager"
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH
)
# Create model pipeline
loaded_model_pipeline = pipeline("text-generation",
model=loaded_model,
tokenizer=tokenizer)
# Create the demo
description = """Extract food and drink items from text with a fine-tuned SLM (Small Language Model) or more specifically a fine-tuned [Gemma 3 270M](https://huggingface.co/google/gemma-3-270m-it).
Our model has been fine-tuned on the [FoodExtract-1k dataset](https://huggingface.co/datasets/mrdbourke/FoodExtract-1k).
* Input (str): Raw text strings or image captions (e.g. "A photo of a dog sitting on a beach" or "A breakfast plate with bacon, eggs and toast")
* Output (str): Generated text with food/not_food classification as well as noun extracted food and drink items and various food tags.
For example:
* Input: "For breakfast I had eggs, bacon and toast and a glass of orange juice"
* Output:
```
food_or_drink: 1
tags: fi, di
foods: eggs, bacon, toast
drinks: orange juice
```
"""
demo = gr.Interface(fn=pred_on_text,
inputs=gr.TextArea(lines=4, label="Input Text"),
outputs=[gr.TextArea(lines=4, label="Generated Text"),
gr.TextArea(lines=7, label="Raw Output"),
gr.Number(label="Generation Time (s)")],
title="🍳 Structured FoodExtract with a Fine-Tuned Gemma 3 270M",
description=description,
examples=[["Hello world! This is my first fine-tuned LLM!"],
["A plate of food with grilled barramundi, salad with avocado, olives, tomatoes and Italian dressing"],
["British Breakfast with baked beans, fried eggs, black pudding, sausages, bacon, mushrooms, a cup of tea and toast and fried tomatoes"],
["Steak tacos"],
["A photo of a dog sitting on a beach"]]
)
if __name__ == "__main__":
demo.launch(share=False)Writing demos/FoodExtract/app.py
16.2 Create the README.md file
The README.md file will tell people what our app does.
We could add more information here if we wanted to but for now we’ll keep it simple.
Notice the special text at the top of the file below (the text between the ---), these are some settings for the Space, you can see the settings for these in the docs.
%%writefile demos/FoodExtract/README.md
---
title: FoodExtract Fine-tuned LLM Structued Data Extractor
emoji: 📝➡️🍟
colorFrom: green
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
---
"""
Fine-tuned Gemma 3 270M to extract food and drink items from raw text.
Input can be any form of real text and output will be a formatted string such as the following:
```
food_or_drink: 1
tags: fi, re
foods: tacos, milk, red apple, pineapple, cherries, fried chicken, steak, mayonnaise
drinks: iced latte, matcha latte
```
The tags map to the following items:
```
tags_dict = {'np': 'nutrition_panel',
'il': 'ingredient list',
'me': 'menu',
're': 'recipe',
'fi': 'food_items',
'di': 'drink_items',
'fa': 'food_advertistment',
'fp': 'food_packaging'}
```
You can see walkthrough step by step code details at: https://www.learnhuggingface.com/notebooks/hugging_face_llm_full_fine_tune_tutorial
"""Overwriting demos/FoodExtract/README.md
16.3 Create the requirements.txt file
This will tell the Hugging Face Space what libraries we’d like it to run inside.
%%writefile demos/FoodExtract/requirements.txt
transformers
gradio
torch
accelerateOverwriting demos/FoodExtract/requirements.txt
16.4 Upload our demo to the Hugging Face Hub
We can upload our demo to the Hugging Face Hub in a similar way to uploading our model.
We could also upload it file by file via the Hugging Face Spaces interface.
But let’s stick to the code-first approach.
# 1. Import the required methods for uploading to the Hugging Face Hub
from huggingface_hub import (
create_repo,
get_full_repo_name,
upload_file, # for uploading a single file (if necessary)
upload_folder # for uploading multiple files (in a folder)
)
# 2. Define the parameters we'd like to use for the upload
LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD = "demos/FoodExtract/"
HF_TARGET_SPACE_NAME = "FoodExtract-v1"
HF_REPO_TYPE = "space" # we're creating a Hugging Face Space
HF_SPACE_SDK = "gradio"
HF_TOKEN = "" # optional: set to your Hugging Face token (but I'd advise storing this as an environment variable as previously discussed)
# 3. Create a Space repository on Hugging Face Hub
print(f"[INFO] Creating repo on Hugging Face Hub with name: {HF_TARGET_SPACE_NAME}")
create_repo(
repo_id=HF_TARGET_SPACE_NAME,
# token=HF_TOKEN, # optional: set token manually (though it will be automatically recognized if it's available as an environment variable)
repo_type=HF_REPO_TYPE,
private=False, # set to True if you don't want your Space to be accessible to others
space_sdk=HF_SPACE_SDK,
exist_ok=True, # set to False if you want an error to raise if the repo_id already exists
)
# 4. Get the full repository name (e.g. {username}/{model_id} or {username}/{space_name})
full_hf_repo_name = get_full_repo_name(model_id=HF_TARGET_SPACE_NAME)
print(f"[INFO] Full Hugging Face Hub repo name: {full_hf_repo_name}")
# 5. Upload our demo folder
print(f"[INFO] Uploading {LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD} to repo: {full_hf_repo_name}")
folder_upload_url = upload_folder(
repo_id=full_hf_repo_name,
folder_path=LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD,
path_in_repo=".", # upload our folder to the root directory ("." means "base" or "root", this is the default)
# token=HF_TOKEN, # optional: set token manually
repo_type=HF_REPO_TYPE,
commit_message="Uploading FoodExtract demo app.py"
)
print(f"[INFO] Demo folder successfully uploaded with commit URL: {folder_upload_url}")[INFO] Creating repo on Hugging Face Hub with name: FoodExtract-v1
[INFO] Full Hugging Face Hub repo name: mrdbourke/FoodExtract-v1
[INFO] Uploading demos/FoodExtract/ to repo: mrdbourke/FoodExtract-v1
[INFO] Demo folder successfully uploaded with commit URL: https://huggingface.co/spaces/mrdbourke/FoodExtract-v1/tree/main/.
Nice!
It looks like our demo upload worked!
We can try it out via the URL (in my case, it’s: https://huggingface.co/spaces/mrdbourke/FoodExtract-v1). And we can even embed it right in our notebook.
from IPython.display import HTML
html_code = """<iframe
src="https://mrdbourke-foodextract-v1.hf.space"
frameborder="0"
width="850"
height="1000"
></iframe>"""
display(HTML(html_code))How cool!
We’ve now got a sharable demo of a fine-tuned LLM which anyone can try out for themselves.
17 Bonus: Speeding up our model with batched inference
Right now our model only inferences on one sample at a time but as is the case with many machine learning models, we could perform inference on multiple samples (also referred to as a batch) to significantly improve throughout.
In batched inference mode, your model performs predictions on X number of samples at once, this can dramatically improve sample throughput.
The number of samples you can predict on at once will depend on a few factors:
- The size of your model (e.g. if your model is quite large, it may only be able to predict on 1 sample at time)
- The size of your compute VRAM (e.g. if your compute VRAM is already saturated, add multiple samples at a time may result in errors)
- The size of your samples (if one of your samples is 100x the size of others, this may cause errors with batched inference)
To find an optimal batch size for our setup, we can run an experiment:
- Loop through different batch sizes and measure the throughput for each batch size.
- Why do we do this?
- It’s hard to tell the ideal batch size ahead of time.
- So we experiment from say 1, 2, 4, 8, 16, 32, 64 batch sizes and see which performs best.
- Just because we may get a speed up from using batch size 8, doesn’t mean 64 will be better.
- Why do we do this?
from datasets import load_dataset
dataset = load_dataset("mrdbourke/FoodExtract-1k")
print(f"[INFO] Number of samples in the dataset: {len(dataset['train'])}")
def sample_to_conversation(sample):
return {
"messages": [
{"role": "user", "content": sample["sequence"]}, # Load the sequence from the dataset
{"role": "system", "content": sample["gpt-oss-120b-label-condensed"]} # Load the gpt-oss-120b generated label
]
}
# Map our sample_to_conversation function to dataset
dataset = dataset.map(sample_to_conversation,
batched=False)
# Create a train/test split
dataset = dataset["train"].train_test_split(test_size=0.2,
shuffle=False,
seed=42)
# Number #1 rule in machine learning
# Always train on the train set and test on the test set
# This gives us an indication of how our model will perform in the real world
dataset[INFO] Number of samples in the dataset: 1420
DatasetDict({
train: Dataset({
features: ['sequence', 'image_url', 'class_label', 'source', 'char_len', 'word_count', 'syn_or_real', 'uuid', 'gpt-oss-120b-label', 'gpt-oss-120b-label-condensed', 'target_food_names_to_use', 'caption_detail_level', 'num_foods', 'target_image_point_of_view', 'messages'],
num_rows: 1136
})
test: Dataset({
features: ['sequence', 'image_url', 'class_label', 'source', 'char_len', 'word_count', 'syn_or_real', 'uuid', 'gpt-oss-120b-label', 'gpt-oss-120b-label-condensed', 'target_food_names_to_use', 'caption_detail_level', 'num_foods', 'target_image_point_of_view', 'messages'],
num_rows: 284
})
})
# Step 1: Need to turn our samples into batches (e.g. lists of samples)
print(f"[INFO] Formatting test samples into list prompts...")
test_input_prompts = [
loaded_model_pipeline.tokenizer.apply_chat_template(
item["messages"][:1],
tokenize=False,
add_generation_prompt=True
)
for item in dataset["test"]
]
print(f"[INFO] Number of test sample prompts: {len(test_input_prompts)}")
test_input_prompts[0][INFO] Formatting test samples into list prompts...
[INFO] Number of test sample prompts: 284
'<bos><start_of_turn>user\nLiving Planet Goat Milk Whole Milk, 1 Litre, GMO Free, Australian Dairy, 8.75g Protein Per Serve, Good Source of Calcium.<end_of_turn>\n<start_of_turn>model\n'
# Step 2: Need to perform batched inference and time each step
import time
from tqdm.auto import tqdm
all_outputs = []
# Let's write a list of batch sizes to test
chunk_sizes_to_test = [1, 4, 8, 16, 32, 64, 128]
timing_dict = {}
# Loop through each batch size and time the inference
for CHUNK_SIZE in chunk_sizes_to_test:
print(f"[INFO] Making predictions with batch size: {CHUNK_SIZE}")
start_time = time.time()
for chunk_number in tqdm(range(round(len(test_input_prompts) / CHUNK_SIZE))):
batched_inputs = test_input_prompts[(CHUNK_SIZE * chunk_number): CHUNK_SIZE * (chunk_number + 1)]
batched_outputs = loaded_model_pipeline(text_inputs=batched_inputs,
batch_size=CHUNK_SIZE,
max_new_tokens=256,
disable_compile=True)
all_outputs += batched_outputs
end_time = time.time()
total_time = end_time - start_time
timing_dict[CHUNK_SIZE] = total_time
print()
print(f"[INFO] Total time for batch size {CHUNK_SIZE}: {total_time:.2f}s")
print("="*80 + "\n\n")[INFO] Making predictions with batch size: 1
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
[INFO] Total time for batch size 1: 144.40s
================================================================================
[INFO] Making predictions with batch size: 4
[INFO] Total time for batch size 4: 52.49s
================================================================================
[INFO] Making predictions with batch size: 8
[INFO] Total time for batch size 8: 36.89s
================================================================================
[INFO] Making predictions with batch size: 16
[INFO] Total time for batch size 16: 29.80s
================================================================================
[INFO] Making predictions with batch size: 32
[INFO] Total time for batch size 32: 33.65s
================================================================================
[INFO] Making predictions with batch size: 64
[INFO] Total time for batch size 64: 31.30s
================================================================================
[INFO] Making predictions with batch size: 128
[INFO] Total time for batch size 128: 48.09s
================================================================================
Batched inference complete! Let’s make a plot comparing different batch sizes.
import matplotlib.pyplot as plt
# Data
data = timing_dict
total_samples = len(dataset["test"])
batch_sizes = list(data.keys())
inference_times = list(data.values())
samples_per_second = [total_samples / time for bs, time in data.items()]
# Create side-by-side plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# --- Left plot: Total Inference Time ---
ax1.bar([str(bs) for bs in batch_sizes], inference_times, color='steelblue')
ax1.set_xlabel('Batch Size')
ax1.set_ylabel('Total Inference Time (s)')
ax1.set_title('Inference Time by Batch Size')
for i, v in enumerate(inference_times):
ax1.text(i, v + 1, f'{v:.1f}', ha='center', fontsize=9)
# --- ARROW LOGIC (Left) ---
# 1. Identify Start (Slowest) and End (Fastest)
start_val = max(inference_times)
end_val = min(inference_times)
start_idx = inference_times.index(start_val)
end_idx = inference_times.index(end_val)
speedup = start_val / end_val
# 2. Draw Arrow (No Text)
# connectionstyle "rad=-0.3" arcs the arrow upwards
ax1.annotate("",
xy=(end_idx, end_val+(0.5*end_val)),
xytext=(start_idx+0.25, start_val+10),
arrowprops=dict(arrowstyle="->", color='green', lw=1.5, connectionstyle="arc3,rad=-0.3"))
# 3. Place Text at Midpoint
mid_x = (start_idx + end_idx) / 2
# Place text slightly above the highest point of the two bars
text_y = max(start_val, end_val) + (max(inference_times) * 0.1)
ax1.text(mid_x+0.5, text_y-150, f"{speedup:.1f}x speedup",
ha='center', va='bottom', fontweight='bold',
bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="none", alpha=0.8))
ax1.set_ylim(0, max(inference_times) * 1.35) # Increase headroom for text
# --- Right plot: Samples per Second ---
ax2.bar([str(bs) for bs in batch_sizes], samples_per_second, color='coral')
ax2.set_xlabel('Batch Size')
ax2.set_ylabel('Samples per Second')
ax2.set_title('Throughput by Batch Size')
for i, v in enumerate(samples_per_second):
ax2.text(i, v + 0.05, f'{v:.2f}', ha='center', fontsize=9)
# --- ARROW LOGIC (Right) ---
# 1. Identify Start (Slowest) and End (Fastest)
start_val_t = min(samples_per_second)
end_val_t = max(samples_per_second)
start_idx_t = samples_per_second.index(start_val_t)
end_idx_t = samples_per_second.index(end_val_t)
speedup_t = end_val_t / start_val_t
# 2. Draw Arrow (No Text)
ax2.annotate("",
xy=(end_idx_t-(0.05*end_idx_t), end_val_t+(0.025*end_val_t)),
xytext=(start_idx_t, start_val_t+0.6),
arrowprops=dict(arrowstyle="->", color='green', lw=1.5, connectionstyle="arc3,rad=-0.3"))
# 3. Place Text at Midpoint
mid_x_t = (start_idx_t + end_idx_t) / 2
text_y_t = max(start_val_t, end_val_t) + (max(samples_per_second) * 0.1)
ax2.text(mid_x_t-0.5, text_y_t-4.5, f"{speedup_t:.1f}x speedup",
ha='center', va='bottom', fontweight='bold',
bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="none", alpha=0.8))
ax2.set_ylim(0, max(samples_per_second) * 1.35) # Increase headroom
plt.suptitle("Inference with Fine-Tuned Gemma 3 270M on NVIDIA DGX Spark")
plt.tight_layout()
plt.savefig('inference_benchmark.png', dpi=150)
plt.show()
We get a 4-6x speedup when using batches!
At 10 samples per second, that means we can inference on ~800k samples in a day.
samples_per_second = round(len(dataset["test"]) / min(timing_dict.values()), 2)
seconds_in_a_day = 86_400
samples_per_day = seconds_in_a_day * samples_per_second
print(f"[INFO] Number of samples per second: {samples_per_second} | Number of samples per day: {samples_per_day}")[INFO] Number of samples per second: 9.53 | Number of samples per day: 823392.0
18 Bonus: Performing evaluations on our model
We can evaluate our model directly against the original labels from gpt-oss-120b and see how it stacks up.
For example, we could compare the following:
- F1 Score
- Precision
- Recall
- Pure accuracy
Have these metrics as well as samples which are different to the ground truth would allow us to further explore where our model needs improvements.
19 Next steps
There are several avenues we could approach next.
Improve the data sampling - if our model makes mistakes, could we improve the input data? For example, more samples on certain tasks where the model is weaker.
Quantisation - right now our model is ~500MB, could we make this smaller with the help of INT8 quantisation? (this could also help speed up inference time)
Output compression - our output is compressed to a simple YAML-like format, but is there a better compression option to generate even less tokens?
Test the model on a large unseen sample - practice using the model on a large corpus of image captions (e.g. 10,000+) and randomly inspect them to find failure cases to boost the model performance