Gemma is a collection of lightweight, cutting-edge open models, built using the same technology that powers our Gemini models. Available in a range of sizes, anyone can adapt and run them on their own infrastructure. This combination of performance and accessibility has resulted in more than 250 million downloads and 85,000 published community variants for a wide range of tasks and domains.
You don’t need expensive hardware to create highly specialized custom models. The compact size of the Gemma 3 270M allows you to quickly refine it for new use cases and then deploy it on-device, giving you flexibility in model development and full control over a powerful tool.
To show how simple this is, this article shows an example of training your own model to translate text to emoji and testing it in a web application. You can even teach it the specific emojis you use in real life, resulting in a personal emoji generator. Try it in the live demo.
We’ll walk you through the process of creating a task-specific template in less than an hour. You will learn to:
- Refine the model: Train Gemma 3 270M on a custom dataset to create a personal “emoji translator”
- Quantify and convert the model: Optimize the model for on-device inference, reducing its memory footprint to less than 300 MB of memory.
- Deploy in a web application: Run the client-side model in a simple web application using MediaBlowjob Or Transformers.js
Step 1: Customize Model Behavior Using Fine-Tuning
Out of the box, LLMs are generalists. If you ask Gemma to translate text to emoji, you might get more than you asked for, like conversational filler.
Fast:
Translate the following text into a creative combination of 3-5 emojis: “what a fun party”Model output (example):
Of course! Here is your emoji: 🥳🎉🎈
For our app, Gemma should only display emojis. Although you can try complex prompt engineering, the most reliable way to apply a specific output format and teach new knowledge to the model is fine tuning on example data. So, to teach the model to use specific emojis, you need to train it on a dataset containing example text and emojis.
Models learn better with more examples you provide, so you can easily make your dataset more robust by prompting the AI to generate different text sentences for the same emoji output. For fun, we did this with emojis that we associate with pop songs and fandoms:
If you want the model to remember specific emoji, provide more examples in the dataset.
Developing a model used to require huge amounts of VRAM. However, with quantized low-rank adaptation (QLoRA), a parameter fine-tuning technique (PEFT), we only update a small number of weights. This significantly reduces memory requirements, allowing you to fine-tune the Gemma 3 270M in minutes when using the free T4 GPU acceleration in Google Colab.
Start with a sample dataset or fill the template with your own emojis. You can then run the focus notebook to load the dataset, train the model, and test the performance of your new model against the original.
Step 2: Quantify and convert the model for the web
Now that you have a custom template, what can you do with it? Since we typically use emojis on mobile devices or computers, it makes sense to deploy your model in an on-device app.
The original model, although small, is still more than 1 GB. To ensure a fast-loading user experience, we need to reduce it. We can do this using quantificationa process that reduces the precision of model weights (e.g., 16-bit integers to 4 bits). This significantly reduces file size with minimal impact on performance for many tasks.
Smaller models result in a faster loading app and a better experience for end users.
To prepare your model for a web application, quantize and convert it in a single step using either LiteRT Conversion Notebook for use with MediaPipe or the ONNX conversion notebook for use with Transformers.js. These frameworks make it possible to run client-side LLMs in the browser by leveraging WebGPU, a modern web API that allows applications to access a local device’s hardware for computing, eliminating the need for complex server setups and per-call inference costs.
Step 3: Run the template in the browser
You can now run your custom model directly in the browser! Download our web application example and modify a line of code to plug in your new model.
MediaPipe and Transformers.js make things simple. Here is an example of an inference task running in the MediaPipe worker:
// Initialize the MediaPipe Task
const genai = await FilesetResolver.forGenAiTasks('
llmInference = await LlmInference.createFromOptions(genai, {
baseOptions: { modelAssetPath: 'path/to/yourmodel.task' }
});
// Format the prompt and generate a response
const prompt = `Translate this text to emoji: what a fun party!`;
const response = await llmInference.generateResponse(prompt);
JavaScript
Once the model is cached on the user’s device, subsequent queries run locally with low latency, user data remains completely private, and your app even works offline.
Do you like your app? Share it by uploading it to Hugging Face Spaces (just like the demo).
What’s next
You don’t need to be an AI expert or data scientist to create a specialized AI model. You can improve the performance of the Gemma model by using relatively small data sets, and it takes minutes, not hours.
We hope you will be inspired to create your own template variations. Using these techniques, you can create powerful AI applications that are not only personalized to your needs, but also provide a superior user experience: fast, private, and accessible to anyone, anywhere.
The full source code and resources for this project are available to help you get started:
- Effectively refine Gemma with QLoRA in Colab
- Convert Gemma 3 270M for use with MediaPipe LLM Inference API in Colab
- Convert Gemma 3 270M for use with Transformers.js in Colab
- Download the demo code at GitHub
- Check out more Web AI demos on the Gemma’s recipe book And chrome.dev
- Learn more about the Gemma 3 model family and their capabilities on the device