MamayLM v1.0
Creating a high-performing multimodal LLM for Ukrainian and English

MamayLM can now see! We are releasing MamayLM v1.0, the best-performing efficient Ukrainian language model that surpasses all similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.

We are delighted to announce the release of MamayLM v1.0, a new state-of-the-art LLM targeting the Ukrainian language. We are releasing the model in two sizes - 4B and 12B - both of which are cost-efficient, fast, multimodal and can be run on 1 GPU, yet are effective in both Ukrainian and English. The model comes with strong capabilities outpacing open models of similar sizes in both languages, while matching or favourably comparing against much larger models. MamayLM is a result of a collaboration between researchers at INSAIT and ETH Zurich.

The new version has the following updates:

Stronger base model: Using Gemma 3 models as the base model provides improved performance and capabilities for Ukrainian language tasks.
Multimodality: The model is designed to handle multiple modalities, including text and images, enabling a wider range of applications and use cases in both English and Ukrainian.
Longer context: The model is designed to handle longer context lengths, allowing it to better understand and generate text with more complex dependencies and relationships.

Enriching the Training Data for Ukrainian

In our v0.1 version we have successfully adapted Gemma 2 model to Ukrainian language, based on our former research about language transfer. Now, taking Gemma 3 as the base model with even more powerful multilingual (and multimodal!) capabilities, we have applied a similar pipeline of data curation, continual pretraining and instruction fine-tuning, with some notable improvements in various aspects to adapt Gemma 3 4B and 12B to Ukrainian using a total of 81B tokens of Ukrainian and English text.

Pre-Training Phase

In the previous version, our Ukrainian pre-training data was based on the FineWeb2, Malyuk, and CulturaX datasets. For the current v1.0 release, we switched to the Kobza dataset, which builds upon the same sources while integrating HPLT. Kobza also includes fuzzy deduplication and leverages a wider range of web data, as HPLT follows a different pipeline and collects multilingual content from diverse sources. Since FineWeb2 and CulturaX rely on overlapping data and share a similar knowledge cut-off date, we selected the FineWeb2 and UberText (Ukrainian news) subsets within Kobza to maximize coverage. This approach provides a larger and more diverse foundation for our pre-training corpus. Additionally, we applied a data rehydration technique by incorporating the Ukrainian Wikipedia subset, ensuring greater emphasis on high-quality content.

During pretraining, we used best-fit packing to pack sequences at the desired context length, preserving data structure and coherence with minimal disruption. This approach enhances context learning and improves language reasoning capabilities. To prevent catastrophic forgetting, we include a small proportion of English-centric data, such as English Wikipedia, Smoltalk and Mixture of Thoughts.

Post-Training Phase

Similarly to the v0.1 version, for the post-training stage we extracted topics relevant to Ukrainian history and culture, which enabled the generation of a synthetic dataset of Ukrainian QA pairs using knowledge distillation from a larger model. We also employed our LLM-based translation pipeline to translate domain-specific data to Ukrainian, enhancing both quantity and quality in the target language.

Our instruction-tuning dataset incorporates various open-source datasets, such as the Nemotron SFT and Post-Training datasets, OpenCoder (OPC) SFT dataset, Aya Collection and more. We acknowledge the significant contributions of the Ukrainian open-source community, particularly creators of Spivavtor, UAlpaca, UA-Squad, Ukrainian StackExchange, Crimean Tatar Parallel Corpora and UA-Lawyer QA, which amplify the potential of Ukrainian post-training.

Adapting Gemma 3 to Ukrainian Language

In the pre-training stage we have split the dataset into two parts based on different massive web-sourced datasets and re-introducing smaller domain-specific datasets in both splits. Based on the training with different splits we utilized model souping technique to improve pre-trained model performance - this allowed us to increase pre-training performance dramatically.

In the post-training stage, we trained English- and Ukrainian-focused instruction-tuned models separately, which were later combined into a final better version. Such separated approach allows us to increase the performance on both languages even more thanks to the having data targeted for a specific language. We also applied an advanced model merging technique inspired by Layer Swapping to more precisely extract linguistic capabilities. Further, we consider findings on language imbalances and model merging, which highlight the impact of data mixing proportions on model performance.

The chosen pipeline allows us to not just preserve visual and long-context capabilities, but even improve them for both languages without having specific datasets targeted for those domains. We believe that the visual multilingual performance is strongly dependent on the model's linguistic capabilities in given languages, therefore, we observe improvements on visual benchmarks without training on text-image data.

Evaluating MamayLM v1.0 12B Capabilities

We evaluated MamayLM on a set of standard English benchmarks, a translated version of them in Ukrainian, as well as Ukrainian-specific benchmarks we collected:

ZNO: mandatory testing of knowledge of the Ukrainian high school curriculum in Ukrainian language & literature, history, mathematics and geography
Winogrande challenge: testing world knowledge and understanding
Hellaswag: testing sentence completion
ARC Easy/Challenge: testing logical reasoning
TriviaQA: testing trivia knowledge
GSM-8K: solving multiple-choice questions in high-school mathematics
MMLU: testing knowledge on a multitude of topics
IFEval: testing instruction-following skills

We undertook the challenge of unraveling the best translation method for the English-only benchmarks. Although some effort has been made in this direction, we found that it was not extensive enough, and the Ukrainian translations could be improved. We identified two key issues in benchmark translation:

the separation of question and answer during translation;
translation quality heavily relying on few-shot prompting or additional model output verification.

To address these issues, we developed a translation framework that preserves the context of both questions and answers. It also employs multisampling and scoring of translation candidates to optimize the balance between machine translation quality and human involvement, ensuring maximum efficiency. All adapted benchmarks for Ukrainian are available in the according GitHub repository.

Performance Against Similarly Sized Models

As illustrated by the figures below, across all benchmarks, MamayLM outperforms all similarly sized models (even outperforming much bigger 70B models on Ukrainian!). It does so in both English and Ukrainian, thanks to the particular method used to train MamayLM (mentioned above).

MamayLM English evaluation — Average score across reported English benchmarks

MamayLM Ukrainian evaluation — Average score across reported Ukrainian benchmarks

Performance Against Larger Models

We also evaluated MamayLM v1.0 against current state-of-the-art LLMs. Impressively, our model outperforms models up to 6 times larger across various benchmarks, including those specific to Ukrainian contexts, as shown in the figure below.

Performance on Mandatory National Ukrainian Exams (ZNO)

Importantly, as the figure below shows, MamayLM v1.0 achieves the highest score on the ZNO (National Ukrainian) high school exams amongst similarly sized models, while outperforming much larger models, including Gemma2 27B, Llama 3.1 70B and Qwen 2.5 72B.

MamayLM ZNO evaluation — ZNO evaluation results

Performance on Visual Benchmarks

We also evaluated MamayLM v1.0 on visual benchmarks, where it demonstrates strong performance in both Ukrainian and English. The model's ability to understand and generate text based on visual inputs highlights its versatility and effectiveness across different modalities.

To assess the English performance we use original MMMU benchmarks, where our trained model shows improved performance comparing to the base version.

MamayLM MMMU evaluation — MMMU evaluation results

To monitor Ukrainian visual performance we used ZNO-Vision to evaluate the model's capabilities in understanding local cultural and historical knowledge together with other domain-specific capabilities in Ukrainian. Our model also shows positive improvements after training comparing to the base model.

MamayLM MMZNO evaluation — ZNO-Vision/MMZNO evaluation results

Generative Performance vs. Larger Models

Beyond benchmark evaluations, we assessed generative capabilities of MamayLM v1.0 on a set of 500 complex questions. The results demonstrate that our model consistently outperforms significantly larger models, excelling both in the linguistic quality of the generated Ukrainian text and the accuracy of its content. To ensure unbiased and high-quality evaluations, we relied on Gemini 2.0 Flash, which has strong proficiency in Ukrainian and a deep understanding of its cultural and linguistic nuances.

We evaluate the model performance on factual Ukrainian QA data, where our model shows positive performance against much larger models as well as GPT-4o and Claude 3.7 Sonnet.

MamayLM Ukrainian QA evaluation — Chat performance comparison against proprietary models on custom Ukrainian QA benchmark

We also check the model performance on m-ArenaHard (Ukrainian subset), designed to evaluate more domain-specific knowledge in math and coding, where our model displays similarly good performance against much larger models.

MamayLM ArenaHard-M UKR evaluation — Chat performance comparison against proprietary models on m-ArenaHard benchmark

Evaluating MamayLM v1.0 4B Capabilities

We assess the capabilities of MamayLM v1.0 4B using the same benchmarks, targeted to evaluate text generation, comprehension, and domain-specific knowledge for both Ukrainian and English. The model shows strong performance against similarly sized models, demonstrating its effectiveness across a range of tasks.

MamayLM 4B Ukrainian evaluation — MamayLM v1.0 4B Ukrainian evaluation results (comparison with similarly sized models)

Furthermore, MamayLM v1.0 4B achieves 50% accuracy on ZNO benchmark, showing promising performance on Ukrainian-focused tasks as a small model.

Benefits of MamayLM

In the current technological landscape, the need for fast, adaptable, and locally optimized solutions has become critical. Available in 4B and 12B sizes, MamayLM is relatively compact and consistently outperforms models up to 10x larger in both English and Ukrainian. Its ability to operate on a single GPU allows for faster adaptation, lower operational costs, and simpler deployment, making it particularly well-suited for environments with limited resources and evolving demands. Moreover, the new version has now visual and long context capabilities with increased performance for both languages.

This offers significant advantages for Ukrainian local businesses and government institutions, which can integrate advanced AI technologies without the prohibitive costs or complex technical requirements typically associated with larger systems. Having smaller size option allows more flexibility in deployment and scaling for smaller businesses which do not have extensive infrastructure. Additionally, the model's bilingual capabilities support its application in sectors such as education and healthcare, where addressing language barriers can have a meaningful impact. In particular, it helps meet immediate needs in Ukraine by enhancing service delivery across critical areas.

Download Models and Benchmarks

We make normal and quantized versions of MamayLM available on HuggingFace, alongside a detailed description of how to use them for inference:

The Ukrainian benchmarks are available in the according GitHub repository.

If you use our models, please consider citing our work (citation below).

Contact Us

For any questions on MamayLM, please contact us at contact@insait.ai.

INSAIT is a world-class computer science and AI research institute, which is part of Sofia University, located in Sofia, Bulgaria. INSAIT was created in 2022, in partnership with Switzerland's ETH Zurich and EPFL. It is a strategic institution for Bulgaria, funded with an initial endowment of around 100M USD by the Bulgarian government, over a period of 10 years, and is generously supported with donations of roughly 15M USD from SiteGround, Google, AWS, VMware and other big-tech companies. INSAIT is the first center of its kind in Eastern Europe, structured according to top Western computer science and AI institutions – it provides world-class packages and conditions for outstanding tenure-track and tenured faculty, research scientists, post-docs, PhDs and many other positions. Currently, INSAIT hosts researchers from more than 23 nationalities and does research in areas spanning foundational models, safe and secure AI, robotics, computer vision, quantum computing, algorithms, information security, and other key areas.

Citation

For attribution in academic contexts, please cite this work as

"MamayLM v1.0: An efficient state-of-the-art multimodal Ukrainian LLM", 2025.

BibTeX citation

@misc{MamayLMv1,
      title={MamayLM v1.0: An efficient state-of-the-art multimodal Ukrainian LLM},
      author={Yukhymenko, Hanna and Alexandrov, Anton and Vechev, Martin},
      year={2025},
      }

Distill Template

This blog was based on The Distill Template by Leandro von Werra.