Large Language Models | Duc Q. Nguyen

Recently, large language models have achieved state-of-the-art results on many natural language processing tasks. It is believed that large language models can learn the knowledge of the world from the text data. As the era of AGI (Artificial General Intelligence) is coming, large language models are considered as one of the key technologies to achieve AGI.

Key Projects

URA-LLaMa Family

Fine-tuned Vietnamese LLMs from Meta’s LLaMa-2 model, including 7B, 13B, and 70B versions.

MixSUra & GemSUra

Vietnamese language models based on Mixtral and Gemma architectures.

Vision-Language Models

Integrating vision capabilities with LLMs using the LLaVA architecture:

MixSUraV
GemSUraV 7B, 2B

Research Interests

Developing new large language models for low-resource languages
Applying LLMs to solve real-world problems
Cross-lingual transfer learning

Citations

If you use our models in your research, please cite this paper (Truong et al., 2024).

References

2024

NAACL
Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

Sang Truong, Duc Nguyen, Toan Nguyen, and 4 more authors

In Findings of the Association for Computational Linguistics: NAACL 2024, Jun 2024

Abs DOI Bib PDF

Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM evaluation. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese and developed a comprehensive evaluation framework encompassing 10 tasks and 31 metrics. We observe that finetuning can help LLMs transfer knowledge across languages, serving as an efficient way to bolster their capabilities in non-English languages. Moreover, our analysis indicates that larger models can introduce more biases and uncalibrated outputs and the key factor influencing LLM performance is the quality of the training or finetuning datasets. These insights underscore the significance of meticulous finetuning with high-quality datasets in enhancing LLM performance.
@inproceedings{truong_crossing_2024, address = {Mexico City, Mexico}, title = {Crossing {Linguistic} {Horizons}: {Finetuning} and {Comprehensive} {Evaluation} of {Vietnamese} {Large} {Language} {Models}}, copyright = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License}, shorttitle = {Crossing {Linguistic} {Horizons}}, url = {https://aclanthology.org/2024.findings-naacl.182/}, doi = {10.18653/v1/2024.findings-naacl.182}, urldate = {2025-09-08}, booktitle = {Findings of the {Association} for {Computational} {Linguistics}: {NAACL} 2024}, publisher = {Association for Computational Linguistics}, author = {Truong, Sang and Nguyen, Duc and Nguyen, Toan and Le, Dong and Truong, Nhi and Quan, Tho and Koyejo, Sanmi}, editor = {Duh, Kevin and Gomez, Helena and Bethard, Steven}, month = jun, year = {2024}, pages = {2849--2900}, }