Recently, large language models have achieved state-of-the-art results on many natural language processing tasks. It is believed that large language models can learn the knowledge of the world from the text data. As the era of AGI (Artificial General Intelligence) is coming, large language models are considered as one of the key technologies to achieve AGI.
Key Projects
URA-LLaMa Family
Fine-tuned Vietnamese LLMs from Meta’s LLaMa-2 model, including 7B, 13B, and 70B versions.
Recent advancements in large language models (LLMs) have underscored their importance in the evolution of artificial intelligence. However, despite extensive pretraining on multilingual datasets, available open-sourced LLMs exhibit limited effectiveness in processing Vietnamese. The challenge is exacerbated by the absence of systematic benchmark datasets and metrics tailored for Vietnamese LLM evaluation. To mitigate these issues, we have finetuned LLMs specifically for Vietnamese and developed a comprehensive evaluation framework encompassing 10 tasks and 31 metrics. We observe that finetuning can help LLMs transfer knowledge across languages, serving as an efficient way to bolster their capabilities in non-English languages. Moreover, our analysis indicates that larger models can introduce more biases and uncalibrated outputs and the key factor influencing LLM performance is the quality of the training or finetuning datasets. These insights underscore the significance of meticulous finetuning with high-quality datasets in enhancing LLM performance.
@inproceedings{truong_crossing_2024,address={Mexico City, Mexico},title={Crossing {Linguistic} {Horizons}: {Finetuning} and {Comprehensive} {Evaluation} of {Vietnamese} {Large} {Language} {Models}},copyright={Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License},shorttitle={Crossing {Linguistic} {Horizons}},url={https://aclanthology.org/2024.findings-naacl.182/},doi={10.18653/v1/2024.findings-naacl.182},urldate={2025-09-08},booktitle={Findings of the {Association} for {Computational} {Linguistics}: {NAACL} 2024},publisher={Association for Computational Linguistics},author={Truong, Sang and Nguyen, Duc and Nguyen, Toan and Le, Dong and Truong, Nhi and Quan, Tho and Koyejo, Sanmi},editor={Duh, Kevin and Gomez, Helena and Bethard, Steven},month=jun,year={2024},pages={2849--2900},}