2025-06-02 17:20:33
:::info Authors:
(1) Vladislav Trifonov, Skoltech ([email protected]);
(2) Alexander Rudikov, AIRI, Skoltech;
(3) Oleg Iliev, Fraunhofer ITWM;
(4) Ivan Oseledets, AIRI, Skoltech;
(5) Ekaterina Muravleva, Skoltech.
:::
2 Neural design of preconditioner
3 Learn correction for ILU and 3.1 Graph neural network with preserving sparsity pattern
5.2 Comparison with classical preconditioners
5.4 Generalization to different grids and datasets
7 Conclusion and further work, and References
Large linear systems are ubiquitous in modern computational science. The main recipe for solving them is iterative solvers with well-designed preconditioners. Deep learning models may be used to precondition residuals during iteration of such linear solvers as the conjugate gradient (CG) method. Neural network models require an enormous number of parameters to approximate well in this setup. Another approach is to take advantage of small graph neural networks (GNNs) to construct preconditioners of the predefined sparsity pattern. In our work, we recall well-established preconditioners from linear algebra and use them as a starting point for training the GNN. Numerical experiments demonstrate that our approach outperforms both classical methods and neural network-based preconditioning. We also provide a heuristic justification for the loss function used and validate our approach on complex datasets.
\ The well-designed preconditioner should tend to approximate A, be easily invertible and be sparse. The construction of a preconditioner is typically a trade-off between the quality of the approximation and the cost of storage/inversion of the preconditioner Saad [2003].
\ Recent papers on the application of neural networks to speed up iterative solvers include usage of neural operators as nonlinear preconditioner functions Rudikov et al. [2024], Shpakovych [2023] or within a hybrid approach to address low-frequencies Kopanicáková and Karniadakis [2024], Cui ˇ et al. [2022] and learning preconditioner decomposition with graph neural networks (GNN) Li et al. [2023], Häusner et al. [2023].
\ We suggest a GNN-based construction of preconditioners that produce better preconditioners than their classical analogous. Our contributions are as follows:
\ • We propose a novel scheme for preconditioner design based on learning correction for well-established preconditioners from linear algebra with the GNN.
\ • We suggest a novel understanding of the loss function used with accent on low-frequencies and provide experimental justification for the understanding of learning with such.
\ • We propose a novel approach for dataset generation with a measurable complexity metric that addresses real-world problems.
\ • We provide extensive studies with varying matrix sizes and dataset complexities to demonstrate the superiority of the proposed approach and loss function over classical preconditioners.
\
\
\
\
\
\ \ This loss function previously appeared in related research Li et al. [2023] but with understanding of inductive bias from PDE data distribution. In experiment section we evidence our hypothesis, that loss (5) indeed mitigate low-frequency components.
\
Our main goal is to construct preconditioners that will reduce condition number of a SPD matrix greater, than classical preconditioners with the same sparisy pattern. We work with SPD matrices so ILU, ILU(p) and ILUt(p) results in incomplete Choletsky factorization IC, IC(p) and ICt(p)
Following the idea from Li et al. [2023], we use of GNN architecture Zhou et al. [2020] to preserve the sparsity pattern and predict the lower triangular matrix to create a preconditioner in a form of IC decomposition.
\
\
\
:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.
:::
\
2025-06-02 17:10:04
:::info Authors:
(1) Pavan L. Veluvali, Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstr. 1, 39106 Magdeburg;
(2) Jan Heiland, Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstr. 1, 39106 Magdeburg;
(3) Peter Benner, Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstr. 1, 39106 Magdeburg.
:::
Spinodal decomposition in a binary A-B alloy
Summary and Outlook, Acknowledgments, Data Availability, and References
The practice to perform data and software intensive tasks has been taken hold by computational workflows. Subsequently, the rapid growth in their uptake and application on computer-based experiments presents a crucial opportunity to advance the development of reproducible scientific softwares. As a part of the MaRDI consortium [MaR21] on research data management in mathematical sciences, in this work, we presented a novel computational workflow framework, namely, MaRDIFlow, a prototype that focuses on the automation of abstracting meta-data embedded in an ontology of mathematical objects while negating the underlying execution and environment dependencies into multi-layered vertical descriptions. Additionally, the different components are characterized by their input and output relation such that they can be used interchangeably and in most cases redundantly.
\ The design specification as well as the working prototype of our RDM tool was presented through different use cases. In the present version, MaRDIFlow acts a command-line tool such that it enables users to handle the workflow components as abstract objects described by input to output behavior. At its core, MaRDIFlow ensures that the output generated is detailed, and a comprehensive description facilitates the reproduction of computational experiments. At first we illustrated the conversion rates of CO2 using a methanization reactor model, and later, we demonstrated the two-dimensional spinodal decomposition of a virtual A-B alloy using the Cahn-Hilliard model. Our RDM tool adheres to FAIR principles, such that the abstracted workflow components are Findable, Accessible, Interoprable and Reusable. Overall, the ongoing development of MaRDIFlow aims at covering heterogeneous use cases and act as a scientific tool in the field of mathematical sciences.
\ Apart from this, we are also working towards developing an Electronic Lab Notebook (ELN) in order to visualize as well as execute the MaRDIFlow tool. The ELN will provide researchers with a user-friendly interface to interact with the tool efficiently and seamlessly. Lastly, although the present manuscript introduces our RDM tool as a working proof of concept, we plan to publish a detailed manuscript with technical details and use cases in the near future.
The authors are supported by NFDI4Cat and MaRDI, funded by the Deutsche Forschungsgemeinschaft (DFG), project 441926934 “NFDI4Cat – NFDI f¨ur Wissenschaften mit Bezug zur Katalyse” and project 460135501 “MaRDI – Mathematische Forschungsdateninitiative”.
Results presented in this work are apart of an ongoing investigation, however a working prototype with the second use-case is available and documented at https://doi.org/10.5281/zenodo.10608764
[AGMT17] M. Atkinson, S. Gesing, J. Montagnat, and I. Taylor. Scientific workflows: Past, present and future, 2017.
\ [BCG+19] A. Brinckman, K. Chard, N. Gaffney, M. Hategan, M. B. Jones, K. Kowalik, S. Kulasekaran, B. Lud¨ascher, B. D. Mecum, J. Nabrzyski, V. Stodden, I. J. Taylor, M. J. Turk, and K. Turner. Computing environments for reproducibility: Capturing the “whole tale”. Future Generation Computer Systems, 94:854–867, 2019.
\ [BHBS21] J. Bremer, J. Heiland, P. Benner, and K. Sundmacher. Non-intrusive time-pod for optimal control of a fixed-bed reactor for co2 methanation. IFAC-PapersOnLine, 54(3):122–127, 2021.
\ [BOA+11] T. Blochwitz, M. Otter, M. Arnold, C. Bausch, C. Clauß, H. Elmqvist, A. Junghanns, J. Mauss, M. Monteiro, T. Neidhold, et al. The functional mockup interface for tool independent exchange of simulation models. In Proceedings of the 8th international Modelica conference, pages 105–114. Link¨oping University Press, 2011.
\ [BTK+21] M. Beg, J. Taka, T. Kluyver, A. Konovalov, M. Ragan-Kelley, NM. Thi´ery, and H. Fangohr. Using jupyter for reproducible scientific workflows. Computing in Science & Engineering, 23(2):36–46, 2021.
\ [CAI+22a] M. Crusoe, S. Abeln, A. Iosup, P. Amstutz, J. Chilton, N. Tijani´c, H. M´enager, S. Soiland-Reyes, B. Gavrilovi´c, C. Goble, et al. Methods included: Standardizing computational reuse and portability with the common workflow language. Communications of the ACM, 65(6):54–63, 2022.
\ [CAI+22b] MR. Crusoe, S. Abeln, A. Iosup, P. Amstutz, J. Chilton, N. Tijani´c, H. M´enager, S. Soiland-Reyes, B. Gavrilovi´c, C. Goble, et al. Methods included: Standardizing computational reuse and portability with the common workflow language. Communications of the ACM, 65(6):54–63, 2022.
\ [CH58] JW. Cahn and JE. Hilliard. Free energy of a nonuniform system. i. interfacial free energy. The Journal of chemical physics, 28(2):258–267, 1958.
\ [Com22] The Galaxy Community. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Research, 50(W1):W345–W351, 04 2022.
\ [CSFG19] A. Clyburne-Sherin, X. Fei, and SA. Green. Computational reproducibility via containers in social psychology. Meta-Psychology, 3, 2019. [DGST09] E. Deelman, D. Gannon, M. Shields, and I. Taylor. Workflows and e-science: An overview of workflow system features and capabilities. Future Generation Computer Systems, 25(5):528–540, 2009.
\ [DHM+20] A. Devaraju, R. Huber, M. Mokrane, P. Herterich, L. Cepinskas, J. de Vries, H. L’Hours, J. Davidson, and Angus W. Fairsfair data object assessment metrics, October 2020.
\ [FHHS16] J. Fehr, H. Heiland, C. Himpe, and J. Saak. Best practices for replicability, reproducibility and reusability of computer-based experiments exemplified by model reduction software. AIMS Mathematics, 1(3):261–281, 2016.
\ [For22] Deutsche Forschungsgemeinschaft. Guidelines for Safeguarding Good Research Practice. Code of Conduct, April 2022. Available in German and in English.
\ [GCBSR+20] C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, MR. Crusoe, K. Peters, and D. Schober. Fair computational workflows. Data Intelligence, 2(1- 2):108–121, 2020.
\ [HW09] M. A. Heroux and J. M. Willenbring. Barely sufficient software engineering: 10 practices to improve your cse software. In Proceedings of the 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, pages 15–21, 2009.
\ [KRKP+16a] T. Kluyver, B. Ragan-Kelley, F. P´erez, B. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. Hamrick, J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla, and C. Willing. Jupyter notebooks - a publishing format for reproducible computational workflows. In F. Loizides and B. Schmidt, editors, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87–90, 2016.
\ [KRKP+16b] T. Kluyver, B. Ragan-Kelley, F. P´erez, B. Granger, M. Bussonnier, J. Jonathan Frederic, K. Kelley, J. Hamrick, J. Grout, S. Corlay, P. Ivanov, D. Dami´an Avila, S. Abdalla, and C. Willing. Jupyter notebooks – a publishing format for reproducible computational workflows. In F. Loizides and B. Schmidt, editors, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87 – 90. IOS Press, 2016.
\ [MaR21] MaRDI. Mathematic research data initiative, 2021. URL: https://www.mardi4nfdi.de.
\ [Nat22] National Academies of Sciences, Engineering, and Medicine. Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop. The National Academies Press, Washington, DC, 2022.
\ [PMBF17] JF. Pimentel, L. Murta, V. Braganholo, and J. Freire. noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. Proceedings of the VLDB Endowment, 10(12), 2017.
\ [PMBF21] JF. Pimentel, L. Murta, V. Braganholo, and J. Freire. Understanding and improving the quality and reproducibility of Jupyter notebooks. Empirical Software Engineering, 26(4):65, 2021.
\ [SM24] S. Samuel and D. Mietchen. Computational reproducibility of Jupyter notebooks from biomedical publications. GigaScience, 13:giad113, 01 2024.
\ [UHY+21] M. Uhrin, SP. Huber, J. Yu, N. Marzari, and G. Pizzi. Workflows in aiida: Engineering a high-throughput, event-based engine for robust and modular computational workflows. Computational Materials Science, 187:110086, 2021.
\ [VHB23] PL. Veluvali, J. Heiland, and P. Benner. Mardiflow: A workflow framework for documentation and integration of fair computational experiments. In Proceedings of the Conference on Research Data Infrastructure, volume 1, 2023.
\ [WDA+16] MD. Wilkinson, M. Dumontier, IJ. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, JW. Boiten, LB. da Silva Santos, PE. Bourne, et al. The FAIR guiding principles for scientific data management and stewardship. Scientific data, 3(1):1–9, 2016.
\
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
2025-06-02 17:08:28
Related Works
2.3 Evaluation benchmarks for code LLMs and 2.4 Evaluation metrics
Methodology
Evaluation
Democratization of AI and Large Language Models (LLMs) in particular is becoming an increasingly relevant topic. Thanks to training on extremely large amounts of data supported by extensive computational infrastructure, LLMs can demonstrate impressive performance on a variety of tasks. However, these very same reasons became the main obstacles for individual users benefiting from LLMs. This is threatening to deepen the digital divide in the society.
\ Quantization is aimed at making LLMs more accessible on consumer devices by trading off performance for a lesser computational demand. In this study, we evaluated the feasibility of running code LLMs with 7 billion parameters on consumer devices. The quantized code LLMs were evaluated using Lua programming tasks as e benchmark and with respect to several metrics including pass@1 rate, inference time, error types, and lines of code generated. The quantized code LLMs were also compared to non-quantized code LLMs with lower parameter numbers.
\ The overall results suggest that code LLMs quantized at 4-bit integer precision can be comfortably run on an average CPU-only consumer laptop while maintaining good performance relative to other quantized and non-quantized code LLMs. The 4-bit quantized models also outperformed the non-quantized models with lower parameter numbers.
\ However, the study also revealed that the exact effects of quantization are not homogeneous among the five tested models. The performance of a quantized model may also be a subject of the model architecture, pre-trained dataset, training procedure, and follow-up fine-tuning. Therefore, a more in-depth study is necessary to explore how these factors and quantization interact. Furthermore, besides the four general categories, the exact nature of errors in code generated by the code LLMs was not explored in this study. Such understanding can give a greater insight into the effects of quantization on code generation and how to mitigate respective performance degradation. This also needs to be addressed in a follow-up study.
\ Finally, using Lua, a low-resource language, as a benchmark further emphasizes the need to improve LLMs aimed at precision tasks such as code generation. This highlights the need for enabling consumers to not only quantize but also fine-tune models for specific tasks. Furthermore, accessibility of fine-tuning is not only an algorithmic problem but also the issue of data availability. Therefore, the democratization of LLMs also involves the democratization of training data, which also needs to be addressed in future work.
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
\ [2] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI preprint, 2018. URL https://cdn.openai.com/research-covers/ language-unsupervised/languageunderstandingpaper.pdf. Accessed 22-Oct-2024.
\ [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
\ [4] Muhammad Usman Hadi, Qasem Al Tashi, Abbas Shah, Rizwan Qureshi, Amgad Muneer, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, et al. Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints, 2024.
\ [5] Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.
\ [6] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
\ [7] Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
\ [8] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36, 2024.
\ [9] Sophie Lythreatis, Sanjay Kumar Singh, and Abdul-Nasser El-Kassar. The digital divide: A review and future research agenda. Technological Forecasting and Social Change, 175:121359, 2022.
\ [10] John Lai and Nicole O Widmar. Revisiting the digital divide in the covid-19 era. Applied economic perspectives and policy, 43(1):458–464, 2021.
\ [11] Adel Ben Youssef, Mounir Dahmani, and Ludovic Ragni. Ict use, digital skills and students’ academic performance: Exploring the digital divide. Information, 13(3):129, 2022.
\ [12] Anjun Chen, Lei Liu, and Tongyu Zhu. Advancing the democratization of generative artificial intelligence in healthcare: a narrative review. Journal of Hospital Management and Health Policy, 8, 2024.
\ [13] Nur Ahmed and Muntasir Wahed. The de-democratization of ai: Deep learning and the compute divide in artificial intelligence research. arXiv preprint arXiv:2010.15581, 2020.
\ [14] Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Anders Freeman, Carolyn Jane Anderson, Molly Q Feldman, Michael Greenberg, Abhinav Jangda, and Arjun Guha. Knowledge transfer from high-resource to low-resource programming languages for code llms, 2024. URL https://arxiv.org/abs/ 2308.09895.
\ [15] AI@Meta. Llama 3 model card. HuggingFace.co, 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md.
\ [16] Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, and Deyi Xiong. A comprehensive evaluation of quantization strategies for large language models. arXiv preprint arXiv:2402.16775, 2024.
\ [17] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2024. URL https://arxiv.org/abs/2308.12950.
\ [18] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
\ [19] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024. URL https://arxiv.org/abs/2401.14196.
\ [20] CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A. Choquette-Choo, Jingyue Shen, Joe Kelley, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Paul Michel, Peter Choy, Pratik Joshi, Ravin Kumar, Sarmad Hashmi, Shubham Agrawal, Zhitao Gong, Jane Fine, Tris Warkentin, Ale Jakse Hartman, Bin Ni, Kathy Korevec, Kelly Schaefer, and Scott Huffman. Codegemma: Open code models based on gemma, 2024. URL https://arxiv.org/abs/2406.11409.
\ [21] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder 2 and the stack v2: The next generation, 2024. URL https://arxiv.org/abs/2402.19173.
\ [22] Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, et al. Mceval: Massively multilingual code evaluation. arXiv preprint arXiv:2406.07436, 2024.
\ [23] Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023.
\ [24] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2022.
\ [25] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024. URL https://arxiv.org/abs/2306.00978.
\ [26] Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, and Yongin Kwon. A comprehensive evaluation of quantized instruction-tuned large language models: An experimental analysis up to 405b. arXiv preprint arXiv:2409.11055, 2024.
\ [27] Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. Evaluating quantized large language models, 2024. URL https://arxiv.org/abs/2402.18158.
\ [28] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
\ [29] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
\ [30] Burak Yeti¸stiren, I¸sık Özsoy, Miray Ayerdem, and Eray Tüzün. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt. arXiv preprint arXiv:2304.10778, 2023.
\ [31] Lianghong Guo, Yanlin Wang, Ensheng Shi, Wanjun Zhong, Hongyu Zhang, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. When to stop? towards efficient code generation in llms with excess token prevention. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1073–1085, 2024.
\ [32] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. arXiv preprint arXiv:2406.00515, 2024.
\ [33] Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, MingHo Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49(7):3675–3691, 2023.
\ [34] Russell A Poldrack, Thomas Lu, and Gašper Beguš. Ai-assisted coding: Experiments with gpt-4. arXiv preprint arXiv:2304.13187, 2023.
\ [35] Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D Le, and David Lo. Refining chatgpt-generated code: Characterizing and mitigating code quality issues. ACM Transactions on Software Engineering and Methodology, 33(5):1–26, 2024.
\ [36] Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. Spoc: Search-based pseudocode to code. Advances in Neural Information Processing Systems, 32, 2019.
\ [37] Roberto Ierusalimschy, Luiz Henrique de Figueiredo, and Waldemar Celes. The evolution of lua. In Proceedings of the third ACM SIGPLAN conference on History of programming languages, pages 2–1, 2007.
\ [38] Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, Yu-Yang Liu, and Li Yuan. Llm lies: Hallucinations are not bugs, but features as adversarial examples, 2024. URL https://arxiv.org/abs/2310.01469.
\ [39] Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. Exploring and evaluating hallucinations in llm-powered code generation, 2024. URL https://arxiv.org/ abs/2404.00971.
\ [40] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be with you!, 2023. URL https://arxiv.org/abs/2305.06161.
\ [41] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open models based on gemini research and technology, 2024. URL https://arxiv.org/abs/2403.08295.
\
:::info Author:
(1) Enkhbold Nyamsuren, School of Computer Science and IT University College Cork Cork, Ireland, T12 XF62 ([email protected]).
:::
:::info This paper is available on arxiv under CC BY-SA 4.0 license.
:::
\
2025-06-02 17:08:22
Related Works
2.3 Evaluation benchmarks for code LLMs and 2.4 Evaluation metrics
Methodology
Evaluation
The results suggest that 4-bit integer quantization provides the best balance between model performance and model size. It is consistent with a conclusion made in the earlier study that evaluated quantized LLMs on general reasoning and knowledge tasks [16]. Furthermore, while still being smaller in size, quantized 4-bit models with 7 billion parameters performed better than non-quantized half-precision models with 3 billion or less parameters.
\ On the other hand, 2-bit integer quantization resulted in a significant performance degradation. In the extreme case of 2-bit CodeGemma, there was a complete breakdown in the model’s ability to generate coherent responses. This is likely an effect of hallucination [38; 39]. The low precision rounding likely impacted the model’s next token prediction ability (underlying probability distributions) resulting in a sequence of repetitive out-of-context tokens.
\ According to [14], StarCoderBase 1B, StarCoderBase 15B, and CodeLlama 34B demonstrated MultiPL-E pass@1 percentages of 12.1, 26.6, and 43.9 for Lua. In another study [33], the InCoder 6.7B, CodeGen 16.1B, and Codex 12B models demonstrated MultiPL-HumanEval pass@1 rates of approximately 0.05, 0.08, and 0.41 for Lua. In the same study, the corresponding pass@1 rates in the MultiPL-MBPP Lua benchmark were 0.13, 0.09, and 0.49. These values can be compared with the pass@1 rates in Table 4. The 4-bit and 8-bit quantized models with 7B parameters generally do not perform much worse than the non-quantized models in [14; 33] with higher parameter numbers. This may be
\
\
\ explained by advances made in LLM training and fine-tuning since the studies were published. For example, we used StarCoder2 7B while [14] evaluated StarCoderBase 15B. Compared to the original StarCoder released in 2023 [40], StarCoder2 was released in 2024 and trained on the Stack V2 dataset [21], which is seven times larger than the dataset the original StarCoder was trained on.
\ In Chai, Liu, Yang, et al. [22], CodeQwen 1.5 7B Chat, DeepSeek Coder 1.5 7B Instruct, CodeLlama 7B Instruct, and Codegemma 7B it (instruct tuned) demonstrated pass@1 percentages of 48%, 48%, 30%, and 48% on the MCEVAL Lua benchmark. According to Fig. 2, the 4-bit and 8-bit quantized CodeLlama Instruct models performed comparatively to the non-quantized CodeLlama Instruct model. Similarly, the 4-bit and 8-bit quantized CodeQwen 1.5 Chat only slightly underperforms the non-quantized models. There is a greater discrepancy between quantized and non-quantized CodeGemma and DeepSeek Coder models.
\ This demonstrated that performance-wise the effect of quantization differs between models. Many factors may influence the quantization such as model architecture, training datasets, training process, underlying foundational models, etc. For example, the underperforming quantized models are heavily instruction fine-tuned models. The DeepSeek Coder Instruct models were pre-trained from scratch on 2 trillion tokens and further instruction-tuned on 2 billion tokens of instruction data [19]. CodeGemma 7B [20] was pre-trained on 500 billion tokens but is based on the Gemma 7B model, which was pre-trained on 6 trillion tokens and further instruction-tuned on an undisclosed amount of data [41]. In contrast, CodeQwen 1.5 is based on Qwen 1.5 pre-trained on 3 trillion tokens and was further pre-trained on 90 billion tokens [18]. Therefore, It can be hypothesized that quantization may negatively affect most instruction fine-tuning performance. Overall, further studies are necessary to granularize the effects of quantization on different facets of code LLMs.
\ The study demonstrates that quantized 4-bit and 8-bit models with 7 billion parameters can be run reasonably well on laptops even without a dedicated GPU. From the perspective of computational demand, it is becoming increasingly feasible to use code LLMs for everyday programming tasks. This is especially true with the introduction of integrated environments, such as LM Studio and Ollama, that provide convenient and easy-to-use UIs for locally deploying LLMs. Generating 20 lines of correct code may require 30-50 seconds Fig. 7, which is a reasonable amount of time on a laptop oriented for business rather than coding productivity. The main problem lies with increasing inference time for generating incorrect solutions. Higher inference time does not necessarily result in better-quality code. This particularly applies to 8-bit models. Ironically, it can be another argument for using 4-bit models that can fail sooner rather than unproductively spend more time on generating incorrect solutions.
\ Performance-wise, both non-quantized code LLMs geared toward consumer devices and quantized code LLMs leave a lot to be desired. In most cases, these models demonstrate pass@1 rates lower than 50% in Lua. This is very low for precision tasks such as programming. The problem is further deepened by the difficulty of detecting errors. Ironically, code LLMs are quite good at generating incorrect code that is otherwise syntactically correct and does not produce
\
\
\ runtime errors (see Table 6). Therefore, any code generated by code LLMs requires extensive testing and supervision, which may negate any advantages of using code LLMs.
\ In this study, the Lua programming language was used for benchmarking the code LLMs. Lua is a low-resource language [14] characterized by a lower amount of available training data compared to high-resource languages like Python and Java. Moreover, Lua has programming patterns and constructs, such as metatables and class emulations, typically not found in other languages. Hence, it is not straightforward for code LLMs to leverage generic knowledge from other languages while generating Lua code. In other words, there is no bias imposed by high-resource programming languages. Therefore, performance in Lua is more representative of the real-life performance of code LLMs on a variety of tasks. It can be further argued that real-life professional programming that code LLMs ideally need to support is about writing efficient code for specific or even niche tasks. Similar to Lua, these kinds of tasks can be seen as ‘low-resource tasks’ even within high-resource programming languages. As such, Lua, as a niche language [33], may arguably be a better representative of these ‘low-resource’ tasks.
\ Likely, both proprietary and permissively licensed foundational models such as GPT-4o and Llama 3.1 405B can demonstrate significantly better performance, but accessibility is a major issue for these models. On the one hand, proprietary models like GPT-4o are pay-walled. On the other hand, permissive models like Llama 3.1 405B require a computational infrastructure that also may require considerable financial commitments. Therefore, further research is necessary to bring democratization of code LLMs. While quantization still remains a highly relevant topic, optimizing fine-tuning to be feasible on consumer devices is also essential. The ability to both fine-tune and quantize smaller LLMs for specific tasks is the necessary gap to address to enable greater consumer adoption. It should be noted that fine-tuning is not only a technical challenge. Unfortunately, datasets on which LLMs are pre-trained usually remain obscure and inaccessible to the public. Therefore, greater transparency and democratization of datasets is a necessary step toward the democratization of LLMs.
\
:::info Author:
(1) Enkhbold Nyamsuren, School of Computer Science and IT University College Cork Cork, Ireland, T12 XF62 ([email protected]).
:::
:::info This paper is available on arxiv under CC BY-SA 4.0 license.
:::
\
2025-06-02 17:08:15
Related Works
2.3 Evaluation benchmarks for code LLMs and 2.4 Evaluation metrics
Methodology
Evaluation
As shown in Fig. 6, lines of codes generated by the models do not differ much between the quantization levels. In generating incorrect solutions, CodeQwen and CodeGemma tended to be more verbose. The correct solutions in HumanEval require more lines of code than in the other two benchmarks. Interestingly, for the correct solutions, MBPP requires slightly more lines of code than MCEVAL while needing less inference time (Fig. 4). Overall, there is no effect of quantization on the number of lines of code generated. However, as depicted by Fig. 7, the time required to generate the same number of lines of code increases with higher precision quantization. This is observed for both the correct and incorrect solutions. This indicates that the increase in inference time in higher precision models is mainly due to longer forward pass time (calculations at the layers) rather than longer output generation time. In simpler terms, the higher precision models spend more time ‘thinking’ before generating output. However, this additional thinking time does not effectively translate into better performance when the 4-bit and 8-bit models are compared (Fig. 1).
Instead of using quantized models, it may be better to use a non-quantized model but with a smaller number of parameters. For this reason, we raised the research question RQ4. We performed the same tests on DeepSeek Coder 1.3B Instruct, CodeGemma 2B, and StarCoder2 3B. The three models were tested at half-precision (FP16). The storage requirements for these models are 2.69GB, 4.40GB, and 6.06GB respectively. When loaded into memory, these models require 2.53GB, 4.44GB, and 5.79GB respectively. These sizes roughly correspond to the sizes of 2-bit, 4-bit, and 8-bit models. No low-parameter models were available for CodeLLama and CodeQwen.
\
\ As Fig. 8 suggests, the low-parameter models at the FP16 half-precision performed roughly at the level of 2-bit quantized models. The low-parameter models performed considerably worse than the 4-bit quantized models.
\
:::info Author:
(1) Enkhbold Nyamsuren, School of Computer Science and IT University College Cork Cork, Ireland, T12 XF62 ([email protected]).
:::
:::info This paper is available on arxiv under CC BY-SA 4.0 license.
:::
\
2025-06-02 17:08:08
Related Works
2.3 Evaluation benchmarks for code LLMs and 2.4 Evaluation metrics
Methodology
Evaluation
Fig. 4 depicts the inference times broken down by the models and benchmarks. The figure also shows the inference times separately for the correct (pass@1) and incorrect (fail@1) solutions.
\ For all models, the inference time increased with higher precision (more qbits). The effect is stronger for the failed solutions. Overall, the failed solutions took longer times to generate than the correct solutions. CodeGemma, StarCoder, and DeepSeek Coder share the same pattern. CodeQwen spent more time than the other models inferring both correct and failed solutions. CodeGemma similarly demanded longer inference times but this did not as effectively translate to better results as with CodeQwen (Fig. 2).
\ For the benchmarks, the inference times also increased with qbits. For the failed solutions in MBPP and MCEVAL, the inference times demonstrate a V-shaped pattern. It is due to CodeGemma’s inflated inference times at 2-bit quantization, where it had a complete breakdown of its inference ability.
\ Fig. 5 offers a more detailed view of the inference times. For the correct solutions, the divergence in inference times for CodeQwen from the other models is mainly observed in the HumanEval benchmark. Higher precision resulted in more divergence. According to Fig. 2, 8-bit CodeQwen is the second-best-performing model on HumanEval. However, the 4-bit CodeQwen model was not able to take similar advantage of longer inference time as its performance on HumanEval was not better than that of the other models. This demonstrates a variable effect of quantization on different models and benchmarks. For example, a longer inference time does not necessarily translate to better performance and may not compensate for the lower precision of a model. This is especially evident in CodeGemma, which consistently took longer inference time but demonstrated poorer performance across all three quantization levels than the other models.
\ A regression analysis was done on the inference times with solution correctness (boolean), benchmarks, model, and q-bits as nominal predictors. The regression model also included all possible two-way interactions. The data from CodeGemma was excluded from the analysis to avoid the skewed results from influencing the analysis. The model’s adjusted R2 is 0.38 compared to the adjusted R2 of 0.31 of the baseline regression model without any interactions. AIC is 63743.23 compared to the AIC of 64471.65 of the baseline model.
\ Table 7 lists the coefficients with corresponding statistics. Insignificant two-way interactions are not listed in Table 7. The intercept represents an average inference time in seconds for an incorrect solution generated by 2-bit CodeLlama in the HumanEval benchmark. The regression model confirms the effects observed in the descriptive statistics. The correct solutions required less time to generate than the incorrect solutions. For both the correct and incorrect solutions, the inference time increases with higher q-bits. The tasks in MBPP generally require less inference time than the tasks in the two other benchmarks. This discrepancy increases with higher quantization bits. Many of the interactions account for the effects related to CodeQwen: the rate of increase in inference time with more q-bits is higher than for the other models, and this effect is even more inflated in the HumanEval benchmark.
\
\
:::info Author:
(1) Enkhbold Nyamsuren, School of Computer Science and IT University College Cork Cork, Ireland, T12 XF62 ([email protected]).
:::
:::info This paper is available on arxiv under CC BY-SA 4.0 license.
:::
\