No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks, such as machine translation, question answering, summarization, and so on. Additionally, LLMs are also highly valuable in supporting software engineering tasks, particularly in the field of code generation. Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements, improving developer productivity. In this study, we perform a systematic empirical assessment to the quality of code generation using <italic>ChatGPT</italic>, a recent state-of-the-art product LLM. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by <italic>ChatGPT</italic>, focusing on three critical aspects: correctness, complexity, and security. We also specifically investigate <italic>ChatGPT</italic>'s ability to engage in multi-round fixing process (i.e., <italic>ChatGPT</italic>'s dialog ability, chatting between users and <italic>ChatGPT</italic> for fixing generated buggy code) of facilitating code generation. By delving into the generated code and examining the experimental results, this work provides valuable insights into the performance of <italic>ChatGPT</italic> in tackling code generation tasks over the three critical aspects. The experimental results demonstrate that (1) <italic>ChatGPT</italic> is better at generating functionally correct code for problems before 2021 in different languages than problems after 2021 with <inline-formula><tex-math notation="LaTeX">$48.14\%$</tex-math><alternatives><mml:math><mml:mn>48.14</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math><inline-graphic xlink:href="tang-ieq1-3392499.gif"/></alternatives></inline-formula> advantage in <italic>Accepted</italic> rate on judgment platform, but <italic>ChatGPT</italic>'s ability to directly fix erroneous code with multi-round fixing process to achieve correct functionality is relatively weak; (2) the distribution of cyclomatic and cognitive complexity levels for code snippets in different languages varies. Furthermore, the multi-round fixing process with <italic>ChatGPT </italic> generally preserves or increases the complexity levels of code snippets; (3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by <italic>ChatGPT </italic> has relevant vulnerabilities. However, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with more than <inline-formula><tex-math notation="LaTeX">$89\%$</tex-math><alternatives><mml:math><mml:mn>89</mml:mn><mml:mi mathvariant="normal">%</mml:mi></mml:math><inline-graphic xlink:href="tang-ieq2-3392499.gif"/></alternatives></inline-formula> of vulnerabilities successfully addressed; and (4) code generation may be affected by <italic>ChatGPT</italic>'s non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security. Overall, our findings uncover potential issues and limitations that arise in the <italic>ChatGPT</italic>-based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.


INTRODUCTION
Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements.It supports a range of capabilities that benefit software development greatly.By using automatic code generation, developers are able to enhance productivity, reduce development time, and assign more focus to higher-level tasks and core logic.Lots of studies on code generation leverage AI-based approaches [1], [2], [3], [4], [5], especially for using large language models (LLMs) [6], [7], [8], [9], [10], [11] such as the recent Chat-GPT [12].AI-based Code Generation.The emergence of AI-based code generation is driven by the increasing complexity of software systems and the desire for a more effi-cient development process [13].Traditional code generation approaches [14] rely on predefined templates or rules (e.g., context-free grammar) and input-output specifications, which limits their flexibility and requires manual effort.AIbased approaches [15], [9], [11], [12] leverage the power of machine learning (deep learning) and natural language processing (NLP) to overcome these limitations and can offer more intelligent and adaptable code-generation capabilities.These approaches analyze directly input specifications or requirements expressed in natural language and generate corresponding code snippets or complete programs based on the provided input.Large Language Model and ChatGPT.Recently, large language models (LLMs) demonstrate remarkable capabilities in a wide range of NLP tasks, such as machine translation, question answering, summarization, text generation, grammar checking, and so on [16], [17], [18], [19].These models possess a capacity for understanding and generating human-like text, approaching the level of humans.LLMs are primarily built on the Transformer architecture [6], with OpenAI's GPT-3 (Generative Pretrained Transformer 3) [20] being a prominent example.GPT-3 is trained on extensive amounts of textual data, resulting in exceptional performance.ChatGPT [12] is an implementation with dialog ability that is built upon the foundation of GPT-3.5 [21] (or arXiv:2308.04838v2[cs.SE] 13 Apr 2024 GPT-4 [22]).It exhibits outstanding performance in areas such as machine translation, question answering, summarization, and so on, and is found in widespread usage in various daily activities.Importantly, ChatGPT also possesses the capability of code-related tasks, which can further expand its potential applications.ChatGPT now has become an essential tool for individuals, academia, and industry, significantly enhancing productivity in various domains.Motivation.While AI-based code generation, using LLMs, provides promising advantages in enhancing productivity and automating software development tasks, it is still essential to assess the generated code for showing better insights and understanding.Code generation by LLMs is facing challenges.For example, whether the code generated by LLMs is functionally correct, complex, and secure.The training datasets for LLMs come from the internet, but the quality of the data is uncertain.Subsequently, the quality of the code generated by LLMs also cannot be guaranteed [23], [24], [25], [26].A deep analysis of these aspects can provide a more comprehensive understanding of AI and LLM-based code generation.In this paper, we are interested in deeply and systematically evaluating the code generated by LLMs in terms of its correctness, complexity, and security.Specifically, we leverage the state-of-the-art ChatGPT 1 , a recent product, as the representative of LLMs for evaluation, due to its advanced capabilities and widespread recognition [12].We also assess ChatGPT's dialog ability (i.e., the multi-round fixing process in one single conversation, chatting between users and ChatGPT for fixing generated buggy code) in the code generation task over correctness, complexity, and security.By conducting a comprehensive analysis, we seek to uncover potential issues and limitations that arise in the ChatGPT-based code generation for improving AI and LLMbased code generation techniques.Our Study.To cope with the aforementioned challenges and explore the ability of ChatGPT [12] to generate code, we collect and leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios from LeetCode platform [27] and [23], respectively, for the code generation task and intend to answer the following research questions (RQs): • RQ1 (Functionally Correct Code Generation): Is the code generated by ChatGPT functionally correct?• RQ2 (Multi-round Fixing for Code Generation): How effective is the multi-round fixing process in improving code generation for functional correctness?• RQ3 (Code Complexity): How complex is the code generated by ChatGPT? • RQ4 (Security Code Generation): Is the code generated by ChatGPT secure?• RQ5 (Non-determinism of ChatGPT): How does the nondeterministic output of ChatGPT affect code generation?
Our experimental results demonstrate that (1) ChatGPT is better at generating functionally correct code for problems before 2021 in different languages than problems after 2021 with 48.14% advantage in Accepted rate on judgment platform, but ChatGPT's ability to directly fix erroneous code with multi-round fixing process to achieve correct functionality is relatively weak; (2) the distribution of cy-1.The version used in ChatGPT is GPT-3.5 instead of GPT-4.
clomatic and cognitive complexity levels for code snippets in different languages varies.Furthermore, the multiround fixing process with ChatGPT generally preserves or increases the complexity levels of code snippets; (3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by ChatGPT has relevant vulnerabilities.However, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with more than 89% of vulnerabilities successfully addressed; and (4) code generation may be affected by ChatGPT's non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security.Contributions.In summary, we make the following contributions to this paper: • In this paper, we conduct a comprehensive empirical assessment to the quality of ChatGPT-based code generation; • We systematically evaluate the ChatGPT-based code generation, including multi-round process, from three aspects: correctness, complexity, and security.The evaluated results reveal potential issues and limitations in ChatGPT-based code generation over the three aspects; and • Our research contributes to advancing the potential knowledge and understanding of the capabilities of LLMs in enhancing software engineering practices, with a particular focus on code generation.Online Artifact.The experimental scripts, results, and raw data are available at: [28].

BACKGROUND
In this section, we briefly introduce LLMs, ChatGPT, and the use of ChatGPT.LLMs and ChatGPT.LLMs (large language models) [6], [29], [30], [31], [32], [33], [20], [12] refer to a class of AI models that use an enormous amount of parameters and are designed to process and generate human-like text based on large-scale language datasets.These models utilize deep learning techniques, typically employing Transformer architectures [6], consisting of stacked encoders and decoders, to learn patterns, relationships, and structures in languages.Transformer utilizes self-attention mechanism to weigh the importance of words in the input text, capturing long-range dependencies and relationships between words.LLMs are trained on massive amounts of text data from various sources and show a strong ability in many NLP tasks, such as machine translation, question answering, summarization, and so on.GPT [32] and BERT [29] are based on the decoder (unidirectional) and encoder (bidirectional) components of the Transformer, respectively.They utilize pre-training and fine-tuning techniques.GPT-2 [33] and GPT-3 [20] are the successors of GPT, with GPT-2 having a larger model size in parameters than GPT, and GPT-3 being even larger than GPT-2 with using 175 billion parameters.Additionally, with larger corpus, GPT-2 and GPT-3 introduce zero-shot and few-shot learning to enable adaptation to multitask scenarios.Moreover, GPT-3 has demonstrated performance comparable to state-of-the-art fine-tuned systems across various tasks.Codex [9] is obtained by training GPT-3 on GitHub code data.It serves as the underlying model for GitHub Copilot [11], a tool that can automatically generate and complete code automatically.To enhance the alignment between LLMs and users (humans), InstructGPT [31] incorporates additional supervised learning and reinforcement learning from human feedback (RLHF) to fine-tune GPT-3.ChatGPT [12], [31], implemented atop GPT-3.5 [21] (or GPT-4 [22]), is now the most ideal product LLM that adapts to human expression by using Instruct [31].ChatGPT utilizes the same methods as InstructGPT and provides the ability to answer follow-up questions (i.e., dialog ability) through RLHF.The dialog ability [12] enables ChatGPT to communicate with users conversationally, continuously generating information or correcting previously incorrect ones.This property makes ChatGPT even more powerful and versatile than previous LLMs.Thus, in this study, we take the stateof-the-art ChatGPT (the default version of GPT-3.5), the recent popular product, as the representative of LLMs for evaluation.Use of ChatGPT.To use ChatGPT, developers send a text message as input.The message is called prompt, used to guide ChatGPT's text generation.The prompt serves as a cue for the model to understand the desired output or the user's intent.ChatGPT responds based on the input prompt and the knowledge it learns from its massive amounts of training data.ChatGPT also supports answering follow-up questions (i.e., dialog ability), which allows users to engage in backand-forth conversations.This capability enables users to send multiple text messages consecutively to ChatGPT and receive responses that maintain context and continuity.
When a user submits a series of messages to ChatGPT, each message within the conversation context is considered by the model when generating a response.The messages can include user prompts, system instructions, and previous responses from ChatGPT itself.By incorporating the conversation history, ChatGPT can produce responses that are more coherent and informed, taking into account the ongoing dialogue.
ChatGPT can be used in many software engineering tasks, including code generation.Additionally, by leveraging ChatGPT's dialog ability, users can provide a series of prompts (i.e., multi-round process) for code generation (e.g., fix functionally incorrect code).Example.For example, the prompt, "Write a bubble sort program in Python to rank 1-10", can be used for ChatGPT to generate a bubble sort program in Python.As shown in Fig. 1, ChatGPT offers a sample code for the bubble sort program in Python, which fulfills the requirement given in the prompt.We can also append a new prompt "Output the sorted result in nonincreasing order" in the same conversation that ChatGPT is guided to replace comparison operator > with < in line 7 of the code snippet.

EMPIRICAL STUDY SETUP
In this section, we introduce the workflow of interacting with ChatGPT to generate code and other study setups.

Data Collection
The assessment of our study involves the utilization of two datasets: LeetCode problems [27] and CWE (Common Weakness Enumeration) scenarios (CWE's code scenarios)  [3,7,2,8,10,1,5,9,4,6] 11 # Sort the numbers using bubble sort 12 bubble _ sort(numbers) 13 # Print the sorted list 14 print("Sorted list of numbers from 1 to 10:") 15 print(numbers) Fig. 1: ChatGPT-generated Bubble Sort Algorithm in Python.as provided in [23].For LeetCode problem dataset, we randomly collect 728 algorithm problems where 354 and 374 of them are published after 2021 and before 2021, respectively.The reason for splitting them is because ChatGPT is trained on text data before 2021.For each problem, the problem description, input-output examples, and the method signature template in the specified language are used for code generation.As for the CWE scenario dataset, it contains 18 CWEs with 54 scenarios in MITRE Top 25 CWEs [34] (3 of them drop in rank to below 25 in 2022 MITRE Top 25 CWEs).For each CWE, three different code scenarios (context) are provided for code generation.The detailed introduction and preprocessing of these datasets is presented in the corresponding subsections in Sec. 4 (i.e., Sec.4.1 and Sec.4.4).

Methodology
Workflow.The overall workflow of our study framework is shown in Fig. 2. ❶ We construct a suitable prompt for the given LeetCode problem or CWE scenario (i.e., one CWE's code scenario) and send the constructed prompt to ChatGPT.❷ ChatGPT generates a response based on the current round provided prompt and the previous round conversation context (first round has no previous round conversation context).We extract the code snippet by ChatGPT between two triple backticks from the response.❸ For the generated code, we leverage LeetCode online judgment to test its functional correctness, or we utilize CodeQL [35] (with manual analysis) to detect CWE vulnerabilities.Here, we refer to them collectively as testing in Fig. 2. If the testing result passes (e.g., pass all test cases or no vulnerability detected), the code generation process ends.❹ Otherwise, there are bugs (e.g., compile error) in the generated code snippet.If the (round) number in the conversation (i.e., dialog) with ChatGPT does not exceed the round limit (e.g., the maximum round number of 5), we utilize the feedback provided from LeetCode and CodeQL to reconstruct a new prompt and input it to ChatGPT for a new round of code generation (i.e., go back to ❶ for fixing).If the testing is consistently unpassed and the round number in the conversation exceeds the round limit, the code generation is considered failed.The entire process including multiple rounds in the conversation is called the multi-round (fixing) process (one-round process with the maximum round number of 1 has no fixing property).The details of prompt construction, testing, and multi-round fixing process are explained in the subsections of Sec. 4.
Principle of Prompt Design.The goal of our prompt design is not to find the optimal prompt that maximizes ChatGPT's performance.Instead, our goal is to provide a reasonable prompt that simulates real-world usage scenarios, especially for code generation, which can also avoid overfitting to the specific prompts and datasets.In developing the prompt template, we refer to online prompt templates (e.g., Ope-nAI Cookbook [36] and PromptBase [37]) for code generation tasks and finally establish the following principle for prompt design: offering sufficient information to ChatGPT while leveraging its dialog ability.Subject LLM.The default language model provided by OpenAI [38] for ChatGPT is GPT-3.5.This model contains 175 billion parameters, making it a highly capable and complex model.GPT-3.5 is engineered to handle a diverse range of natural language processing tasks, such as text generation, text completion, and other related tasks.In this study, we utilize the model version gpt-3.5-turbo-0301 of ChatGPT for performing evaluation.We query ChatGPT through a simple wrapper [39] of OpenAI API [38] to easily control the dialog ability of ChatGPT.The temperature of ChatGPT is set to the default value of 0.7 [12] to simulate real-world usage scenarios.Furthermore, the token limitation of ChatGPT [12] is 4,096, which may influence the output from ChatGPT.If the total length of the input prompt and the generated response exceeds this limitation, then the excess part is discarded and possibly produces incomplete code snippets, causing errors in the generated code.In our experiments, we impose strict length limitations on both the input prompt and the generated response.For each round in the multi-round process, we find that the current round prompt2 lengths and response lengths are all under 2,400 tokens and 800 tokens, respectively, which does not exceed ChatGPT's token limitation.Thus, for the oneround process (e.g., Sec.4.1), the outputs of ChatGPT are not influenced by the token limitation problem.However, in the complete multi-round process, especially when performing code generation for LeetCode problems, there are some cases where the token lengths used (include previous prompts, responses, and the current round prompt and response) can exceed the token limitation.To mitigate this issue when encountering the cases, we take a token-limitation strategy of adding necessary information (e.g., LeetCode problem descriptions) at the beginning of the current round prompt and remove as little of the beginning dialog content (in block granularity, i.e., one prompt or response) from the conversation as possible to keep the remaining token space for the response from ChatGPT having at least 1000 3 in length.This strategy avoids missing the necessary details in tasks for ChatGPT.Moreover, in our observation, the strategy guarantees that the generated code snippets are complete and at least ensures that the immediate previous round's response remains throughout the conversation such that ChatGPT does not lose the most recent code generationrelated information.The detailed introduction of this strategy is presented in Sec.4.2 4 .

Experiment Environment
All experiments are conducted on a server with an Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz (10 cores) and 128GB RAM.Its operating system is Ubuntu 20.04.The framework designed and scripts used in the experiments are developed in Python 3.10.9.CodeQL [35] used is in version 2.12.2.

Functionally Correct Code Generation
RQ1: Is the code generated by ChatGPT functionally correct?Motivation.Given an appropriate prompt, ChatGPT [12] is able to generate text consistent with the prompt based on knowledge learned.This ability may improve developer productivity [24], [40], [41], [42].In the first step, we focus on evaluating the ability of ChatGPT to generate functionally correct code automatically in one-round process.
Approach.We let ChatGPT read the natural language description of the given problem to generate the corresponding code snippet in one-round process (i.e., the maximum round number is set to 1), and utilize the problems on Leet-Code [27] as our dataset.LeetCode is an online platform that 3.In the first experiment in Sec.4.1, all lengths of responses are under 770.Thus, We slightly amplify 770 to 1000 as the length of remaining token space that should be guaranteed when generating responses.
4. The tasks in Sec.4.1 and Sec.4.4 have no this issue.All token lengths used are lower than the token limitation.
provides challenging coding problems and automatic judgment.At the time of writing, there are over 2,500 problems on LeetCode with easy, medium, and hard levels, starting from the 2014 year.We collect all problems on LeetCode, and divide them into two categories, problems before 2021 (Bef.problems) and problems after 2021 (Aft.problems), by using the time divider of 2022-01-01.Since ChatGPT [12] is trained on text data before 2021, Bef.problems and corresponding solutions may have a high probability to appear in its training set.This case may degenerate the code generation task for Bef.problems into querying code in the database (i.e., code reuse [43], [4], [44]).Code reuse is a commonly used software development practice that avoids creating new code from scratch (e.g., copy-paste).Therefore, we take both problems into account.
Specifically, we focus on Algorithm problems 5 on Leet-Code since Algorithm problems are the most significant, numerous, and diverse problems on the platform.The total numbers of Bef.problems and Aft.problems are 1,624 and 354, respectively.Furthermore, the difficulty level distribution to both of them is in the ratio of 1 : 2 : 1 for hard, medium, and easy problems.Among all the Bef.problems, we sample 374 of them randomly, having similar quantities to the Aft.problems and following the same difficulty level distribution as Aft.problems.The ratio of the numbers of hard, medium, and easy problems is also 1 : 2 : 1 for both 354 Aft.problems and 374 Bef.problems, consistent with the difficulty level distribution of all problems on the LeetCode platform.Additionally, we also check if there are significant differences between Bef. problems and Aft.problems.If Aft. problems are just reformulations of Bef.problems, ChatGPT may likely be able to easily solve them, which can affect the reliability of the experiment results in distinguishing between time periods.Specifically, we first use the "similar questions" provided for each problem on the LeetCode platform to find similar problem pairs of Bef.problems and Aft.problems.The "similar questions" [27] represent two paired problems that have similar scenarios (e.g., processing string) or require using similar algorithms for solving (e.g., dynamic programming).In total, there are 142 pairs found.Then, we have two graduate students independently and manually check these problem pairs.Through a careful checking and discussion process, we find that these similar problems are either having similar scenarios but completely different solution goals, or different scenarios and conditions but can be solved using similar algorithms such as dynamic programming.After a careful manual analysis, we do not find any cases that Bef.problems can be easily reformulated to obtain Aft.problems.Thus, we consider Aft.problems and Bef.problems to be sufficiently different.Moreover, for each problem, we ask ChatGPT to generate code in five different languages: C, C++, Java, Python3, and JavaScript.Moreover, we create a corresponding prompt using the same prompt template for each <problem, language> pair.In total, there are 1,870 and 1,770 prompts for Bef.problems and Aft.problems, respectively.Due to the rate-limiting of queries to ChatGPT, we input every prompt once into it to ask for generating code.Then, we submit parsed solutions to LeetCode for func-5.https://leetcode.com/problemset/algorithms/tional correctness judgment and get submission statuses [27] including Accepted, Wrong Answer, Compile Error, Time Limit Exceeded, and Runtime Error.They correspond to A., W.A., C.E., T.L.E., and R.E., respectively.One problem corresponds to one unique conversation to avoid triggering ChatGPT's reasoning from other problems.The status explanations are as follows: • Accepted: The submitted code snippet passes all test cases.We evaluate ChatGPT's ability of code generation on the metric of status rate (SR) defined as follows: Where, N c and N i are the number of code snippets generated belonging to the status and the number of prompts input, respectively.Status is either A., W.A., C.E., T.L.E., or R.E..The deep analysis for code with W.A., C.E., T.L.E., or R.E. is presented in Sec.4.2.
We also conduct Wilcoxon rank-sum test [45] and Cliff's Delta effect size measure [46] to compare two independent samples and determine whether there are significant differences between them and quantify the magnitude of the differences observed between the two independent samples.The null hypothesis for the Wilcoxon rank-sum test is that there is no significant difference between the two samples where the samples are the combinations of SR values in different conditions (e.g., A. rate values of five languages in different period problems).If the obtained p-value from Wilcoxon rank-sum test is small (less than 0.05), it suggests that there is a statistically significant difference between the two independent samples.In cases of multiple comparisons, we apply Holm-Bonferroni correction [47], a commonly used technique, to adjust p-values to reduce the risk of Type I errors.The absolute value of effect size (effect size value) obtained from Cliff's Delta measure ranges from 0 to  1.A value close to 0 indicates a small effect, meaning that there is minimal difference between the two independent samples, and a value close to 1 indicates a substantial effect size, meaning that there are significant differences between them.By combining the results from the Wilcoxon rank-sum test and Cliff's Delta, we can gain a comprehensive insight into the differences in code generation results, allowing us to draw more robust conclusions.Prompt.The prompt template designed consists of 4 components.They are <Content>, <Examples>, <Template>, and <Command>, aligning with the principle of prompt design (see Sec. 3.2).Fig 3 shows an example of a prompt.<Con-tent> describes the problem in nature language, <Examples> shows <input, output> pairs of functionally correct code, <Template> specifies the method signature of generated code, and <Command> asks for generating code in a specific language.Result.Table 1 and 2 show code generation results judged by LeetCode for five languages in two periods and in two forms, SR, and corresponding relative frequency bar chart.Columns of Python3 and JavaScript contain no C.E. since both of them are dynamic programming languages.From the overall results, ChatGPT generates functionally correct code for Bef.problems at a significantly higher A. rate than Aft.problems.Specifically, the average A. rate (68.41%) in five languages of Bef.problems exceeds Aft.problems' (20.27%) by 48.14%.The performance in five languages of code generation in different periods is significantly different with a p-value of 0.008 and an effect size value of 1.
• Aft.Problems.For Aft. problems, the overall A. rate is lower than 25%, where the A. rates of hard, medium, and easy problems are 0.66%, 13.90%, and 52.47%, respectively.The p-values adjusted using Holm-Bonferroni correction procedure and effect size values between different difficulties in five languages are all less than 0.05 and equal to 1, respectively.The result indicates that ChatGPT's ability to functionally correct code generation decreases significantly as the difficulty of the problem increases in the face of Aft.problems.Additionally, even for easy problems, it is only able to answer half of them correctly.Out of these five/four metrics, the W.A. rate is the highest one reaching 58% for all languages.Moreover, each W.A. code snippet has an average of 109 test cases, however, the code generated by ChatGPT can pass only 25% of them.Hard, medium, and easy problems achieve 20.90%, 21.03%, and 38.41% test case pass rates, respectively.Thus, regardless of the difficulty, the semantics of the code generated differs significantly from the logic of the corresponding problem descriptions.In addition, the C.E.rate and R.E.rate also reach 16%, and hard and medium problems' rates are significantly higher than easy problems.The code generated by ChatGPT for hard and medium problems is more likely to contain both compile and runtime errors.The compile errors include undeclared variable, function declaration error, uninitialized variable, constant function (i.e., generate an empty body), and so on.For example, Fig. 4 shows that the generated function cmpfunc is not declared before invocation.The syntax errors account for only a small fraction (3.7%) of these errors.For runtime errors, there are null pointer dereference, out-ofbound, heap-buffer-overflow, type error, and so on, which are common in human-written code.As for T.L.E.rate, it does not dominate a high value (6%), but the average pass rate of test cases is 51% which is higher than W.A. code snippets'.The average test case pass rates of three difficulty levels in hard, medium, and easy of T.L.E.problems are 68%, 50%, and 1% (easy problems can be neglected due to their T.L.E.rate close to 0%), respectively.Since T.L.E.code snippets' test case pass rate is partial, it is the lower bound for these problems, and at most, an additional 6% of the generated code can be functionally correct, even though their time complexity may not be ideal.
Breaking down to each language, language C, C++, Java, Python3, and JavaScript have A. rates of 15.38%, 19.37%, 20.17%, 23.93%, and 22.51%, respectively.Moreover, the A. rate distributions (acceptance ratio distributions) of combining five different languages to each problem (only consider problems have at least one correct solution) are shown in Fig. 5. From the figure, we can see that both medium's mean and median lines are ≤ 0.5, and easy's ones are all ≥ 0.6.ChatGPT is easier to generalize code generated to different languages for easy problems.The differences between easy and medium's median and mean are 0.4 and  0.22, respectively.Moreover, the average acceptance rate in humans for the problems accepted by ChatGPT is 66%, and the one for the problems unaccepted by ChatGPT is 48%.ChatGPT has similarities with human performance.In addition, functionally correct code's runtime and memory overheads are excellent in the human ranking, averaging over 68% and 51% solutions, respectively.
• Bef.Problems.As for Bef.problems, the A. rates of hard, medium, and easy problems are 40.13%,70.95%, and 89.80%, respectively, which are much higher than Aft.problems' though there still exist significant differences among different difficulties.The p-values adjusted using Holm-Bonferroni correction procedure and effect size values between hard and medium and hard and easy difficulties 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Acceptance Ratio Hard Medium Easy Difficulty Fig. 7: Distribution of the ratios of languages accepted to corresponding Bef.problems (the meaning of two lines is presented in Fig. 5).
in five languages are all less than 0.05 and greater than 0.9, respectively.The adjusted p-value and effect size value between medium and easy difficulties in five languages are 0.056 and 0.76, respectively.ChatGPT performs better against problems that may appear in the training set before 2021, especially for medium and easy problems.The A. rate of solving hard problems has increased by 40% but is still below 50%, which indicates ChatGPT's ability to generate code for logically complex problems still has a big room for improvement.The overall W.A. rate decreases to 17.03%, and hard, medium, and easy problems' W.A. rates are 32.89%,15.05%, and 6%, respectively.The code generated can still pass 25% of average 112 test cases.Hard, medium, and easy problems achieve 19.19%, 31.12%, and 47.32% test case pass rates, respectively.Both the latter two have a 10% improvement, which indicates that ChatGPT has a better understanding of Bef.problems.However, the C.E.rate and R.E.rate still reach 13% close to Aft. problems' 16% with a p-value and effect size value, between two periods, of 0.328 and 0.3125, respectively, and hard problems have the highest rate, followed by the medium ones.The compile errors and runtime errors are similar to Aft. problems' including undeclared variable, uninitialized variable, null pointer dereference, out-of-bounds, heap-buffer-overflow, type error, and so on.For example, the code shown in Fig. 6 is used to reshape a given 2-dimensional matrix but triggers runtime errors at line 15 that allocates a wrong size of memory to * returnColumnSizes.To T.L.E.rate, the value decreases to 1.87% with an average of 74% test case pass rate.
Breaking down to each language, C, C++, Java, Python3, and JavaScript have A. rates of 47.24%, 68.63%, 76.37%, 75.35%, and 74.44%, respectively.The rate values of the last four languages are close to each other and substantially higher than the rate value of C, the lowest-level language, for at least 20%.Fig. 7 shows the same as Fig. 5 but for Bef.problems.From the figure, we can see that medium and easy's mean and median lines are ≥ 0.75, and the differences between their median and mean are smaller than previous ones of Aft.problems by half.Moreover, hard's mean and median lines are both ≥ 0.55.ChatGPT is easier to generalize code to different languages for Bef.problems.The average acceptance rate in humans for the problems accepted by ChatGPT is 55%, and the one for the problems unaccepted by ChatGPT is 47%.Functionally correct code's runtime and memory overheads are also excellent in the human ranking, averaging over 71% and 54% solutions, respectively.
We also sample 50 problems from all problems (25 Aft.problems and 25 Bef.problems, and each problem has 5 solutions in 5 different languages) to investigate how many times ChatGPT accurately generates the exact solutions (token-by-token) to ground truth solutions.We collect 5 distinct ground truth solutions to each ChatGPT-generated solution and the ground truth solutions are obtained from the LeetCode platform and [48]  6 .By our manual analysis, we find that none of the solutions are generated on a tokenby-token basis.However, for Bef.problems we find that 6 solutions in easy and medium difficulties are Type-2 Clone (i.e., have some renamed unique identifiers) [49]   We utilize the defect categories from the fixing perspective used in [50] (CodeFlaws) and [51].[50] analyzes the programs submitted in Codeforces [52] and classifies the defects in these programs into multiple classes, and [51] follows the defect classification in [50] to construct the code defects generated by Codex [9].The defect classes, definitions, and classified results are shown in Table 4.As we can see, most of the defects fall under the M-L subclass of Multi-hunk and Misaligned Algorithm subclass of Algorithm-related with 51 and 62, respectively.The other subclasses of defects are relatively much less, especially for S-A, S-F, S-AS, and S-DS which are all 1.From the perspective of difficulty, hard problems require more fixing work than medium and easy  ones since hard problems have only M-L and Misaligned Algorithm while medium and easy ones have other subclasses.Moreover, during our manual analysis of these code snippets, we also find that the causes of errors from logic perspective could be divided into three categories as follows (also see Table 4): ① Wrong Detail (WD): The code generated by Chat-GPT has errors in some details.These detail errors stem from a little misunderstanding (e.g., a word) of the given problem or the generated code that is not consistent with the understanding of the problem.Fig. 8 and 9 show two code examples corresponding to the two detail errors, respectively.Fig. 8 is an example of having a little misunderstanding of the given problem.The given problem 2375 7  is a medium one asks to generate the lexicographically 7. https://leetcode.com/problems/construct-smallest-numberfrom-di-string/Fig. 9: An example code snippet with WD error stemming from the generated code that is not consistent with the understanding of the problem.smallest possible string that meets conditions.However, the generated code does not completely hold the meaning of lexicographically smallest due to line 11 decreasing number from the maximum one.As for the code snippet in Fig. 9, it is an example of the generated code that is not consistent with the understanding of the problem.The given problem 2525 8 is an easy one that asks to categorize a given box into  one of the categories according to its properties.ChatGPT understands the meaning of the problem but fails to transfer the meaning of the problem description in natural language to code semantics.This is reflected in conditional expressions of lines 12, 15, and 18.We take line 12 as an example.The condition in natural language is "If the box is both "Bulky" and "Heavy", then its category is Both", and ChatGPT utilizes strcmp twice to compare if the category string is both "Bulky" and "Heavy".The meaning of the natural language description is not equivalent to the code semantics.
WD errors are also easy to be fixed by humans since the generated code logic is roughly correct.The defect subclasses corresponding to WD errors are mainly the subclasses other than M-L and Misaligned Algorithm, based on our manual analysis.
② Misunderstanding Certain Content (MCC): The code generated by ChatGPT does not hold the main condition of the given problem.However, the algorithm used by the generated code is suitable.Fig. 10 shows an example code snippet.The corresponding problem 2433 9 is a medium one and asks to find the solution satisfying one xor-based condition.The key to this problem is to solve the correct recurrence formula according to this xor-based condition.ChatGPT reasons out a recursive formula in lines 2-5, but this recursive formula does not satisfy the condition required by the problem.Other typical examples are the problems using dynamic programming (DP) that the generated code uses wrong DP equations.
MCC errors are more difficult to fix than WD errors by humans since the core of the code needs to be modified to meet conditions provided by problems.The defect class corresponding to MCC errors is Multi-hunk based on our manual analysis.
③ Misunderstanding Problem (MP): ChatGPT misunderstands or does not understand the problem description given.The generated code does not hold all conditions 9. https://leetcode.com/problems/find-the-original-array-of-prefixxor/description/and uses wrong (misaligned) algorithms.Fig. 11 shows an example for the problem 2289 10 .
MP errors are the most difficult to fix among these three kinds of errors by humans since the code needs rewriting completely.The defect subclass corresponding to MP errors is Misaligned Algorithm based on our manual analysis.▶ Multi-round Fixing: We take each code snippet of <problem, language> pair to ChatGPT to continuously generate code in one unique conversation with multiple rounds.The round limit number is set to 5, providing a reasonable maximum number of fixes to 5 times [53].For each pair, we create an initial prompt by leveraging the corresponding problem (i.e., <Content> and <Examples> in Fig. 3), the code snippet, and error message.The error message is returned by LeetCode online judgment, which is suitable to be taken as feedback provided to ChatGPT.One example is shown below (where bolded words are filled in according to each pair's information): If the newly generated code snippet is still not accepted (i.e., W.A, C.E., R.E., and T.L.E.), the corresponding error message is taken directly as a new prompt provided to ChatGPT to fix and generate a new code snippet, in the same conversation.It is appropriate to use the error message directly as the new prompt since ChatGPT has the ability to dialog.The whole process lasts for a maximum of five rounds (one round corresponds to one newly generated code snippet) if the generated code is never accepted.However, there are cases that the cumulative token length of previous prompts, responses, and the current round prompt and response exceeds the token limitation of ChatGPT.We mitigate this problem with token-limitation strategy (see Sec. 3.2) through reusing the initial prompt template with the current round's error code and message, which avoids missing necessary information of problem description to ChatGPT.Moreover, it guarantees completely generated code snippets and also at least remains the immediate previous round's response in practice (Sec.3.2).
The result of multi-round code generation is shown in Table 5, where '/'s left hand and right hand represent the accepted (i.e., code snippets accepted in five rounds) number and the total number, respectively.From the result, we can see that the majority of these 157 <problem, language> pairs cannot be fixed by automation.Only 25 pairs are fixed in 5 different languages, and 16 of them are problems at easy level.The pairs at medium level are fixed with only 7 pairs though its total number of pairs is more than twice as 10. https://leetcode.com/problems/steps-to-make-array-nondecreasing/description/many as the pairs at easy level.Pairs at hard level are nearly impossible to be fixed.The percentage of fixes for pairs under all difficulties is less than half.However, judging from the fixes of the problems, 12 out of 13 easy problems are fixed.The percentage of fixes for hard and medium problems is still below 30%.The average number of rounds per fixed pair is 1.32.21 of the 25 can be fixed with just one round.The defect classes of the 25 pairs are mostly Multihunk, where M-S, M-U, M-L, and M-B account for 3, 3, 12, and 1, respectively.The redundant is in Single-hunk and Algorithm-relate, where S-O, S-AS, S-HO, and Misaligned Algorithm account for 2, 1, 2, and 1, respectively.
To further analyze why most of the code snippets cannot be fixed under the multi-round process, we randomly select 10 more pairs from these unfixed pairs and expand the round limit number to 10 for multi-round fixing.The results are shown in Table 6.In these 10 code snippets, 2 of them are successfully fixed under 10 rounds.The remaining 8 still fail to be fixed, including 2 that are eventually fixed as R.E.. We manually check these failed pairs' final generated code snippets and find that 7 of them marked as dark grey deviate significantly from the meaning of the corresponding problems (i.e., MCC and MP).The remaining one marked as light grey is almost correct, with only a very small logical error (i.e., WD and single-hunk), where this error persists throughout the multi-round process.Moreover, there are only 5 W.A. code snippets with single-hunk fixed under the previous 5-round fixing.
Therefore, we conclude that there are 2 reasons why ChatGPT cannot automatically fix W.A. code snippets through multi-round fixing.On one hand, ChatGPT lacks the ability to grasp logical details, even though these details may be straightforward for humans.ChatGPT struggles to notice and make corresponding fixes to them.Thus, ChatGPT needs improving for its implementation ability for logical details.On the other hand, ChatGPT lacks in dealing with problems that require complex reasoning (for W.A. code snippets), resulting in the code newly generated still deviating from the actual meaning of the problems.As a result, these kinds of W.A. code snippets are difficult to be fixed directly and automatically, but it is not always the case (e.g., <Construct Smallest Number From DI String, Python3> and <Construct Smallest Number From DI String, JavaScript> in Table 6 where the latter one is fixed at the 10-th round).⋆ Summary 1.Most of the defects of code with W.A. fall under the M-L subclass and Misaligned Algorithm subclass with 51 and 62, respectively.The other subclasses of defects are relatively much less.⋆ Summary 2. After our manual analysis, we conclude that W.A. code snippets can be divided into three categories of WD, MCC, and MP, from logic perspective.⋆ Summary 3. By applying multi-round fixing, ChatGPT has difficulty fixing W.A. code snippets.We conclude for two reasons: (1) ChatGPT lacks the ability to grasp logical details, and (2) ChatGPT lacks in dealing with problems that require complex reasoning.

Code with Compile Error
▶ Analyzing: We analyze all C.E. code snippets and classify them manually based on the compile error messages returned by LeetCode.There are 312 code snippets with C.E. in three different languages, C, C++, and Java.
The compile error classes, explanations, and classified results are shown in Table 7. From the table, we can see that the majority of compile errors are in the class of constant function, accounting for half (159/314) of all compile errors.The code snippets having constant function compile error means that the functions or methods in code snippets have empty body (i.e., the generated codes are the same as corresponding code templates provided).Thus, this type of compile error is not a real compile error for generated code since it is a case of failure of code generation by ChatGPT (it is not a failure of response).For other three special compile errors of classes of wrong method name, redefinition of main, and incompatible parameter types, they are related to LeetCode online judgment platform, inconsistent with the settings in LeetCode but not real compile errors.For example, a compile error-free code snippet generated by ChatGPT may contain main function but LeetCode has set another internal main for running test cases, which causes compile error of redefinition of main.Nevertheless, for wrong method name and incompatible parameter types, though they do not indicate real compile errors, it shows that ChatGPT may have a certain chance to generate code regardless of the requirement (i.e., code template) given in the prompt.More interestingly, we also find that for code snippets with compile error of wrong method name, a few method names used for Aft.problems are method names of Bef.problems.For example, problem 2449 11 requires using makeSimilar as method name but ChatGPT generates method name of minOperations which is used in problem 1658 12 , which may point to inference attack problem [54], [55].As for the remaining classes of compile errors, they are the real compile errors not triggered by LeetCode platform.Table 7 provides the explanations of corresponding classes of these compile errors, and examples of these classes can be found at our online artifact [28].
11. https://leetcode.com/problems/minimum-number-ofoperations-to-make-arrays-similar/12. https://leetcode.com/problems/minimum-operations-toreduce-x-to-zero/Where the error message also comes from LeetCode online judgment.The entire fixing process continues until the generated code snippet is accepted or the process reaches the maximum round number of 5. We take the final status (e.g., A.) in one conversation as the final generation result for the corresponding <problem, language> pair.The strategy of mitigating token limitation follows the setting in W.A. multi-round fixing.In addition, we do fixing for all classes except for the class of constant function, since fixing constant function is equivalent to regenerating the entire code snippets for <problem, language> pairs.The result is shown in Table 8.The x:y:z in the table ① Errors in Languages (EIL): EIL errors arise from the properties of the language used (i.e., C and C++) to implement the code.The errors include both retained errors and changed errors.Fig. 12 shows an example code snippet of retained error (label error), where the code snippet is the final generated one in the conversation.In this particular case, the retained error is related to label error, where the code violates the rule of C by placing a label (line 6) before a declaration (line 7).ChatGPT fails to fix the error in 5 rounds even though the error message contains the explanation of label error shown in Table 7. Regarding changed errors, all of them are runtime error, except one no attribute error and two use undeclared function errors, which means that almost all C.E. errors are fixed.ChatGPT tries to generate functionally correct code, but R.E.errors are triggered in the implementation of algorithms.For instance, the final generated code may have an out-of-bound error.We further discuss R.E.subsequently.For the two classes of C.E. errors, we show an example of use undeclared function in Fig. 13 that ChatGPT fixes its original syntax error on '}' but introduces another C.E. error.The specific error is related to the comparison function cmp used as an argument for the qsort function.However, cmp is not defined in the code snippet, resulting in a compile error.
② Errors between Languages (EBL): Different from EIL, EBL errors arise from the similarity between languages of C and C++.The errors still include both retained errors and changed errors, and all changed errors are C.E. errors which are the same as the C.E. errors in retained errors of EBL.Thus, we only use the retained error as an example.Fig. 14 shows an example code snippet of retained error (error of #include).The language for this <problem, language> pair is C but the generated code uses <cstring> and <algorithm> 1 #include <cstring> 2 #include <algorithm> 3 ... 4 char * subStrHash(...) {...} Fig. 14: An example code snippet in C with retained error (error of #include) of EBL.The code snippet is the final generated one in the conversation.Fig. 15: An example code snippet in Python3 with type error.
header files belonging to C++.We observe the entire multi-round process for these unfixed <problem, language> pairs of which the final errors are C.E. errors.We find that in most cases, although the error messages provided contain the causes of the C.E. errors and the corresponding locations, and ChatGPT is aware of the error problem from its natural language, the newly generated code snippets still have the same errors.One example is Fig. 12 that ChatGPT notices the error in each round but fails to fix it.For these errors, a potentially appropriate approach is to add information to prompts from human knowledge that triggers ChatGPT to truly fix errors.For instance, for the example of Fig. 14, we can supply extra information (e.g."the code snippet is in language C, you cannot use C++ header files") to ChatGPT to fix the error.
Additionally, for each <class, difficulty> (e.g., <use undeclared function, medium>) in Table 8 except for label error, it has at least one code snippet can be fixed.Thus, ChatGPT's multi-round code fixing for errors (include R.E.errors.See Table 10) may also be related to the randomness (i.e., Temperature) of ChatGPT itself or the code snippets it receives.⋆ Summary 1.More than half of ChatGPT's C.E. errors (in static languages) are unreal compile errors, especially for constant function, wrong method name, and incompatible parameter types.This experimental result indicates that ChatGPT's code generation stability (avoid generating empty body) and alignment with human attention (meet user requirements such as method signature provided) are the potentially severe issues that need to be strengthened.⋆ Summary 2. By applying multi-round fixing, most (70%) of C.E. code snippets can be fixed, including 26% of them can be fixed to A.. After analyzing unfixed code snippets, it can be inferred that the unfixed reasons can be concluded to EIL and EBL.Additionally, a potential approach to help ChatGPT fix the unfixed errors is to add human knowledge.  1and problem 905 14 ).Additionally, the majority of 13. https://leetcode.com/problems/sort-even-and-odd-indicesindependently/14. https://leetcode.com/problems/sort-array-by-parity/runtime errors are overflow runtime errors (i.e., integeroverflow, heap-buffer-overflow, and out-of-bound) and the languages used in these errors are also mainly in static languages, C, C++, and Java, which is similar to humans making runtime errors.For dynamic languages (i.e., Python3 and JavaScript), the majority of runtime errors are in the class of type error.The error occurs when an operation is performed on an object of an incompatible type.One example is shown in Fig. 15 that in line 9, the code statement performs modulus operations (%) on characters num _ str[i] and num _ str[j], which is not valid.As for the remaining classes, their explanations are provided in Table 9 and their examples can be found at our online artifact [28].
▶ Multi-round Fixing: We follow the settings in C.E.'s multi-round fixing, and the prompt template used here is modified, turning "The code in <language> below has compile errors:" to "The code in <language> below has runtime errors:".The result is shown in Table 10.The x:y:z in the table is the same as Table 8.From the result, we can see that most of the code snippets can be fixed, and there are 32 and 9 code snippets in five different languages that get retained errors and changed errors, respectively.For the 143 fixed code snippets, 52 of them are accepted, containing 23, 12, 2, 10, and 5 in C, C++, Java, Python3, and JavaScript, respectively.For the 52 code snippets, their classes of runtime errors contain integer-overflow (11), heap-buffer-overflow (12), undefined-behavior (1), out-of-bound (10), null pointer dereference (4), wrong method name (3), type error (5), value error (1), heap-use-after-free (1), recursion error (1), uninitialized variable (2), and divided by zero (1).As for the code snippets with retained errors and changed errors, they are few in number (41) and most of them belong to overflow error.Regarding retained errors, by manual analysis, we believe that the main reason why the runtime errors cannot be eliminated is the algorithm implementation by ChatGPT.It is similar to those (e.g., WD) in the W.A. errors.One example of overflow is shown in Fig. 6 that ChatGPT fails to fix line 15 under the 5 rounds of dialogue.Fig. 16 shows an example of value error.The problem (or conflict) in the code snippet is between line 14 and line 9 (i.e., char.isalpha() and char in '0', '1', '&', '|', '(', ')').The char.isalpha() condition checks whether a character is an alphabetic character (a-z or A-Z), while the char in '0', ' 1', '&', '|', '(', ')' condition checks for specific characters which are not alphabetic.Regarding the 9 changed errors, 1 is changed to compile error (use undeclared function) and the remaining errors are still runtime errors including heap-buffer-overflow (1), undefined-behavior (2), out-of-bound (2), type error (2), and recursion error (1).These new errors are introduced as the code snippets continue to be fixed and some parts of code snippets conflict.
We also observe the entire multi-round process for these unfixed <problem, language> pairs of which the final errors are R.E.errors.Like C.E., in most cases, ChatGPT notices errors based on the error messages provided, however, the newly generated code snippets still have the same errors.Fig. 6 and Fig. 16 are two examples.To fix runtime errors, also like C.E. errors, a potentially appropriate approach is to add information to prompts from human knowledge.For instance, for fixing Fig. 6, we can supply extra information (e.g., " * returnColumnSizes = (int * )malloc(sizeof(int)); allocates a wrong size of memory to * returnColumnSizes") to fix the error.⋆ Summary 1.The majority of runtime errors for static languages and dynamic languages are overflow (105) and type error (21), respectively.Additionally, for Python3 and JavaScript, 10 code snippets are constant function and 13 are wrong method name.The result indicates that Chat-GPT's code generation stability and alignment with human attention are also potentially severe issues for dynamic languages.⋆ Summary 2. By applying multi-round fixing, like C.E., most (78%) of R.E.code snippets can be fixed, including 28% of them can be fixed to A.. After analyzing unfixed code snippets, it can be concluded that ChatGPT is flawed in the details of the algorithm implementation and a potential approach to trigger ChatGPT to fix the unfixed errors is to add human knowledge.

Code with Time Limit Exceeded
▶ Analyzing: There are 140 code snippets with T.L.E. in five different languages, C, C++, Java, Python3, and JavaScript.Two graduate students together analyze each T.L.E.code snippet and categorize all these code snippets based on the analysis results.After manually analyzing these code snippets, we classify their timeout reasons into three categories: ① Aligned but Inefficient Algorithm Implementation (AIAI): The algorithm used in code generated by ChatGPT is aligned with the requirement given in the problem description, but some parts are not efficient.algorithm, but their time complexities are O(log n) and O(n), respectively.Some T.L.E.errors are caused by AIAI.Fig. 17 shows an example 15 similar to gcd.It uses subtraction operatorrather than modulo operator %, which increases the time complexity and fails to pass all test cases in the limited time set by LeetCode platform.
② Functionally Correct but Misaligned Algorithm (CMA): The algorithm used by ChatGPT is functionally correct to the corresponding problem but it is inefficient for the limited time set by LeetCode platform.So, the algorithm is misaligned.In CMA, the most common examples of the generated code are to solve problems using the brute-force method.Although the algorithm (or code) using the bruteforce method is functionally correct, the time complexity may also be very high (e.g., O(2 n ) time complexity), and thus the code is judged as timeout by LeetCode online judgment.One example 16 is shown in Fig. 18 whose time complexity is exponential (the input size of the code is log n, representing the input number n).
③ Functionally Incorrect Algorithm (IA): The algorithm used by ChatGPT is functionally incorrect (i.e., WD, MCC, and MP) to the corresponding problem, which is also slow (due to the incorrect function) to resolve given problem instances (i.e., test cases).The algorithm may be aligned or misaligned (e.g., brute-force method).One example 17 is shown in Fig. 19, using greedy algorithm.The algorithm is   The result is shown in Table 11.There are 44/140 code snippets across different languages and difficulty levels in total that can be fixed (i.e., accepted) by ChatGPT.By manual analysis, we find that these code snippets are in AIAI, CMA, and IA.For code snippets in AIAI and IA, ChatGPT tends to generate patches or rewrite code in different algorithms for fixing.One example of generating patches is Fig. 17 that ChatGPT modifies subtraction operatorto modulo operator %.The example of rewriting code is that for problem 1047 18 in C, ChatGPT changes aligned stack-based algorithm to array-based algorithm without using stack.As for code snippets in CMA, ChatGPT tends to change the algorithms used.As for the remaining 96 code snippets not fixed, their newly generated code snippets are judged as W.A. (64), R.E.(8), and T.L.E.(24) by LeetCode online judgment.For the new ones judged as W.A., ChatGPT is able to fix the conflict parts causing T.L.E.but the incorrect functions cannot be fixed (e.g., Fig. 19), or it changes the algorithms used but the new algorithms are functionally incorrect (e.g., Fig. 18 in Java version).For the new ones judged as R.E., the runtime errors of them include integer-overflow, heap-buffer-overflow, outof-bound, value error, and out-of-memory 19 .For the new 18.https://leetcode.com/problems/remove-all-adjacent-duplicatesin-string/19.A computer program tries to allocate memory from the heap, but there is not enough available memory to fulfill the request ones judged as T.L.E., 16 of them can pass > 75% test cases and only 4 of them pass < 50% test cases.By our manual analysis, we find that ChatGPT tends to change algorithms used to avoid T.L.E., but the algorithms used have some inefficient parts (i.e., AIAI or IA).Thus, to fix these inefficient parts, a potentially appropriate approach is to tell ChatGPT the inefficient locations and fixing suggestions by humans.⋆ Summary 1.By manually analyzing T.L.E.code snippets, it can be concluded that the timeout reasons are AIAI, CMA, and IA.⋆ Summary 2. By applying multi-round fixing, only 31% code snippets can be fixed.For fixed code snippets in AIAI and IA, ChatGPT tends to generate patches or rewrite code in different algorithms, and for code snippets in CMA, ChatGPT tends to change the algorithm used.The unfixed ones have 24 in T.L.E., caused by AIAI and IA.To fix them, a potential approach is to provide inefficient locations and fixing suggestions by humans.
Answer to RQ2: Multi-round Fixing for Code Generation ❶ Multi-round fixing process can only fix a small fraction (< 32%) of code snippets with W.A., C.E., R.E., or T.L.E. to A.. For fixing code snippets with C.E., R.E., or T.L.E. to A., W.A., or T.L.E.(exclude T.L.E.→ T.L.E.), most (≥ 70%) of them can be fixed by using multi-round fixing; ❷ Our analysis identifies several factors for errors in code snippets and unfixed cases under the multi-round fixing process.The findings contribute to the ongoing research focused on improving functionally correct code generation.

RQ3: How complex is the code generated by ChatGPT?
Motivation.The complexity of code is a critical factor influencing code readability, maintainability, and overall quality [56], [24], [57].In this RQ, we evaluate the complexity of the code generated by ChatGPT.Approach.We utilize SonarQube [58] and cccc [59] to calculate two metrics for evaluating the complexity of Bef. and Aft.generated code, including the code generated in multi-round fixing.The metrics are cyclomatic complexity and cognitive complexity [24], [56], [60], and their specific meanings are as follows: • Cyclomatic Complexity: The complexity counts the number of linearly independent paths through a given source code.It determines how difficult the given code is to test.A high cyclomatic complexity can potentially lead to a high probability of errors and bugs.• Cognitive Complexity: The complexity refers to a measure of how difficult it is to understand and reason about a piece of code from the human perspective.It takes factors into account like control structures but slightly different from cyclomatic complexity.Its specific methodology can be found in [61].A high cognitive complexity can affect code maintainability and increase the risk of bugs or errors.
Where cognitive complexity is measured for three languages (Java, Python3, and JavaScript) due to the limitation of SonarQube and cccc.We also utilize LeetCode solutions [48] in C++ and Python3 written by humans (lack solutions in other languages) to compare with ChatGPT's, for observing their different extent in code complexity.
Note that the complexity is measured in terms of problems as a unit (most solutions have only one method).Result.We first examine code snippets not generated in multi-round fixing.The analysis for code snippets generated in multi-round fixing is discussed in Multi-round Comparisons in this section.Table 12 and 13 show the cyclomatic and cognitive complexity values of code generated by Chat-GPT in five languages.Table 14 and 15 show the cyclomatic and cognitive complexity values of code written by humans in two languages.
Compared with human solutions in C++ and Python3 (see Table 14), we find that the complexity distributions of the generated code for both languages closely resemble those of the written code.For C++, the written code's percentage of low complexity is 5% higher than the generated code's, and correspondingly, the one of moderate complexity is 5% lower than the generated code's.They have similar percentages of high and very high complexities.As for Python3, both generated code and written code have similar percentages of low complexity, while the percentages of moderate and high complexities of generated code are higher than written code's by 3% and 2.6%.Consequently, the former's percentage of very high complexity is lower than the latter's with 4.3%.
• Cognitive Complexity.According to [61], cognitive complexity can also be categorized into four classes low (<5 cognitive complexity value), moderate (6-10), high (11)(12)(13)(14)(15)(16)(17)(18)(19)(20), and very high complexity (≥ 21).From Table 13, we can see that both low and moderate complexities dominate more than 70% in five languages.The generated code in Java and JavaScript has similar distributions of four complexity levels.For Python3, it has a 6% higher percentage (42.6%) of low complexity, and correspondingly, its percentage of moderate complexity is lower than the other two languages' with 4.7%.Python3's percentages of high and very high complexities are also similar to those in Java and JavaScript.
Compared with human solutions in Python3 (see Table 14), we find that the written code has a higher percentage of low complexity with 8% than the generated code.Correspondingly, the former's percentages of moderate and high complexity are lower than the latter's with double 4.9%, respectively.However, the difference between the two percentages of very high complexity for the two kinds of code is only 2%.⋆ Summary 1. Fig. 20 also further shows the density graphs of cyclomatic and cognitive complexities for each language and corresponding coder (i.e., ChatGPT and Human) pair.The horizontal coordinate is the complexity value and the vertical one is the corresponding density.By analyzing the two figures, we can gain a more intuitive insight that Chat-GPT's Python3's distributions have comparatively smaller mean, while C's ones have larger mean.The distributions of C++, Java, and JavaScript are nearly overlapping.In addition, the distributions of code written by humans skew to the left compared with code generated in corresponding languages by ChatGPT.Therefore, we can conclude that the level of complexity of code generated by ChatGPT varies among the five programming languages.The generated code in C is more complex than the other languages, while the complexity of code in C++, Java, and JavaScript is comparable.The code in Python3 is the least complex.Moreover, the complexity of code generated by ChatGPT in C++ and Python3 is slightly higher but nearly equal to that of code written by humans.
• Cross-difficulty Comparisons.We further examine the distributions of cyclomatic complexity and cognitive complexity under different difficulty levels.The results are shown in Fig. 21 and Fig. 22.Each row in the figures corresponds to the percentages of the same complexity level at different difficulty levels, while each column represents to the percentages of different complexities at the same difficulty level.The percentages in the graph indicate the proportion of a certain complexity level within the same difficulty level.
Regardless of cyclomatic complexity or cognitive complexity, the low complexity of each language decreases as the difficulty of the problem increases.On the other hand, high and very high complexity increase with the difficulty  of the problem increasing.Moderate complexity shows no significant changes as the difficulty of the problem increases.The high and very high complexity percentages increase as the difficulty of the problem increases.Compared to code written by humans [48] which is shown in Table 16 (suffix -H represents human-based results), the complexity trend in ChatGPT-generated code across different difficulty levels is comparable to that observed in human-written code.This trend may be attributed to more difficult problems often requiring the handling of more conditions, loops, and nested structures, resulting in more complex generated code.For problems of the same difficulty, the proportions of low and moderate complexity in the generated code in C++, Java, and Python3 by ChatGPT are all over 50%, even for hard difficulty problems.The proportion of low and moderate complexity in the generated JavaScript code snippets is also      • Cross-language Comparisons.We measure the combinations of cyclomatic complexity and cognitive complexity for five different languages (C, C++, Java, Python3, and JavaScript) and three different languages (Java, Python3, and JavaScript) for the same problems, respectively.Thus, we only include problems containing 5 valid (exist and are not constant functions) different language code snippets.The statistical results of the top 20 combinations are depicted in Fig. 23 and Fig. 24.Cyclomatic complexity has 212 combinations and cognitive complexity has 52 combinations, in total.The numbers 1, 2, 3, and 4 in parentheses represent low, moderate, high, and very high complexities, respectively.The positions of the elements in parentheses correspond to languages.For cyclomatic complexity' (v, w, x, y, z), v to z's corresponding languages are C, C++, Java, Python3, and JavaScript, respectively.Similarly, for cognitive complexity's (x, y, z), their corresponding languages are Java, Python3, and JavaScript, respectively.From the results, we can see that the numbers of (1, 1, 1, 1, 1) and (1, 1, 1) are the most in cyclomatic complexity and cognitive complexity, For all these combinations with the same complexity levels, they have high individual percentages but their overall percentages are below 50%, indicating that the majority of combinations have different complexities.As the results of combinations having different complexities shown in Fig. 23 and Fig. 24, the majority of the differences in complexity levels are 1 (e.g., (2, 1, 1, 1, 1) and (2, 1, 2)), with a few ones greater than 1 (e.g., (3, 2, 2, 1, 2) and (2, 1, 3)).We further count the number of differences for each value (i.e., 1, 2, 3) (see Table 17).Regardless of cyclomatic complexity or cognitive complexity, the number of difference of 1 accounts for more than 50%, especially for cognitive's 76.5%.In most cases, for the same problem, the complexities of code snippets generated by ChatGPT in different languages are similar.We also manually inspect the code snippets and summarize the differences in complexity levels between generated code snippets for the same problem in different languages: ① Built-in Libraries: Different languages have different numbers of built-in libraries 20 .ChatGPT learns from a large corpus of text [12] and may have the ability to choose whether to use built-in libraries to simplify algorithm implementation or to generate helper functions to achieve specific functionality, in different languages.For example, in Python3, ChatGPT can directly use heapq to implement a min-heap, whereas in C, it needs to generate the relevant code for a min-heap as well, leading to different complexity level.
② Different Algorithms: The code snippets generated for the same problem in different languages do not always use the same algorithm.Different algorithms can lead to different complexities.For example, the cyclomatic complexity combination of problem 2543 21 is (3, 4, 2, 2, 1).The code snippet in C uses an iterative algorithm, and the code snippets in C++, Java, and Python3 use a recursive algorithm.However, the code snippet in JavaScript uses an algorithm based on number theory.
③ Implementation of Logic: The complexities of code snippets in different languages may vary due to the specific implementation of the same or similar algorithms.One example is the problem of Fig. 9.The corresponding cyclomatic complexity combination is (4, 3, 2, 1, 2).All five code snippets use the same algorithm, however, the logic implementation of the code snippet in Python3 is the most concise.⋆ Summary 1.By the analysis of cross-language comparisons, it is observed that the majority of combinations of cy-20.https://www.python.org/doc/essays/comparisons/21. https://leetcode.com/problems/check-if-point-is-reachable/clomatic complexity and cognitive complexity for the same problem across different languages exhibit similar (≤ 1 of complexity difference) complexity levels.Factors observed contributing to the differences include the built-in libraries in languages, different algorithms employed, and variations in the implementation of logic.
• Multi-round Comparisons.ChatGPT's multiple rounds of conversations allow it to continuously generate code snippets.We take all code snippets from multi-round fixing in Sec.4.1 as samples to study the variations in code snippet complexity levels during the multi-turn process.Since the numbers of multiple rounds in conversations for different code snippets may be different, we use the initial code snippets and the code snippets generated at the end of the conversations as objects.Fig. 25 shows the relationship between the initial code snippet complexity levels and the final code snippet complexity levels for different languages and complexities (i.e., cyclomatic and cognitive).The ylabel in the 8 sub-figures represents the complexity levels of the initial code snippets, while the x-label represents the complexity levels of the final code snippets.The percentages shown in the figure represent the proportions of different complexity level variations in the cases of complexitylanguage.
From the figure, it can be seen that in all cases of complexity-language, the total percentages shown by the diagonals are all higher than 50%.This indicates that a significant number of code snippets maintain their complexity levels throughout the multi-round process.Moreover, all cells above the diagonal lines, corresponding to cases where the final complexity level is higher than the initial level, generally exhibit higher percentages compared to the cells symmetrically opposite along the diagonals except the ones of (Low, Moderate) and (Moderate, Low) in Cyclomatic-JavaScript.It is also worth noting that all cells with 0% are also below the diagonal lines.Therefore, we can conclude that the multi-round fixing process with ChatGPT generally preserves or increases the complexity levels of code snippets.
We also observe these pairs of <initial code snippet, final code snippet>.For the code snippets in the cells with preserving or increasing complexity levels, ChatGPT patches the initial code snippets by adjusting the logical implementation (e.g., change the recursive implementation of DFS to an iterative implementation or fix type error), adding or modifying conditions, or changing the algorithms used (e.g., change brute-force algorithm to dynamic programming algorithm).As for the code snippets in the cells with decreasing complexity levels, ChatGPT patches the initial code snippet by also adjusting the logical implementation (e.g., simplify and fix the implementation of logic for Fig. 9), deleting or modifying conditions, or changing the algorithms used (e.g., turn a binary search-based algorithm to an iterative algorithm containing fewer control flows in <problem 2483 22 , JavaScript>).⋆ Summary 1.The multi-round fixing process with Chat-GPT generally preserves or increases the complexity levels of code snippets, which may potentially make it increasingly 22. https://leetcode.com/problems/minimum-penalty-for-ashop/description/difficult to understand the automatically and consistently generated code by ChatGPT.

Answer to RQ3: Code Complexity
❶ The generated code in C is the most complex code, while the code in C++, Java, and JavaScript has comparable complexity.The code in Python3 is the least complex code.The complexity of the generated code in C++ and Python3 is similar (slightly higher) to the written code.Additionally, low complexity decreases while high and very high complexity increase with increasing problem difficulty for code generation; ❷ Code complexity levels for the same problem differ among programming languages.Python3 has the highest probability of generating code with the lowest complexity level, while C has the lowest probability.C++, Java, and JavaScript have intermediate probabilities.This suggests that the choice of programming language affects generated code complexity; ❸ The multi-round fixing process with ChatGPT generally preserves or increases the complexity levels of code snippets, which may potentially make it increasingly difficult to understand the automatically and consistently generated code by ChatGPT.

Security Code Generation
RQ4: Is the code generated by ChatGPT secure?Motivation.ChatGPT may learn knowledge from vulnerable code.In this RQ, we intend to evaluate the security of code generated by ChatGPT in several specific vulnerability scenarios, queries, and languages.Approach ❶.We utilize CodeQL [35] for vulnerability detection on all the C, C++, and Java code snippets generated in Sec.4.1 23 .We do not perform detection for the code snippets in other languages (i.e., Python and JavaScript) since CodeQL standard library and other vulnerability detection tool SonarQube [58] has no suitable queries [63] related to algorithm problems for them.The vulnerability detection is limited to these three languages for code snippets generated based on LeetCode problems.Moreover, we only perform detection on pointer and memory-related vulnerabilities due to that the code snippets are for algorithm problems (CodeQL and SonarQube are limited in these two kinds of vulnerabilities to Python and JavaScript).We conduct vulnerability detection on 5 CWEs in MITRE Top 25 CWEs [34]    6.9 % 3 % 2 % 5.9 % 21 % 6.9 % 9.9 % 0 % 5.9 % 8.9 % 5 % 0 % 0 % 0.99 % 9.9 % can be found in Table 18, and their meanings can be found in [63].
We study the multi-round fixing for fixing code snippets with vulnerabilities.The approach is introduced in the corresponding part of this section.Approach ❷.We follow the same setup used in [23].Specifically, we utilize 18 common weakness enumerations (CWEs) in MITRE Top 25 CWEs [34] (3 of them drop in rank to below 25 in 2022 MITRE Top 25 CWEs).For each CWE, three different code scenarios (context) are provided to generate complete code by ChatGPT.These scenarios are small and incomplete code snippets (in C and Python3) of which completed code may contain relative and specific CWEs.They come from CodeQL [35], MITRE, and [23].For example, Fig. 26 shows a code scenario example of CWE-787 (Out-ofbounds Write), where "ChatGPT next line" is used to tell ChatGPT to complete code from here.We ask ChatGPT to generate 60 complete code snippets in one-round process for each scenario and leverage CodeQL [35] to analyze whether these code snippets have corresponding CWE.Moreover, we only analyze specific CWEs to corresponding scenarios without evaluating functional correctness [23].In total, there are 18 CWEs with 54 scenarios.
We also study the multi-round fixing for fixing code snippets with vulnerabilities.The approach is introduced in the corresponding part of this section.Prompt ❷.We utilize code scenarios (CWE's code scenarios or CWE scenarios) used in [23] with a little modification (specify ChatGPT to generate code) as our prompts.Result ❶.Table 18 shows the results of vulnerability detection.# Vln.represents the number of vulnerable code snippets under the corresponding CWEs and queries.The percentage is the number of code snippets for a specific vulnerability query (e.g., CWE-476's MissingNullTest) divided by the total amount (183) of vulnerable code snippets.As the results show, the majority of vulnerable code snippets are in the query of MissingNullTest, accounting for 91.8%.The code generated by ChatGPT does not perform a NULL test after allocating memory in C or C++ languages, which may lead to potential vulnerabilities.For the remaining vulnerability queries, their vulnerable code snippets are relatively less frequent (≤ 5), such as PotentialBufferOverflow and OffsetUseBeforeRangeCheck, but they are still significant and should not be overlooked.Additionally, there are no detected vulnerabilities in CWE-416 and CWE-190 to C, C++, and Java languages.▶ Multi-round Fixing: We sample 15 code snippets containing vulnerabilities from each query of CWE categories.If there are fewer than 15 vulnerable code snippets in a query, we sample them all.The round number is set to 5 same as Sec.4.1.For each vulnerable code snippet, we create an initial prompt by leveraging the corresponding problem, the vulnerable code snippet, and the corresponding CWE information provided by CodeQL.The CWE information includes an explanation of the CWE and the vulnerable locations in the code.One example is shown below: The k-beauty of an integer num is defined as the number...If the newly generated code snippet is still vulnerable to the same vulnerability, the corresponding CWE information (i.e., vulnerability message) returned by CodeQL is taken directly as a new prompt provided to ChatGPT to fix and generate a new code snippet, in the same conversation.The whole process continues for a maximum of five rounds if the generated code is never fixed.Furthermore, the strategy of mitigating token limitation follows the setting in W.A. multi-round fixing (Sec.4.2.1).
The results of multi-round fixing are shown in Table 18's # Fixed column.All vulnerable code snippets are fixed.The fix for these code snippets is straightforward since it only requires some additional statements for checking corresponding vulnerabilities.For instance, by including statements to test for NULL or perform boundary checks.Thus, in general, ChatGPT performs well in this multi-round fixing process for the code snippets in Sec.4.1.⋆ Summary 1.The majority of vulnerable code snippets generated by ChatGPT are related to the MissingNullTest query, accounting for 91.8% of the total.These code snippets fail to perform NULL tests after memory allocation, potentially leading to security vulnerabilities.Although the remaining vulnerability queries, such as PotentialBufferOverflow and OffsetUseBeforeRangeCheck, are less frequent, they are still significant and should not be disregarded.By applying multi-round fixing, all sampled vulnerable code snippets are fixed.ChatGPT performs well for the vulnerable code snippets in the scenario of algorithm problems.Py.
We divide the 18 CWEs into 6 groups according to their relationships and descriptions [34], which are shown in Table 20.
▷ Overflow: This group is related to buffer overflow and integer overflow.Out of the 12 code scenarios, 7 of them are marked as red (5) and yellow (2), and the remaining are marked as green.Where CWE-787, CWE-125, and CWE-119 are related to buffer overflow, and CWE-190 is data overflow.For each CWE in buffer overflow, it has at least one red mark.Fig. 27 shows an example of vulnerable code generated in CWE-787 code scenario 3. The scenario asks ChatGPT to convert float variables to string formats, but the generated code sets the length of an array in char less than 48 (the maximum number of characters output is 47 + 1 for the '\0' by sprintf).However, CWE-190 has three green marks.ChatGPT performs poorly on CWE-787, CWE-125 and CWE-119 but well on CWE-190 code scenarios.
▷ Data Neutralization Issues: This group is related to creating or neutralizing data using an incorrect format (e.g., SQL Injection).Out of the 9 scenarios, 2, 6, and 1 of them are marked as red, yellow, and green.Both 2 red marks are in CWE-78 achieving 100% # Vln.percentages.CWE-78 is related to OS command injection.An example of generated code in scenario 1 is shown in Fig. 28.The scenario asks ChatGPT to retrieve the user's information using given username, but it does not check whether username is valid, which lets attackers be able to inject arbitrary commands.For CWE-79 and CWE-89, their yellow marks are all lower than 21%, indicating that most of the generated code in CWE-79 and CWE-89 code scenarios is non-vulnerable.
1 // get the username from the command line 2 char * username = argv [1]; 3 // get the userinfo from the system 4 char command[100] = {0}; 5 sprintf(command, "getent passwd %s", username); 6 system(command); ▷ Input Validation: This group is related to Vulnerabilities caused by improper input without appropriate validation.An improper input may lead to data leakage or execution of arbitrary commands.Out of the 12 code scenarios, 4 of them are marked as red belonging to CWE-20, CWE-22, and CWE-434.The remaining's # Vln.percentages are all lower than 10% (except both CWE-22 and CWE-502 code scenario 2), indicating that the majority of generated code in various scenarios is non-vulnerable.Fig. 29 shows an example of generated vulnerable code in CWE-22 scenario 1.It asks ChatGPT to read files by using given file names, but ChatGPT does not check whether file names contain any special elements such as '..' and '/'.Interestingly, the same code scenario 2 of CWE-22 but in a Python3 web app has a # Vln.percentage of 43.1% much lower than CWE-22 scenario 1's 100%.The reason may be the given context (e.g., Flask module) or the training code seen forcing ChatGPT to generate secure code as much as possible.For CWE-502 code scenario 2 achieving a yellow mark close to a # Vln.percentage of 45%, it asks ChatGPT to deserialize YAML type file in Python3, however, 26 code snippets use yaml.load(the method can deserialize Python objects) for deserialization instead of yaml.safe_ load, which may allow attackers to execute arbitrary code 24 .
▷ Unsafe Memory Operation: This group is related to pointer-related problems including referencing memory after free (CWE-416) and dereferencing NULL pointer (CWE-476).Out of the 6 code scenarios, 4 and 2 of them are marked as red and green, respectively.CWE-416 code scenario 1 asks ChatGPT to free temporary memory according to the context.Fig. 30 shows an example of this scenario.The generated code frees tmp without checking abort _ flag which may lead to being freed twice.For CWE-476, all its scenarios are marked as red.One example is already shown in Fig. 6   ▷ Unauthenticated Access: This group is related to improper use (CWE-798), authentication (CWE-306), and protection (CWE-522) of credentials.Out of the 9 code scenarios, only 1 is marked as red in CWE-522.The remaining's # Vln.percentages are all lower than 6% (except CWE-798 code scenario 3 with 13%).ChatGPT performs well in this group.For CWE-522 code scenario 1, it asks ChatGPT to hash the password given before storing it.One example is shown in Fig. 31 using one round SHA-256, a secure hash function, to directly hash the password which is still susceptible to brute-force attacks, and vulnerable to collision attacks [64].Interestingly, the other code scenarios in CWE-522 are the same as scenario 1 with additional either salt=os.urandom(16) statement or bcrypt module, avoiding generated code with the vulnerability with a high probability.
▷ Information Leakage: This group is related to incorrect permission assignments (CWE-732) and unauthorized actions (CWE-200) to critical resources.Out of the 6 code scenarios, none of them are marked as red, and only CWE-200 code scenario 2 is marked as yellow with a # Vln.percentage of 23.6%.The remaining are marked as orange (1) and green (4).The yellow mark scenario asks Chat-GPT to connect to a database in a try/catch structure in a Python3 web app, expecting the generated code not to return database error messages.The majority of generated code avoids the vulnerability.Thus, we can conclude that ChatGPT performs well in these code scenarios.▶ Multi-round Fixing: We sample 5 code snippets containing vulnerabilities from each code scenario of CWE categories and input these code snippets into ChatGPT for fixing.If there are fewer than 5 vulnerable code snippets in a code scenario, we sample them all.The round number is set to 5 same as Sec.4.1.For each vulnerable code snippet, we create an initial prompt by leveraging the vulnerable code snippet  and the corresponding CWE information provided by Cod-eQL, or the corresponding CWE descriptions from [34] for code snippets checked by authors.One example is shown below:

Prompt:
The  of them can be fixed.Moreover, there are 30 code scenarios where all code snippets are fixed, 4 code scenarios where code snippets are partially fixable, and 2 code scenarios where all code snippets are not fixable.
▷ Overflow: All code snippets in CWE-787, CWE-190, and CWE-119 can be fixed.Although the first generated code snippets by ChatGPT have overflow vulnerability, providing these code snippets with corresponding CWE information helps ChatGPT to successfully fix simple overflow problems.For instance, for Fig. 27, ChatGPT turns sprintf to snprintf 25 , preventing buffer overflow.
▷ Data Neutralization Issues: All code snippets in CWE-79, CWE-89, and CWE-78'S code scenarios 2 and 3 can be fixed.One code snippet in CWE-78's code scenario 1 is still vulnerable.Most vulnerable code snippets can be fixed by providing corresponding CWE information.For the vulnerable one, ChatGPT fails to fix the code for checking the external input of username, which may lead to OS command injection.
▷ Input Validation: All code snippets in CWE-20's code scenario 1, CWE-22, CWE-434's code scenario 1, and CWE-502 can be fixed.Partial code snippets in CWE-434's code scenario 2 can be fixed.None of the code snippets in CWE-20's code scenario 3 can be fixed.The fixing performance of ChatGPT in this group category is poor.For the 2 unfixed vulnerable code snippets in CWE-434's code scenario 2 (the requested images should be saved in the database as base64 encoded, and their types must be JPG and sizes should be less than 1,024KB), both of them satisfy the first requirement but do not meet the second requirement at all, missing alignment.As for the 5 unfixed vulnerable code snippets in CWE-20's code scenario 3 (generate the values of a share sale where the price comes from an external function.The 25. https://learn.microsoft.com/en-us/cpp/c-runtimelibrary/reference/snprintf-snprintf-snprintf-l-snwprintf-snwprintfl?view=msvc-170 values should ≥ 0), all of them conduct a lot of necessary checks but they overlook checking the input values of the function as well as the values of output (i.e., the values of a share sale).For instance (see Fig. 32), the final generated code snippet does not check the input parameter quantity, which may result in the function calculate _ sale _ value's return value being less than 0 when quantity is less than 0.
▷ Unsafe Memory Operation: All code snippets in CWE-476 can be fixed and only one code snippet in CWE-416 is still vulnerable.In general, ChatGPT performs well in this group category.For the vulnerable code snippet in CWE-416 code scenario 1, it still contains the problem of being freed twice (i.e., Fig. 30).
▷ Unauthenticated Access: All code snippets in CWE-798, CWE-306, and CWE-522's code scenario 1 can be fixed.3 of 5 code snippets in CWE-522's code scenario 1 are still vulnerable.ChatGPT performs well in this group category.For the 3 vulnerable code snippets, they still use hashlib.sha256method one time rather than a more secure way (e.g., use slow hash method bcrypt.hashpw).
▷ Information Leakage: The only one code snippet in CWE-200's code scenario 3 is fixed, but the other 5 code snippets in CWE-200's code scenario 2 are all still vulnerable, though the code scenario 2 is marked as yellow in Table 19 (23.6% vulnerability rate).The final code snippets generated by ChatGPT still return database error messages by exception handler.In general, ChatGPT performs poorly in this group category.⋆ Summary 1. ChatGPT generates 2,983 (99.07%) valid code snippets successfully where 994 (33.32%) are vulnerable.Moreover, the vulnerable code snippet percentage in C (51.64%) is much higher than the one in Python3 (17.08%), indicating that developers should be more aware of the security of code generated by ChatGPT in C than in Python3.The reason for the result can be the context of the provided code scenarios and the quality of the code in C and Python seen in the training set.⋆ Summary 2. ChatGPT has different performances under different groups, CWEs, and code scenarios in security code generation.Overall, no code scenarios are marked as red for the group of Information Leakage, while the remaining 5 groups have at least one code scenario marked as red.
Where the group of Unsafe Memory Operation has 4/6 (the highest percentage) code scenarios marked as red.Among all CWEs, 10 of them have at least one code scenario marked as red, but only 3 CWEs have scenarios only marked as green or orange.Among all code scenarios, there are 18 (33%), 4 (7%), 16 (30%), and 16 (30%) code scenarios marked as green, orange, yellow, and red, respectively.⋆ Summary 3. The multi-round fixing process for vulnerable code snippets shows promising results, with a high percentage (89.4%) of vulnerabilities successfully addressed.Most vulnerabilities related to Overflow, Data Neutralization Issues, Unsafe Memory Operations, and Unauthenticated Access can be fixed through multi-round fixing, demonstrating the ability of ChatGPT to generate fixed code by incorporating prompts based on corresponding CWE information.However, the performance in fixing vulnerabilities of Input Validation and Information Leakage is relatively weak, indicating room for improvement.
Answer to RQ4: Security Code Generation ❶ In most scenarios including the scenario of algorithm problems and CWE scenarios, the code generated by ChatGPT has relevant vulnerabilities such as Overflow, Unsafe Memory Operation (e.g., MissingNullTest) and so on; ❷ The multi-round fixing process for vulnerable code snippets demonstrates promising results, with a high percentage (100% and 89.4%) of vulnerabilities successfully addressed.The experiment result indicates that combining ChatGPT with vulnerability detection tools can mitigate the presence of vulnerabilities in the code generated by ChatGPT.

Non-determinism of ChatGPT
RQ5: How does the non-deterministic output of ChatGPT affect code generation?Motivation.LLMs like ChatGPT have a non-deterministic nature [12] typically due to the sampling methods such as top-k sampling [65], which means they can produce various responses to the same input [23], [66].In this RQ, we intend to investigate the non-deterministic output of ChatGPT.Approach.We randomly and respectively select 9 problems from Aft. problems and Bef.problems, and leverage ChatGPT to generate code for each of these 18 problems with five languages 10 times across 2 temperatures of 0.7 (the default value used in the paper) and 0 (for stabilizing output [12]).The generated code snippets from these repeated trials are compared across functional correctness, complexity, and security.Additionally, we sample 20 CWE code scenarios and use ChatGPT to generate code snippets 10 times at temperature 0. These code snippets are compared in security.Moreover, the multi-round fixing process is included at temperature settings of 0.7 and 0 where each sampled error code or vulnerable code is fixed 5 times.The maximum round number is set to 5. Result ❶.The selected algorithm problems and the experimental results at temperature 0.7 are listed in Table 22 and Table 23 for Aft problems and Bef.problems, respectively, where the values in status rates (i.e., A., W.A., C.E., T.L.E., and R.E.) are the percentages of corresponding statuses in 10 trials; the rate values in L, M, H, and V in Cyclomatic and Cognitive represent the percentages of low, moderate, high, and very high complexity levels in 10 trials, respectively; and, the CWE in the table corresponds to MissingNullTest vulnerability (no other vulnerability is detected), and the value in CWE represents the percentage of code snippets with vulnerabilities in 10 trails.From the results, we can observe the following findings: ▷ Status Rates.The data shows that for different trials at the same problem and language, the generated code can have different statuses.For example, problem 2224 in JavaScript language has 50% A. rate, 10% W.A. rate, 10% T.L.E.rate, and 30% R.E.rate in 10 trials.Additionally, we also find that some generated code snippets are constant functions.Thus, in the subsequent evaluation of complexity and security, we remove these constant function code snippets, and correspondingly, the number of trials for the corresponding problems and languages also decreases.
▷ Complexity Levels.The data indicates that the complexity of the code generated in different trials with ChatGPT may vary.For instance, problem 363 in language Java has 40.0%low, 50.0%moderate, 10.0% high, 0.0% very high cyclomatic complexity levels, and 40.0%low, 0.0% moderate, 60.0% high, 0.0% very high cognitive complexity levels.In different trials, ChatGPT may use different algorithms, implementations, and so on to generate code snippets based on the same input.
▷ CWEs.ChatGPT may or may not generate vulnerable code under different trials.For instance, problem 2264 in language C has a 42.85% (3 vulnerable code snippets out of 7 non-constant function code snippets) CWE rate.Additionally, Table 19 (column of # Vln.) also shows the nondeterminism of ChatGPT-based code generation (at temperature 0.7) in the aspect of security.
Under the temperature 0, the statistics on the nondeterminism code generation of algorithm problems and CWE code scenarios in 10 trials are shown in Table 24, Table 25 and Table 26, where Table 26 presents the selected 20 CWE code scenarios.Cyc. and Cog.represent cyclomatic and cognitive complexities, respectively.Elements in table entries are presented as sets or ratios.-represents an inability to evaluate, including two reasons: tool support is unavailable and the generated code is constant functions, based on the corresponding context (e.g., <problem 2304, Java> is constant functions and thus cyclomatic and cognitive complexities are not evaluated).From the result, we can observe that when the temperature is set to 0, the statuses of the generated code for each algorithm problem are consistent in 10 trials.The same results are observed in terms of complexity, except for <problem 2532, Python3>.All generated C code in Problem 304 and 2523 have Miss-ingNullTest vulnerability.We also further manually analyze these code snippets and find that all generated code snippets are completely identical for every <problem, language> except <problem 2532, Python3> using different code strictures in different trials.As for the sampled CWE code scenarios, all generated code snippets are also identical for each scenario in 10 trials.Therefore, setting the temperature to 0 may be a potential strategy to mitigate the non-determinism of ChatGPT in one-round process.Result ❷.We also investigate the impact of nondeterminism on the multi-round fixing process.For functional correctness and complexity, we sample 20 code snippets with errors randomly from all generated code snippets at temperatures 0.7 and 0, respectively, where each sampled code snippet belongs to one unique <problem, language>.As for security, we select one vulnerable code snippet randomly from each category (selected in this section) of vulnerabilities having generated vulnerable code at temperatures 0.7 and 0, respectively, across algorithm problems and CWE code scenarios.Each error code or vulnerable code is fixed 5 times under the multi-round fixing process.Additionally, the multi-round fixing process, when set to temperatures of 0.7 and 0, is only performed on the generated code snippets in the one-round process at temperatures of 0.7 and 0, respectively.The fixing results are shown in Table 27 -32.From the results, we can observe the following findings: ▷ Status Rates.The data shows that for different trials of the multi-round fixing process at the same error code, the fixing results can be different, regardless of whether the temperature is set at 0.7 or 0. For example, for <problem 363, ▷ Complexity Levels.The data indicates that the complexity of the fixed code in different trials under the multiround fixing process with ChatGPT may vary, regardless of the temperature setting.For example, the fixed code snippets <problem 2124, Java> at temperature 0.7 have low and moderate levels in both cyclomatic and cognitive complexities; the fixed code snippets <problem 2532, Python3> at temperature 0 have complexity levels across low, moderate, high and very high in both cyclomatic and cognitive complexities.In different trials, ChatGPT may choose different patches for fixing error code snippets under the multi-round fixing process, even for the setting of temperature 0.
▷ CWEs.In different trials under the multi-round fixing process, ChatGPT may or may not fix vulnerable code, regardless of the temperature setting.For instance, ChatGPT at temperature 0.7 only fixes vulnerable code one time in code scenario 3 of CWE 20; ChatGPT at temperature 0 fails to fix vulnerable code one time in code scenario 1 of CWE 190.
Answer to RQ5: Non-determinism of ChatGPT ❶ Code generation in one-round process may be affected by ChatGPT's non-determinism factor when the temperature is set to 0.7, resulting in variations of code snippets in functional correctness, complexity, and security.One potential strategy to mitigate the non-determinism of ChatGPT in the one-round process is to set the temperature to 0; ❷ However, in the multi-round fixing process, the fixed code snippets by ChatGPT may vary in functional correctness, complexity, and security, regardless of the temperature settings of 0.7 and 0.

Lessons Learnt and Insight
Functionally Correct Code Generation.ChatGPT is better at generating functionally correct code for Bef.problems in different languages than Aft.problems.This result indicates that ChatGPT may have limitations when generating code for unfamiliar or unseen problems in the training dataset, even if the problems are easy with logic from human perspective.Moreover, ChatGPT also has differences in its ability to write code in different languages.In general, the probabilities of ChatGPT generating functionally correct code in C++, Java, Python3, and JavaScript are close to each other and substantially higher than that in C. By analyzing the ChatGPT-generated code snippets with errors (i.e., W.A., C.E., R.E., and T.L.E.), we identify several factors for errors in code snippets and unfixed cases under the multi-round fixing process.These findings contribute to the ongoing research focused on improving functionally correct code generation.Among them, besides the need to further strengthen ChatGPT's logical reasoning ability, improving its code generation stability (avoid generating empty body) and alignment with human attention (grasp logical details and meet user requirements such as method signature provided) is also very important, especially for the latter.The code generation process of ChatGPT may be careless, and the generated code may fail to meet some of the detailed conditions described, resulting in it being difficult to successfully generate or fix (to functional correct) with the application of the multi-round fixing process.Thus, future research can focus on how to provide additional useful information, such as missing details in the code or the correct algorithm logic, to ChatGPT to supplement the multi-round fixing process for fixing, or how to design an effective workflow for automatic code generation.Code Complexity.The complexity of code generated in different languages may be different.Additionally, the multiround fixing process with ChatGPT generally preserves or increases the complexity levels of code snippets, which may potentially make it increasingly difficult to understand the automatically and consistently generated code by ChatGPT.Secutiry Code Generation.During the evaluation across various scenarios, including algorithm problems and CWE scenarios, it is observed that the code generated by ChatGPT often exhibits relevant vulnerabilities, which is a severe issue.However, fortunately, the multi-round fixing process for vulnerable code snippets demonstrates promising results.By providing CWE information, ChatGPT is able to automatically fix vulnerable code.Therefore, combining ChatGPT with vulnerability detection tools (e.g., CodeQL) can mitigate the code generated with vulnerabilities.Furthermore, as an AI-powered assistant learning from largescale datasets, ChatGPT itself may also have the ability to detect vulnerabilities, serving as a more flexible vulnerability detection tool.Non-determinism of ChatGPT.By the results in Sec.4.5, we can observe that code generation may be affected by ChatGPT's non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security.One potential strategy to mitigate this issue is to set ChatGPT's temperature to 0. However, this strategy can only work in the one-round process.In the multi-round fixing process, the fixed code snippets by ChatGPT may vary in functional correctness, complexity, and security, regardless Sec.4.5 and set ChatGPT's temperature to 0 to stabilize the output in the one-round process.We first count the token used for output in each code generation to these selected problems and scenarios and then limit the maximum output token to half of the counted token usage in each corresponding code generation for generating incomplete code      snippets.For example, if ChatGPT outputs x tokens to the problem/scenario i, then limit the maximum output token to x/2 to i for generating code snippet.This approach simulates the case of generating incomplete code snippets due to exceeding the token limitation of ChatGPT and provides the pairs of the complete code snippets and the corresponding incomplete ones.After testing and manually analyzing these generated incomplete code snippets, we find that all these code snippets have compile or syntax errors due to the discarded part (e.g., a binary operator has no right operand).Additionally, the cyclomatic and cognitive complexities of the incomplete code snippets are also less than or equal to that of the original code snippets (e.g., an incomplete code snippet misses one if statement to the corresponding original code snippet or the discarded part does not contain any branch statement).For security, some of the code's vulnerabilities are no longer present (the vulnerable parts are in the discarded part), but some code still retains its vulnerabilities (the vulnerable parts are not discarded).
Comparison with Other Code Generation Models.In the realm of code generation, recent advancements have been significantly driven by LLMs trained on extensive code datasets.Code-related LLMs, such as Codex [9] and CodeGen [67], have demonstrated substantial capabilities and generalization in code generation tasks.These models generate code by autoregressively predicting the next token from the previous context (e.g., function signature, docstring, and previously generated tokens) and combining the previous context and the generated tokens together as the finally generated code snippet.ChatGPT takes this a step further.It has also demonstrated impressive code generation capabilities.Additionally, with RLHF [31], ChatGPT supports the ability to answer follow-up questions, providing even more powerful and versatile features compared to previous code-related LLMs.

Limitations
The results and experiments of this study are limited to two parts: (1) ChatGPT is a closed-source model, which means that we are unable to directly map the analysis results to the internal workings of ChatGPT or understand the specific characteristics of the model.

Threats to Validity
LeetCode Problems and CWE Scenarios.To reduce bias by manually selecting subjects for evaluation, we utilize LeetCode problems as our main dataset.However, LeetCode problems are designed specifically for coding practice and interview preparation.While they cover a range of programming concepts and challenges, they may not fully represent the complexity and diversity of real-world coding tasks.Real-world coding scenarios often involve various external factors, domain-specific requirements, and specific constraints that may not be fully captured by LeetCode problems alone.Moreover, the classes of vulnerabilities that LeetCode problems' code can have are limited, so we also utilize CWE scenarios [23] to supplement the evaluation of ChatGPT's security code generation.Nevertheless, similar to LeetCode problems, these scenarios may not cover all realworld code scenarios.LeetCode Online Judgment.LeetCode online judgment platform terminates the testing process upon encountering the first failed test case.Thus, the test case pass rates provided by the platform may serve as a lower bound, but it does not affect the statuses of code snippets generated by ChatGPT and the conclusion drawn from this study.Vulnerability Detection by CodeQL.CodeQL may report a code as vulnerable when it is actually secure.To mitigate the risk, human expertise is employed to manually inspect the code for potential vulnerabilities, thereby ensuring the accuracy and reliability of the analysis.Limited Languages in Vulnerability Detection.The evaluation of vulnerability detection in our study only focuses on limited languages (C, C++, and Java to LeetCode problem scenarios and C and Python to CWE scenarios) out of five languages (C, C++, Java, Python, and JavaScript) due to the targeted scenarios and limitations of vulnerability detection tools.Though the evaluation results provide insights into ChatGPT-based code generation in these languages, our study does not fully reflect the spectrum of all five languages in security.Statistical Validity.ChatGPT has randomness.When faced with the same input prompt, ChatGPT may produce different responses.To reduce the risk, we use 728 LeetCode problems.For each <Problem, Language> pair, we independently generate one corresponding code snippet once 26 , following the law of large numbers.For CWE's code scenarios, we generate 60 code snippets independently for each scenario.
As for the multi-round process, we sample many code snippets for experimentation and we set 5 to the maximum round number [53], a reasonable round limit.

RELATED WORK
Language Models.Language models have a wide range of applications in NLP, including machine translation, question answering, summarization, text generation, code generation and so on [16], [18], [68], [69], [70], [71], [72], [73], [9], [4].These models, with a large number of parameters, are trained on extensive corpus to better understand language (i.e., LLM).One of the fundamental architectures used in language models is Transformer [6], which consists of stacked encoders and decoders.Transformer utilizes selfattention mechanism to weigh the importance of words in the input text, capturing long-range dependencies and relationships between words.Many language models are built upon Transformer.ELMo [74] employs multi-layer bidirectional LSTM (long short-term memory) and provides high-quality word representations.GPT [32] and BERT [29] are based on the decoder (unidirectional) and encoder (bidirectional) components of the Transformer, respectively.They utilize pre-training and fine-tuning techniques.GPT-2 [33] and GPT-3 [20] are the successors of GPT, with GPT-2 having a larger model size in parameters than GPT, and GPT-3 being even larger than GPT-2.Additionally, with larger corpus, GPT-2 and GPT-3 introduce zero-shot and fewshot learning to enable adaptation to multitask scenarios.
Codex [9] is obtained by training GPT-3 on GitHub code data.It serves as the underlying model for GitHub Copilot [11], a tool that can automatically generate and complete code automatically.InstructGPT [31] uses additional supervised learning and reinforcement learning from human feedback (RLHF) to fine-tune GPT-3, aligning the language model with users.ChatGPT [12], based on GPT-3.5 [38], utilizes the same methods as InstructGPT and provides the ability to answer follow-up questions.Code Generation.Code generation [75] is a fundamental application of language models that aims to automatically generate or complete computer code based on given specifications or natural language descriptions, improving programming productivity.There is a lot of research on it, including traditional approaches and AI-based approaches.Traditional code generation [75], [14], [76], [77], [78], [79] approaches typically rely on predefined templates or rules (e.g., context-free grammar), along with input-output specifications, which limits their flexibility and requires manual effort.For example, Gulwani [78] identifies a string expression language available to various string manipulation tasks (e.g., extract substrings in a specific format) and designs an algorithm for learning a string expression that is consistent with the provided input-output examples.As for AI-based approaches [15], [9], [11], [80], [12], [81], [82], [67], they 26.ChatGPT has a rate-limiting of queries and it may be retrained at a later date.Thus, we query once for each problem and do not requery for failed responses (e.g., response is empty or irrelevant such as violation of policy.This small amount of responses is excluded from the experimental evaluation).
leverage deep learning and NLP to overcome these limitations and can offer more intelligent and adaptable codegeneration capabilities.Li et al. [80] leverage recurrent neural network with attention mechanism and pointer mixture network on abstract syntax tree (AST) to predict next word in code completion tasks, learning from large-scale codebases.Liu et al. [83] model the structural information in AST and use Transformer-XL network and multi-task learning to capture long-term dependency in programs and learn two disjoint code-related tasks in code completion, respectively.Ashwin et al. [84] combine traditional method with neural network (e.g., LSTM network) to generate code from examples, with high correctness, strong generalization, and low synthesis time.Recently, with the advantages of LLMs, researchers apply LLMs directly to the code generation task by using extensive code datasets, such as Codex [9], Copilot [11], CodeGen [67], providing more powerful capabilities and generalizations.These code-related LLMs (e.g., Codex) autoregressively predict the next token from the previous context (e.g., function signature, docstring, and previously generated tokens) in code generation and combine the previous context and the generated tokens together as the finally generated code snippet.ChatGPT [12], the state-ofthe-art LLM based on GPT-3.5 [21], has also demonstrated impressive code generation capabilities.Additionally, with RLHF [31], ChatGPT supports the ability to answer followup questions, providing even more powerful and versatile features compared to previous code-related LLMs.Evaluation on LLM-based Code Generation.Hendrycks et al. [85] craft APPS benchmark of Python programming problems and assess the code generation performance for several GPT-based variant models by fine-tuning with APPS.Fan et al. [51] systematically study whether automated program repair (APR) techniques, including Codex, can fix the incorrect solutions to LeetCode problems produced by Codex.Xia et al. [86] perform an extensive study on directly applying LLMs (9 state-of-the-art LLMs) for APR.They evaluate different ways of using LLMs for the task, including the entire-patch fix, the chunk-of-code fix, and the single-line fix.Pearce et al. [66] examine the use of LLMs (e.g., Codex) by zero-shot learning for vulnerability repair.CodeT [87] utilizes LLMs to generate functionally correct code solutions by leveraging dual execution agreement.It generates multiple code solutions and multiple test cases for a given programming problem and executes the generated code solutions using the generated test cases to rank and find the best solution.Dong et al. [53] introduce the concept of software development life cycle and propose a self-collaboration framework that leverages different ChatGPT conversations to play different roles (e.g., analyst, developer, and tester), collaborating to generate code.Sobania et al. [40] conduct an evaluation of Copilot on standard program synthesis benchmark program, comparing the results with genetic programming.Pearce et al. [23] assess Copilot's security code generation on the top-25 CWE vulnerabilities.Nguyen et al. [24] evaluates the quality of Copilot-generated code by using 33 LeetCode problems in 4 different languages.Kou et al. [88] investigate the attention alignment between the nature language description from humans and the code generation by LLMs.Liu et al. [25] propose EvalPlus framework to enhance code generation benchmarks.EvalPlus takes in a base evaluation dataset and uses LLMs and mutation technique to produce and diversify large amounts of new test cases.Liu et al. [89] characterize several code quality issues of ChatGPT-based code generation across Java and Python languages, including correctness and maintainability.They also examine the ability of ChatGPT to repair bugs and code style issues by leveraging feedback information.Different from their work, we conduct a systematical assessment with deep analysis for ChatGPT-based code generation across five languages in terms of correctness, complexity, and security, including the multi-round process.Our research significantly extends the current understanding of ChatGPTbased code generation.We not only evaluate the correctness of the generated code but also provide a deep dive into the underlying causes of incorrectness in ChatGPT-generated code.We also deeply assess the complexity and security of the generated code.Furthermore, we deeply investigate the impact of the multi-round fixing process on these aspects, providing a more realistic evaluation of ChatGPT's capabilities in iterative code generation scenarios.This comprehensive study underscores the practical implications of AI-generated code in real-world software development.

CONCLUSION
In this paper, we present a systematic assessment of Chat-GPT-based code generation.We comprehensively evaluate code snippets generated by ChatGPT from three aspects of correctness, complexity, and security, including the multiround fixing process.Our experimental results demonstrate that (1) ChatGPT is better at generating functionally correct code for Bef.problems in different languages than Aft.problems (the average Accepted rate of the former exceeds the latter by 48.14%), but ChatGPT's ability to directly fix erroneous code to achieve correct functionality is relatively weak; (2) the distribution of cyclomatic and cognitive complexity levels for code snippets in different languages varies.Additionally, the multi-round fixing process with ChatGPT generally preserves or increases the complexity levels of code snippets; (3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by ChatGPT has relevant vulnerabilities.Fortunately, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with a high percentage (100% and 89.4%) of vulnerabilities successfully addressed; and (4) code generation may be affected by ChatGPT's non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security.Overall, our findings uncover potential issues and limitations that arise in the ChatGPT-based code generation and pave the way for improving AI and LLMbased code generation techniques.

Fig. 2 :
Fig. 2: The workflow of interacting with ChatGPT to generate code snippets.

Fig. 5 :
Fig. 5: Distribution of the ratios of languages accepted to corresponding Aft.problems.Where dark violet and dark red lines represent the median and mean, respectively.

Fig. 8 :
Fig.8: An example code snippet with WD error stemming from a misunderstanding of the meaning of lexicographically smallest in the given problem description.

Fig. 16 :
Fig.16: An example code snippet in Python3 with a retained error (value error).

Fig. 18 :
Fig. 18: An example code snippet in Python3 with CMA.The time complexity of the code snippet is O(n log n).

Fig. 19 :
Fig.19: An example code snippet in JavaScript with IA.

Fig. 21 :
Fig. 21: Distribution of cyclomatic complexity under three difficulty levels of problems.

Fig. 23 :
Fig. 23: Top 20 numbers of cyclomatic complexity combinations in five different languages (C, C++, Java, Python3, and JavaScript) of the same problems.

Fig. 24 :
Fig. 24: Top 20 numbers of cognitive complexity combinations in three different languages (Java, Python3, and JavaScript) of the same problems.

Fig. 25 :
Fig. 25: Heatmap of the numbers of complexity levels of the original code snippets and the final code snippets under multi-round process.
Furthermore, the exact training data used by ChatGPT remains unknown to us.Consequently, it becomes difficult to ascertain whether the problems we input have been previously used in the training dataset; (2) it is important to note that ChatGPT is a continuously evolving and training model.The responses generated by ChatGPT in this study can only reflect the performance of the model at the time of our work (i.e., model version gpt-3.5-turbo-0301 of ChatGPT).
The submitted code snippet has no compile errors but cannot pass all test cases.The submitted code snippet cannot be compiled.•TimeLimit Exceeded: The runtime of the submitted code snippet exceeds the permitted execution time.•Runtime Error: The execution of the submitted code snippet triggers a runtime error for at least one test case.
• Wrong Answer: • Compile Error:Note that code with W.A. does not necessarily mean it does not contain R.E. or T.L.E.. Several errors can occur at the same time.Nevertheless, we study the functional correctness of code generation.Therefore, we take their priorities as C.E., R.E.> W.A. > T.L.E. by default, meaning that we only focus on the judgment results returned by LeetCode and this processing method has a negligible impact on the experimental conclusions.We prioritize C.E. and R.E. because these two errors lead to code running failures, which also implies W.A.. T.L.E. is set to the lowest priority because it mainly relates to non-functional requirements.Moreover, LeetCode online judgment platform terminates the testing process upon encountering the first failed test case.Thus, the test case pass rates (the percentage of predefined test cases that a submitted code snippet successfully passes) provided by the platform may serve as a lower bound.

TABLE 1 :
Code-judged Result in C, C++, and Java Languages . It is notable that the analysis result is a lower bound since it is impossible to check all the ground truth for each solution. 6

TABLE 3 :
The Statistics of 157 <problem, language> Pairs

TABLE 4 :
Defect Classification of the 157 <problem, language> Pairs Fig.10: An example code snippet with MCC error that reasons a wrong recurrence formula.

TABLE 5 :
Result of Multi-round Code Generation for W.A. Code Snippets

TABLE 6 :
Result of 10-round Code Generation for 10 W.A. Code Snippets

TABLE 7 :
Compile Error Classification of All C.E. Code Snippets ▶ Multi-round Fixing: We follow the settings in W.A.'s multi-round fixing.The prompt used in C.E.'s multi-round fixing has a little bit different from the previous one.One example is shown below:Prompt:

TABLE 8 :
Result of Multi-round Code Generation for C.E. Code Snippets (1))her C.E. or R.E.), respectively.From the result, we can see that most of the code snippets can be fixed.19and22codesnippets in C and C++ get retained errors and changed errors, respectively.For the 115 fixed code snippets, 40 of them are accepted, containing 30, 7, and 3 in C, C++, and Java, respectively.For the 40 code snippets, their classes of compile errors contain redefinition (1), function declaration error (8), undeclared variable (1), wrong method name (2), redefinition of main(8), use undeclared function(12), incompatible parameter types (1), uninitialized variable (3), invalid operators (1), and error of #include(1).As for the code snippets with retained errors and changed errors, we manually analyze them and divide the causes of the two errors into two categories as follows:

TABLE 9 :
Runtime Error Classification of All R.E.Code Snippets A method (function) or operation receives a valid type of input, but the value itself is not suitable or within the expected range for the specific operation (e.g., pass an empty List to function max in Python3).

Table 9 .
From the table, we can see that there are a small number of code snippets in the class of constant function for dynamic languages (i.e., Python3 and JavaScript), which is different from C.E.'s results.Moreover, like C.E. errors, wrong method name in R.E.errors is not a real runtime error and we also find examples that Aft.problems use method names of Bef.problems (e.g., problem 2164

TABLE 10 :
Result of Multi-round Code Generation for R.E.Code Snippets One classic example of AIAI is gcd functions, which use the modulo operator % or subtraction operator -.Both two functions with either the modulo operator or subtraction operator use the Euclidean Fig.17:An example code snippet in C with AIAI.

TABLE 11 :
Result of Multi-round Code Generation for T.L.E.Code Snippets aligned, but its function is incorrect.In line 8, the condition sum + grades[i] <= count can lead the outer while loop to be an infinite loop if the condition is false, causing T.L.E.error.▶ Multi-round Fixing: We follow the settings in W.A.'s multi-round fixing.

TABLE 12 :
Cyclomatic Complexity Result of Generated Code

TABLE 13 :
Cognitive Complexity Result of Generated Code

TABLE 14 :
Cyclomatic Complexity Result of Written Code

TABLE 15 :
Cognitive Complexity Result of Written Code

TABLE 16 :
Comparision between Code Generated by ChatGPT and Code Written by Humans to the Distributions of Cyclomatic and Cognitive Complexities under Three Difficulty Levels of Problems

TABLE 17 :
Numbers of Differences for Cyclomatic Complexity and Cognitive Complexity Combinations CodeQL and SonarQube.The specific queries 23.The algorithmic code typically focuses on solving specific logical or computational problems and often does not involve managing system resources, network communications, or other operations that are commonly sensitive to security issues.

TABLE 18 :
Result of Vulnerability Detection

TABLE 19 :
Result of Security Code Generation

TABLE 20 :
table, ChatGPT generates 2,983 valid code snippets achieving a 99.07%valid rate on average, where 994 (33.32%) of them are vulnerable.Broken down into languages, there are 1,402 (47%) valid code snippets in Group Categories and Descriptions for the 18 CWEs at lines 7, 9, and 15.The code generated by ChatGPT does not check whether a pointer is NULL.Overall, ChatGPT performs poorly in this group.

TABLE 21 :
Result of Multi-round Vulnerable Code Fixing

Untrusted data passed to external API", ... "Call to free with untrusted data from...", ...
Fix the vulnerable code and generate the fixed code.If the newly generated code snippet is still vulnerable to the same vulnerability (e.g., CWE-79's ExternalAPISink), the corresponding CWE information (i.e., vulnerability message) returned by CodeQL is taken directly as a new prompt provided to ChatGPT to fix and generate a new code snippet, in the same conversation.For the code snippets checked by authors, we tell ChatGPT that the newly generated code snippets still have the same CWE vulnerabilities (e.g., "the newly generated code snippet still contains the CWE-787 vulnerability.").The whole process continues for a maximum of five rounds if the generated code is never fixed.Furthermore, the strategy of mitigating token limitation follows the setting in W.A. multi-round fixing (Sec.4.2.1).The results of multi-round fixing are shown in Table21, where # Fixed represents the fixed numbers for corresponding code scenarios.Out of 160 vulnerable code snippets, 143

TABLE 22 :
Functional Correctness, Complexity, and Security for Aft.Problems in 10 Trials at Temperature 0.7

TABLE 23 :
Functional Correctness, Complexity, and Security for Bef.Problems in 10 Trials at Temperature 0.7

TABLE 24 :
Functional Correctness, Complexity, and Security for Aft.Problems in 10 Trials at Temperature 0

TABLE 27 :
Multi-round Fixing Process for Algorithm Problems in 5 Trials at Temperature 0.7

TABLE 28 :
Multi-round Fixing Process for Algorithm Problems in 5 Trials at Temperature 0

TABLE 29 :
Multi-round Fixing Process for CWE of Algorithm Problems in 5 Trials at Temperature 0.7

TABLE 30 :
Multi-round Fixing Process for CWE of Algorithm Problems in 5 Trials at Temperature 0

TABLE 31 :
Multi-round Fixing Process for Security Code Generation in 5 Trials at Temperature 0.7

TABLE 32 :
Multi-round Fixing Process for Security Code Generation in 5 Trials at Temperature 0