MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

研究人员如何测量、检测和评估人工智能操纵行为

2026-03-01 00:15:26

:::info

Authors:

  1. Enes Altuncu, [email protected] (University of Kent, UK)
  2. Virginia N. L. Franqueira, [email protected] (University of Kent, UK)
  3. Shujun Li, [email protected] (University of Kent, UK)

:::

Abstract

Recent advancements in AI, especially deep learning, have contributed to a significant increase in the creation of new realistic-looking synthetic media (video, image, and audio) and manipulation of existing media, which has led to the creation of the new term “deepfake”. Based on both the research literature and resources in English and in Chinese, this paper gives a comprehensive overview of deepfake, covering multiple important aspects of this emerging concept, including 1) different definitions, 2) commonly used performance metrics and standards, and 3) deepfake-related datasets, challenges, competitions and benchmarks. In addition, the paper also reports a meta-review of 12 selected deepfake-related survey papers published in 2020 and 2021, focusing not only on the mentioned aspects, but also on the analysis of key challenges and recommendations. We believe that this paper is the most comprehensive review of deepfake in terms of aspects covered, and the first one covering both the English and Chinese literature and sources.

Keywords: Deepfake, Survey, Definition, Datasets, Benchmarks, Challenges, Competitions, Standards, Performance Metrics.

\

1       Introduction

Recent advancements in AI and machine learning have increased the capability to produce more realistic media, e.g., video, image, and audio. Especially, state-of-the-art deep learning methods enabled the generation of “deepfakes”, manipulated or synthetic media the realness of which are not easily recognisable by the human eye. Although deepfake is a relatively new phenomenon (having first appeared at the end of 2017), its growth has been remarkable. According to the 2019 and 2020 Deeptrace reports on the state of deepfake [2], the number of deepfake videos in the English-speaking internet grew from 7,964 (December 2018) to 14,678 (July 2019) to 85,047 (December 2020), representing a 968% increase from 2018 to 2020.

In this work, we review existing deepfake-related research ecosystem in terms of various aspects, including performance metrics and standards, datasets, challenges, competitions, and benchmarks. Furthermore, we provide a meta-review of 12 selected deepfake-related survey papers which covers several additional aspects other than the mentioned ones in a systematic manner, such as performance comparison, key challenges, and recommendations.

Despite being a hugely popular term, there is a lack of consensus on the definition of “deepfake” and the boundary between deepfakes and non-deepfakes is not clear cut. For this survey, we adopt a relatively more inclusive approach to cover all forms of manipulated or synthetic media that are considered deepfakes in a broader sense. We also cover closely related topics including biometrics and multimedia forensics, since deepfakes are often used to launch presentation attacks against biometrics-based authentication systems and detection of deepfakes can be considered part of multimedia forensics. A more detailed discussion on different definitions of “deepfake” is given next.

\

1.1       Definitions of the Term Deepfake

As its name implies, the term “deepfake” is derived from the combination of “deep” (referring to deep learning (DL)) and “fake”. It is normally used to refer to manipulation of existing media (image, video and/or audio) or generation of new (synthetic) media using DL-based approaches. The most commonly discussed deepfake data are fake face images, fake speech forgeries, and fake videos that combine both fake images and fake speech forgeries. While having “fake” in the word indicates manipulated or synthesised media, there are plenty of benign applications of the deepfake technology, e.g., for entertainment and creative arts. With this respect, another term “deep synthesis” has been proposed as a more neutral-sounding alternative [60]. This new term, however, has not been widely adopted.

In addition to the lack of a universal definition, as mentioned already, the boundary between deepfakes and non-deep fakes is actually not a clear cut. There are at least two important aspects we should consider, one on detection of and the other on creation of deepfakes.

First, detection of deepfakes often follows very similar approaches to detection of traditional fakes generated without using DL techniques. Advanced detection methods have also started leveraging DL to improve their performance, but they do not necessarily need to know how a target media is created (deep or not). To some extent, one could argue that detecting deepfakes does not involve developing deepfake-specific methods (even though some researchers choose to do so), but a more robust and universal detector that can handle any (deep or not) fake media. This can be seen for two closely related topics: biometrics and multimedia forensics. For biometrics, there is a trend of using deep learning techniques to generate fake biometric signals (e.g., face images and videos) for biometric spoofing or presentation attacks. For multimedia forensics, deepfake-based forgeries have become a new threat to the traditional problem of “forgery detection”. For both topics, detection of biometric spoofing and multimedia forgeries have evolved to consider both deep and non-deep fakes.

Second, one may argue that the word “deep” in “deepfake” does not necessarily refer to the use of “deep learning”, but any “deep” (i.e., sophisticated) technology that creates a very believable fake media. For instance, Brady [9] considered deepfake as audio-visual manipulation using “a spectrum of technical sophistication … and techniques”. They also introduced two new terms, Shallowfake and Cheapfake, referring to “low level manipulation of audio-visual media created with (easily) accessible software [or no software] to speed, slow, restage or re-contextualise content”. This broader understanding of “deepfake” has also been adopted by law makers for new legislations combating malicious deepfakes. For instance, the following two United States acts define “deepfakes” as follows:

  • 2018 Malicious Deep Fake Prohibition Act1:

§1041.(b).(2): “the term ‘deep fake’ means an audiovisual record created or altered in a manner that the record would falsely appear to a reasonable observer to be an authentic record of the actual speech or conduct of an individual.

  • 2019 DEEP FAKES Accountability Act2:

§1041.(n).(3): “The term ‘deep fake’ means any video recording, motion-picture film, sound recording, electronic image, or photograph, or any technological representation of speech or conduct substantially derivative thereof—

(A)  which appears to authentically depict any speech or conduct of a person who did not in fact engage in such speech or conduct; and

(B)  the production of which was substantially dependent upon technical means, rather than the ability of another person to physically or verbally impersonate such person.

As we can see from the above legal definitions of “deepfake”, the use of DL as a technology is not mentioned at all. The focus here is on “authenticity”, “impersonation” and (any) “technical means”.

\

1.2      Scope and Contribution

Based on the above discussion on definitions of deepfake, we can see it is not always straightforward or meaningful to differentiate deepfakes from non-deep fakes. In addition, for our focus on performance evaluation and comparison, the boundary between deepfakes and non-deep fakes is even more blurred. This is because DL is just a special (deeper) form of machine learning (ML), and as a result, DL and non-deep ML methods share many common concepts, metrics and procedures.

Despite the fact that deepfake may be understood in a much broader sense, in this work, we have a sufficiently narrower focus to avoid covering too many topics. We, therefore, decided to define the scope of this survey as follows:

  • For metrics and standards, we chose to include all commonly used ones for evaluating general ML methods and those specifically defined for evaluating deepfake creation or detection methods.
  • For datasets, challenges, competitions and benchmarks, we considered those related to fake media covered in the deepfake-related survey papers and those with an explicit mention of the term “deepfake” or a comparable term.
  • For the meta-review, we considered only survey papers whose authors explicitly referred to the term “deepfakes” in the meta data (title, abstract and keywords).

\

2      Methodology

Research papers covered in this survey (i.e., the deepfake-related survey papers) were identified via systematic searches on the scientific databases, Scopus and China Online Journals (COJ)3. The following search queries were used to perform the searches on Scopus and COJ, respectively:

(deepfake* OR deep-fake* OR “deep fake*”) AND (review OR survey OR overview OR systemati* OR SoK)

(deepfake OR 深度伪造) AND (综述 OR 进展)

The searches returned 41 survey papers in English and 15 survey papers in Chinese. Out of these papers, eight published in English and four published in Chinese were selected for consideration.

Deepfake-related challenges, competitions and benchmarks were identified via multiple sources: the survey papers selected, research papers from the co-authors’ personal collections, Google Web searches, and manual inspection of websites of major AI-related conferences held in 2020 and 2021 (where such challenges and competitions are routinely organised). The inspected conferences include those listed in the ACL (Association for Computational Linguistics) Anthology4, ICCV, CVPR, AAAI, ICML, ICLR, KDD, SIGIR, WWW, and many others. In addition, a comprehensive list of datasets was compiled based on the selected survey papers and the identified challenges, competitions, and benchmarks. Relevant standards were identified mainly via research papers covered in this survey, the co-authors’ personal knowledge, and Google Web searches. For performance metrics, we covered those commonly used based on relevant standards, the survey papers, and the identified challenges, competitions, and benchmarks.

\

3      Deepfake-Related Performance Metrics & Standards

In this survey, we focus on performance evaluation and comparison of deepfake generation and detection methods. The metrics used for such performance evaluations are at the core of our discussions. In this section, we review the performance metrics that are commonly used to evaluate deepfake generation and detection algorithms. Note that all metrics covered in this section are also commonly used for evaluating performance of similar systems that are not for generating or detecting deepfakes. Therefore, this section can be seen as a very brief tutorial on general performance metrics.

In the last subsection, we also briefly discuss how the related performance metrics are covered in formal standards. By “formal standards”, we refer to standards defined following a formal procedure, often by one or more established standardisation bodies such as the International Organization for Standardization (ISO)5 and the International Electrotechnical Commission (IEC)6. Note that we consider a broad range of documents defined to be standards by standardisation bodies, e.g., International Telecommunication Union (ITU)7 recommendations and ISO technical reports (TRs).

\

3.1      The Confusion Matrix

Deepfake detection is primarily a binary classification problem. A binary classifier takes an input that is actually positive or actually negative and outputs a binary value denoting it to be predicted positive or predicted negative. For example, a deepfake detection system will take a suspected image as the input that may be actually fake or actually real and output predicted fake or predicted real.

A fundamental tool used in evaluating a binary classifier is the confusion matrix that summarises the success and failure of the classification model. On one axis are the two actual values and on the other axis are the two predicted values. The classification is successful/correct/true (true positive and true negative) when the actual and the predicted values match. It is failed/incorrect/false (false positive and false negative) when the actual and predicted values do not match. Table 1 shows the confusion matrix for a binary deepfake classifier (detector). The two cells in green, TP (the number of true positives) and TN (the number of true negatives), indicate correct prediction results, and the two cells in red, FN (the number of false negatives) and FP (the number of false positives), indicate two different types of errors when making incorrect prediction results.

\ Table 1: Confusion matrix for a binary classifier for detecting deepfake.

| | fake (predicted) | real (predicted) | |----|----|----| | fake (actual) | TP | FN | | real (actual) | FP | TN |

\ \

3.2      Precision and Recall

Based on the four fundamental values introduced in Section 3.1, i.e., TP, TN, FP and FN, we define two important performance metrics for a binary classifier – precision and recall.

Precision of a binary classifier is defined as the fraction of actually positive samples among all the predicted positives. In the confusion matrix, it is the fraction of true samples in the first column. It can be formally defined as Eq. (1).

When the “natural” ratio between positive and negative samples is significantly different from the test set, it is often useful to adjust the weight of the false positives, which leads to the weighted precision (wP) defined in Eq. (2), where α > 0 is a weight determined by the ratio between the negative and positive samples.

Recall of a binary classifier is the fraction of predicted positive samples among the actually positive samples, as shown in Eq. (3). In the confusion matrix, it is the fraction of true samples in the first row.

Let us consider an example binary classifier that predicts if an image from a database containing both deepfake and real (authentic) images is fake or not. Precision of the classifier is the fraction of correctly classified images among all images classified as deepfake. On the other hand, recall is the fraction of deepfake images identified by the classifier, among all deepfake images in the database.

\

3.3      True and False Positive Rates

Focusing on predicted positive samples, we can also define two metrics: true positive rate (TPR), also called correct detection rate (CDR), as the fraction of the predicted positive samples among the actually positive samples and false positive rate (FPR), also called false alarm rate (FAR), as the fraction of the predicted positive samples among the actually negative samples, as shown in Eqs. (4) and

(5). In the confusion matrix, TPR is the fraction of predicted positive samples in the first row and FPR is the fraction of predicted positive samples in the second row. Note that TPR is basically a different name for recall (Eq. (3)).

\

3.4     True and False Negative Rates

Similar to true and false positive rates, we can define two other rates focusing on negative predicted results: true negative rate (TNR) indicating the fraction of the predicted negative samples among the actually negative samples, and false negative rate (FNR) indicating the fraction of the predicted negative samples among the actually positive samples, as shown in Eqs. (6) and (7).

\

3.5      Sensitivity and Specificity

In some applications of binary classifiers, especially in biology and medicine, the TPR and the TNR are more commonly used, and they are often called sensitivity (TPR) and specificity (TNR). The focus of these two terms is on the two types of correctness of the predicted results. These are less used in deepfake-related research, hence, we will not refer to them in the remainder of this paper.

\

3.6     Equal Error Rate

Focusing on error rates means that we need to consider the FPR and the FNR. These two rates normally conflict with each other so that reducing one rate normally leads to an increase in the other. Therefore, rather than trying to reduce both error rates at the same time, which is normally impossible, the more realistic task in practical applications is to find the right balance so that they are both below an acceptable threshold.

In some applications, such as biometrics, people are particularly interested in establishing the so-called equal error rate (EER) or crossover error rate (CER), the point where the FPR and the FNR are equal. The EER/CER is not necessarily a good metric for some applications, especially when the two types of errors are of different levels of importance, e.g., for detecting critical deepfakes (e.g., fake news that can influence how people cast their votes) we can often tolerate more false positives (false alarms) than false negatives (missed alarms).

\

3.7      Accuracy and F-Score

In addition to the EER/CER, there are also other metrics that try to reflect both types of errors, in order to give a more balanced indication of the overall performance of a binary classifier. The two most commonly used are accuracy and F-score (also called F-measure). Both metrics can be defined based on the four fundamental values (TP, TN, FP, and FN).

Accuracy of a binary classifier is defined as the fraction of correctly predicted samples (true positives and true negatives) among the total number of samples that have been classified, as shown in Eq. (8).

The F-score of a binary classifier is actually a family of metrics. Its general form can be described based on a parameter β as defined in Eq. (9).

The most widely used edition of all F-scores is the so-called F1-score, which is effectively the F-score with β = 1. More precisely, it is defined as shown in Eq. (10).

\

3.8     Receiver Operating Characteristic Curve and Area Under Curve

Receiver operating characteristic (ROC) curves are commonly used to measure the performance of binary classifiers that output a score (or probability) of prediction.

Consider the following. Let S be the set of all test samples and let the output scores f (s) (for all sS) lie in the interval [a, b] on the real line. Let t ∈ [a, b] be a prediction threshold for the model, and assume that the classifiers works as follows for all sS:

\ It is easy to see that, for t = a, all the samples will be classified as positive, leading to FN = TN = 0 so TPR = FPR = 1; while for t = b, all the samples will be classified as negative, leading to FP = TP = 0 so TPR = FPR = 0. For other threshold values between a and b, the values of TPR and FPR will normally be between 0 and 1. By changing t from a to b continuously, we can normally get a continuous curve that describes how the TPR and FPR values change from (0,0) to (1,1) on the 2D plane. This curve is the ROC curve of the binary classifier.

For a random classifier, assuming that f (s) distributes uniformly on [a, b] for the test set, we can mathematically derive its ROC curve being the TPR = FPR line, whose area under the ROC curve (AUC) is 0.5. For a binary classifier that performs better than a random predictor, we can also mathematically prove that its AUC is always higher than 0.5, with 1 being the best possible value. Note that no binary classifier can have an AUC below 0.5, since one can simply flip the prediction result to get a better predictor with an AUC of 1 − AUC. The relationship between the ROC and the AUC is graphically illustrated in Figure 1.

Figure 1: A representative ROC curve showing how TPR and FPR change w.r.t. the (hidden) threshold t. The area under the (ROC) curve (AUC) is shown in grey.

\

3.9     Log Loss

Another widely used performance metric for binary classifiers that can return a probability score for the predicted label is log loss. For a binary classification with a true label y ∈ {0*,* 1} and an estimated probability p = Pr(y = 1), the log loss per sample is the negative log-likelihood of the classifier given the true label, defined as shown in Eq. (12).

Given a testing set with n samples, the log loss score of a binary classifier can be calculated using Eq. (13), where yi is 1 if the i-th sample is true and 0 if false, and yˆi is the predicted probability of yi = 1.

3.10     Extension to Multi-class Classifiers

All metrics that are defined based on the four basic values TP, TN, FP and FN can be easily extended to multi-class classification by considering the prediction to be true or false individually with respect to each class. For example, if the system is classifying animals (cats, dogs, horses, lions, tigers, etc.), then a true positive prediction of an image to be of a cat, would simultaneously be true negative predictions for the remaining classes (dogs, horses, lions, tigers, etc.). If an image of a cat is incorrectly predicted to be that of a dog, it would be a false negative with respect to a cat, a false positive with respect to a dog, and a true negative with respect to all other classes.

\

3.11      Perceptual Quality Assessment (PQA) Metrics

By definition, the main goal of deepfakes is to make it hard or impossible for human consumers (listeners or viewers) to distinguish fake media from real media. Therefore, when evaluating the quality of deepfake media, the quality perceived by human consumers of the media is key. This calls for subjective assessment of the perceptual quality of the deepfake media as the “gold standard”. The most widely used subjective perceptual quality assessment (PQA) metric for audio-visual signals is mean opinion score (MOS), which has been widely used by the signal processing and multimedia communication communities, including digital TV and other multimedia-related consumer applications. As its name implies, MOS is calculated by averaging the subjective scores given by a number of human judges, normally following a numerical scale between 1 and 5 or between 0 and 100. MOS has been used in some deepfake-related challenges (see Section 5.2) and also for evaluating and comparing the quality (realness/naturalness) of deepfake datasets (see Section 4.6).

As a general subjective PQA metric, MOS has been standardised by the ITU8. There are also ITU standards defining more specific subjective Video Quality Assessment (VQA) metrics and the standard procedures one should follow to conduct VQA user studies, e.g., ITU-T Recommendation P.910 “Subjective video quality assessment methods for multimedia applications”9. Note that the ITU standards focus more on traditional perceptual quality, i.e., how good a signal looks or sounds, even if it looks or sounds not real (e.g., too smooth). On the other hand, for deepfakes, the focus is rather different because what matters is the realness and naturalness of the created media, i.e., how real and natural it looks or sounds, even if it is of low quality. To some extent, we can also consider realness and naturalness as a special aspect of perceptual quality.

One major problem of subjective PQA metrics like MOS is the need to recruit human judges and to have a well-controlled physical testing environment and protocol, which are not easy for many applications. To help reduce the efforts and costs of conducting PQA-related user studies, various objective PQA metrics have been proposed, where the term “objective” refers to the fact that such metrics are human-free, i.e., automatically calculated following a computational algorithm or process. Depending on whether a reference exists, such objective PQA metrics can be largely split into three categories: full-reference (FR) metrics (when the original “perfect-quality” signal is available as the reference), reduced-reference (RR) metrics (when some features of the original “perfect-quality” signal are available as the reference), and no-reference (NR) metrics (when the original signal is unavailable or such an original signal does not exist). For deepfakes, normally NR or RR metrics are more meaningful because the “fake” part of the word means that part of the whole data does not exist in the real world, hence a full reference cannot be obtained. RR metrics are still relevant because deepfakes are often produced for a target’s specific attributes (e.g., face and voice), where the reduced reference will be such attributes. NR metrics will be useful to estimate the realness and naturalness of a deepfake, simulating how a human judge would rate it in a controlled subjective PQA user study.

PQA is a very active research area and many PQA metrics have been proposed, some of which have been widely used in real-world products and services, e.g., mean squared error (MSE), peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) for FR PQA of digitalimages and videos defined as in Eqs. (14), (15), and (16), respectively, where X = {xi} n i is the reference (the original signal), Y = {yi} n i is the signal whose visual quality is assessed, n is the number of pixels in X and Y , L is the maximum possible pixel value of X and Y (e.g., 255 for 8-bit gray-scale images), c1 = (k1L) 2 and c2 = (k2L) 2 ) are two stabilising parameters (k1 = 0.01 and k2 = 0.03 by default). For more about PQA metrics for different types of multimedia signals, we refer readers to some relevant surveys [3, 51, 72].

3.12      More about Standards

Many of the basic performance metrics described in this section have been widely used by deepfake researchers as de facto standards, e.g., EER, log loss and MOS have been widely used in deepfake-related challenges (see Section 5). Also, the combination of precision, recall and F1-score has been widely used to assess performance of binary classifiers. While there have been a number of ITU standards on PQA to date, there does not seem to be many standardisation efforts on the performance metrics for evaluation of binary classifiers. This was the case until at least 2017, when ISO and IEC jointly set up the ISO/IEC JTC 1/SC 4210, a standardisation subcommittee (SC) focusing on AI under ISO/IEC JTC 111, the joint technical committee for standardising “information technology”.

One recent effort that ISO/IEC JTC 1/SC 42 made is to produce the ISO/IEC TR 24029-1:2021 “Artificial Intelligence (AI) – Assessment of the robustness of neural networks – Part 1: Overview”12, a technical report (TR) that systematically covers many commonly used performance assessment concepts, methods and metrics. Although the technical report has “neural networks” in its title, most performance assessment concepts, methods and metrics included are common ones for all supervised machine learning models.

In terms of performance metrics, two other ongoing work items of the ISO/IEC JTC 1/SC 42 that deserve attention are as follows:

  • ISO/IEC DTS (Draft Technical Specification) 4213 “Information technology – Artificial Intelligence – Assessment of machine learning classification performance”13
  • ISO/IEC AWI (Approved Work Item) TS (Technical Specifications) 5471 “Artificial intelligence – Quality evaluation guidelines for AI systems”14

While the ISO/IEC JTC 1/SC 42 was created very recently, another standardisation subcommittee under ISO/IEC JTC1 has a much longer history of nearly 20 years: the ISO/IEC JTC 1/SC 3715 that focuses on biometrics-related technology. This standardisation subcommittee is highly relevant for deepfake since deepfake faces can be used to spoof biometrics-based user authentication systems. In this context, the following three standards are of particular relevance:

ISO/IEC 19795-1:2021 “Information technology – Biometric performance testing and reporting – Part 1: Principles and framework”16: This standard covers general metrics about evaluating biometric systems. Two major metrics in this context are false accept rate (FAR) and false reject rate (FRR), which refer to the standard FPR and FNR, respectively. This standard also deprecates the use of single-number metrics including the EER and AUC (which were widely used in biometrics-related research in the past).

ISO/IEC 30107-1:2016 “Information technology – Biometric presentation attack detec-tion – Part 1: Framework”17: This standard defines a general framework about presentation attack detection (PAD) mechanisms, where the term “presentation attack” refers to the “presentation of an artefact or of human characteristics to a biometric capture subsystem in a fashion intended to in-terfere with system policy”. It focuses on biometric recognition systems, where a PAD mechanism is a binary classifier trying to predict presentation attacks (also called attack presentations, e.g., fake faces) as positive and bona fide (real) presentations as negative.

ISO/IEC 30107-3:2017 “Information technology – Biometric presentation attack detection – Part 3: Testing and reporting”18: This standard defines a number of special performance metrics for evaluating PAD mechanisms standardised in the ISO/IEC 30107-1:2016. Three such metrics look at error rates: attack presentation classification error rate (APCER) referring to the standard FPR, normal/bona fide presentation classification error rate (NPCER/BPCER) referring to the standard FNR, and average classification error rate (ACER) that is defined as the average of the APCER and the NPCER/BPCER. Such metrics have been used in biometrics-related challenges such as Face Anti-spoofing (Presentation Attack Detection) Challenges19. When deepfake images or videos are used to spoof a biometric system, such standardised metrics will become relevant.

\

3.13      Discussion: Performance Metrics & Standards

This section provided a comprehensive summary of performance metrics used for evaluating and bench-marking binary classifiers. It is rare that all such metrics are used for a specific application. Instead, one or several are chosen based on specific needs. For a deepfake detection system as a binary classifier, many researchers have chosen to use overall metrics such as accuracy, AUC, EER and log loss, but the combination of precision, recall and F1-score is also common. Some deepfake-related challenges and competitions have introduced their own specific metrics, some of which will be described in Section 5. The use of different performance metrics can make comparison of different reported results more difficult, so we hope the expected new ISO/IEC standard particularly ISO/IEC 4213 will help.

It is worth mentioning that, in addition to evaluating performance of deepfake detectors, the introduced performance metrics for evaluating binary classifiers can also be used to evaluate performance of deepfake generation methods by considering how deepfake detectors fail. For instance, organisers of the Voice Conversion Challenge 2018 and 2020 used this approach to benchmark how well voice conversion (VC) systems can generate high-quality fake speech samples.

Another point we would like to mention is that for deepfake videos there are two levels of performance metrics: those at the frame level (metrics of each frame), and those at the video level (metrics for the whole video). Generally speaking, the latter can be obtained by averaging the former for all frames, potentially following an adaptive weighting scheme, so that more important (key) frames will be counted more.

\

4      Deepfake-Related Datasets

In this section, we cover all deepfake-related datasets we identified from the meta-review of deepfake-related survey papers, deepfake-related challenges, competitions and benchmarks covered, one online collection of deepfake-related datasets on GitHub20, and the co-authors’ personal collections. Table 2 shows basic information about these datasets. We explain them in four categories: deepfake image datasets, deepfake video datasets, deepfake audio/speech datasets, and hybrid deepfake datasets (mainly mixed image and video datasets).

Note that many datasets of real (authentic) media were also used by deepfake researchers for two purposes. First, any detectors would need both fake and real media to demonstrate their performance. Second, real media have also been used to train deepfake generators as the training set. In this section, we include only datasets containing deepfake media, some of which contain both deepfake and real media.

Some datasets, especially those created for deepfake-related challenges and competitions, have separate subsets for training and evaluation (testing) purposes. The split is necessary for such challenges and competitions, but not very useful for people who just want to use such datasets. Therefore, in this section when introducing such datasets we will ignore that level of details and focus on the total number of data including the number of real and fake samples.

Table 2: Deepfake-related datasets

\

4.1      Deepfake Image Datasets

SwapMe and FaceSwap dataset [78]: This dataset contains 4,310 images, including 2,300 real images and 2,010 fake images created using FaceSwap21 and the SwapMe iOS app (now discontinued).

Fake Faces in the Wild (FFW) dataset [32]: This dataset contains 131,500 face images, including 78,500 images extracted from 150 videos in the FaceForensics dataset and 53,000 images extracted from 150 fake videos collected from YouTube.

generated.photos datasets22: This is a number of commercial datasets provided by the Generated Media, Inc., with up to nearly 2.7 million synthetic face images generated by StyleGAN. A free edition with 10,000 128x128 synthetic images is made available for academic research. The website also provides an interactive face generator23 and an API24. The generated.photos datasets have a good diversity: five age groups (infants, children, youth, adults, middle-aged), two genders (male and female), four ethnicities (white, black, Latino, Asian), four eye colours (brown, grey, blue, green), four hair colours (brown, black, blond, gray), three hair length (short, medium, long), facial expressions, three head poses (front facing, left facing, right facing), two emotions (joy and neutral), two face styles (natural, beautified). (According to a number of research papers we read, an earlier 100K-Faces dataset was released by generated.photos for academic research in 2018, which was used by many researchers. This dataset is not currently available any longer.)

MesoNet Deepfake Dataset [1]: This dataset includes 19,457 face images, including 7,948 deepfake images generated from on 175 forged videos collected online and 11,509 real face images collected from various online sources. (Table 2 of the paper shows the dataset size is 19,509, but the dataset downloaded from pCloud contains just 19,457 images.)

100K-Generated-Images [30]: This dataset includes 100,000 synthesised face, bedroom, car and cat images by a GAN generator trained based on real images in the FFHQ25 and LSUN26 datasets (three object types – bedrooms, cars and cats – for the latter). Note that the name “100K-Generated-Images” was not a proper one as the authors [30] just used this to name a sub-folder of their Google Drive shared space, but it was used in one of the survey papers [65].

Ding et al.’s swapped face dataset [17]: This dataset contains 420,053 images of celebrities, including 156,930 real ones downloaded using Google Image API and 263,123 fake face-swapped ones created using two different methods (Nirkin’s method and Auto-Encoder-GAN)

iFakeFaceDB [48]: This dataset includes 87,000 224x224 face images, generated by processing some StyleGAN-generated synthetic images using the GAN-fingerprint Removal approach (GANprintR) proposed by Neves et al.. It is the replaced version of the FSRemovalDB dataset, which contains 150,000 face images generated using an earlier version of GANprintR.

Faces-HQ [21]: This dataset includes 40,000 images, half real and half deepfake. The images were collected from four sources: the CelebA-HQ dataset27, the Flickr-Faces-HQ dataset28, the 100K-Faces dataset29 (not available any longer, see the description of generated.photos datasets), and thisperson-doesnotexist.com.

CelebA-Spoof [75]: This dataset includes 625,537 synthesised face images of 10,177 celebrities, with 43 rich attributes on face, illumination, environment and spoof types. The real images were selected from the CelebA dataset30. The 43 attributes include 40 for real images, covering all facial components and accessories (e.g., skin, nose, eyes, eyebrows, lip, hair, hat, eyeglass), and 3 for fake images, covering spoof types, environments and illumination conditions.

Diverse Fake Face Dataset (DFFD) [11]: This dataset contains 299,039 images, including 58,703 real images sampled from three datasets (FFHQ31, CelebA32 and FaceForensics++33) and 240,336 fake ones in four main facial manipulation types (identity swap, expression swap, attribute manipulation, and entire synthesis). The images cover two genders (male and female), a wide age groups (the majority between 21 and 50 years old), and both low- and high-quality levels.

\n

4.2     Deepfake Video Datasets

DeepfakeTIMIT [35]: This dataset contains 620 deepfake face videos, generated by face swapping without manipulation of audio, covering 32 subjects and two quality levels (high and low).

FaceForensics (FF) [55]: This dataset contains 1,004 face videos with over 500,000 frames, covering various quality levels and two types of facial manipulation. This dataset is now replaced by the larger FaceForensics++ dataset (see below).

FaceForensics++ (FF++) [56]: This dataset contains 5,000 face videos with over 1.8 million manipulated frames, including 1,000 real videos (with 509,914 frames) downloaded from YouTube, and 4,000 fake videos created using four face manipulation methods (Deepfakes, Face2Face, FaceSwap and NeuralTextures). The videos cover two genders (male and female), and three quality levels (VGA/480p, HD/720p, and FHD/1080p).

UADFV dataset [39]: This dataset contains 98 face videos, half (49) are real ones downloaded from Youtube, and the other half are fake ones generated using the FakeApp mobile application (which is now discontinued). The video dataset was created to used to demonstrate a deepfake video detection method based on detection of eye blinking behaviours, so all videos contain at least one eye-blinking event. All fake videos were created by swapping the original face in each of the real videos with the face of the actor Nicolas Cage34, thus, only one subject is represented.

Deep Fakes Dataset [10]: This dataset contains 142 “in the wild” deepfake portrait videos, collected from a range of online sources including news articles, online forums, mobile apps, and research presentations. The videos are diverse, covering the source generative model, resolution, compression, illumination, aspect-ratio, frame rate, motion, pose, cosmetics, occlusion, content, and context.

DFDC (Deepfake Detection Challenge) preview dataset [18]: This dataset contains 5,244 face videos of 66 subjects with both face and voice manipulation. It was released as a preview of the full dataset of the 2020 Deepfake Detection Challenge (DFDC, see below).

Celeb-DF v135: This dataset contains 1,203 face videos of celebrities, including 408 real videos collected from YouTube with subjects of different ages, ethic groups and genders, and 795 deepfake videos synthesised from these real videos.

Celeb-DF v2 [40]: This dataset contains 6,229 face videos of celebrities, including 590 real videos collected from YouTube with subjects of different ages, ethic groups and genders, and 5,639 deepfake videos synthesised from these real videos.

DeepFake Detection (DFD) Dataset [20]: This dataset contains 3,363 face videos, covering 28 subjects, gender, and skin colour. It was created as a joint effort between two units of Google, Inc.: Google AI36 and JigSaw37.

DeeperForensics-1.0 [27]: This dataset contains 60,000 indoor face videos (with 17.6 million frames) generated by face swapping, covering 100 subjects, four skin tones (white, black, yellow, brown), two gen-ders (male and female), different age groups (20-45), 26 nationalities, 7 different angles, 8 face expressions, and different head poses.

DFDC (Deepfake Detection Challenge) full dataset [18]: This dataset contains 128,154 face videos of 960 subjects, including 23,654 real videos from 3,426 paid actors and 104,500 deepfake videos created using eight different methods (DF-128, DF-256, MM/NN face swap, NTH, FSGAN, StyleGAN, refinement, and audio swap).

FFIW10K (Face Forensics in the Wild) dataset [79]: This dataset contains 10,000 high-quality forgery videos, with video- and face-level annotations. The dataset focuses on a more challenging case for forgery detection: each video involves one to 15 individuals, but only some (a minority of) faces are manipulated.

Korean DeepFake Detection Dataset (KoDF) [36]: This dataset contains 37,942 videos of paid subjects (395 Koreans and 8 Southeastern Asians), including 62,166 real videos and 175,776 fake ones created using six methods – FaceSwap, DeepFaceLab, FSGAN, First Order Motion Model (FOMM), Audio-driven Talking Face HeadPose (ATFHP) and Wav2Lip. The videos cover a balanced gender ratio and a wide range of age groups.

VideoForensicsHQ [23]: This dataset contains 1,737 videos with 1,666,816 frames, including 1,339,843 real frames and 326,973 fake frames generated using the Deep Video Portraits (DVP) [34] method. The original videos were obtained from three sources: the dataset used in [33], the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [42], and YouTube. Most videos have a resolution of 1280×720.

WildDeepfake [81]: This dataset contains 7,314 face sequences extracted from 707 deepfake videos that were collected completely from the Internet. It covers diverse scenes, multiple persons in each scene and rich facial expressions. Different from other deepfake video datasets, WildDeepfake contains only face sequences not the full videos. This makes the dataset more like between an image dataset and a video one. We decided to keep it in the video category since the selection process was still more video-focused.

\

4.3     Deepfake Audio/Speech Datasets

Voice conversion (VC) is a technology that can be used to modify an audio and speech sample so that it appears as if spoken by a different (target) person than the original (source) speaker. Obviously, it can be used to generate deepfake audio/speech samples. The biennial Voice Conversion Challenge38 that started in 2016 is a major challenge series on VC. Datasets released from this challenge series are very different from other deepfake datasets: the deepfake data is not included in the original dataset created by the organisers of each challenge, but in the participant submissions (which are retargeted/fake utterances produced by VC systems built by participants). The challenge datasets also include the evaluation (listening-based) results of all submissions. Some fake utterances may be produced by DL-based VC systems, so we consider all datasets from this challenge series relevant for our purpose of this survey.

Voice Conversion Challenge 2016 database [62]: The original dataset created by the challenge organisers was derived from the DAPS (Device and Produced Speech) Dataset [47]. It contains 216 utterances (162 for training and 54 for testing) per speaker from 10 speakers. Participating teams (17) developed their own VC systems for all 25 source-target speaker pairs, and then submitted generated utterances for evaluation. At least six participating teams used DL-related techniques (LSTM, DNN) in their VC systems (see Table 2 of the result analysis paper39), so the submitted utterances can certainly be considered deepfakes.

Voice Conversion Challenge 2018 database [44]: The original dataset created by the challenge organisers was also based on the DAPS dataset. It contains 116 utterances (81 for training and 35 for testing) per speaker from 12 speakers in two different tasks (called Hub and Spoke). Participating teams (23 in total, all for Hub and 11 for Spoke) developed their own VC systems for all 16 source-target speaker pairs, and then submitted generated utterances for evaluation. Comparing with the 2016 challenge, more participating teams used DL-related techniques (e.g., WaveNet, LSTM, DNN, CycleGAN, DRM – deep relational models, and ARBM – adaptive restricted Boltzmann machines) in their VC systems.

Voice Conversion Challenge 2020 database [70]: This dataset is based on the Effective Multilingual Interaction in Mobile Environments (EMIME) dataset40, a bilingual (Finnish/English, German/English, and Mandarin/English) database. It contains 145 utterances (120 for training and 25 for testing) per speaker from 14 speakers for two different tasks (with 4 × 4 and 4 × 6 source-target speaker pairs, respectively). Participating teams (33 in total, out of which 31 for Task 1 and 28 for Task 2) developed their own VC systems for all source-target speaker pairs, and then submitted generated utterances for evaluation. Comparing with the 2018 challenge, DL-based VC systems were overwhelmingly used by almost all participating teams (WaveNet and WaveGAN among the most used DL-based building blocks).

A major set of deepfake speech datasets were created for the ASVspoof (Automatic Speaker Verification Spoofing and Countermeasures) Challenge41 (2015-2021, held biannually). The datasets for the 2019 and 2021 contain speech data that can be considered deepfakes.

ASVspoof 2019 Challenge database [67]: This dataset is based on the Voice Cloning Toolkit (VCTK) corpus42, a multi-speaker English speech database captured from 107 speakers (46 males and 61 females). Two attack scenarios were considered: logical access (LA) involving spoofed (synthetic or converted) speech, and physical access (PA) involving replay attacks of previously recorded bona fide recordings). For our purpose in this survey, the LA scenario is more relevant. The LA part of the dataset includes 12,483 bona fide (real) utterances and 108,978 spoofed utterances. Some of the spoofed speech data for the LA scenario were produced using a generative model involving DL-based techniques such as long short-term memory (LSTM)43, WaveNet [50], WaveRNN [28], WaveCycleGAN2 [58]. Note that the challenge organisers did not use the term “deepfake” explicitly, despite the fact that the DL-generated spoofed speech data can be considered as deepfakes.

ASVspoof 2021 Challenge – Logical Access Database [14]: This dataset contains bona fide and spoofed speech data for the logical access (LA) task. The challenge is still ongoing and we did not find a detailed paper on the dataset, so cannot include more details other than its size (7.8 GB after compression). Although we did not see details of the generative algorithms used to produce spoofed speech data, we believe similar DL-based algorithms were used like for the 2019 challenge.

ASVspoof 2021 Challenge – Speech Deepfake Database [15]: In 2021, the challenge included an explicitly defined track on deepfake, but the task description suggests that the organisers of the challenge considered a broader definition of the term “deepfake” by looking at spoofing human listeners rather than ASV (Automatic Speaker Verification) systems. The size of the dataset is 34.5 GB after compression.

Possibly because of the long history and wide participation of the community in the ASVspoof challenges for creating the dedicated datasets, there are very few other deepfake audio/speech datasets. One such dataset was created by a group of researcher from Baidu Research [5]. This dataset was created to demonstrate a proposed voice cloning method. It is relatively small, and contains 134 utterances, including 10 real ones, 120 cloned ones, and 4 manipulated ones. Another dataset was created by Google AI and Google News Initiative44, but it was made part of the ASVspoof 2019 dataset. This dataset contains thousands of phrases spoken by 68 synthetic “voices” covering a variety of regional accents.

\

4.4     Hybrid Deepfake Datasets

NIST OpenMFC (Open Media Forensics Challenge) Datasets45: These datasets were created by the DARPA Media Forensics (MediFor) Program46 for the 2020 OpenMFC47. There are two GAN-generated deepfake datasets, one with more than 1,000 deepfake images and the other with over 100 deepfake videos. The datasets were made available to registered participants of the competition only.

ForgeryNet [25]: This dataset is named as “a versatile benchmark for comprehensive forgery analysis”. It contains 2,896,062 images and 221,247 videos, including 1,457,861 fake images and 121,617 fake videos. The videos and images cover seven image-level and eight video-level manipulation approaches, 36 different types of perturbations and more mixed perturbations, and a large number of annotation labels (6.3 million classification labels, 2.9 million manipulated area annotations and 221,247 temporal forgery segment labels). The dataset is being used for supporting the Face Forgery Analysis Challenge 202148 at the SenseHuman 2021 (3rd Workshop on Sensing, Understanding and Synthesizing Humans)49, co-located at the ICCV 2021 conference50.

\

4.5      A Deepfake Dataset Generator

DatasetGAN [74]: This is not actually a dataset per se, but a system for producing large datasets more automatically, including generating deepfake datasets. One may argue the automatically generated datasets are fake since they are not produced from real-world scenes.

\

4.6     Subjective Quality of Deepfakes in Different Databases

As mentioned in Section 4.7, subjective quality evaluation is necessary to evaluate the realness, realisticness, and naturalness of deepfake media. While there has been very limited work on this topic, in 2020, Jiang et al. [27] conducted a user study on realness of deepfake videos. They recruited 100 professional participants (most of whom are computer vision researchers), who were asked to evaluate the realness of 30 randomly selected videos from 7 deepfake video datasets (DeeperForensics-1.0, UADFV, DeepFake-TIMIT, Celeb-DF, FaceForensics++, Deep Fake Detection, and DFDC). Participants were asked to respond to the statement “The video clip looks real.” and gave scores following a five-point Likert scale (1 – clearly disagree, 2 – weakly disagree, 3 – borderline, 4 – weakly agree, 5 – clearly agree).

Table 3 shows the results. Interestingly, we can see a huge difference between the realness levels of different datasets. What is probably quite surprising is that FaceForensics++, one of the most widely used deepfake datasets, has a very low MOS score and less than 9% of participants considered the 30 selected videos as real.

Table 3: Human-judged subjective quality (realness) of deepfake videos in 7 datasets. The MOS scores were not reported by Jiang et al., but calculated by us based on the raw data shown in Table 3 of [27].

\

4.7      Discussion: Datasets

Among all deepfake image and video datasets, a significant majority are about face images and videos. This is not surprising since face swapping, face attribution manipulation, and fully synthesised face images are among the hottest topics within deepfake research and real-world applications. We hope more non-face deepfake image and video datasets can be produced to support a broader range of research activities on deepfake.

The subjective quality results shown in Table 3 indicate that it is important to check realness of deep-fake media to support any performance evaluation or comparison. To ensure that the quality evaluation of datasets is fair, transparent and reliable, standard procedures need defining and a common pool of qualified human experts should be used.

Many authors of deepfake-related datasets attempted to classify such datasets into different generations. Chronologically speaking, we could broadly split such datasets into two generations: before 2019 and since 2019. Typically, datasets created before 2019 are relatively less advanced and smaller, while those created after 2019 tend to be larger, more diverse (i.e., covering more attributes), and of higher quality (i.e., produced by more advanced generative models). This can also be seen from the data in Table 3, in which the top two datasets (DeeperForensics-1 and Celeb-DF) fall within the new generation (2020), while others belong to the old generation. In addition to the two generations, a newer generation has also emerged in 2021: a number of very recent datasets started focusing on more realistic deepfakes (i.e., in the wild) or more specified areas of deepfakes (e.g., FFIW10K focusing on multiple faces in the same video, and KoDF focusing on Korean faces). This trend shows that the deepfake research community has grown significantly in the past few years so that narrower topics have also started gaining attention and interest from some researchers.

\

5      Deepfake-Related Challenges, Competitions & Benchmarks

This section reviews initiatives aiming to advance the state-of-the-art of detection and generation of synthetic or manipulated media (such as video, image and audio) via competitions or challenges open to the public, and ongoing benchmarks tackling specific problems.

\

5.1       Detection of Manipulated Media

The Deepfake Detection Challenge (DFDC)51 was an initiative promoted by an AI and Media Steering Committee52, including BBC, Facebook, Amazon, Microsoft and New York Times, and some universities around the world including the University of Oxford. The competition remained open from 5 September 2019 till 31 March 2020, and involved 3 stages. At first, the DFDC preview dataset was released. At a later stage, the DFDC full dataset was also made available to the 2,114 participants of the competition incorporating face and audio swap techniques for generation of deepfake content. At the final stage, the submitted models were evaluated using a test dataset (referred to as the “black box dataset”) of 10,000 videos which included in-the-wild deepfake videos. The best performance on the black box dataset had an accuracy of 65.18%, according to the released results [22]. Submissions were ranked53 according to the overall log loss score, as defined in Eq. (13). All top five ranked models (the winner had the lowest overall log loss) are available on GitHub. Results indicate how challenging the detection of deepfake is since the best accuracy was low and “many submissions were simply random”, according to Dolhansky et al. [19]. Figure 2 shows a screenshot of the leaderboard with the five finalists. The first top ranked model used MTCNN (Multi-tasked Cascaded Convolutional Network), the second used WS-DAN (Weakly Supervised Data Augumentation Network), and the third used the EfficientNetB7 architecture. Meta compiling the common themes observed in the winning models, they were: clever augmentations, architectures, and absence of forensics methods. Moving forward, they called for “solutions that go beyond analysing images and video. Considering context, provenance, and other signals may be the way to improve deepfake detection models”.

Figure 2: Screenshot of leaderboard with top five finalists of the DFDC competition.

\ The Automatic Speaker Verification Spoofing And Countermeasures Challenge Workshop (ASVspoof)54 has been running biennially since 2015. This competition is organised by an international consortium that includes Inria and EURECOM (France), University of Eastern Finland, National Institute of Informatics (Japan), and Institute for Infocomm Research (Singapore). This year the ASVspoof challenge includes, for the first time, a sub-challenge focused on Speech DeepFake where the envisioned use case is an adversary trying to fool a human listener. The metric used for evaluating performance of submitted solutions (i.e., classifiers) is EER. Four baseline solutions55 (also called “countermeasures”), each using a different technique, were made available to participants with their corresponding EER metric values. The ASVspoof 2021 Speech Deepfake Database containing audio recordings with original and spoofed utterances has also been made available. The competition involves three phases56: a progress phase, an evaluation phase and a post-evaluation phase; it is unclear how teams move from one phase to the next. More information about the 2021 competition is available in the published evaluation plan [13]. The organisers of the competition noted that they opted for the EER as the performance evaluation met-ric for countermeasures submitted to the speech deepfake task for legacy reasons. They acknowledged, however, that “EER reporting is deprecated ” by the ISO/IEC 19795-1:202157 standard. Despite the fact that only the 2021 ASVspoof competition contained a track explicitly related to deepfake, some data in the ASVspoof 2019 dataset (Logical Access task) used for the 2019 competition was generated using DL-based algorithms as mentioned in Section 4. We expect that this also holds for the ASVspoof 2021 dataset (Logical Access task). The ASVspoof 2019 competition used the EER as secondary metric; the primary performance metric used was the tandem detection cost function (t-DCF) [63]. According to its evaluation plan [69], t-DCF assesses the performance of the whole tandem system whereby “a CM [countermeasure] serves as a ‘gate’ to determine whether a given speech input originates from a bona fide (genuine) user, before passing it the main biometric verifier (the ASV system)”. It is calculated according to Eq. (17), where P cm (s) and P cm(s) are, respectively, “the miss rate and the false alarm rate of the CM system at threshold s”.

For further information about Eq. (17), including constants C1 and C2, please refer to the ASVspoof 2019 evaluation plan [69].

An implementation of the t-DCF metric has been made available by the ASVspoof 2019’s organisers in Python58 and Matlab59 formats.

The Face Anti-spoofing (Presentation Attack Detection) Challenge60 started in 2019. Its first two editions were held at the 2019 and 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), respectively. Its third edition was moved to be co-located with the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021). This competition series was organised by a group of researchers from academia and industry in China, Mexico, Spain, Finland and the US. The 2021 competition was focused on 3D high-fidelity mask attacks, and followed a 2-phased61 process. The first phase is the “development phase”; it started in April 2021 when the CASIA-SURF HiFiMask dataset62 was released to participants. The second phase is the “final ranking phase” (June 2021), when the competition ended. The competition adopted the following performance metrics for evaluation63 of the solutions submitted: attack presentation classification error rate (APCER), normal/bona fide presentation classification error rate (NPCER/BPCER), and average classification error rate (ACER), in accordance with the ISO/IEC 30107-3:201764 standard. Figure 3 provides the leaderboard for the top three solutions.

Figure 3: Screenshot of leaderboard with top three finalists of the Face Anti-spoofing Challenge 2021 competition.

\ The FaceForensics Benchmark65 is an ongoing automated benchmark for detection of face manipulation. The organisers of the benchmark made the FaceForensics++ dataset available for training. Manipulated videos (4,000 in total) were created using four techniques, i.e., two computer graphics-based approaches (Face2Face and FaceSwap) and two learning-based approaches (DeepFakes and Neural Textures). The deepfakes videos were generated using a slightly modified version of FaceSwap66, and the Neural Textures videos were created using the approach proposed by Thies et al. [61]. The benchmark test dataset is created from the collection of 1,000 images randomly selected from either the manipulation methods or the original videos [56]. Participants have to submit results to the benchmark, rather then code like other competitions; this is illustrated in Figure 4a. The outcome of a submission is illustrated in Figure 4b, where the scores are a measure of accuracy (Eq. (8)).

Figure 4: Illustration of the FaceForensics Benchmark in terms of submission and result.

\ The Open Media Forensics Challenge (OpenMFC, formerly DARPA MFC)67 is an annual image and video forensics evaluation aiming to facilitate development of multimedia manipulation detection systems. It has been organised annually68 starting from 2017 under the name of DARPA MFC. In 2020, the National Institute of Standards and Technology (NIST) initiated the OpenMFC as a new evaluation platform, based on their previous experiences with the DARPA MFC series, to make the participation more convenient for all researchers. In OpenMFC 2020, two deepfake-related tasks were included for the first time: Image GAN Manipulation Detection (IGMD) and Video GAN Manipulation Detection (VGMD). The organisers provided an image evaluation dataset for the IGMD task, containing 1,000 images from over 200 image journals69, and a video evaluation dataset for the VGMD task, including over 100 test videos. Furthermore, they provided the datasets70 used in the previous MFC challenges as development datasets. The challenge is composed of two main phases for development and evaluation, respectively, and a pre-challenge phase for quality control testing. For evaluation of submissions, AUC-ROC is used as the primary metric. Furthermore, CDR@FAR, where CDR refers to correct detection rate or TPR (Eq. (4)) and FAR refers to false alarm rate or FPR (Eq. (5)), is also used as a metric [49]. The DeeperForensics Challenge 202071 is a deepfake face detection challenge held at the 2020 ECCV

SenseHuman Workshop72. The challenge used the DeeperForensics1.0 dataset.

The organisers provided a hidden test dataset to better simulate real-world scenarios. The challenge involved two phases: the “development phase” that started in August 2020 allowing 100 successful sub-missions, and the “final test phase” that started in October 2020 allowing 2 successful submissions until the end of the month. The submissions were evaluated using the binary cross-entropy loss (BCELoss) metric, calculated according to Eq. (18), where N is the number of videos in the hidden test set, yi is the ground truth label of video i (fake:1, real:0), and p(yi) is the predicted probability that video i is fake.

Results73 of the competition were discussed by Jiang et al. [26]. The top solution used three models, i.e., EfficientNet-B0, EfficientNet-B1 and EfficientNet-B2, for classification. The second top used EfficientNet-B5 for both an image-based model and a video-based model. The third ranked solution used a 3D convolutional neural network (3DCNN).

Figure 5: Final results for the 2020 CelebA-Spoof Face Anti-Spoofing Challenge.

\ The Face Forgery Analysis Challenge 202174 is a competition hosted at the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021). It is organised by researchers from a number of organisations in China including universities and SenseTime Research (the research arm of SenseTime75, one of the major AI “unicorns” in China). The challenge aims to advance the state-of-the-art in detection of photo-realistic manipulation of images and videos. Participants are able to use a large annotated face dataset (i.e., the ForgeryNet dataset) that was obtained by applying a number of techniques for manipulation (15) and perturbation (36) to train their solutions. The phases comprise of Forgery Image Analysis, Forgery Video Analysis, Forgery Video Temporal Localization phases, and the final phase (i.e., “private test”) where participants’ models will be tested against an unseen dataset. The following metrics will be used [25]: AUC, average precision (AP) at some “temporal Intersection over Union” (AP@tIoU) compared to a threshold t ∈ [0*.5,* 0*.*95], and average recall (AR) at K (AR@K) where K is the top K labels returned for multi-class classifiers.

The 2020 CelebA-Spoof Face Anti-Spoofing Challenge76 was hosted at the 16th European Conference on Computer Vision (ECCV 2020). The challenge ran between August and October 2020, and aimed to advance the state-of-the-art in detecting “whether a presented face is live or spoof ” [76]. The organisers made the face CelebA-Spoof dataset available for the competition containing rich annotation across a range of attributes. The competition only had one phase where participants submitted their solutions to be evaluated against a test dataset; the spoof class was considered as “positive” and the live class as “negative”. Metric TPR@FPR was used and collected at three points where the TPR when FPR = 104 determined the final ranking. The top three finalists (see Figure 5) used deep learning models ResNet, EfficientNet-B7, and a novel architecture combining Central Difference Convolutional Networks (CDCN) and Dual Attention Network (DAN). The two top ranked solutions used different strategies to boost their models’ performance: a heuristic voting scheme was used by the top-ranked solution, and a weight-after-sorting strategy was used by the second ranked solution.

The 2021 CSIG Challenge77 is the second edition of a challenge organised by the China Society of Image and Graphics78. The 2021 challenge has the Fake Media Forensic Challenge79 as its 6th track, co-organised by CSIG’s Digital Media Forensics and Security Technical Committee80 and Institute of Information Engineering, Chinese Academy of Sciences81. This track has two tasks, one on deepfake video detection, and the other on deepfake audio/speech detection. For the deepfake video detection task, the dataset used contains a public training set with 10,000 sound-free face videos (including 4,000 fake videos), a public test set with 20,000 face videos (the percentage of deepfake videos is unknown to participants), and a private test set that will be determined and used at the final session for selecting the winners. All videos contain faces of Eastern Asian people, and cover a wide range of parameters such as multiple resolutions and encoding quality factors, the use of blurring or sharpening filters, and added noise. Deepfake videos were created using public tools including DeepFaceLab [53], Faceswap82, Faceswap-GAN, Recycle-GAN [6] and ALAE (Adversarial Latent Autoencoders) [54]. For the deepfake audio/speech detection task, the dataset used contains a public training set with 10,000 speech samples (including 6,000 fake ones), a public test set with 20,000 face videos (the percentage of deepfake videos is unknown to participants), and a private test set for the final session (the same as the deepfake video detection task). The tools used for generating the fake speech samples include TTS (text-to-speech) voice synthesis tools and VC (voice conversion) tools. The main TTS tools used include open-source tools such as DeepVoice, TensorFlowTTS83 and GAN-TTS [8] and commercial software tools such as those from iFlytek84 and IBM. The main VC tools used include Adaptive-VC and CycleGAN-VC [29]. For both deepfake detection tasks, the performance metric used is log loss.

2020 China Artificial Intelligence85 was the second edition of a Chinese AI competition open for the general public to participate, organised by the municipal government of the City of Xiamen in China. In 2020, it had two sub-competitions, Multimedia Information Recognition Technology Competition86 and Language and Knowledge Technology Competition87. The Multimedia Information Recognition Technology Competition included two tasks on deepfakes: one on deepfake video detection88 and one on deepfake audio/speech detection89. The deepfake video detection task used 3,000 videos, and log loss was used as the sole performance metric. The deepfake audio/speech detection task used 20,000 audio samples (mostly in Chinese, and the remaining in English), and EER was used as the sole performance metric. For both tasks, the ratio between real and deepfake samples was 1:1. We did not find where to download the datasets used for the tasks nor a more detailed technical description of the datasets. For the deepfake video detection tasks, the top two winning teams (with an A prize) were from Netease (Hangzhou) Network Co., Ltd. and Beijing RealAI Technology Co., Ltd., followed by three other teams winning a B prize: Xiamen Fuyun Information Technology Co., Ltd.; Institute of Computing Technology, Chinese Academy of Sciences; and Wuhan Daqian Information Technology Co., Ltd. For the deepfake audio/speech task, there was no team winning an A prize, but one team winning a B prize: SpeakIn Technologies Co., Ltd. The final results of some teams were published, but some teams were allowed to hide their results. We did not find a detailed technical report summarising the results and explaining the work of the winning teams.

One of the B-prize winning team is from Beijing RealAI Technology Co., Ltd., a Chinese company active in deepfake-related R&D.

\

5.2      Generation of Manipulated Media

The Voice Conversion Challenge90 is a biennial competition that has been running since 2016. The challenge and the corresponding workshop, hosted at the INTERSPEECH conference91, is supported by the SynSig (Speech Synthesis Special Interest Group)92 of the International Speech Communication Association (ISCA)93. Its aim is to promote progress in voice conversion (VC) technology that can be applied to a number of positive and negative use cases, such as spoofing voice biometric systems. The 2020 challenge focused on speaker conversion, a sub-problem of VC, and included two tasks. For the first task “intra-lingual semi-parallel voice conversion”, participants had to develop 16 VC systems (speaker-pair combinations) including male and female speakers and English sentences, using the provided Voice Conversion Challenge 2020 database v1.0 for training (refer to Section 4). For the second task “cross-lingual voice conversion”, participants had to develop 24 VC systems, also including male and female speakers, but uttering sentences in three languages (Finnish, German and Mandarin), based on the provided training dataset. Figure 6 illustrates the process of training and generation of VC systems.

Submissions were evaluated for “perceived naturalness and similarity through listening tests94. As such, the organisers used subjective evaluation [70] and recruited both native and non-native English speakers (i.e., Japanese native speakers) via crowd-sourcing for the listening tests. Naturalness (answering the question “How natural does the converted voice sound? ”) was measured using the metric MOS (covered in Section 4.6), and similarity (answering the question “how similar the converted voice sound comparing source and target speakers? ”) was measured in terms of speaker recognition as “same” or “different”, as elaborated by Wester et al. [68]. Tests also focused on the effects of language differences on the performance of VC systems submitted to the competition. The most popular CNN/RNN/GANbased VC systems submitted used WaveNet, WaveRNN, and Parallel WaveGAN. Results indicated that, in terms of similarity, the best performing VC systems were as good as natural speech but none reached human-level naturalness for task 1; scores were lower for task 2 which was more complex [70]. The organisers of the 2020 competition also used objective evaluation [12]. The metrics used for evaluation of speaker similarity were: equal error rate (EER), false acceptance rate of target (P tar fa ), miss rate of source (P src miss), and cosine similarity of speaker embedding vectors (cos-sim) according to Eq. (19) where A is the speaker embedding vectors for the converter audio and B is the speaker embedding vectors for the original audio. The performance of the VC systems as a spoof countermeasure was also evaluated using EER, while to evaluate the quality of the subjective MOS obtained via listening tests, a DL-based model to predict MOS, called MOSNet [43], was used. Lastly, to evaluate intelligibility of the converted transcribed speech, in comparison with the original transcribed speech, the word error rate (WER) [4] was used. WER is calculated according to Eq. (20) where I refers to insertions, D refers to deletions, S refers to substitutions, and N refers to the total number of words in the original transcript.

Figure 6: Illustration of tasks for the Voice Conversion Challenge 2020, extracted from [70].

\

The Deepfake Africa Challenge (2021)95 is a new initiative of the AI Africa Expo, in partnership with a film and media production company (Wesgro) and the African Data Science competition platform Zindi. Its aim is “to create convincing deepfakes to highlight the power of this synthetic media, illustrating its creative potential for exploitation for both positive and negative outcomes and focusing debate about its ethical use / misuse in an African context ”. Eligible participants were required to be citizens and residents of the African continent. Submissions, accepted up to end of July 2021, can be either video or audio. Evaluation of submissions is defined in terms of artistic creativity, relevance of challenge topic, and innovation in the process of generation as long as participants use tools and packages publicly available. The top three finalists will receive a prize, present their work at the Expo, and will have to grant copyrights to Zindi. Unlike the other competitions reviewed in this section, which were focused on advancing the state-of-the-art in detection of synthetic or manipulated media, this competition focused on the generation of deepfake which seems more humanities-centred. This is a trend observed in arts [31] and culture [57].5.3      Generation and Detection of Manipulated Media

The DeepFake Game Competition (DFGC)96 is in its first edition, hosted at the 2021 International Joint Conference on Biometrics (IJCB 2021). Its organisers are mainly from the Institute of Automation Chinese Academy of Sciences (CASIA). The idea of the competition was to promote an adversarial game between agents pushing for advances in both deepfake creation and detection. In order to achieve this, a 6-stage protocol was designed interleaving three creation phase (C-phase) and detection phase (D-phase), typically one week apart; submissions closed in April 2021. Both C-phases and D-phases were bound to the Celeb-DF (v2) dataset [40], containing 6,229 videos (590 real/original videos and 5,639 fake/manipulated videos), for training purposes. As such, submissions to a C-phase would consist of datasets extracted from Celeb-DF (v2) which included novel face-swap approaches to obtain evaluation results. Submissions to a D-phase would consist of detection models/codes to obtain evaluation results. The models submitted for a D-phase were evaluated against the datasets submitted for the previous C-phase [52]. The metrics used for evaluation97 were: a detection score, used for evaluation of a D-phase, and a creation score, used for evaluation of a C-phase. The top three finalists for the detection phase employed CNN-based classifiers EfficientNet-B3, Efficientnet-B0 and EfficientNetV2.

The Detection Score (DS) metric captures the models’ ability to correctly classify fake images submitted to the previous C-phase against a set of real images in the CelebDF test dataset. It is calculated using Eq. (21), where NC is the number of valid submissions of created synthesis test sets in the last C-phase.

The Creation Score (CS) metric used to evaluate creation models submitted to this challenge is calculated by Eq. (22), where ND is the number of valid submissions of detection methods in the last D-phase, the noise score (Snoise) penalises noisy images, the other three parts of the equation relate to the following98: “ID level similarity to the donor ID, image level similarity to the target frame, and the deception ability against detection models. ID level similarity is scored by a face recognition model using dot product of two ID features (fake face ID and donor ID). The image level similarity is scored by SSIM [Structural Similarity Index] to make sure the face-swapped image is similar to the corresponding target image in content and quality ”.

Peng et al. [52] observed a commonality between the three winning teams for the creation task, i.e., the use of the FaceShifter [37] framework for face swapping. They highlighted two overall reflections about the competition: (1) the limited diversity of the deepfake datasets submitted and the use of repetitive methods to generate them, and (2) the limited size of the Celeb-DF (v2) dataset itself flagging the need for a larger dataset for next year’s competition. The organisers of the competition also applied the top two detection models to unseen datasets (DFDC and FaceForensics++) and noticed that they do not generalise well.

\

6      A Meta-Review of Deepfake-Related Surveys

This section presents a meta-review of 12 selected deepfake-related survey papers, including eight published in English [16, 45, 46, 6466, 71, 73] and four published in Chinese [7, 38, 41, 59]. It covers the following aspects in a systematic manner: definitions and scope, performance metrics, datasets, challenges/competitions/benchmarks, performance comparison, key challenges and recommendations.

The meta-review aims at drawing some high-level insights for monitoring future development of deepfake-related technologies and their applications.

\

6.1      Definitions and Scope

As we discussed in Section 1.1, among researchers, practitioners and law makers there is no universally accepted definition of “deepfake” as a term. This is also reflected in how the authors of the 12 survey papers considered this aspect. Most authors talked about the history of deepfakes and pointed out that the term reflects the combination of “deep learning” and “fake”, but some used a broader definition, e.g., Lyu [45] defined deepfake as “high quality fake videos and audios generated by AI algorithms”. Some authors also referred to deepfake-related legislations, but none of them pointed out that the definitions in some such legislations are completely different from the more technical definitions involving the use of deep learning. No authors discussed the blurred boundary between deepfakes and non-deepfakes, although some surveys actually cover both, e.g., Tao et al. [59] focused on speech forgery and did not explicitly highlight “deepfake”.

In terms of the scope, while some authors (correctly) considered all types of media that can be produced by deepfake-related techniques [38, 41, 45, 65], some considered only a narrow scope, e.g., authors of [7, 64, 71, 73] considered only videos, and only authors of [16, 66] have considered images and videos. Another phenomenon we observed is that many authors focused more on face images and videos, and authors of three surveys [16, 64, 71] even limited the definition of “deepfake” to such a narrow scope:

  • Deshmukh and Wankhade [16] defined it as “a technology which creates fake images or videos of targeted humans by swapping their faces [by] another character saying or doing things that are not absolutely done by them and humans start believing in such fake as it is not always recognisable with the everyday human eye”;
  • Younus and Hasan [71] considered deepfake as a technique allowing “any computer user to exchange the face of one person with another digitally in any video”; and
  • Tolosana et al. [64] defined it as “a deep learning based technique able to create fake videos by swapping the face of a person by the face of another person”.

Such unnecessarily narrow definitions and scopes can lead to confusion and do not help exchanges between researchers and practitioners working on different types of deepfakes.

We call on more researchers to accept a broader definition of “deepfake” so that highly realistic/natural media of any kind generated by a sophisticated automated method (often AI-based) is considered deepfake. Here, we provide two examples of such a broader definition: the image2image (or pixel2pixel) technique [80] that allows the production of deepfake images and videos of any objects (e.g., the “horse2zebra” deepfake image shown in Figure 7), and the the so-called “deepfake geography [77]”, where AI-based techniques are used to generate realistic-looking satellite images.

Figure 7: An image of a horse (left) and a deepfake image generated using the image2image technique proposed in [78] (right).

\ Another important fact missed or not sufficiently discussed by authors of all the 12 surveys is that deepfake techniques can be used for positive applications, e.g., creative arts, entertainment and protecting online users’ privacy. We call for more researchers and practitioners to follow the proposal in the 2020 Tencent AI White Paper [60] to start using the more neutral-sounding term “deep synthesis”. Accordingly,we can use different words for different types of data generated using “deep synthesis” techniques, e.g., “deep art”, “deep animation”, “deep music”, and “deepfake”. While authors of the 12 survey papers did not recognise the positive applications of “deepfake” technologies, some other researchers did, e.g., organisers of the Voice Conversion Challenge 202099 who said the VC technology (for speech deepfake) “is useful in many applications, such as customizing audio book and avatar voices, dubbing, movie industry, teleconferencing, singing voice modification, voice restoration after surgery, and cloning of voices of historical persons”.

\

6.2     Performance Metrics

Surprisingly, none of the 12 surveys have covered performance metrics explicitly. Some directly used performance metrics to explain and compare performance of covered deepfake generation and detection methods. The most used performance metrics include accuracy, ERR, and AUC. This may be explained by the page constraints of such survey papers, which did not allow the authors to extend their coverage significantly to cover performance metrics systematically. The subjective quality of deepfakes is an area least covered by the surveys, which seems related to an unbalanced coverage on deepfake generation and deepfake detection in terms of performance evaluation and comparison (the former much less than the latter).

\

6.3     Datasets

Many of the 12 survey papers list a number of deepfake-related datasets, but none of them have coverage as complete as ours shown in Section 4. For instance, none of the surveys have covered the Voice Conversion Challenge 2016/2018/2020 datasets and the ASVspoof 2019/2021 datasets are covered briefly only in two surveys [38, 59]. In addition, more recent deepfake datasets especially those released in 2021 are also not covered by any of the surveys. We believe that our Section 4 is the most comprehensive review of deepfake-related datasets so far.

Some survey papers include datasets that are likely deepfakes, e.g., Verdoliva [66] covered many general fake image datasets where the manipulated images were not generated by deep learning or even AI-based methods, and some surveys (e.g., [38]) mentioned ASVspoof 2015 datasets but we did not see the use of deep learning for generating data used in the dataset.

\

6.4     Challenges, Competitions and Benchmarks

Many surveys cover deepfake-related challenges, competitions and benchmarks. The coverage is, however, mostly limited, and some challenges (e.g., the Voice Conversion Challenge 2016/2018/2020 and the two Chinese challenges we covered in Section 5) are not covered by any of the surveys. The level of detail of challenges, competitions and benchmarks is also normally limited, compared with what we chose to include in Section 5. Similar to the datasets we covered in Section 4, we believe that our coverage of deepfake-related challenges, competitions and benchmarks in Section 5 is also the most comprehensive so far.

\

6.5      Performance Comparison

Most surveys have a good coverage of related methods for deepfake generation and detection, but only some explicitly covered performance comparison between different methods [38, 46, 64].

Among all the survey papers, Li et al. [38] conducted the most comprehensive study on performance of different deepfake detection methods. In addition to showing the performance metrics of a number of deepfake detection methods in Table 3 of [38], they also looked at general characteristics and issues of different types of deepfake detection methods, as shown in Table 4. Furthermore, they also looked at research on robustness of deepfake detection methods against adversarial samples, referring to some work that showed a lack of such robustness.

Due to quality issues of many deepfake-related datasets (discussed in Section 4.6), we need to treat any performance metrics and comparison of different detection methods with caution. Without testing all methods on a sufficiently large, diverse and high-quality deepfake dataset, the performance comparison results can be misleading. This highlights the importance of having more challenges, competitions and benchmarks to encourage performance comparison on standard datasets and using consistent performance metrics.

Table 4: Comparison of different deepfake detection methods as shown in Table 4 of [38].

\

6.6     Challenges and Recommendations

The authors of some surveys identified some key challenges and future research directions for the deepfake community.

Not surprisingly, how to develop more robust, scalable, generalisable and explainable deepfake detection methods is one of the most discussed key challenges and also a major future research direction [7, 16, 38, 41, 45, 59, 65, 66, 71]. Considering the arms race between deepfake generation and detection, this research direction will likely remain the hottest topic in deepfake research.

A couple of surveys [38, 66] mentioned fusion as a key future research direction, where “fusion” refers to combining different methods (e.g., combining multiple detectors of different types) and data sources (e.g., jointly considering audio-visual analysis) to achieve better performance for deepfake detection. Lyu [45] suggested that, for detection of deepfake videos, we need to consider video-level detection more, which can be considered fusion of detection results of all video frames.

The authors of three surveys, Lyu [45] , Deshmukh and Wankhade [16] and Younus and Hasan [71], argued that better (higher-quality, more up-to-date, and more standard) deepfake datasets are needed to develop more effective deepfake detection methods. Lyu [45] also suggested that we need to consider social media laundering effects in training data and improve the evaluation of datasets. We agree with them on these points.

Tao et al. [59] suggested that low-cost deepfake generation/detection should be considered as a future research direction. This is a valid recommendation since lightweight methods will allow less powerful computing devices (e.g., IoT devices) to benefit from such technologies.

Two Chinese surveys [38, 41] also mentioned the need to have new deepfake-related legislations combating malicious use of deepfakes and the need to train end users such as journalists. This is likely an area where interdisciplinary research can grow.

There are also other ad-hoc recommendations given by the authors of some surveys. For example, Lyu [45] argued that deepfake detection should be considered a (more complicated) multi-class, multi-label and local detection problem. Tolosana et al. [64] discussed specific research directions for different deep-fake generation methods (face synthesis, identity swap, attribute manipulation, and expression swap). Liang et al. [41] and Li et al. [38] recommended more active defence mechanisms such as using digital watermarking and blockchain technologies to build trustworthy media frameworks against deepfakes.

\

7       Conclusion

The rapid growth in the capability to manipulate media or create synthetic media which look realistic and natural paved the way for deepfakes. At first, this paper adopted a critical approach to look at different definitions of the term “deepfake”. In that regard, we point out the different contradicting definitions and call for the wider community to consider how to define a new term that has a more consistent scope and meaning. For instance, replacing “deepfake” by “deep synthesis” can be more inclusive by embracing positive applications of deepfake techniques, e.g., in entertainment and for simulation purposes.

This paper provided a comprehensive overview of multiple aspects of the deepfake ecosystem drawing from the research literature and other online sources published in two languages: English and Chinese. It covers commonly used performance metrics and standards, related datasets, challenges, competitions and benchmarks. It also presents a meta-review of 12 selected deepfake-related survey papers published in 2020 and 2021, covering not only the above mentioned aspects, but also highlighting key challenges and recommendations.

\

References

[1]   Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. MesoNet: A Compact Facial Video Forgery Detection Network. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security. IEEE, 1–7. https://doi.org/10.1109/WIFS.2018.8630 761

[2]   Henry Ajder, Giorgio Patrini, Francesco Cavalli, and Laurence Cullen. 2019. The State of Deepfakes: Landscape, Threats, and Impact. Deeptrace. , 27 pages.  https://sensity.ai/reports/

[3]   Zahid Akhtar and Tiago H. Falk. 2017. Audio-Visual Multimedia Quality Assessment: A Comprehensive Survey. IEEE Access 5 (2017), 21090–21117. https://doi.org/10.1109/ACCESS.2017.2750918

[4]   Ahmed Ali and Steve Renals. 2018. Word Error Rate Estimation for Speech Recognition: e-WER. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 20–24. https://doi.org/10.18653/v1/P18-2004

[5]   Sercan O¨ . Arık, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. 2018. Neural Voice Cloning with a Few Samples. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. Curran Associates Inc., 10040–10050. https://papers.nips.cc/paper/201 8/hash/4559912e7a94a9c32b09d894f2bc3c82-Abstract.html

[6]   Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. 2018. Recycle-GAN: Unsupervised Video Retargeting. In Proceedings of the 2018 European Conference on Computer Vision. Springer, 17 pages.  https://doi.org/10.1007/978-3-030-01228-1 8

[7]   Yu-xuan Bao, Tian-liang Lu, and Yan-hui Du. 2020. Overview of Deepfake Video Detection Technology. Computer Science 47, 9 (2020), 283–292. https://doi.org/10.11896/jsjkx.200400130

[8]   Mikol-aj Bin´kowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, and Karen Simonyan. 2019. High Fidelity Speech Synthesis with Adversarial Networks.  https://doi.org/10.48550/ARXIV.1909.11646

[9]   ![](file:///C:/Users/user/AppData/Local/Temp/msohtmlclip1/01/clip_image050.gif)Madeline Brady. 2020. Deepfakes: A New Desinformation Threat? Report by the Democracy Reporting International. , 9 pages. https://democracy-reporting.org/dri publications/de epfakes-a-new-disinformation-threat/

[10]   Umur Aybars Ciftci, Ilke Demir, and Lijun Yin. 2020. FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 17 pages. https://doi.org/10.1109/TPAMI.2020.3009287

[11]   Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K. Jain. 2020. On the Detection of Digital Face Manipulation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10 pages. https://doi.org/10.1109/CVPR42600.2020.00582

[12]  Rohan Kumar Das, Tomi Kinnunen, Wen-Chin Huang, Zhen-Hua Ling, Junichi Yamagishi, Zhao Yi, Xiaohai Tian, and Tomoki Toda. 2020. Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions. In Proceedings of the Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. International Speech Communication Association, 99–120. https://doi.org/10.21437/VCC BC.2020-15

[13] H´ector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, and Junichi Yamagishi. 2021. ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan.  https://www.asvspoof.org/asvspoof2021/asvspoof2021 evaluation plan.pdf

[14]   H´ector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, and Junichi Yamagishi. 2021.

ASVspoof 2021 Challenge - Logical Access Database.   https://doi.org/10.5281/zenodo.4837263

[15]   H´ector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, and Junichi Yamagishi. 2021. ASVspoof 2021 Challenge - Speech Deepfake Database.  https://doi.org/10.5281/zenodo.4835108

[16]   ![](file:///C:/Users/user/AppData/Local/Temp/msohtmlclip1/01/clip_image051.gif)Anushree Deshmukh and Sunil B. Wankhade. 2021. Deepfake Detection Approaches Using Deep Learning: A Systematic Review. In Intelligent Computing and Networking: Proceedings of IC-ICN 2020 (Lecture Notes in Networks and Systems, Vol. 146). Springer, 293–302. https://doi.org/ 10.1007/978-981-15-7421-4 27

[17]   Xinyi Ding, Zohreh Raziei, Eric C. Larson, Eli V. Olinick, Paul Krueger, and Michael Hahsler. 2020. Swapped Face Detection using Deep Learning and Subjective Assessment. EURASIP Journal on Information Security 2020, 1 (2020), 1–12. https://doi.org/10.1186/s13635-020-00109-8

[18]   Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The DeepFake Detection Challenge (DFDC) Dataset.  https://doi.org/10.48550/ARXIV.2006.07397

[19]   Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The DeepFake Detection Challenge (DFDC) Dataset. arXiv:2006.07397. https://arxiv.org/abs/2006.07397

[20]   Nick Dufour and Andrew Gully. 2019. Contributing Data to Deepfake Detection Research. Google AI Blog.   https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detectio n.html

[21]   Ricard Durall, Margret Keuper, Franz-Josef Pfreundt, and Janis Keuper. 2019. Unmasking Deep-Fakes with Simple Features.   https://doi.org/10.48550/ARXIV.1911.00686

[22]   Cristian Canton Ferrer, Brian Dolhansky, Ben Pflaum, Joanna Bitton, Jacqueline Pan, and Jikuo Lu. 2020. Deepfake Detection Challenge Results: An Open Initiative to Advance AI. Meta AI Blog. https://ai.facebook.com/blog/deepfake-detection-challenge-results-an-open-initia     tive-to-advance-ai/

[23]   Gereon Fox, Wentao Liu, Hyeongwoo Kim, Hans-Peter Seidel, Mohamed Elgharib, and Christian Theobalt. 2021. Videoforensicshq: Detecting High-Quality Manipulated Face Videos. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo. IEEE, 1–6. https://doi.or g/10.1109/ICME51207.2021.9428101

[24]   Haiying Guan, Andrew Delgado, Yooyoung Lee, Amy N. Yates, Daniel Zhou, Timothee Kheyrkhah, and Jon Fiscus. 2021. User Guide for NIST Media Forensic Challenge (MFC) Datasets. https://doi.org/10.6028/NIST.IR.8377

[25]   Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. 2021. ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE,  4360–4369.    https://doi.org/10.1109/CVPR46437.2021.00434

[26]   Liming Jiang, Zhengkui Guo, Wayne Wu, Zhaoyang Liu, Ziwei Liu, Chen Change Loy, Shuo Yang, Yuanjun Xiong, Wei Xia, Baoying Chen, Peiyu Zhuang, Sili Li, Shen Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, Liujuan Cao, Rongrong Ji, Changlei Lu, and Ganchao Tan. 2021. DeeperForensics Challenge 2020 on Real-World Face Forgery Detection: Methods and Results. arXiv:2102.09471. https://arxiv.org/pdf/2102.09471.pdf

[27]   Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. 2020. DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2886–2895. https://doi.org/ 10.1109/CVPR42600.2020.00296

[28]   Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. https://doi.org/10.48550/ARXIV.1802.08435

[29]   Takuhiro Kaneko and Hirokazu Kameoka. 2017. Parallel-Data-Free Voice Conversion Using Cycle-Consistent  Adversarial  Networks.   https://doi.org/10.48550/ARXIV.1711.11293

[30]   Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-based Generator Architecture for Generative Adversarial Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 4401–4410. https://doi.org/10.1109/CVPR.2019.00453

[31]   Ross Kelly. 2021. Digital Art Exhibition to Showcase Creative Potential of AI. DIGIT News. https://www.digit.fyi/digital-art-exhibition-to-showcase-creative-potential-of-ai/

[32]   Ali Khodabakhsh, Raghavendra Ramachandra, Kiran Raja, Pankaj Wasnik, and Christoph Busch. 2018. Fake Face Detection Methods: Can They Be Generalized?. In Proceedings of the 2018 International Conference of the Biometrics Special Interest Group. IEEE, 1–6. https://doi.org/10.2 3919/BIOSIG.2018.8553251

[33]   Hyeongwoo Kim, Mohamed Elgharib, Hans-Peter Zoll¨ofer, Michael Seidel, Thabo Beeler, Christian Richardt, and Christian Theobalt. 2019. Neural Style-Preserving Visual Dubbing. ACM Transactions on Graphics 38, 6, Article 178 (2019), 13 pages. https://doi.org/10.1145/3355089.3356500

[34]   Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Niessner, Patrick P´erez, Christian Richardt, Michael Zollh¨ofer, and Christian Theobalt. 2018. Deep Video Portraits. ACM Transactions on Graphics 37, 4, Article 163 (2018), 14 pages. https://doi.org/ 10.1145/3197517.3201283

[35]   Pavel Korshunov and S´ebastien Marcel. 2019. Vulnerability Assessment and Detection of Deepfake Videos. In Proceedings of the 2019 International Conference on Biometrics. IEEE, 1–6. https://doi.org/10.1109/ICB45273.2019.8987375

[36]   Patrick Kwon, Jaeseong You, Gyuhyeon Nam, Sungwoo Park, and Gyeongsu Chae. 2021. KoDF: A Large-scale Korean DeepFake Detection Dataset. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE, 10724–10733. https://doi.org/10.1109/ICCV 48922.2021.01057

[37]   Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. 2020. FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping. arXiv:1912.13457. https://arxiv.org/abs/1912.13457

[38]   Xurong Li, Shouling Ji, Chunming Wu, Zhenguang Liu, Shuiguang Deng, Peng Cheng, Min Yang, and Xiangwei Kong. 2021. Survey on Deepfakes and Detection Techniques. Journal of Software 32, 2 (2021), 496–518. http://www.jos.org.cn/1000-9825/6140.htm

[39]   Yuezun Li, Ming-Ching Chang, and Siwei Lyu. 2018. In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security. IEEE, 1–7. https://doi.org/10.1109/WIFS.2018.8630787

[40]   Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. 2020. Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 3204–3213. https://doi.org/10.1109/CVPR42600. 2020.00327

[41]   Ruigang Liang, Peizhuo Lv, Yue Zhao, Peng Chen, Hao Xing, Yingjun Zhang, Jizhong Han, Ran He, Xianfeng Zhao, Ming Li, and Kai Chen. 2020. A Survey of Audiovisual Deepfake Detection Techniques. Journal of Cyber Security 5, 2 (2020), 1–17. http://jcs.iie.ac.cn/xxaqxb/ch/re ader/view abstract.aspx?file no=20200202&flag=1

[42]   Steven R. Livingstone and Frank A. Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PloS one 13, 5 (2018), 35 pages.

[43]   Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. 2021. MOSNet: Deep Learning based Objective Assessment for Voice Conversion. arXiv:1904.08352. https://arxiv.org/pdf/1904.08352.pdf

[44]   Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling. 2018. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods. In Proceedings of the Odyssey 2018 The Speaker and Language Recognition Workshop. International Speech Communication Association, 195–202. https://doi.org/10.21437/Odyssey.2018-28

[45]   Siwei Lyu. 2020. Deepfake Detection: Current Challenges and Next Steps. In Proceedings of the 2020 IEEE International Conference on Multimedia Expo Workshops. IEEE, 6 pages. https://doi.org/10.1109/ICMEW46912.2020.9105991

[46]   Yisroel Mirsky and Wenke Lee. 2021. The Creation and Detection of Deepfakes: A Survey. ACM Computing Survey 54, 1, Article 7 (2021), 41 pages. https://doi.org/10.1145/3425780

[47]   Gautham J. Mysore. 2015. Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges. IEEE Signal Processing Letters 22, 8 (2015), 1006–1010. https://doi.org/10.1109/LSP.2014.2379648

[48]   Jo˜ao C. Neves, Ruben Tolosana, Ruben Vera-Rodriguez, Vasco Lopes, Hugo Proen¸ca, and Julian Fierrez. 2020. GANprintR: Improved Fakes and Evaluation of the State of the Art in Face Manipulation Detection. IEEE Journal of Selected Topics in Signal Processing 14, 5 (2020), 1038–1048. https://doi.org/10.1109/JSTSP.2020.3007250

[49]   NIST Media Forensics Challenge Team. 2021. Open Media Forensics Challenge 2020-2021 Evaluation Plan.   https://mig.nist.gov/MFC/Web/EvalPlan2020/OpenMFC2020EvaluationPlan.pdf

[50]   Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio.  https://doi.org/10.48550/ARXIV.1609.03499

[51]   Debajyoti Pal and Tuul Triyason. 2018. A Survey of Standardized Approaches towards the Quality of Experience Evaluation for Video Services: An ITU Perspective. International Journal of Digital Multimedia Broadcasting 2018, Article 1391724 (2018), 25 pages. https://doi.org/10.1155/20 18/1391724

[52]   Bo Peng, Hongxing Fan, Wei Wang, Jing Dong, Yuezun Li, Siwei Lyu, Qi Li, Zhenan Sun, Han Chen, Baoying Chen, Yanjie Hu, Shenghai Luo, Junrui Huang, Yutong Yao, Boyuan Liu, Hefei Ling, Guosheng Zhang, Zhiliang Xu, Changtao Miao, Changlei Lu, Shan He, Xiaoyan Wu, and Wanyi Zhuang. 2021. DFGC 2021: A DeepFake Game Competition. arXiv:2106.01217. https:

//arxiv.org/abs/2106.01217

[53]   Ivan Perov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Um´e, Mr. Dpfks, Carl Shift Facenheim, Luis RP, Jian Jiang, Sheng Zhang, Pingyu Wu, Bo Zhou, and Weiming Zhang. 2020. DeepFaceLab: Integrated, Flexible and Extensible Face-swapping Framework. https://doi.org/10.48550/ARXIV.2005.05535

[54]   Stanislav Pidhorskyi, Donald A. Adjeroh, and Gianfranco Doretto. 2020. Adversarial Latent Au-toencoders. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10 pages. https://doi.org/10.1109/CVPR42600.2020.01411

[55]   Andreas R¨ossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2018. FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces. https://doi.org/10.48550/ARXIV.1803.09179

[56]   Andreas R¨ossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. FaceForensics++: Learning to Detect Manipulated Facial Images. In Proceedings of the 2019 International Conference on Computer Vision. IEEE, 1–11. https://doi.org/10.110 9/ICCV.2019.00009

[57]   Anindya Sen. 2021. Art and Artificial Intelligence: How ‘Deepfakes’ Can Help Create Authentic Museum Experiences! Medium Blog. https://medium.com/art-world-zen/art-and-artificia l-intelligence-how-deepfakes-can-help-create-authentic-museum-experiences-65d8aa 7da29c

[58]   Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and Nobukatsu Hojo. 2019. WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation. https://doi.org/10.48550/A RXIV.1904.02892

[59]   Jianhua Tao, Ruibo Fu, Jiangyan Yi, Chenglong Wang, and Tao Wang. 2020. Development and Challenge of Speech Forgery and Detection. Journal of Cyber Security 5, 2 (2020), 28–38. http://jcs.iie.ac.cn/xxaqxb/ch/reader/view abstract.aspx?file no=20200204&flag=1

[60]   Tencent. 2020. Artificial Intelligence White Paper. https://tech.sina.com.cn/roll/2020-07-14/doc-iivhvpwx5201226.shtml

[61]   Justus Thies, Michael Zollh¨ofe, and Matthias Niessner. 2019. Deferred Neural Rendering: Image Synthesis using Neural Textures. ACM Transactions on Graphics 38, Article 66 (2019), 12 pages. Issue  4.   https://doi.org/10.1145/3306346.3323035

[62]   Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio, Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi. 2016. The Voice Conversion Challenge 2016. In Proceedings of Interspeech 2016. International Speech Communication Association, 1632–1636. https://doi.org/10.21437/Interspeech.2016-1066

[63]   Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Hector Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. 2019. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection. arXiv:1904.05441. https://arxiv.org/ pdf/1904.05441.pdf

[64]   Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. 2020. Deepfakes and beyond: A Survey of face manipulation and fake detection. Information Fusion 64 (2020), 131–148.  https://doi.org/10.1016/j.inffus.2020.06.014

[65]   Xin Tong, Luona Wang, Xiaoqin Pan, and Jingya Wang. 2020. An Overview of Deepfake: The Sword of Damocles in AI. In Proceedings of the 2020 International Conference on Computer Vision, Image and Deep Learning. IEEE, 265–273. https://doi.org/10.1109/CVIDL51233.2020.00-88

[66]   Luisa Verdoliva. 2020. Media Forensics and DeepFakes: An Overview. IEEE Journal of Selected Topics in Signal Processing 14, 5 (2020), 910–932. https://doi.org/10.1109/JSTSP.2020.3002101

[67]   Xin Wang, Junichi Yamagishi, Massimiliano Todisco, H´ector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, S´ebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-Fran¸cois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, and Zhen-Hua Ling. 2020. ASVspoof 2019: A Large-scale Public Database of Synthesized, Converted and Replayed Speech. Computer Speech & Language 64 (2020), 27 pages.  https://doi.org/10.1016/j.csl.2020.101114

[68]   Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi. 2016. Analysis of the Voice Conversion Challenge 2016 Evaluation Results. In Proceedings of the Interspeech 2016 Conference. International Speech Communication Association, 1637–1641. https://doi.org/10.21437/Interspeech.201 6-1331

[69]  Junichi Yamagishi, Massimiliano Todisco, Md Sahidullah, H´ector Delgado, Xin Wang, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Ville Vestman, and Andreas Nautsch. 2019. ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan. https://www.asvspoof.org/asvspoof2019/asvspoof2019   evaluation   plan.pdf

[70]  Zhao Yi, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhen-Hua Ling, and Tomoki Toda. 2020. Voice Conversion Challenge 2020 – Intra-lingual Semi-parallel and Cross-lingual Voice Conversion –. In Proceedings of the Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. International Speech Communication Association, 80–98. https://doi.org/10.21437/VCC BC.2020-14

[71]   Mohammed A. Younus and Taha M. Hasan. 2020. Abbreviated View of Deepfake Videos Detection Techniques. In Proceedings of the 2020 6th International Engineering Conference. IEEE, 115–120. https://doi.org/10.1109/IEC49899.2020.9122916

[72]   Guangtao Zhai and Xiongkuo Min. 2020. Perceptual Image Quality Assessment: A Survey. Science China Information Sciences 63, Article 211301 (2020), 52 pages. https://doi.org/10.1007/s11432-019-2757-1

[73]   Teng Zhang, Lirui Deng, Liang Zhang, and Xianglei Dang. 2020. Deep Learning in Face Synthesis: A Survey on Deepfakes. In Proceedings of the 2020 IEEE 3rd International Conference on Computer and Communication Engineering Technology. IEEE, 67–70. https://doi.org/10.1109/CCET5090 1.2020.9213159

[74]   Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-Francois Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fidler. 2021. DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10140–10150. https://doi.org/10.1109/CVPR46437.2021.01001

[75]   ![](file:///C:/Users/user/AppData/Local/Temp/msohtmlclip1/01/clip_image051.gif)Yuanhan Zhang, ZhenFei Yin, Yidong Li, Guojun Yin, Junjie Yan, Jing Shao, and Ziwei Liu. 2020. CelebA-Spoof: Large-Scale Face Anti-spoofing Dataset with Rich Annotations. In Proceedings of the 2020 European Conference on Computer Vision. Springer, 70–85. https://doi.org/10.1007/97 8-3-030-58610-2 5

[76]   Yuanhan Zhang, Zhenfei Yin, Jing Shao, Ziwei Liu, Shuo Yang, Yuanjun Xiong, Wei Xia, Yan Xu, Man Luo, Jian Liu, Jianshu Li, Zhijun Chen, Mingyu Guo, Hui Li, Junfu Liu, Pengfei Gao, Tianqi Hong, Hao Han, Shijie Liu, Xinhua Chen, Di Qiu, Cheng Zhen, Dashuang Liang, Yufeng Jin, and Zhanlong Hao. 2021. CelebA-Spoof Challenge 2020 on Face Anti-Spoofing: Methods and Results. arXiv:2102.12642. https://arxiv.org/pdf/2102.12642.pdf

[77]   Bo Zhao, Shaozeng Zhang, Chunxue Xu, Yifan Sun, and Chengbin Deng. 2021. Deep Fake Ge-ography? When Geospatial Data Encounter Artificial Intelligence. Cartography and Geographic Information Science 48, 4 (2021), 338–352. https://doi.org/10.1080/15230406.2021.1910075

[78]   Peng Zhou, Xintong Han, Vlad I. Morariu, and Larry S. Davis. 2017. Two-Stream Neural Networks for Tampered Face Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 1831–1839. https://doi.org/10.1109/CVPRW.2017.229

[79]   Tianfei Zhou, Wenguan Wang, Zhiyuan Liang, and Jianbing Shen. 2021. Face Forensics in the Wild. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE,  5774–5784.    https://doi.org/10.1109/CVPR46437.2021.00572

[80]   Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision. IEEE, 2242–2251. https://doi.org/10.1109/ICCV.2 017.244

[81]   Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. 2020. WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection. In Proceedings of the 2020 28th ACM International Conference on Multimedia. ACM, 2382–2390. https://doi.org/10.1145/339417 1.3413769

\

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

HackerNoon通讯:为何游戏生产系统中不存在“微小改动”(2026年2月28日)

2026-03-01 00:02:45

How are you, hacker?


🪐 What’s happening in tech today, February 28, 2026?


The HackerNoon Newsletter brings the HackerNoon homepage straight to your inbox. On this day, we present you with these top quality stories. From Why “Small Changes” Don’t Exist in Production Game Systems to How to Navigate Identity, Direction, Story, and Sovereignty in the Age of AI, let’s dive right in.

Why “Small Changes” Don’t Exist in Production Game Systems


By @ktdevjournal [ 5 Min read ] It doesn’t matter if you build games or a banking app - you don’t just have a pile of features and assets. You have an ecosystem for each bit of work Read More.

How to Navigate Identity, Direction, Story, and Sovereignty in the Age of AI


By @Lima_Writes [ 9 Min read ] When language comes back at you fast, coherent, and emotionally attuned, it feels like truth. Especially when you’re tired. Or lonely. Read More.


🧑‍💻 What happened in your world this week?

It's been said that writing can help consolidate technical knowledge, establish credibility, and contribute to emerging community standards. Feeling stuck? We got you covered ⬇️⬇️⬇️


ANSWER THESE GREATEST INTERVIEW QUESTIONS OF ALL TIME


We hope you enjoy this worth of free reading material. Feel free to forward this email to a nerdy friend who'll love you for it.See you on Planet Internet! With love, The HackerNoon Team ✌️


Go 1.22:循环作用域的变更

2026-03-01 00:00:21

Go 1.21 includes a preview of a change to for loop scoping that we plan to ship in Go 1.22, removing one of the most common Go mistakes.

The Problem

If you’ve written any amount of Go code, you’ve probably made the mistake of keeping a reference to a loop variable past the end of its iteration, at which point it takes on a new value that you didn’t want. For example, consider this program:

func main() {
    done := make(chan bool)

    values := []string{"a", "b", "c"}
    for _, v := range values {
        go func() {
            fmt.Println(v)
            done <- true
        }()
    }

    // wait for all goroutines to complete before exiting
    for _ = range values {
        <-done
    }
}

\ The three created goroutines are all printing the same variable v, so they usually print “c”, “c”, “c”, instead of printing “a”, “b”, and “c” in some order.

\ The Go FAQ entry “What happens with closures running as goroutines?”, gives this example and remarks “Some confusion may arise when using closures with concurrency.”

\ Although concurrency is often involved, it need not be. This example has the same problem but no goroutines:

func main() {
    var prints []func()
    for i := 1; i <= 3; i++ {
        prints = append(prints, func() { fmt.Println(i) })
    }
    for _, print := range prints {
        print()
    }
}

\ This kind of mistake has caused production problems at many companies, including a publicly documented issue at Lets Encrypt. In that instance, the accidental capture of the loop variable was spread across multiple functions and much more difficult to notice:

// authz2ModelMapToPB converts a mapping of domain name to authz2Models into a
// protobuf authorizations map
func authz2ModelMapToPB(m map[string]authz2Model) (*sapb.Authorizations, error) {
    resp := &sapb.Authorizations{}
    for k, v := range m {
        // Make a copy of k because it will be reassigned with each loop.
        kCopy := k
        authzPB, err := modelToAuthzPB(&v)
        if err != nil {
            return nil, err
        }
        resp.Authz = append(resp.Authz, &sapb.Authorizations_MapElement{
            Domain: &kCopy,
            Authz: authzPB,
        })
    }
    return resp, nil
}

\ The author of this code clearly understood the general problem, because they made a copy of k, but it turns out modelToAuthzPB used pointers to fields in v when constructing its result, so the loop also needed to make a copy of v.

\ Tools have been written to identify these mistakes, but it is hard to analyze whether references to a variable outlive its iteration or not. These tools must choose between false negatives and false positives. The loopclosure analyzer used by go vet and gopls opts for false negatives, only reporting when it is sure there is a problem but missing others. Other checkers opt for false positives, accusing correct code of being incorrect. We ran an analysis of commits adding x := x lines in open-source Go code, expecting to find bug fixes. Instead we found many unnecessary lines being added, suggesting instead that popular checkers have significant false positive rates, but developers add the lines anyway to keep the checkers happy.

\ One pair of examples we found was particularly illuminating:

This diff was in one program:

     for _, informer := range c.informerMap {
+        informer := informer
         go informer.Run(stopCh)
     }

\ And this diff was in another program:

     for _, a := range alarms {
+        a := a
         go a.Monitor(b)
     }

\ One of these two diffs is a bug fix; the other is an unnecessary change. You can’t tell which is which unless you know more about the types and functions involved.

The Fix

For Go 1.22, we plan to change for loops to make these variables have per-iteration scope instead of per-loop scope. This change will fix the examples above, so that they are no longer buggy Go programs; it will end the production problems caused by such mistakes; and it will remove the need for imprecise tools that prompt users to make unnecessary changes to their code.

\ To ensure backwards compatibility with existing code, the new semantics will only apply in packages contained in modules that declare go 1.22 or later in their go.mod files. This per-module decision provides developer control of a gradual update to the new semantics throughout a codebase. It is also possible to use //go:build lines to control the decision on a per-file basis.

\ Old code will continue to mean exactly what it means today: the fix only applies to new or updated code. This will give developers control over when the semantics change in a particular package. As a consequence of our forward compatibility work, Go 1.21 will not attempt to compile code that declares go 1.22 or later. We included a special case with the same effect in the point releases Go 1.20.8 and Go 1.19.13, so when Go 1.22 is released, code written depending on the new semantics will never be compiled with the old semantics, unless people are using very old, unsupported Go versions.

Previewing The Fix

Go 1.21 includes a preview of the scoping change. If you compile your code with GOEXPERIMENT=loopvar set in your environment, then the new semantics are applied to all loops (ignoring the go.mod go lines). For example, to check whether your tests still pass with the new loop semantics applied to your package and all your dependencies:

GOEXPERIMENT=loopvar go test

\ We patched our internal Go toolchain at Google to force this mode during all builds at the start of May 2023, and in the past four months we have had zero reports of any problems in production code.

\ You can also try test programs to better understand the semantics on the Go playground by including a // GOEXPERIMENT=loopvar comment at the top of the program, like in this program. (This comment only applies in the Go playground.)

Fixing Buggy Tests

Although we’ve had no production problems, to prepare for that switch, we did have to correct many buggy tests that were not testing what they thought they were, like this:

func TestAllEvenBuggy(t *testing.T) {
    testCases := []int{1, 2, 4, 6}
    for _, v := range testCases {
        t.Run("sub", func(t *testing.T) {
            t.Parallel()
            if v&1 != 0 {
                t.Fatal("odd v", v)
            }
        })
    }
}

\ In Go 1.21, this test passes because t.Parallel blocks each subtest until the entire loop has finished and then runs all the subtests in parallel. When the loop has finished, v is always 6, so the subtests all check that 6 is even, so the test passes. Of course, this test really should fail, because 1 is not even. Fixing for loops exposes this kind of buggy test.

\ To help prepare for this kind of discovery, we improved the precision of the loopclosure analyzer in Go 1.21 so that it can identify and report this problem. You can see the report in this program on the Go playground. If go vet is reporting this kind of problem in your own tests, fixing them will prepare you better for Go 1.22.

\ If you run into other problems, the FAQ has links to examples and details about using a tool we’ve written to identify which specific loop is causing a test failure when the new semantics are applied.

More Information

For more information about the change, see the design document and the FAQ.


David Chase and Russ Cox

\ This article is available on The Go Blog under a CC BY 4.0 DEED license.

\ Photo by Jp Valery on Unsplash

\

为何游戏系统中不存在“微小改动”

2026-03-01 00:00:03

It’s just a small change!

\ How often do we hear that we need to fix something? We need to add a small feature. We need to tweak something. Code-wise or publishing, just realized they need this for retention, or maybe an analyst brought the newest data, so now we have to add just a few lines to the code. They don’t affect performance or any other departments, I promise. And it’s just like 3 minutes of coder work - why not? Fast forward: they broke the “Buy” button on the front page of the store on release.

\ Why does this always happen with small changes? Well, if we think about it, we don’t usually think about it. Let me explain:

Designers think in features and user experience. \n Engineers think in whole systems. \n Producers think in tasks. \n Stakeholders think in business outcomes.

\ And one small change is always perceived as something isolated and usually without everyone’s awareness. So, it is basically a cognitive shortcut. And that happens not because everyone is wrong or unprofessional. It’s because modern production systems are highly interconnected, so it’s impossible to know what could potentially be affected by anything - especially if you haven’t worked on this project for 15 years.

Perception of a small change

What is modern production? I’m glad you asked!

\ It doesn’t matter if you build games or a banking app - you don’t just have a pile of features and assets. You have an ecosystem for each bit of work: Art, Code, Design, UI, Marketing, Publishing (maybe even Project Management - wow, you are a rich developer), etc. And each one of them has its own infrastructure, pipelines, workflows, and shared assets. To simplify, it can be shared data schemas, builds, automation processes, UI bindings, and many other things.

\ What’s wrong if I just make a small color change to one of the icons? Well, that means you spend 3 seconds changing a color code. Then you have to assemble a build. Then QA has to check your small change to confirm that you indeed changed the color. Then you have to assemble the build again, which should be in a queue with other builds in the waiting list.

\ Then we have to update the server with your changes - oh wait, did you tell anyone about that? No? Oh, that’s great, because you just submitted your changes during the commit freeze, and now deployment engineers have to fix the CI/CD pipeline, and we have to postpone the release for 4 days because it’s Friday.

\ And by the way - we have to communicate that to users because they were waiting for this new version, and some of them decided not to wait that long and removed your app. Whoops, that’s awkward. Sorry to hear that.

\ Production Ecosystem

That’s alright, I’m here to help you! Let me introduce you to Change Propagation Surface (CPS) - the number of systems, pipelines, assets, and workflows that a change must pass through before it reaches the player.

\ Your change should not be estimated by its task size, like “1 hour of work.” Your change equals CPS × Coupling Density (the amount of work other departments need to do in order for this change to pass).

\ Think about it this way:

  • One small UI tweak touches no shared data - low CPS.
  • A gameplay rule change touching code, balance, design, analytics, player experience - high CPS.

\ Let’s go back to the situation where you want to change the color of the icon. Those 3 seconds of work would affect UI, builds, player perception, experience, and design. It might also affect color coding for accessibility rules, plus build assembling, and finally server updates. It’s high CPS - of course, if you didn’t sneak that change in without everyone’s awareness (I see that - drop it!).

\ The same goes for asset swaps or changing a stat value: it affects memory, AI tuning, destruction logic, etc. Don’t do that unless someone from senior leadership said it’s low CPS - then just do it and see how it goes.

\ You can apply this approach basically anywhere in production because it is not an abstract thing at all and can be estimated.

\ For example:

  • UI
  • Backend
  • Client code
  • Game design
  • Economy
  • Analytics
  • QA
  • Build/CI/CD
  • Marketing
  • Customer support
  • Live operations

\ Each of these items counts as a plus 1 CPS factor. Subsequently, the more of the same “items” you touch, the higher the CPS number you will get. And with that information, you can create a small estimation matrix like:

CPS 1-2 - Local change \n CPS 3-5 - Cross-functional change \n CPS 6+ - Systemic change

\ One more time, the formula is: Impact = CPS × Coupling Density. Easy!

CPS explanation

Let’s see how it works in a real-life example:

\ So your developer went on holiday and completed a math course on LinkedIn. And when he came back, he said that there is a more efficient way of calculating EXP. This change is “one line of code.” Okay, but after reading this article, you already know how it works in reality and that it touches multiple things:

  • Game balance
  • Player progression pacing
  • Economy inflation
  • Battle pass progression
  • Retention metrics
  • Analytics dashboards
  • Marketing promises
  • Customer support load

\ That means CPS is more than 7. So now you see that even though the code diff is tiny, the propagation surface is systemic and has a massive potential outcome. In other words, if XP progression speeds up things like economy, availability of the content, battle pass value, retention curves, etc., you should know that even if the implementation takes about 10 minutes, the ripple effect can take weeks of work.

\ Why does live service production make it worse? Because it is amplified by content being reused across multiple features, by telemetry and economy being tightly coupled, and by systems being persistent and often requiring backward compatibility.

\ So, the real cost you pay for propagation lies in prolonged timelines, hidden rework, cross-team friction, technical debt, burnout, and eventually, people resigning directly or indirectly.

\ Instead of thinking, “Oh, this is a small change,” we should probably think, “What systems does this change touch?” Think about this as infrastructure, not a feature, and always try to bring that to cross-team awareness. And if you are capable enough, try to estimate the surface area, not just this exact small change.

\ **The whole point of my way-too-long introduction is that there is no such thing as a small change in production systems. There are only changes in misunderstood affected areas. And the more senior you become, the more your vision shifts toward understanding how this change will travel instead of trying to avoid the change altogether.

认识M6:这款能大规模理解文本与图像的中国人工智能

2026-02-28 23:28:23

\

:::info

Authors:

  1. Junyang Lin, [email protected] (Alibaba Group, China)
  2. Rui Men, [email protected] (Alibaba Group, China)
  3. An Yang, [email protected] (Alibaba Group, China)
  4. Chang Zhou, [email protected] (Alibaba Group, China)
  5. Ming Ding, [email protected] (Tsinghua University, China)
  6. Yichang Zhang, [email protected] (Alibaba Group, China)
  7. Peng Wang, [email protected] (Alibaba Group, China)
  8. Ang Wang, [email protected] (Alibaba Group, China)
  9. Le Jiang, [email protected] (Alibaba Group, China)
  10. Xianyan Jia, [email protected] (Alibaba Group, China)
  11. Jie Zhang, [email protected] (Alibaba Group, China)
  12. Jianwei Zhang, [email protected] (Alibaba Group, China)
  13. Xu Zou, [email protected] (Tsinghua University, China)
  14. Zhikang Li, [email protected] (Alibaba Group, China)
  15. Xiaodong Deng, [email protected] (Alibaba Group, China)
  16. Jie Liu, [email protected] (Alibaba Group, China)
  17. Jinbao Xue, [email protected] (Alibaba Group, China)
  18. Huiling Zhou, [email protected] (Alibaba Group, China)
  19. Jianxin Ma, [email protected] (Alibaba Group, China)
  20. Jin Yu, [email protected] (Alibaba Group, China)
  21. Yong Li, [email protected] (Alibaba Group, China)
  22. Wei Lin, [email protected] (Alibaba Group, China)
  23. Jingren Zhou, [email protected] (Alibaba Group, China)
  24. Jie Tang, [email protected] (Tsinghua University, China)
  25. Hongxia Yang, [email protected] (Alibaba Group, China)

:::

\n

ABSTRACT

In this work, we construct the largest dataset for multimodal pre-training in Chinese, which consists of over 1.9TB images and 292GB texts that cover a wide range of domains. We propose a cross-modal pretraining method called M6, referring to Multi-Modality to Multi-Modality Multitask Mega-transformer, for unified pretraining on the data of single modality and multiple modalities. We scale the model size up to 10 billion and 100 billion parameters, and build the largest pretrained model in Chinese. We apply the model to a series of downstream applications, and demonstrate its outstanding performance in comparison with strong baselines. Furthermore, we specifically design a downstream task of text-guided image generation, and show that the finetuned M6 can create high-quality images with high resolution and abundant details.

KEYWORDS

Multimodal Pretraining; Multitask; Text-to-Image Generation

\

1      INTRODUCTION

Pretraining has become a focus in the research in natural language processing (NLP) [1, 2, 7, 16, 18, 19, 27, 31, 37, 44, 49]. The recent GPT-3 with over 175 billion parameters demonstrates that large models trained on big data have extremely large capacity and it can outperform the state-of-the-arts in downstream tasks especially in the zero-shot setting. Also, the rapid development of pretraining in NLP sparkles cross-modal pretraining. A number of studies [4, 11, 17, 22, 24, 25, 28, 29, 38, 51] have created new state-of-the-art performances for various cross-modal downstream tasks.

A pity is that most recent studies focus on the pretraining on English data. There are lack of both large-scale datasets in Chinese and large-scale models pretrained on the data of Chinese. Therefore, in this work, we develop a large-scale dataset M6-Corpus, which consists of over 1.9TB images and 292GB texts. To the best of our knowledge, this is the largest dataset in Chinese for pretraining in both multimodality and natural language. The dataset collected from the webpages consists of different types of data and covers a large scale of domains, including encyclopedia, question answering, forum discussion, product description, etc. Also, we design sophisticated cleaning procedures to ensure that the data are of high quality.

Furthermore, in order to sufficiently leverage such a large amount of high-quality data, we propose to build an extremely large model that can process data of multiple modalities and adapt to different types of downstream tasks. Thus we propose a novel model called M6, referring to MultiModality-to-MultiModality Multitask Mega-transformer. The model is based on the transformer, and it is pretrained with multiple tasks. Pretraining endows the model with the capability of single-modality and multimodality understanding and generation. Based on the architecture of M6, we build M6-10B and M6-100B, which are scaled up to 10 billion and 100 billion pa-rameters respectively. To be more specific, M6-100B is the recent largest model pretrained on Chinese data. We apply the model to a series of downstream applications, including product description generation, visual question answering, community question answering, Chinese poem generation, etc., and our experimental results show that M6 outperforms a series of strong baselines.

Another contribution of this work is that we first incorporate pretraining with text-to-image generation. Following Ramesh et al. [32], we leverage a two-stage framework for image generation. To be more specific, we apply a trained vector-quantized generative adversarial network to representing images with discrete image codes, and we then use the pretrained M6 to learn the relations between texts and codes. Such learning can bridge the two modalities and enables controllable text-to-image generation.

To summarize, the contributions of M6 are as follows:

  • We collect and build the largest Chinese multi-modal pre-training data in industry, which includes 300GB texts and 2TB images.
  • We propose M6 for multimodal pretraining in Chinese, and we scale the model size to up to 10 and 100 billion parameters. Both M6-10B and M6-100B are the recent largest multimodal pretrained model.
  • M6 is versatile and exceeds strong baselines by 11.8% in VQA, 18.4 in image captioning, and 10.3% in image-text matching. Furthermore M6 is able to generate high-quality images.
  • With carefully designed large-scale distributed training optimizations, M6 has obvious advantages in training speed and greatly reduces training costs, creating the possibility for more widespread use of multi-modal pretraining.

2      DATASET

We collect and develop the largest multi-modality and text dataset in Chinese for now, which is one of the key contributions of this paper. In this section, we first identify the limitations of existing datasets and then describe the construction and preprocessing procedure of our proposed dataset.

2.1      Existing Datasets

The construction of large-scale corpus with high quality and do-main coverage is crucial to Chinese pretraining. In early previous works, the Chinese Wikipedia1 is one of the most frequently used datasets to train Chinese language models. It contains 1.6GB texts (around 0.4B tokens) covering around 1M encyclopedia entries. Another corpus with a comparable size is the THUCTC[39] dataset, which includes 740K news articles. However, with the rapidly increasing capacity of recent language models, the scale of these existing datasets is clearly insufficient. Recently, Cui et al. [5] employ unreleased extended data that are 10 times larger than the CN-Wikipedia to pretrain their Chinese language model. Xu et al.

[47] released a 100GB corpus named CLUECorpus2020, which is retried from the multilingual Common Crawl dataset. However, the scale of the datasets is still insufficient to facilitate super large-scale pretraining compared with existing English pretrained models. For example, GPT-3 contains 175B parameters and is trained on 570GB texts. Meanwhile, the dataset should contain image-text pairs rather than plain texts for multi-modal pretraining.

2.2      Standards for a High-quality Dataset

To perform large-scale multi-modal pretraining and learn complex world knowledge in Chinese, the dataset is highly required to provide both plain texts and image-text pairs on super large scale, covering a wide range of domains. In order to perform large-scale multi-modal pretraining in Chinese, we focus on the construction of large-scale datasets in Chinese. Specifically, while we unify our pretraining for both natural language and multimodalities, we construct large datasets of both plain texts and image-text pairs. We are interested in obtaining large-scale data that covers a wide range of domains, so that it is possible for the model to learn the complex world knowledge of different fields. Also, we aim to collect data of multiple modalities for the cross-modal pretraining. This raises the difficulty for the construction of a large-scale dataset as the data for multimodal pretraining are usually image-text pairs, where in each pair the text provides a detailed description of a fraction of the image.

Though there are a tremendous amount of text resources and images on the world wide web, the corpus for multimodal pretraining is assumed to be better when satisfying the following properties:

(1). the sentences should be fluent natural language within a normal length, and should not contain meaningless tokens, such as markups, duplicate punctuation marks, random combinations of characters, etc.; (2). the images should be natural and realistic, and the resolutions of the images need to be identifiable by humans; (3). both the texts and images should not contain illegal content, such as pornography, violence, etc.; (4). the images and texts should be semantically relevant; (5). the datasets should cover a wide range of fields, say sports, politics, science, etc., and therefore it can endow the model with sufficient world knowledge.

\

2.3      Dataset Construction

Based on the requirements above, we collect data of both plain texts and image-text pairs. There are different types of data, including encyclopedia, crawled webpage, community question answering, forum, product description, etc. We present the details in Table 3. The collected corpus consists of bothag plain-texts and image-text pairs, which is compatible with the designed text-only and multi-modal pretraining tasks. Also, the data has a large coverage over domains, such as science, entertainment, sports, politics, common-sense of life, etc. We have also compared some characteristics of our corpus with existing datasets used for Chinese pretraining in Table 2. The size of our dataset is much larger than the previous ones. To our knowledge, this is the first large-scale, multimodal and multidomain corpus for Chinese pretraining.

We implement sophisticated preprocessing to obtain clean data. For text data, we first remove HTML markups and duplicate punctuation marks, and we only reserve characters and punctuation marks that are in Chinese and English. We remove the topics that are shorter than 5 characters and contents shorter than 15 characters. We further apply in-house spam detection to remove sentences that contain words related to certain political issues, pornography, or words in the list of dirty, naughty, and other bad words. In order to preserve the linguistic acceptance of the texts, we implement a language model to evaluate their perplexities, and sentences with high perplexities are discarded. Only images with at least 5000 pixels are reserved for pretraining. A sequence of classifiers and heuristic rules are applied to filter out images containing illegal content. We also use a pretrained image scorer to evaluate the qual-ities of images. For images and texts in crawled webpages, we only consider images and their surrounding text as relevant image-text pairs. Other sentences in the webpages are discarded.

\

3      M6 FRAMEWORK

Multimodal pretraining leverages both the power of self-attention-based transformer architecture and pretraining on large-scale data. We endeavor to endow the model with strong capability of cross-modal understanding and generation. In this section, we describe the details of our proposed pretrained model M6, which refers to Multi-Modality-to-Multi-Modality Multitask Mega-transformer.

#Table 1: Statistics of our pretraining dataset. We demonstrate the sources of our data, and we calculate the number of images, tokens, and passages, the average length, as well as the size of images and texts.

#Figure 1: Examples of the multimodal data of M6-Corpus. We demonstrate three cases that belong to different categories, including encyclopedia, crawled webpages, and product description.

Table 2: Comparison with the existing large-scale Chinese corpora for pretraining. Our dataset is the largest dataset for Chinese pretraining. The size of texts is larger than that of the existing datasets, and the size of images is even larger than that of ImageNet.

3.1      Å   Visual and Linguistic Inputs

The mainstream multimodal pretraining methods transform images to feature sequences via object detection. However, the performance of the object detectors as well as the expressivity of their backbones strongly impact the final performance of the pretrained models in the downstream tasks. We observe that a large proportion of the images contain only a few objects. Take the images of the data of e-commerce as an example. We randomly sample 1M images and perform object detection on the images. The results show that over 90% of the images contain fewer than 5 objects. Also, the objects have high overlapping with each other. To alleviate such influence, we turn to a simple but effective solution following Gao et al. [\[12\]](#bookmark28) and Dosovitskiy et al. [\[8\]](#bookmark24). In general, we split an image into patches and extract features of the 2D patches with a trained feature extractor, say ResNet-50. Then we line up the representations to a sequence by their positions. The processing of the input word sequence is much simpler. We follow the similar preprocessing procedures in the previous work [4, 11, 24]. We apply WordPiece [34, 45] and masking to the word sequence and embed them with an embedding layer, following BERT [6].

Figure 2: Examples of the plain text data of M6-Corpus. We demonstrate three cases that belong to different categories, including encyclopedia, community QA, forum discussion, and common crawl.

Table 3: Statistics of the pretraining dataset. We demonstrate the sources of our data, and we calculate the number of images, tokens, and passages, as well as the size of images and texts.

\

3.2      Unified Encoder-Decoder

We integrate the image embeddings 𝑒𝑖 and the word embeddings 𝑒𝑡 into the cross-modal embedding sequence 𝑒 = {𝑒𝑖, 𝑒𝑡 }. We send the sequence to the transformer backbone for high-level feature extraction. To differ their representations, we add corresponding segment embeddings for different modalities. Specifically, we leverage the

self-attention-based transformer blocks for our unified cross-modal representation learning. To be more specific, the building block is identical to that of BERT or GPT, which consists of self attention and point-wise feed-forward network (FFN). On top of the transformer backbone, we add an output layer for word prediction, and thus we tie its weights to those of the embedding layer.

In the unified framework, we use different masking strategies to enable encoding and decoding. The input is segmented into three parts, including visual inputs, masked linguistic inputs, and complete linguistic inputs. We apply bidirectional masking to both the visual inputs and masked linguistic inputs, and we apply causal masking to the complete linguistic inputs. Thus the model is allowed to encode and decode in the same framework.

\n

3.3      Pretraining Methods

We pretrain the model with the multitask setup, including text-to-text transfer, image-to-text transfer, and multimodality-to-text transfer. Thus the model can process information of different modalities and perform both single-modal and cross-modal understanding and generation.

Text-to-text Transfer As demonstrated in Figure 3, the model learns to perform text denoising and language modeling in the setting of text-to-text transfer. In text denoising, we mask the input text by a proportion, which is 15% in practice following BERT [6]. Specifically, we mask a continuous span of text with a single mask, and the model should learn to decode the whole sequence. This encourages the model to learn both recovering and length predict-ing. Besides, in order to improve the model ability in generation, we add a setup of language modeling, where the encoder receives no inputs and the decoder learns to generate words based on the previous context.

#Figure 3: An overview of the pretraining tasks for M6. The design of masking strategies allows the learning of different tasks under the same framework. M6 is pretrained with image-based text denoising, image captioning, text denoising, and language modeling.

Table 4: Model sizes of M6. 𝑛𝑙𝑎𝑦𝑒𝑟𝑠 is the number of transformer layers. 𝑑𝑚𝑜𝑑𝑒𝑙 is the dimension of hidden states in each layer. 𝑛ℎ𝑒𝑎𝑑𝑠 is the number of attention heads in each layer. 𝑛𝑒𝑥𝑝𝑒𝑟𝑡𝑠 is the number of experts. The M6-100B model employs multiple experts to scale up parameters to 100 billion. 𝑛𝑝𝑎𝑟𝑎𝑚 is the number of all parameters.

\ Image-to-text transfer Image-to-text transfer is similar to image captioning, where the model receives the visual information as the input, and learns to generate a corresponding description. In this setting, we add the aforementioned patch feature sequence to the input and leave the masked input blank. The model encodes the patch features, and decodes the corresponding text.

Multimodality-to-text transfer Based on the setup of image-to-text transfer, we additionally add masked linguistic inputs, and thus the model should learn to generate the target text based on both the visual information and the noised linguistic information. This task allows the model to adapt to the downstream tasks with both visual and linguistic inputs.

3.4      Scaling up to 10 and 100 Billion Parameters

We scale up the model size to 10 billion parameters and 100 billion parameters, which are named M6-10B and M6-100B. The increase in model size provides a much larger capacity for the model that it can learn knowledge from more data. For the construction of M6-10B, we simply scale up the model by hyperparameter tuning.

To be more specific, we increase the size of hidden states and the number of layers. To better leverage GPU memory, we apply mixed-precision training and activation checkpointing to save memory. Still, the model cannot be fit into one single GPU, and thus we use model parallelism to split the feed-forward networks and attention heads to multiple GPUs following the implementation of Megatron-LM [36].

However, directly scaling up to M6-100B is much more difficult as there are more challenges for the computation resources. Alternatively, inspired by the recent progress in sparse activations [10, 20, 35], we combine Mixture-of-Experts (MoE) with M6 to build the version of 100 billion parameters. Note that the original MoE requires mesh-tensorflow as well as TPUs. This sets limits for a number of researchers without such resources. Thus we implement the M6-100B with MoE with our in-house framework Whale [43] to perform model parallelism with GPUs. We demonstrate the key statistics of the models of different scales in Table 4.

Specifically, different from the conventional FFN layer, the MoE layer is a parallel combination of multiple FFN layers, each of which acts as an expert. This is also called expert parallelism. The model first learns a sparse gating network to route the tokens to specific experts. Thus each token is only sent to a small set of experts and the computation can be much less compared with that in dense models. This kind of model is highly efficient as it realizes data parallelism and expert parallelism across workers. The computation of MoE layer for a specific token 𝑥 can be described as below:

\

where 𝑔(·) refers to the sparse gating function, and T refers to the indices of top-𝑘 values of 𝑔(·). The output of MoE is a linear combination of the computation of selected expert FFNs 𝑓 (·).

In expert parallelism, the parameters of experts do not share across workers, while those of other parts are identical across workers. Therefore, it is necessary to perform all-to-all communication across workers at the MoE layers in order to dispatch tokens to selected experts and combine them to their original experts. While Lepikhin et al. [20] and Fedus et al. [10] implement the MoE on TPUs with one expert in each MoE layer on a TPU, we implement our model on Nvidia GPUs where there are several experts in each MoE layer on a GPU so as to fully utilize the memory. As all-to-all communication takes up a large amount of time, the optimization to improve efficiency is highly significant. We implement a series of optimization, including half-precision communication. A key problem is load balancing, which denotes that tokens can gather to only a few experts due to dynamic routing. Following Fedus et al. [10], we apply expert capacity, which refers to the number of tokens for an expert (𝐶 = 𝑁 - 𝑐/m, where 𝐶 refers to expert capacity, 𝑁 refers to the number of tokens in a batch, 𝑐 refers to capacity factor (which is a hyperparameter usually larger than 1.0) and 𝑚 refers to the number of experts), to alleviate this problem. Tokens out of the capacity of an expert are dropped from the computation and they are sent to next layers through residual connections. We find that the overloading problem can be severe, and this issue can be a significant one in the future research of expert models.

Besides the optimization in all-to-all communication, we com-pare the top-2 gating and top-1 gating and find that they can achieve similar model performance in perplexity, while the latter converges slightly slower. The effectiveness of top-1 gating enables faster computation. Besides, we also apply methods of memory optimization for higher efficiency. We find that gradient clipping globally can increase costs on all-to-all communication as it computes norms across all experts, and thus we apply local clipping for memory saving. We implement M6-100B with around 100 billion parameters on 128 Nvidia A100s and the speed of pretraining achieves 1440 samples/s (for samples of the sequence length of 272).

We demonstrate that using MoE structure for model size scaling is effective and it can achieve similar performance to that of M6-10B, the largest dense model, within 2-3 times shorter time. The negative log perplexity of M6-100B reaches −2.297, in comparison with M6-10B that reaches −2.253 but with twice of time.2 This shows that the MoE-based M6 model has advantages on the time basis compared with dense models with many more FLOPs.

\

4      APPLICATIONS

4.1      Text-to-Image Generation

Text-to-image generation has been an open problem for a long time. Previous studies mainly focused on generation on a limited domain, among which Generative Adversarial Nets (GANs) [14, 48] are dominated methods. Following Ramesh et al. [32], we leverage a two-stage framework for text-to-image generation, including discrete representation learning and language modeling.

#Table 5: Results on the FMIQA dataset. We report both the overall accuracy and the accuracy on specific question types.

\ In the first stage, we focus on transforming images into sequences of discrete codes. There are a number of alternatives for discrete code generation, including VQVAE [41] and VQGAN [9]. In the second stage, it is necessary to build a language model to learn to generate text and code sequence. In the finetuning, we add code embedding and output layers to the pretrained M6. We concat the word sequence and the aforementioned generated code sequence as the input, and we set the objective of autoregressive language modeling for the training. At the stage of inference, we input the text sequence, and the model generates codes autoregressively with top-k sampling. The last step is to transform the code sequence to an image with the generator from the first stage.

We construct a dataset for text-to-image generation in E-commerce. Specifically, we collect over 50 million product titles and images from the mobile Taobao. We apply a series of processing methods on the images to filter the unqualified. We filter the images with complex background features (characters, patterns, etc.) with the in-house white-background image detector and OCR model. We then filter the images with over 3 objects with our in-house object detector based on Faster R-CNN [33]. We finally obtain 1.8m high-quality product image-text pairs for finetuning. Compared with the images in the general domains, our collected data have the following features. The image and text are highly correlated as the text describes key features of the product, and there is no complex background in the images, which is easier to learn compared with the images in the public datasets such as MSCOCO [26].

We demonstrate two examples in Figure 4 and Figure 5. It can be found that the generated images have high quality and the generated objects resemble the real ones. Furthermore, in Figure 6 , we find that the model is able to imagine items according to the query military style camouflage high heels(军旅风迷彩高跟鞋), which do not exist in the real world. The imagination ability provides room for creative design in real-world industrial scenarios, such as clothing design, shoe design, etc.

We also finetune M6 under our proposed framework on another dataset which contains 3 million images crawled from the Internet, which cover more general domains. And we find that the model can adapt to different domains. As shown in Figure 7, the model is able to generate clip arts of robots . This reveals the versatility of the framework in text-to-image generation.

4.2      Visual Question Answering

We demonstrate our experimental results on a visual question answering dataset, and we illustrate how we directly apply the pre-trained M6 to the VQA application.

Figure 4: Generated images for sheep wool business casual suit (绵羊毛商务休闲西服套装).

Figure 5: Generated images for shock absorption and breathable running shoes (减震透气跑鞋).

\ We leverage the FMIQA dataset [13] as the Chinese visual QA benchmark, which requires the model to generate the answer given an image and a question. We implement a transformer-based model as our baseline. For the evaluation, we split the test set manually by random sampling 200 from the dataset as there is no official release of the test set, and we evaluate the overall accuracy by human evaluation. The results are demonstrated in Table 5. The pretrained M6-base outperforms the baseline by a large margin (+6.2%), which indicates the effectiveness of multimodal pretraining. Scaling up the model to M6-10B further brings 5.2% improvement.

Furthermore, we show that simply finetuning on such a small VQA dataset may limit the potential of M6. Therefore, we directly leverage M6 for the VQA application. We find that the model is able to recognize general features and provide more related knowledge based on its understanding. Though the model pretrained on pseudo-parallel image-text pairs cannot directly answer questions about detailed features, such as color, number, etc., it is able to answer questions related to background knowledge. We demonstrate some examples in Figure 8.

\

4.3      Image Captioning

Image captioning requires the model to generate a caption that describes the given image, which examines the model ability of cross-modal generation. We construct a dataset (named E-Commerce IC) containing pairs of product descriptions and product images from Taobao. Since too long or too short descriptions may be noisy, we discard pairs with a description longer than 100 words or less than 10 words. To avoid dirty generations, we further use an in-house tool to filter descriptions that may contain dirty words (i.e., pornographic or violent words). Finally, E-Commerce IC contains about 260k text-image pairs. We finetune the model with the image-to-text transfer task on E-Commerce IC.

#Figure 6: Generated images for military style camouflage high heels (军旅风迷彩高跟鞋).

Figure 7: Generated images for a clip art of robots (机器人矢量插图)).

\ We compare our model with a baseline of transformer in the human evaluation. We ask several annotators with the linguistic background to evaluate from three perspectives: grammar (whether a text is fluent without grammatical error), correctness (whether a text is faithful to the image), richness (whether a text is informative and attractive). During the evaluation, we randomly sample 100 images from the test set. For each image, an annotator is asked to score the text generated by different models. The scores are within the range of [0, 5].

The results in Table 6 show that M6-base outperforms the baseline in all of the metrics. We find that all models achieve high scores in grammar. However, in both correctness and richness, M6-base outperforms the baseline model by a large margin (+18.2% and +14.4%), indicating that multimodal pretraining helps to generate more faithful, informative and attractive texts. Scaling up the model to M6-10B further improves the correctness and richness (about 14.7% and 7.0%). Figure 9 illustrates two examples of image caption.

#Table 6: Results on the E-Commerce IC dataset.

Figure 8: Several examples of general visual question answering without finetuning. We turn the origin questions to the de-signed pattern , with typical tokens such as “?” and “Answer:”. The pretrained model can recognize the question and provide the answer as well as some further description.

\n

4.4      Question Answering

To demonstrate the potential availability in the applications of intelligent chatbots, we further employ the M6 model to generate long answers in the style of forum discussion. Human-generated questions are collected from various Chinese forums, which are input to the model to generate the answer. At the stage of inference, we append a question mark and a token “Answer:” in the prompt, which better triggers the model to generate an answer. To facilitate the generation of longer and more informative texts, we pick more complex questions.

Figure 10 demonstrates an example of general question answer-ing. The model can illustrate a man’s own experiences that are related to the question and also point out the answer at the end. This generated text confused human annotators and passed the Turing Test. It shows that the model can not only answer general questions but also generate long fluency text.

\ \

4.5      Poem Generation

We apply the pretrained model to Chinese poem generation. The model is able to generate genres with format constraints.

#Figure 9: Two examples of image caption. We only use the image feature as input and provide no prompt.

\ #Figure 10: One example of general question answering. The prompt which includes the question successfully triggers the model to generate long texts in the style of forum discussion.

\ Ancient Chinese poetry has various specific formats. We adopt the simplest constraints that

  • The poem shall be consisted of at least 4 lines.
  • The total number of lines shall be even.
  • Each line must have exactly 5 or 7 words.
  • All lines shall have the same number of words.

Text generation under format constraint is done in a search framework that we generate short sentences ending with punctuation until the number of words meets the constraint. We repeat this process until the model generates an "" token, or the number of lines exceeds a limit of 16. Figure 11 illustrates an example of a generated poem.

\

4.6      Image-Text Matching

We evaluate the model’s ability in cross-modal retrieval. Specifically, we construct a dataset (named E-Commerce ITM) containing pairs of texts and images from the mobile Taobao. Each pair belongs to a single item. we collect 235K products in the clothing industry from Taobao. For each product, aside from the product image, we obtain a query by rewriting the product title. Specifically, we conduct named entity recognition on the title using an in-house tool, which extracts the terms describing the style, color, category and texture of the product.

#Figure 11: One example of a generated poem, the prompt and the constraint mask work together to generate a poem based on the given title.

Table 7: Results on the E-Commerce ITM dataset. We report the accuracy on the test set.

\ These terms are then concatenated into a natural language query, which is used in image-text matching. The length of each query is between 6 to 12 words. The pairs of the query and corresponding product image are labeled as positive samples. The negative samples are constructed by randomly substituting the query in the original pairs.

We require the model to perform binary classification to discriminate positive and negative samples. We compare our model with InterBert [25], which is also a Chinese multi-modal pretrained model effective in cross-modal classification downstream tasks. The InterBert utilizes object-based features and has been pretrained on Taobao product image-text data as well.

The results are shown in Table 7. It should be noted that the InterBert and M6-base are both implemented with transformer-based architecture and have similar model scales. However, M6-base still outperforms InterBert by 10.3%. In experiments, we find the product images generally contain relatively fewer detected objects, which may harm the performance on this task. In contrast, M6 avoids this problem by employing the patch features and achieves much better performance.

\n

5      RELATED WORK

The tremendous success of NLP pretraining, including BERT [6], GPT [2, 30, 31], and also some other related studies [1, 7, 19, 27, 49], inspires the research in cross-modal representation learning. Also, recent studies show that the ubiquitous Transformer architecture [42] can be extended to different fields, including computer vision [3, 8]. Therefore, the simplest solution to incorporate recent pretraining methods and cross-modal representation learning is the extension of BERT. From the perspective of architecture, there are mainly two types, including single-stream model and dual stream model. Specifically, single-stream model is simple and it gradually becomes the mainstream architecture. These models mostly differ in their designs of pretraining tasks or the construction of input im-age features. Basically, they are mainly pretrained masked language modeling, masked object classification, and image-text matching. VisualBERT [23] and Unicoder-VL [22] simply use BERT and are pretrained with the aforementioned tasks. UNITER [4] pretrains the model with an additional task of word-region alignment. Oscar [24] enhances the alignment between objects and their corresponding words or phrases. VILLA [11] further improves model performance by adding their proposed adversarial learning methods to pretraining and finetuning. Except for pretraining tasks, some studies focus on the features of images. Most pretraining methods for multimodal representation learning utilize the features generated by a trained object detector, say Faster R-CNN [33]. PixelBERT [17] accepts raw images as input and extract their latent representations with a learnable ResNet [15] or ResNext [46]. FashionBERT [12] splits the images into patches with a trained ResNet without co-training. Besides single-stream models, dual-stream models also can achieve outstanding performance, such as VilBERT [28], LXMERT [40] and InterBERT [25]. ViLBERT-MT [29] enhances model performance with multi-task finetuning. ERNIE-ViL [50] enhances the model with the application of scene graph information. In spite of these successful cases, it still requires further researches to unmask the success of multimodal pretraining.

\

6      CONCLUSIONS

In this work, we propose the largest dataset M6-Corpus for pre-training in Chinese, which consists of over 1.9TB images and 292GB texts. The dataset has large coverage over domains, including encyclopedia, question answering, forum discussion, common crawl, etc. We propose a method called M6 that is able to process information of multiple modalities and perform both single-modal and cross-modal understanding and generation. The model is scaled to large model with 10B and 100B parameters with sophisticated deployment, and both models are the largest multimodal pretrained models. We apply the model to a series of downstream applications, showing its versatility. More specifically, we design a downstream task of text-guided image generation, and the finetuned M6 can reach superior performance by producing images of high quality.

In the future, we will continue the pretraining of extremely large models by increasing the scale of data and models to explore the limit of performance, and we also endeavor to search for more downstream applications for further generalization.

\n

REFERENCES

[1]   Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. 2020. Unilmv2: Pseudo-masked language models for unified language model pre-training. In International Conference on Machine Learning. PMLR, 642–652.

[2]   Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,

Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).

[3]   Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213–229.

[4]   Y en-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In ECCV 2020. 104–120.

[5]   Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).

[6]   Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT 2019. 4171–4186.

[7]   Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. In NeurIPS 2019. 13042–13054.

[8]   Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[9]   Patrick Esser, Robin Rombach, and Björn Ommer. 2020. Taming Transformers for High-Resolution Image Synthesis. arXiv:2012.09841 [cs.CV]

[10]   William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. CoRR abs/2101.03961 (2021). arXiv:2101.03961 https://arxiv.org/abs/2101.03961

[11]   Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. In NeurIPS 2020.

[12]   Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. 2020. Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In SIGIR 2020. 2251–2260.

[13]   Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? dataset and methods for multilingual image question answering. arXiv preprint arXiv:1505.05612 (2015).

[14]   Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014).

[15]   Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR 2016. 770–778.

[16]   Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. De-berta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).

[17]   Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).

[18]   Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, and Shuicheng Yan. 2020. Convbert: Improving bert with span-based dynamic convolution. arXiv preprint arXiv:2008.02496 (2020).

[19]   Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR abs/1909.11942 (2019).

[20]   Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020).

[21]   Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettle-moyer. 2021. BASE Layers: Simplifying Training of Large, Sparse Models. CoRR abs/2103.16716 (2021).

[22]   Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. CoRR abs/1908.06066 (2019).

[23]   Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR abs/1908.03557 (2019).

[24]   Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. CoRR abs/2004.06165 (2020).

[25]   Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, and Hongxia Yang. 2020. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198 (2020).

[26]   Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV 2014. 740–755.

[27]   Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).

[28]   Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS 2019. 13–23.

[29]   Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2019. 12-in-1: Multi-Task Vision and Language Representation Learning. CoRR abs/1912.02315 (2019).

[30]   Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/ languageunsupervised/language understanding paper. pdf (2018).

[31]   Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. [n.d.]. Language models are unsupervised multitask learners. ([n. d.]).

[32]   Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. arXiv:2102.12092 [cs.CV]

[33]   Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS 2015. 91–99.

[34]   Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In ACL 2016.

[35]   Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).

[36]   Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053 (2019).

[37]   Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. In ICML 2019. 5926–5936.

[38]   Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR 2020.

[39]   Maosong Sun, Jingyang Li, Zhipeng Guo, Z Yu, Y Zheng, X Si, and Z Liu. 2016. Thuctc: an efficient chinese text classifier. GitHub Repository (2016).

[40]   Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP-IJCNLP 2019. 5099–5110.

[41]   Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. In NIPS.

[42]   Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS 2017. 5998–6008.

[43]   Ang Wang, Xianyan Jia, Le Jiang, Jie Zhang, Yong Li, and Wei Lin. 2020. Whale: A Unified Distributed Training Framework. arXiv preprint arXiv:2011.09208 (2020).

[44]   Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Jiangnan Xia, Liwei Peng, and Luo Si. 2019. Structbert: Incorporating language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577 (2019).

[45]   Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[46]   Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR 2017. 1492–1500.

[47]   Liang Xu, Xuanwei Zhang, and Qianqian Dong. 2020. CLUECorpus2020: A Large-scale Chinese Corpus for Pre-trainingLanguage Model. arXiv preprint arXiv:2003.01355 (2020).

[48]   Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1316–1324.

[49]   Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS 2019. 5754–5764.

[50]   Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie-vil: Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934 (2020).

[51]   Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In AAAI 2020. 13041–13049.

\

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

阿里巴巴的Qwen:挑战硅谷的中国AI模型

2026-02-28 23:08:19

:::info

Authors:

  1. Jinze Bai
  2. Shuai Bai
  3. Yunfei Chu
  4. Zeyu Cui
  5. Kai Dang
  6. Xiaodong Deng
  7. Yang Fan
  8. Wenbin Ge
  9. Yu Han
  10. Fei Huang
  11. Binyuan Hui
  12. Luo Ji
  13. Mei Li
  14. Junyang Lin
  15. Runji Lin
  16. Dayiheng Liu
  17. Gao Liu
  18. Chengqiang Lu
  19. Keming Lu
  20. Jianxin Ma
  21. Rui Men
  22. Xingzhang Ren
  23. Xuancheng Ren
  24. Chuanqi Tan
  25. Sinan Tan
  26. Jianhong Tu
  27. Peng Wang
  28. Shijie Wang
  29. Wei Wang
  30. Shengguang Wu
  31. Benfeng Xu
  32. Jin Xu
  33. An Yang
  34. Hao Yang
  35. Jian Yang
  36. Shusheng Yang
  37. Yang Yao
  38. Bowen Yu
  39. Hongyi Yuan
  40. Zheng Yuan
  41. Jianwei Zhang
  42. Xingxuan Zhang
  43. Yichang Zhang
  44. Zhenru Zhang
  45. Chang Zhou
  46. Jingren Zhou
  47. Xiaohuan Zhou
  48. Tianhang Zhu

\ Qwen Team, Alibaba Group

:::

Abstract

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce QWEN1, the first installment of our large language model series. QWEN is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes QWEN, the base pretrained language models, and QWEN-CHAT, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, CODE-QWEN and CODE-QWEN-CHAT, as well as mathematics-focused models, MATH-QWEN-CHAT, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models. \n

1         Introduction

Large language models (LLMs) (Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2020; Brown et al., 2020; OpenAI, 2023; Chowdhery et al., 2022; Anil et al., 2023; Thoppilan et al., 2022; Touvron et al., 2023a;b) have revolutionized the field of artificial intelligence (AI) by providing a powerful foundation for complex reasoning and problem-solving tasks. These models have the ability to compress vast knowledge into neural networks, making them incredibly versatile agents. With a chat interface, LLMs can perform tasks that were previously thought to be the exclusive domain of humans, especially those involving creativity and expertise (OpenAI, 2022; Ouyang et al., 2022; Anil et al., 2023; Google, 2023; Anthropic, 2023a;b). They can engage in natural language conversations with humans, answering questions, providing information, and even generating creative content such as stories, poems, and music. This has led to the development of a wide range of applications, from chatbots and virtual assistants to language translation and summarization tools.

LLMs are not just limited to language tasks. They can also function as a generalist agent (Reed et al., 2022; Bai et al., 2022a; Wang et al., 2023a; AutoGPT, 2023; Hong et al., 2023), collaborating with external systems, tools, and models to achieve the objectives set by humans. For example, LLMs can understand multimodal instructions (OpenAI, 2023; Bai et al., 2023; Liu et al., 2023a; Ye et al., 2023; Dai et al., 2023; Peng et al., 2023b), execute code (Chen et al., 2021; Zheng et al., 2023; Li et al., 2023d), use tools (Schick et al., 2023; LangChain, Inc., 2023; AutoGPT, 2023), and more. This opens up a whole new world of possibilities for AI applications, from autonomous vehicles and robotics to healthcare and finance. As these models continue to evolve and improve, we can expect to see even more innovative and exciting applications in the years to come. Whether it’s helping us solve complex problems, creating new forms of entertainment, or transforming the way we live and work, LLMs are poised to play a central role in shaping the future of AI.

Figure 1: Model Lineage of the Qwen Series. We have pretrained the language models, namely QWEN, on massive datasets containing trillions of tokens. We then use SFT and RLHF to align QWEN to human preference and thus we have QWEN-CHAT and specifically its improved version QWEN-CHAT-RLHF. Additionally, we also develop specialized models for coding and mathematics, such as CODE-QWEN, CODE-QWEN-CHAT, and MATH-QWEN-CHAT based on QWEN with similar techniques. Note that we previously released the multimodal LLM, QWEN-VL and QWEN-VL-CHAT (Bai et al., 2023), which are also based on our QWEN base models.

\ Despite their impressive capabilities, LLMs are often criticized for their lack of reproducibility, steerability, and accessibility to service providers. In this work, we are pleased to present and release the initial version of our LLM series, QWEN. QWEN is a moniker that derives from the Chinese phrase Qianwen, which translates to “thousands of prompts” and conveys the notion of embracing a wide range of inquiries. QWEN is a comprehensive language model series that encompasses distinct models with varying parameter counts. The model series include the base pretrained language models, chat models finetuned with human alignment techniques, i.e., supervised finetuning (SFT), reinforcement learning with human feedback (RLHF), etc., as well as specialized models in coding and math. The details are outlined below:

\n

1.   The base language models, namely QWEN, have undergone extensive training using up to 3 trillion tokens of diverse texts and codes, encompassing a wide range of areas. These models have consistently demonstrated superior performance across a multitude of downstream tasks, even when compared to their more significantly larger counterparts.

2.   The QWEN-CHAT models have been carefully finetuned on a curated dataset relevant to task performing, chat, tool use, agent, safety, etc. The benchmark evaluation demonstrates that the SFT models can achieve superior performance. Furthermore, we have trained reward models to mimic human preference and applied them in RLHF for chat models that can produce responses preferred by humans. Through the human evaluation of a challenging test, we find that QWEN-CHAT models trained with RLHF are highly competitive, still falling behind GPT-4 on our benchmark.

3.    In addition, we present specialized models called CODE-QWEN, which includes CODE-QWEN-7B and CODE-QWEN-14B, as well as their chat models, CODE-QWEN-14B-CHAT and CODE-QWEN-7B-CHAT. Specifically, CODE-QWEN has been pre-trained on extensive datasets of code and further fine-tuned to handle conversations related to code generation, debugging, and interpretation. The results of experiments conducted on benchmark datasets, such as HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and HumanEvalPack (Muennighoff et al., 2023), demonstrate the high level of proficiency of CODE-QWEN in code understanding and generation.

4.   This research additionally introduces MATH-QWEN-CHAT specifically designed to tackle mathematical problems. Our results show that both MATH-QWEN-7B-CHAT and MATH-QWEN-14B-CHAT outperform open-sourced models in the same sizes with large margins and are approaching GPT-3.5 on math-related benchmark datasets such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021).

5.    Besides, we have open-sourced QWEN-VL and QWEN-VL-CHAT, which have the versatile ability to comprehend visual and language instructions. These models outperform the current open-source vision-language models across various evaluation benchmarks and support text recognition and visual grounding in both Chinese and English languages. Moreover, these models enable multi-image conversations and storytelling. Further details can be found in Bai et al. (2023).

\ Now, we officially open-source the 14B-parameter and 7B-parameter base pretrained models QWEN and aligned chat models QWEN-CHAT2. This release aims at providing more comprehensive and powerful LLMs at developer- or application-friendly scales.

The structure of this report is as follows: Section 2 describes our approach to pretraining and results of QWEN. Section 3 covers our methodology for alignment and reports the results of both automatic evaluation and human evaluation. Additionally, this section describes details about our efforts in building chat models capable of tool use, code interpreter, and agent. In Sections 4 and 5, we delve into specialized models of coding and math and their performance. Section 6 provides an overview of relevant related work, and Section 7 concludes this paper and points out our future work.

\

2         Pretraining

The pretraining stage involves learning vast amount of data to acquire a comprehensive understanding of the world and its various complexities. This includes not only basic language capabilities but also advanced skills such as arithmetic, coding, and logical reasoning. In this section, we introduce the data, the model design and scaling, as well as the comprehensive evaluation results on benchmark datasets.

\

2.1         DATA

The size of data has proven to be a crucial factor in developing a robust large language model, as highlighted in previous research (Hoffmann et al., 2022; Touvron et al., 2023b). To create an effective pretraining dataset, it is essential to ensure that the data are diverse and cover a wide range of types, domains, and tasks. Our dataset is designed to meet these requirements and includes public web documents, encyclopedia, books, codes, etc. Additionally, our dataset is multilingual, with a significant portion of the data being in English and Chinese.

Figure 2: Performance of GPT-4, GPT-3.5, the previous 13B SOTA, as well as QWEN-14B. We demonstrate the results on 12 datasets covering multiple domains, including language understanding, knowledge, reasoning, etc. QWEN significantly outperforms the previous SOTA of similar model sizes, but still lag behind both GPT-3.5 and GPT-4.

\ To ensure the quality of our pretraining data, we have developed a comprehensive data preprocessing procedure. For public web data, we extract text from HTML and use language identification tools to determine the language. To increase the diversity of our data, we employ deduplication techniques, including exact-match deduplication after normalization and fuzzy deduplication using MinHash and LSH algorithms. To filter out low-quality data, we employ a combination of rule-based and machine-learning-based methods. Specifically, we use multiple models to score the content, including language models, text-quality scoring models, and models for identifying potentially offensive or inappropriate content. We also manually sample texts from various sources and review them to ensure their quality. To further enhance the quality of our data, we selectively up-sample data from certain sources, to ensure that our models are trained on a diverse range of high-quality content. In recent studies (Zeng et al., 2022; Aribandi et al., 2021; Raffel et al., 2020), it has been demonstrated that pretraining language models with multi-task instructions can enhance their zero-shot and few-shot performance. To further enhance the performance of our model, we have incorporated high-quality instruction data into our pretraining process. To safeguard the integrity of our benchmark assessment, we have adopted a similar approach as Brown et al. (2020) and meticulously eliminated any instruction samples that exhibit a 13-gram overlap with any data present in the test sets utilized in our evaluation.

Figure 3: Encoding compression rates of different models. We randomly selected 1 million document corpora of each language to test and compare the encoding compression rates of different models (with XLM-R (Conneau et al., 2019), which supports 100 languages, as the base value 1, not shown in the figure). As can be seen, while ensuring the efficient decoding of Chinese, English, and code, QWEN also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr, etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages.

\ Given the large number of downstream tasks, it is not feasible to repeat this filtering process for all tasks. Instead, we have made sure that the instruction data for the reported tasks have undergone our filtering process to ensure their accuracy and reliability. Finally, we have built a dataset of up to 3 trillion tokens.

\

2.2         Tokenization

The design of vocabulary significantly impacts the training efficiency and the downstream task performance. In this study, we utilize byte pair encoding (BPE) as our tokenization method, following GPT-3.5 and GPT-4. We start with the open-source fast BPE tokenizer, tiktoken (Jain, 2022), and select the vocabulary cl100k base as our starting point. To enhance the performance of our model on multilingual downstream tasks, particularly in Chinese, we augment the vocabulary with commonly used Chinese characters and words, as well as those in other languages. Also, following Touvron et al. (2023a;b), we have split numbers into single digits. The final vocabulary size is approximately 152K.

The performance of the QWEN tokenizer in terms of compression is depicted in Figure 3. In this comparison, we have evaluated QWEN against several other tokenizers, including XLM-R (Conneau et al., 2019), LLaMA (Touvron et al., 2023a), Baichuan (Inc., 2023a), and InternLM (InternLM Team, 2023). Our findings reveal that QWEN achieves higher compression efficiency than its competitors in most languages. This implies that the cost of serving can be significantly reduced since a smaller number of tokens from QWEN can convey more information than its competitors. Furthermore, we have conducted preliminary experiments to ensure that scaling the vocabulary size of QWEN does not negatively impact the downstream performance of the pretrained model. Despite the increase in vocabulary size, our experiments have shown that QWEN maintains its performance levels in downstream evaluation.

\

2.3         Architecture

QWEN is designed using a modified version of the Transformer architecture. Specifically, we have adopted the recent open-source approach of training large language models, LLaMA (Touvron et al., 2023a), which is widely regarded as the top open-source LLM. Our modifications to the architecture include:

#Table 1: Model sizes, architectures, and optimization hyper-parameters.

\

  • Embedding and output projection. Based on preliminary experimental findings, we have opted for the untied embedding approach instead of tying the weights of input embedding and output projection. This decision was made in order to achieve better performance with the price of memory costs.
  • Positional embedding. We have chosen RoPE (Rotary Positional Embedding) (Su et al., 2021) as our preferred option for incorporating positional information into our model. RoPE has been widely adopted and has demonstrated success in contemporary large language models, notably PaLM (Chowdhery et al., 2022; Anil et al., 2023) and LLaMA (Touvron et al., 2023a;b). In particular, we have opted to use FP32 precision for the inverse frequency matrix, rather than BF16 or FP16, in order to prioritize model performance and achieve higher accuracy.
  • Bias. For most layers, we remove biases following Chowdhery et al. (2022), but we add biases in the QKV layer of attention to enhance the extrapolation ability of the model (Su, 2023b).
  • Pre-Norm & RMSNorm. In modern Transformer models, pre-normalization is the most widely used approach, which has been shown to improve training stability compared to post-normalization. Recent research has suggested alternative methods for better training stability, which we plan to explore in future versions of our model. Additionally, we have replaced the traditional layer normalization technique described in (Ba et al., 2016) with RMSNorm (Jiang et al., 2023). This change has resulted in equivalent performance while also improving efficiency.
  • Activation function. We have selected SwiGLU (Shazeer, 2020) as our activation function, a combination of Swish (Ramachandran et al., 2017) and Gated Linear Unit (Dauphin et al., 2017). Our initial experiments have shown that activation functions based on GLU generally outperform other baseline options, such as GeLU (Hendrycks & Gimpel, 2016). As is common practice in previous research, we have reduced the dimension of the feed-forward network (FFN) from 4 times the hidden size to 8/3 of the hidden size.

\

2.4         Training

To train QWEN, we follow the standard approach of autoregressive language modeling, as described in Radford et al. (2018). This involves training the model to predict the next token based on the context provided by the previous tokens. We train models with context lengths of 2048. To create batches of data, we shuffle and merge the documents, and then truncate them to the specified context lengths. To improve computational efficiency and reduce memory usage, we employ Flash Attention in the attention modules (Dao et al., 2022). We adopt the standard optimizer AdamW (Kingma & Ba, 2014; Loshchilov & Hutter, 2017) for pretraining optimization. We set the hyperparameters β1 = 0*.*9, β2 = 0*.*95, and ϵ = 10−8. We use a cosine learning rate schedule with a specified peak learning rate for each model size. The learning rate is decayed to a minimum learning rate of 10% of the peak learning rate. All the models are trained with BFloat16 mixed precision for training stability.

\

2.5         Context Length Extension

Transformer models have a significant limitation in terms of the context length for their attention mechanism. As the context length increases, the quadratic-complexity computation leads to a drastic increase in both computation and memory costs. In this work, we have implemented simple training-free techniques that are solely applied during inference to extend the context length of the model. One of the key techniques we have used is NTK-aware interpolation (bloc97, 2023).

Table 2: Overall performance on widely-used benchmarks compared to open-source base models. Our largest QWEN model with 14 billion parameters outperforms previous 13B SoTA models on all datasets.

\ Unlike position interpolation (PI) (Chen et al., 2023a) which scales each dimension of RoPE equally, NTK-aware interpolation adjusts the base of RoPE to prevent the loss of high-frequency information in a training-free manner. To further improve performance, we have also implemented a trivial extension called dynamic NTK-aware interpolation, which is later formally discussed in (Peng et al., 2023a). It dynamically changes the scale by chunks, avoiding severe performance degradation. These techniques allow us to effectively extend the context length of Transformer models without compromising their computational efficiency or accuracy.

QWEN additionally incorporates two attention mechanisms: LogN-Scaling (Chiang & Cholak, 2022; Su, 2023a) and window attention (Beltagy et al., 2020). LogN-Scaling rescales the dot product of the query and value by a factor that depends on the ratio of the context length to the training length, ensuring that the entropy of the attention value remains stable as the context length grows. Window attention restricts the attention to a limited context window, preventing the model from attending to tokens that are too far away.

We also observed that the long-context modeling ability of our model varies across layers, with lower layers being more sensitive in context length extension compared to the higher layers. To leverage this observation, we assign different window sizes to each layer, using shorter windows for lower layers and longer windows for higher layers.

\

2.6         Experimental Results

To evaluate the zero-shot and few-shot learning capabilities of our models, we conduct a thorough benchmark assessment using a series of datasets. We compare QWEN with the most recent open-source base models, including LLaMA (Touvron et al., 2023a), LLAMA 2 (Touvron et al., 2023b), MPT (Mosaic ML, 2023), Falcon (Almazrouei et al., 2023), Baichuan2 (Yang et al., 2023), ChatGLM2 (ChatGLM2 Team, 2023), InternLM (InternLM Team, 2023), XVERSE (Inc., 2023b), and StableBeluga2 (Stability AI, 2023). Our evaluation covers a total of 7 popular benchmarks, which are MMLU (5-shot) (Hendrycks et al., 2020), C-Eval (5-shot) (Huang et al., 2023), GSM8K (8-shot) (Cobbe et al., 2021), MATH (4-shot) (Hendrycks et al., 2021), HumanEval (0-shot) (Chen et al., 2021), MBPP (0-shot) (Austin et al., 2021), and BBH (Big Bench Hard) (3 shot) (Suzgun et al., 2022). We aim to provide a comprehensive summary of the overall performance of our models across these benchmarks.

Table 3: Results of QWEN on long-context inference using various techniques. Our experimental findings reveal that the application of our crucial techniques enables the model to consistently achieve low perplexity as the context length increases. This suggests that these techniques play a significant role in enhancing the model’s ability to comprehend and generate lengthy texts.

\ In this evaluation, we focus on the base language models without alignment and collect the baselines’ best scores from their official results and OpenCompass (OpenCompass Team, 2023). The results are presented in Table 2.

Our experimental results demonstrate that the three QWEN models exhibit exceptional performance across all downstream tasks. It is worth noting that even the larger models, such as LLaMA2-70B, are outperformed by QWEN-14B in 3 tasks. QWEN-7B also performs admirably, surpassing LLaMA2-13B and achieving comparable results to Baichuan2-13B. Notably, despite having a relatively small number of parameters, QWEN-1.8B is capable of competitive performance on certain tasks and even outperforms larger models in some instances. The findings highlight the impressive capabilities of the QWEN models, particularly QWEN-14B, and suggest that smaller models, such as QWEN-1.8B, can still achieve strong performance in certain applications.

To evaluate the effectiveness of context length extension, Table 3 presents the test results on arXiv3 in terms of perplexity (PPL). These results demonstrate that by combining NTK-aware interpolation, LogN-Scaling, and layer-wise window assignment, we can effectively maintain the performance of our models in the context of over 8192 tokens.

\

3         Alignment

Pretrained large language models have been found to be not aligned with human behavior, making them unsuitable for serving as AI assistants in most cases. Recent research has shown that the use of alignment techniques, such as supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF), can significantly improve the ability of language models to engage in natural conversation. In this section, we will delve into the details of how QWEN models have been trained using SFT and RLHF, and evaluate their performance in the context of chat-based assistance.

\

3.1         Supervised Finetuning

To gain an understanding of human behavior, the initial step is to carry out SFT, which finetunes a pretrained LLM on chat-style data, including both queries and responses. In the following sections, we will delve into the details of data construction and training methods.

\

3.1.1          DATA

To enhance the capabilities of our supervised finetuning datasets, we have annotated conversations in multiple styles. While conventional datasets (Wei et al., 2022a) contain a vast amount of data prompted with questions, instructions, and answers in natural language, our approach takes it a step further by annotating human-style conversations. This practice, inspired by Ouyang et al. (2022), aims at improving the model’s helpfulness by focusing on natural language generation for diverse tasks. To ensure the model’s ability to generalize to a wide range of scenarios, we specifically excluded data formatted in prompt templates that could potentially limit its capabilities. Furthermore, we have prioritized the safety of the language model by annotating data related to safety concerns such as violence, bias, and pornography.

In addition to data quality, we have observed that the training method can significantly impact the final performance of the model. To achieve this, we utilized the ChatML-style format (OpenAI, 2022), which is a versatile meta language capable of describing both the metadata (such as roles) and the content of a turn. This format enables the model to effectively distinguish between various types of information, including system setup, user inputs, and assistant outputs, among others. By leveraging this approach, we can enhance the model’s ability to accurately process and analyze complex conversational data.

\

3.1.2          Training

Consistent with pretraining, we also apply next-token prediction as the training task for SFT. We apply the loss masks for the system and user inputs. More details are demonstrated in Section A.1.1.

The model’s training process utilizes the AdamW optimizer, with the following hyperparameters: β1 set to 0*.9, β2 set to 0.95, and ϵ set to 10−8. The sequence length is limited to 2048, and the batch size is 128. The model undergoes a total of 4000 steps, with the learning rate gradually increased over the first 1430 steps, reaching a peak of 2 × 10−6. To prevent overfitting, weight decay is applied with a value of 0.1, dropout is set to 0.1, and gradient clipping is enforced with a limit of 1.*0.

\

3.2         Reinforcement Learning from Human Feedback

While SFT has proven to be effective, we acknowledge that its generalization and creativity capa-bilities may be limited, and it is prone to overfitting. To address this issue, we have implemented Reinforcement Learning from Human Feedback (RLHF) to further align SFT models with human preferences, following the approaches of Ouyang et al. (2022); Christiano et al. (2017). This process involves training a reward model and using Proximal Policy Optimization (PPO) (Schulman et al., 2017) to conduct policy training.

\

3.2.1          Reward Model

To create a successful reward model, like building a large language model (LLM), it is crucial to first undergo pretraining and then finetuning. This pretraining process, also known as preference model pretraining (PMP) (Bai et al., 2022b), necessitates a vast dataset of comparison data. This dataset consists of sample pairs, each containing two distinct responses for a single query and their corresponding preferences. Similarly, finetuning is also conducted on this type of comparison data, but with a higher quality due to the presence of quality annotations.

During the fine-tuning phase, we gather a variety of prompts and adjust the reward model based on human feedback for responses from the QWEN models. To ensure the diversity and complexity of user prompts are properly taken into account, we have created a classification system with around 6600 detailed tags and implemented a balanced sampling algorithm that considers both diversity and complexity when selecting prompts for annotation by the reward model (Lu et al., 2023). To generate a wide range of responses, we have utilized QWEN models of different sizes and sampling strategies, as diverse responses can help reduce annotation difficulties and enhance the performance of the reward model. These responses are then evaluated by annotators following a standard annotation guideline, and comparison pairs are formed based on their scores.

In creating the reward model, we utilize the same-sized pre-trained language model QWEN to initiate the process. It is important to mention that we have incorporated a pooling layer into the original QWEN model to extract the reward for a sentence based on a specific end token.

Table 4: Test Accuracy of QWEN preference model pretraining (PMP) and reward model (RM) on diverse human preference benchmark datasets.

\n The learning rate for this process has been set to a constant value of 3 × 10−6, and the batch size is 64. Additionally, the sequence length is set to 2048, and the training process lasts for a single epoch.

We adopted the accuracy on the test dataset as an important but not exclusive evaluation metric for the reward model. In Table 4, we report the test pairwise accuracy of PMP and reward models on diverse human preference benchmark datasets (Bai et al., 2022b; Stiennon et al., 2020; Ethayarajh et al., 2022; Lightman et al., 2023). Specifically, QWEN Helpful-base and QWEN Helpful-online are our proprietary datasets. The responses in QWEN Helpful-base are generated from QWEN without RLHF, whereas QWEN Helpful-online includes responses from QWEN with RLHF. The results show that the PMP model demonstrates high generalization capabilities on out-of-distribution data, and the reward model demonstrates significant improvement on our QWEN reward datasets.

\

3.2.2          Reinforcement Learning

Our Proximal Policy Optimization (PPO) process involves four models: the policy model, value model, reference model, and reward model. Before starting the PPO procedure, we pause the policy model’s updates and focus solely on updating the value model for 50 steps. This approach ensures that the value model can adapt to different reward models effectively.

During the PPO operation, we use a strategy of sampling two responses for each query simultaneously. This strategy has proven to be more effective based on our internal benchmarking evaluations. We set the KL divergence coefficient to 0*.*04 and normalize the reward based on the running mean.

The policy and value models have learning rates of 1 × 10−6 and 5 × 10−6, respectively. To enhance training stability, we utilize value loss clipping with a clip value of 0*.15. For inference, the policy top-p is set to 0.9. Our findings indicate that although the entropy is slightly lower than when top-p is set to 1.*0, there is a faster increase in reward, ultimately resulting in consistently higher evaluation rewards under similar conditions.

Additionally, we have implemented a pretrained gradient to mitigate the alignment tax. Empirical findings indicate that, with this specific reward model, the KL penalty is adequately robust to counteract the alignment tax in benchmarks that are not strictly code or math in nature, such as those that test common sense knowledge and reading comprehension. It is imperative to utilize a significantly larger volume of the pretrained data in comparison to the PPO data to ensure the effectiveness of the pretrained gradient. Additionally, our empirical study suggests that an overly large value for this coefficient can considerably impede the alignment to the reward model, eventually compromising the ultimate alignment, while an overly small value would only have a marginal effect on alignment tax reduction.

\

3.3         Automatic and Human Evaluation of Aligned Models

To showcase the effectiveness of our aligned models, we conduct a comparison with other aligned models on well-established benchmarks, including MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023), GSM8K (Cobbe et al., 2021), HumanEval (Chen et al., 2021), and BBH (Suzgun et al., 2022). Besides the widely used few-shot setting, we test our aligned models in the zero-shot setting to demonstrate how well the models follow instructions. The prompt in a zero-shot setting consists of an instruction and a question without any previous examples in the context. The results of the baselines are collected from their official reports and OpenCompass (OpenCompass Team, 2023).

The results in Table 5 demonstrate the effectiveness of our aligned models in understanding human instructions and generating appropriate responses. QWEN-14B-Chat outperforms all other models except ChatGPT (OpenAI, 2022) and LLAMA 2-CHAT-70B (Touvron et al., 2023b) in all datasets, including MMLU (Hendrycks et al., 2020), C-Eval (Huang et al., 2023), GSM8K (Cobbe et al., 2021), HumanEval (Chen et al., 2021), and BBH (Suzgun et al., 2022).

Table 5: Performance of aligned models on widely-used benchmarks. We report both zero-shot and few-shot performance of the models.

\n In particular, QWEN’s performance in HumanEval, which measures the quality of generated codes, is significantly higher than that of other open-source models.

Moreover, QWEN’s performance is consistently better than that of open-source models of similar size, such as LLaMA2 (Touvron et al., 2023b), ChatGLM2 (ChatGLM2 Team, 2023), InternLM (InternLM Team, 2023), and Baichuan2 (Yang et al., 2023). This suggests that our alignment approach, which involves fine-tuning the model on a large dataset of human conversations, has been effective in improving the model’s ability to understand and generate human-like language./im

Despite this, we have reservations about the ability of traditional benchmark evaluation to accurately measure the performance and potential of chat models trained with alignment techniques in today’s landscape. The results mentioned earlier provide some evidence of our competitive standing, but we believe that it is crucial to develop new evaluation methods specifically tailored to aligned models.

We believe that human evaluation is crucial, which is why we have created a carefully curated dataset for this purpose. Our process involved collecting 300 instructions in Chinese that covered a wide range of topics, including knowledge, language understanding, creative writing, coding, and mathematics. To evaluate the performance of different models, we chose the SFT version of QWEN-CHAT-7B and the SFT and RLHF versions of QWEN-CHAT-14B, and added two strong baselines, GPT-3.5 and GPT-44, for comparison. For each instruction, we asked three annotators to rank the model responses by the overall score of helpfulness, informativeness, validity, and other relevant factors. Our dataset and evaluation methodology provides a comprehensive and rigorous assessment of the capabilities of different language models in various domains.

Figure 4 illustrates the win rates of the various models. For each model, we report the percentage of wins, ties, and losses against GPT-3.5, with the segments of each bar from bottom to top representing these statistics. The experimental results clearly demonstrate that the RLHF model outperforms the SFT models by significant margins, indicating that RLHF can encourage the model to generate responses that are more preferred by humans. In terms of overall performance, we find that the RLHF model significantly outperforms the SFT models, falling behind GPT-4. This indicates the effectiveness of RLHF for aligning to human preference. To provide a more comprehensive understanding of the models’ performance, we include a case study with examples from different models in Appendix A.2.2. Nonetheless, it remains difficult to accurately capture the gap between our models and the proprietary models. As such, a more extensive and rigorous assessment is required for the chat models.

Figure 4: Results of the human evaluation for chat models. We compare Qwen-7B (SFT), Qwen-14B (SFT), Qwen-14B (RLHF), as well as GPT-4 against GPT-3.5. Each bar segment represents the percentage of wins, ties, and losses, from bottom to top. On average, the RLHF model outperforms the SFT model. The dataset consists of 300 Chinese instructions.

\

3.4         Tool Use, Code Interpreter, and Agent

Table 6: Performance of QWEN on the in-house Chinese benchmark that evaluates its ability to use unseen tools via ReAct prompting.

The QWEN models, which are designed to be versatile, have the remarkable ability to assist with (semi-)automating daily tasks by leveraging their skills in tool-use and planning. As such, they can serve as agents or copilots to help streamline various tasks. We explore QWEN’s proficiency in the following areas:

•     Utilizing unseen tools through ReAct prompting (Yao et al., 2022) (see Table 6).

•     Using a Python code interpreter to enhance math reasoning, data analysis, and more (see Table 7 and Table 8).

•     Functioning as an agent that accesses Hugging Face’s extensive collection of multimodal models while engaging with humans (see Table 9).

Table 7: The proportion of code generated by QWEN that is executable on the in-house evaluation benchmark for Code Interpreter. This benchmark examines QWEN’s coding proficiency in math problem solving, data visualization, and general purposes. CODE LLAMA underperforms on visualization tasks because it hallucinates non-existent columns solely based on CSV file names (see Figure 5).

Table 8: Correctness of the final response on the in-house evaluation benchmark for Code Interpreter. Visualization-Hard tasks involve planning multiple steps, while Visualization-Easy tasks do not. Visualization-All measures both types of tasks. CODE LLAMA excels in performing VisualizationEasy tasks but tends to underperform in Visualization-Hard tasks, due to its inclination to hallucinate non-existent columns based on the name of a CSV file (see Figure 5).

\ Table 9: Results of QWEN-Chat on the Hugging Face Agent benchmark.

\ To enhance QWEN’s capabilities as an agent or copilot, we employ the self-instruct (Wang et al., 2023c) strategy for SFT. Specifically, we utilize the in-context learning capability of QWEN for self-instruction. By providing a few examples, we can prompt QWEN to generate more relevant queries and generate outputs that follow a specific format, such as ReAct (Yao et al., 2022). We then apply rules and involve human annotators to filter out any noisy samples. Afterwards, the samples are incorporated into QWEN’s training data, resulting in an updated version of QWEN that is more dependable for self-instruction. We iterate through this process multiple times until we gather an ample number of samples that possess both exceptional quality and a wide range of diversity. As a result, our final collection consists of around 2000 high-quality samples.

During the finetuning process, we mix these high-quality samples with all the other general-purpose SFT samples, rather than introducing an additional training stage. By doing so, we are able to retain essential general-purpose capabilities that are also pertinent for constructing agent applications.

\ Using Tools via ReAct Prompting We have created and made publicly available a benchmark for evaluating QWEN’s ability to call plugins, tools, functions, or APIs using ReAct Prompting (see Qwen Team, Alibaba Group, 2023b). To ensure fair evaluation, we have excluded any plugins that were included in QWEN’s training set from the evaluation set. The benchmark assesses the model’s accuracy in selecting the correct plugin from a pool of up to five candidates, as well as the plausibility of the parameters passed into the plugin and the frequency of false positives. In this evaluation, a false positive occurs when the model incorrectly invokes a plugin in response to a query, despite not being required to do so.

The results presented in Table 6 demonstrate that QWEN consistently achieves higher accuracy in identifying the relevance of a query to the available tools as the model size increases. However, the table also highlights that beyond a certain point, there is little improvement in performance when it comes to selecting the appropriate tool and providing relevant arguments. This suggests that the current preliminary benchmark may be relatively easy and may require further enhancement in future iterations. It is worth noting that GPT-3.5 stands out as an exception, displaying suboptimal performance on this particular benchmark. This could potentially be attributed to the fact that the benchmark primarily focuses on the Chinese language, which may not align well with GPT-3.5’s capabilities. Additionally, we observe that GPT-3.5 tends to attempt to use at least one tool, even if the query cannot be effectively addressed by the provided tools.

\ Using Code Interpreter for Math Reasoning and Data Analysis The Python code interpreter is widely regarded as a powerful tool for augmenting the capabilities of an LLM agent. It is worth investigating whether QWEN can harness the full potential of this interpreter to enhance its performance in diverse domains, such as mathematical reasoning and data analysis. To facilitate this exploration, we have developed and made publicly available a benchmark that is specifically tailored for this purpose (see Qwen Team, Alibaba Group, 2023a).

The benchmark encompasses three primary categories of tasks: math problem-solving, data visualization, and other general-purpose tasks like file post-processing and web crawling. Within the visualization tasks, we differentiate between two levels of difficulty. The easier level can be achieved by simply writing and executing a single code snippet without the need for advanced planning skills. However, the more challenging level requires strategic planning and executing multiple code snippets in a sequential manner. This is because the subsequent code must be written based on the output of the previous code. For example, an agent may need to examine the structure of a CSV file using one code snippet before proceeding to write and execute additional code to create a plot.

Regarding evaluation metrics, we consider both the executability and correctness of the generated code. To elaborate on the correctness metrics, for math problems, we measure accuracy by verifying if the ground truth numerical answer is present in both the code execution result and the final response. When it comes to data visualization, we assess accuracy by utilizing QWEN-VL (Bai et al., 2023), a powerful multimodal language model. QWEN-VL is capable of answering text questions paired with images, and we rely on it to confirm whether the image generated by the code fulfills the user’s request.

The results regarding executability and correctness are presented in Table 7 and Table 8, respectively. It is evident that CODE LLAMA generally outperforms LLAMA 2, its generalist counterpart, which is not surprising since this benchmark specifically requires coding skills. However, it is worth noting that specialist models that are optimized for code synthesis do not necessarily outperform generalist models. This is due to the fact that this benchmark encompasses various skills beyond coding, such as abstracting math problems into equations, understanding language-specified constraints, and responding in the specified format such as ReAct. Notably, QWEN-7B-CHAT and QWEN-14B-CHAT surpass all other open-source alternatives of similar scale significantly, despite being generalist models.

\ Serving as a Hugging Face Agent Hugging Face provides a framework called the Hugging Face Agent or Transformers Agent (Hugging Face, 2023), which empowers LLM agents with a curated set of multimodal tools, including speech recognition and image synthesis. This framework allows an LLM agent to interact with humans, interpret natural language commands, and employ the provided tools as needed.

To evaluate QWEN’s effectiveness as a Hugging Face agent, we utilized the evaluation benchmarks offered by Hugging Face. The results are presented in Table 9. The evaluation results reveal that QWEN performs quite well in comparison to other open-source alternatives, only slightly behind the proprietary GPT-4, demonstrating QWEN’s competitive capabilities.

\

4         Code-Qwen: Specialized Model for Coding

Training on domain-specific data has been shown to be highly effective, particularly in the case of code pretraining and finetuning. A language model that has been reinforced with training on code data can serve as a valuable tool for coding, debugging, and interpretation, among other tasks. In this work, we have developed a series of generalist models using pretraining and alignment techniques. Building on this foundation, we have created domain-specific models for coding by leveraging the base language models of QWEN, including continued pretrained model, CODE-QWEN and supervised finetuned model, CODE-QWEN-CHAT. Both models have 14 billion and 7 billion parameters versions.

\

4.1         Code Pretraining

We believe that relying solely on code data for pretraining can result in a significant loss of the ability to function as a versatile assistant. Unlike previous approaches that focused solely on pretraining on code data (Li et al., 2022; 2023d), we take a different approach (Rozie`re et al., 2023) by starting with our base models QWEN trained on a combination of text and code data, and then continuing to pretrain on the code data. We continue to pretrain the models on a total of around 90 billion tokens. During the pre-training phase, we initialize the model using the base language models QWEN. Many applications that rely on specialized models for coding may encounter lengthy contextual scenarios, such as tool usage and code interpretation, as mentioned in Section 3.4. To address this issue, we train our models with context lengths of up to 8192. Similar to base model training in Section 2.4, we employ Flash Attention (Dao et al., 2022) in the attention modules, and adopt the standard optimizer AdamW (Kingma & Ba, 2014; Loshchilov & Hutter, 2017), setting β1 = 0*.9, β2 = 0.95, and ϵ = 10−8. We set the learning rate as 6.0 × 10−5 for CODE-QWEN-14B and 3.*0 × 10−5 for CODE-QWEN-7B, with 3% warm up iterations and no learning rate decays.

\

4.2         Code  Supervised  Fine-Tuning

After conducting a series of empirical experiments, we have determined that the multi-stage SFT strategy yields the best performance compared to other methods. In the supervised fine-tuning stage, the model CODE-QWEN-CHAT initialized by the code foundation model CODE-QWEN are optimized by the AdamW (Kingma & Ba, 2014; Loshchilov & Hutter, 2017) optimizer (β1 = 0*.9, β2 = 0.*95, ϵ = 10−8) with a learning rate of 2*.0 × 10−6 and 1.*0 × 10−5 for the 14B and 7B model respectively. The learning rate increases to the peaking value with the cosine learning rate schedule (3% warm-up steps) and then remains constant.

\

4.3         Evaluation

Our CODE-QWEN models have been compared with both proprietary and open-source language models, as shown in Tables 10 and 11. These tables present the results of our evaluation on the test sets of Humaneval (Chen et al., 2021), MBPP (Austin et al., 2021), and the multi-lingual code generation benchmark HUMANEVALPACK (Muennighoff et al., 2023). The comparison is based on the pass@1 performance of the models on these benchmark datasets. The results of this comparison are clearly demonstrated in Tables 10 and 11.

Our analysis reveals that specialized models, specifically CODE-QWEN and CODE-QWEN-CHAT, sig-nificantly outperform previous baselines with similar parameter counts, such as OCTOGEEX (Muen-nighoff et al., 2023), InstructCodeT5+ (Wang et al., 2023d), and CodeGeeX2 (Zheng et al., 2023). In fact, these models even rival the performance of larger models like Starcoder (Li et al., 2023d).

When compared to some of the extremely large-scale closed-source models, CODE-QWEN and CODE-QWEN-CHAT demonstrate clear advantages in terms of pass@1. However, it is important to note that these models fall behind the state-of-the-art methods, such as GPT-4, in general. Nonetheless, with the continued scaling of both model size and data size, we believe that this gap can be narrowed in the near future.

It is crucial to emphasize that the evaluations mentioned previously are insufficient for grasping the full extent of the strengths and weaknesses of the models. In our opinion, it is necessary to develop more rigorous tests to enable us to accurately assess our relative performance in comparison to GPT-4.

\

5         Math-Qwen: Specialized Model for Mathematics Reasoning

We have created a mathematics-specialized model series called MATH-QWEN-CHAT, which is built on top of the QWEN pretrained language models. Specifically, we have developed assistant models that are specifically designed to excel in arithmetic and mathematics and are aligned with human behavior. We are releasing two versions of this model series, MATH-QWEN-14B-CHAT and MATH-QWEN-7B-CHAT, which have 14 billion and 7 billion parameters, respectively.

\

5.1         Training

We carry out math SFT on our augmented math instructional dataset for mathematics reasoning, and therefore we obtain the chat model, MATH-QWEN-CHAT, directly. Owing to shorter average lengths of the math SFT data, we use a sequence length of 1024 for faster training. Most user inputs in the math SFT dataset are examination questions, and it is easy for the model to predict the input format and it is meaningless for the model to predict the input condition and numbers which could be random.

Table 10: Results of pass@1 (%) on HumanEval and MBPP. Most scores are retrieved from the papers of StarCoder (Li et al., 2023d), CodeT5+ (Wang et al., 2023d), WizardCoder (Luo et al., 2023b) and CODE LLAMA (Rozie`re et al., 2023).

Table 11: Zero-shot pass@1 (%) performance on the HUMANEVALPACK (synthesize) benchmark. The baseline results are partly from OCTOPACK (Muennighoff et al., 2023).

\ Table 12: Results of models on mathematical reasoning. We report the accuracy of QWEN for all benchmarks using greedy decoding. For MATH, we are reporting QWEN’s performances on the test set from Lightman et al. (2023).

\ Thus, we mask the inputs of the system and user to avoid loss computation on them and find masking them accelerates the convergence during our preliminary experiments. For optimization, we use the AdamW optimizer with the same hyperparameters of SFT except that we use a peak learning rate of 2 × 10−5 and a training step of 50 000.

\

5.2         Evaluation

We evaluate models on the test sets of GSM8K (Grade school math) (Cobbe et al., 2021), MATH (Challenging competition math problems) (Hendrycks et al., 2021), Math401 (Arithmetic ability) (Yuan et al., 2023b), and Math23K (Chinese grade school math) (Wang et al., 2017). We compare MATH-QWEN-CHAT with proprietary models ChatGPT and Minerva (Lewkowycz et al., 2022) and open-sourced math-specialized model RFT (Yuan et al., 2023a), WizardMath (Luo et al., 2023a), and GAIRMath-Abel (Chern et al., 2023a) in Table 12. MATH-QWEN-CHAT models show better math reasoning and arithmetic abilities compared to open-sourced models and QWEN-CHAT models of similar sizes. Compared to proprietary models, MATH-QWEN-7B-CHAT outperforms Minerva-8B in MATH. MATH-QWEN-14B-CHAT is chasing Minerva-62B and GPT-3.5 in GSM8K and MATH and delivers better performance on arithmetic ability and Chinese math problems.

\

6         Related Work

6.1         Large Language Models

The excitement of LLM began with the introduction of the Transformer architecture (Vaswani et al., 2017), which was then applied to pretraining large-scale data by researchers such as Radford et al. (2018); Devlin et al. (2018); Liu et al. (2019). These efforts led to significant success in transfer learning, with model sizes growing from 100 million to over 10 billion parameters (Raffel et al., 2020; Shoeybi et al., 2019).

In 2020, the release of GPT-3, a massive language model that is 10 times larger than T5, demonstrated the incredible potential of few-shot and zero-shot learning through prompt engineering and in-context learning, and later chain-of-thought prompting (Wei et al., 2022c). This success has led to a number of studies exploring the possibilities of further scaling these models (Scao et al., 2022; Zhang et al., 2022; Du et al., 2021; Zeng et al., 2022; Lepikhin et al., 2020; Fedus et al., 2022; Du et al., 2022; Black et al., 2022; Rae et al., 2021; Hoffmann et al., 2022; Chowdhery et al., 2022; Thoppilan et al., 2022). As a result, the community has come to view these large language models as essential foundations for downstream models (Bommasani et al., 2021).

The birth of ChatGPT (OpenAI, 2022) and the subsequent launch of GPT-4 (OpenAI, 2023) marked two historic moments in the field of artificial intelligence, demonstrating that large language models (LLMs) can serve as effective AI assistants capable of communicating with humans. These events have sparked interests among researchers and developers in building language models that are aligned with human values and potentially even capable of achieving artificial general intelligence (AGI) (Anil et al., 2023; Anthropic, 2023a;b).

One notable development in this area is the emergence of open-source LLMs, specifically LLaMA (Touvron et al., 2023a) and LLAMA 2 (Touvron et al., 2023b), which have been recognized as the most powerful open-source language models ever created. This has led to a surge of activity in the open-source community (Wolf et al., 2019), with a series of large language models being developed collaboratively to build upon this progress (Mosaic ML, 2023; Almazrouei et al., 2023; ChatGLM2 Team, 2023; Yang et al., 2023; InternLM Team, 2023).

\

6.2         Alignment

The community was impressed by the surprising effectiveness of alignment on LLMs. Previously, LLMs without alignment often struggle with issues such as repetitive generation, hallucination, and deviation from human preferences. Since 2021, researchers have been diligently working on developing methods to enhance the performance of LLMs in downstream tasks (Wei et al., 2022a; Sanh et al., 2021; Longpre et al., 2023; Chung et al., 2022; Muennighoff et al., 2022). Furthermore, researchers have been actively exploring ways to align LLMs with human instructions (Ouyang et al., 2022; Askell et al., 2021; Bai et al., 2022b;c). One major challenge in alignment research is the difficulty of collecting data. While OpenAI has utilized its platform to gather human prompts or instructions, it is not feasible for others to collect such data.

However, there has been some progress in this area, such as the self-instruct approach proposed in Wang et al. (2023c). This innovative work offers a potential solution to the data collection problem in alignment research. As a result, there has been a surge in open-source chat data, including Alpaca (Taori et al., 2023), MOSS (Sun et al., 2023a), Dolly (Conover et al., 2023), Evol-Instruct (Xu et al., 2023b), and others (Sun et al., 2023b; Xu et al., 2023a;c; Chen et al., 2023c; Ding et al., 2023; Ji et al., 2023; Yang, 2023). Similarly, there has been an increase in open-source chat models, such as Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), Guanaco (Dettmers et al., 2023), MOSS (Sun et al., 2023a), WizardLM (Xu et al., 2023b), and others (Xu et al., 2023c; Chen et al., 2023c; Ding et al., 2023; Wang et al., 2023b).

To train an effective chat model, available solutions are mostly based on SFT and RLHF (Ouyang et al., 2022). While SFT is similar to pretraining, it focuses on instruction following using the aforementioned data. However, for many developers, the limited memory capacity is a major obstacle to further research in SFT. As a result, parameter-efficient tuning methods, such as LoRA (Hu et al., 2021) and Q-LoRA (Dettmers et al., 2023), have gained popularity in the community. LoRA tunes only low-rank adapters, while Q-LoRA builds on LoRA and utilizes 4-bit quantized LLMs and paged attention (Dettmers et al., 2022; Frantar et al., 2022; Kwon et al., 2023). In terms of RLHF, recent methods such as PPO (Schulman et al., 2017; Touvron et al., 2023b) have been adopted, but there are also alternative techniques aimed at addressing the complexity of optimization, such as RRHF (Yuan et al., 2023c), DPO (Rafailov et al., 2023), and PRO (Song et al., 2023). Despite the ongoing debate about the effectiveness of RLHF, more evidence is needed to understand how it enhances the intelligence of LLMs and what potential drawbacks it may have.

\

6.3         Tool Use and Agents

LLM’s planning function allows for the invocation of tools, such as APIs or agent capabilities, through in-context learning, as demonstrated by Schick et al. (2023). Yao et al. (2022) introduced ReAct, a generation format that enables the model to generate thoughts on which tool to use, accept input from API observations, and generate a response. GPT-3.5 and GPT-4, when prompted with few shots, have shown consistent and impressive performance. In addition to tool usage, LLMs can utilize external memory sources like knowledge bases (Hu et al., 2023; Zhong et al., 2023b) or search engines (Nakano et al., 2021; Liu et al., 2023b) to generate more accurate and informative answers. This has led to the popularity of frameworks like LangChain (LangChain, Inc., 2023). The research on LLMs for tool use has also sparked interest in building agents with LLM capabilities, such as agents that can call different AI models (Shen et al., 2023; Li et al., 2023a), embodied lifelong learning or multimodal agents (Wang et al., 2023a; Driess et al., 2023), and multiple agents interacting with each other and even building a micro-society (Chen et al., 2023b; Li et al., 2023b; Xu et al., 2023d; Hong et al., 2023).

\

6.4         LLM for Coding

Previous research has demonstrated that LLMs possess remarkable capabilities in code understanding and generation, particularly those with massive numbers of parameters (Chowdhery et al., 2022; Anil et al., 2023; Rae et al., 2021; Hoffmann et al., 2022). Moreover, several LLMs have been pre-trained, continued pre-trained, or fine-tuned on coding-related data, which has resulted in significantly improved performance compared to general-purpose LLMs. These models include Codex Chen et al. (2021), AlphaCode (Li et al., 2022), SantaCoder (Allal et al., 2023), Starcoder-Base (Li et al., 2023d), InCoder (Fried et al., 2022), CodeT5 (Wang et al., 2021), CodeGeeX (Zheng et al., 2023), and CODE LLAMA (Rozie`re et al., 2023). In addition to these models, recent studies have focused on developing specialized alignment techniques for coding, such as Code Llama-Instruct (Rozie`re et al., 2023) and StarCoder (Li et al., 2023d). These models can assist developers in various code-related tasks, including code generation (Chen et al., 2021; Austin et al., 2021), code completion (Zhang et al., 2023a), code translation (Szafraniec et al., 2023), bug fixing (Muennighoff et al., 2023), code refinement (Liu et al., 2023c), and code question answering (Liu & Wan, 2021). In a word, LLMs have the potential to revolutionize the field of coding by providing developers with powerful tools for code comprehension, generation, and related tasks.

\

6.5         LLM for Mathematics

LLMs with a certain model scale have been found to possess the ability to perform mathematical reasoning (Wei et al., 2022b; Suzgun et al., 2022). In order to encourage LLMs to achieve better performance on math-related tasks, researchers have employed techniques such as chain-of-thought prompting (Wei et al., 2022c) and scratchpad (Nye et al., 2021), which have shown promising results. Additionally, self-consistency (Wang et al., 2022) and least-to-most prompting (Zhou et al., 2022) have further improved the performance of these models on these tasks. However, prompt engineering is a time-consuming process that requires a lot of trial and error, and it is still difficult for LLMs to consistently perform well or achieve satisfactory results in solving mathematical problems. Moreover, simply scaling the data and model size is not an efficient way to improve a model’s mathematical reasoning abilities. Instead, pretraining on math-related corpora has been shown to consistently enhance these capabilities (Hendrycks et al., 2021; Lewkowycz et al., 2022; Taylor et al., 2022; Lightman et al., 2023). Additionally, fine-tuning on math-related instruction-following datasets (Si et al., 2023; Yuan et al., 2023a; Luo et al., 2023a; Yue et al., 2023; Chern et al., 2023a; Yu et al., 2023), has also been effective and more cost-effective than math-specific pretraining. Despite their limitations in terms of accuracy, LLMs still have significant potential to assist users with practical mathematical problems. There is ample scope for further development in this area.

\

7         Conclusion

In this report, we present the QWEN series of large language models, which showcase the latest advancements in natural language processing. With 14B, 7B, and 1.8B parameters, these models have been pre-trained on massive amounts of data, including trillions of tokens, and fine-tuned using cutting-edge techniques such as SFT and RLHF. Additionally, the QWEN series includes specialized models for coding and mathematics, such as CODE-QWEN, CODE-QWEN-CHAT, and MATH-QWEN-CHAT, which have been trained on domain-specific data to excel in their respective fields. Our results demonstrate that the QWEN series is competitive with existing open-source models and even matches the performance of some proprietary models on comprehensive benchmarks and human evaluation.

We believe that the open access of QWEN will foster collaboration and innovation within the community, enabling researchers and developers to build upon our work and push the boundaries of what is possible with language models. By providing these models to the public, we hope to inspire new research and applications that will further advance the field and contribute to our understanding of the variables and techniques introduced in realistic settings. In a nutshell, the QWEN series represents a major milestone in our development of large language models, and we are excited to see how it will be used to drive progress and innovation in the years to come.

\n

References

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. SantaCoder: Don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Co-jocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. Falcon-40B: An open large language model with state-of-the-art performance, 2023.

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.

Anthropic. Introducing Claude, 2023a. URL https://www.anthropic.com/index/ introducing-claude.nthropic. Claude 2. Technical report, Anthropic, 2023b. URL https://www-files. anthropic.com/production/images/Model-Card-Claude-2.pdf.

Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. ExT5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952, 2021.

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.

AutoGPT. AutoGPT: The heart of the open-source agent ecosystem, 2023. URL https:// github.com/Significant-Gravitas/Auto-GPT.

Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016. URL http://arxiv.org/abs/1607.06450.

Jinze Bai, Rui Men, Hao Yang, Xuancheng Ren, Kai Dang, Yichang Zhang, Xiaohuan Zhou, Peng Wang, Sinan Tan, An Yang andf Zeyu Cui, Yu Han, Shuai Bai, Wenbin Ge, Jianxin Ma, Junyang Lin, Jingren Zhou, and Chang Zhou. OFASys: A multi-modal multi-task learning system for building generalist models. CoRR, abs/2212.04408, 2022a. doi: 10.48550/arXiv.2212.04408. URL https://doi.org/10.48550/arXiv.2212.04408.

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. CoRR, abs/2308.12966, 2023. doi: 10.48550/arXiv.2308.12966. URL https://doi.org/10.48550/arXiv.2308.12966.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022b.

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022c.

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020.

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Con-ference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 7432–7439. AAAI Press, 2020. doi:

10.1609/aaai.v34i05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. GPT-NeoX-20B: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.

bloc97.       NTK-aware scaled RoPE allows LLaMA models to have extended (8k+) con-text size without any fine-tuning and minimal perplexity degradation., 2023.                  URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware scaledropeallowsllamamodelsto_have/.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportuni-ties and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

ChatGLM2 Team. ChatGLM2-6B: An open bilingual chat LLM, 2023. URL https://github. com/THUDM/ChatGLM2-6B.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde´ de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv. org/abs/2107.03374.

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a.

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023b.

Zhihong Chen, Feng Jiang, Junying Chen, Tiannan Wang, Fei Yu, Guiming Chen, Hongbo Zhang, Juhao Liang, Chen Zhang, Zhiyi Zhang, et al. Phoenix: Democratizing ChatGPT across languages. arXiv preprint arXiv:2304.10453, 2023c.

Ethan Chern, Haoyang Zou, Xuefeng Li, Jiewen Hu, Kehua Feng, Junlong Li, and Pengfei Liu. Generative ai for math: Abel. https://github.com/GAIR-NLP/abel, 2023a.

I Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu, et al. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528, 2023b.

David Chiang and Peter Cholak. Overcoming a theoretical limitation of self-attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7654–7664, 2022.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neu-ral Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 2924–2936. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1300. URL https://doi.org/10.18653/v1/n19-1300.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Fran-cisco Guzma´n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free Dolly: Introducing the world’s first truly open instruction-tuned LLM, 2023. URL https://www.databricks.com/blog/2023/04/ 12/dolly-first-open-commercially-viable-instruction-tuned-llm.

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng, Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re´.  FlashAt-tention: Fast and memory-efficient exact attention with io-awareness.  In NeurIPS, 2022.          URL http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html.

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In International conference on machine learning, pp. 933–941. PMLR, 2017.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314, 2023.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaM: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547–5569. PMLR, 2022.

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.

Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty with V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5988–6008. PMLR, 17–23 Jul 2022.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1): 5232–5270, 2022.

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida I. Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022.

Google. An important next step on our AI journey, 2023. URL https://blog.google/ technology/ai/bard-google-ai-search-updates/.

Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with Gaussian error linear units. CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/1606. 08415.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.

Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao. Chatdb: Augmenting llms with databases as their symbolic memory. arXiv preprint arXiv:2306.03901, 2023.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.

Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Ku¨bler, and Lawrence S. Moss. OCNLI: original chinese natural language inference. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pp. 3512–3526. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.findings-emnlp.314. URL https://doi.org/10.18653/v1/2020.findings-emnlp.314.

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.

Hugging Face. Transformers agents, 2023. URL https://huggingface.co/docs/ transformers/transformers_agents.

Baichuan Inc. Baichuan-7B: A large-scale 7B pretraining language model developed by BaiChuan-Inc, 2023a. URL https://github.com/baichuan-inc/Baichuan-7B.

XVERSE Technology Inc.  XVERSE-13B: A multilingual large language model devel-oped by XVERSE Technology Inc., 2023b. URL https://github.com/xverse-ai/ XVERSE-13B.

InternLM Team. InternLM: A multilingual language model with progressively enhanced capabilities, 2023. URL https://github.com/InternLM/InternLM.

Shantanu Jain. tiktoken: A fast BPE tokeniser for use with OpenAI’s models, 2022. URL https://github.com/openai/tiktoken/.

Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Lei Zhang, Baochang Ma, and Xiangang Li. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742, 2023.

Zixuan Jiang, Jiaqi Gu, Hanqing Zhu, and David Z. Pan. Pre-RMSNorm and Pre-CRMSNorm transformers: Equivalent and efficient pre-LN transformers. CoRR, abs/2305.14858, 2023. doi: 10.48550/arXiv.2305.14858. URL https://doi.org/10.48550/arXiv.2305.14858.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019. doi: 10.1162/tacl*\* a*\* 00276. URL https://doi.org/10. 1162/tacla00276.

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.

LangChain, Inc. LangChain: Building applications with LLMs through composability, 2023. URL https://python.langchain.com/.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022.

Chenliang Li, Hehong Chen, Ming Yan, Weizhou Shen, Haiyang Xu, Zhikai Wu, Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, et al. ModelScope-Agent: Building your customizable agent system with open-source large language models. arXiv preprint arXiv:2309.00986, 2023a.

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for “mind” exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023b.

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023c.

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Joa˜o Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy V, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Moustafa-Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Mun˜oz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. StarCoder: May the source be with you! CoRR, abs/2305.06161, 2023d. doi: 10.48550/arXiv.2305.06161. URL https://doi.org/10.48550/arXiv.2305.06161.

Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Re´mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level code generation with AlphaCode. CoRR, abs/2203.07814, 2022.

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.

Chenxiao Liu and Xiaojun Wan. CodeQA: A question answering dataset for source code com-prehension. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pp. 2618–2632. Associa-tion for Computational Linguistics, 2021. doi: 10.18653/v1/2021.findings-emnlp.223. URL https://doi.org/10.18653/v1/2021.findings-emnlp.223.

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.

Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, and Jie Tang. WebGLM: Towards an efficient web-enhanced question answering system with human preferences. arXiv preprint arXiv:2306.07906, 2023b.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.

Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach Dinh Le, and David Lo. Refining ChatGPT-generated code: Characterizing and mitigating code quality issues. CoRR, abs/2307.12596, 2023c. doi: 10.48550/arXiv.2307.12596. URL https://doi.org/10.48550/arXiv.2307.12596.

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The Flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and Jingren Zhou. #InsTag: Instruction tagging for analyzing supervised fine-tuning of large language models. CoRR, abs/2308.07074, 2023. doi: 10.48550/arXiv.2308.07074. URL https://doi. org/10.48550/arXiv.2308.07074.

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. WizardMath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023a.

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. WizardCoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023b.

Mosaic ML. MPT-30B: Raising the bar for open-source foundation models, 2023. URL https://www.mosaicml.com/blog/mpt-30b.

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. OctoPack: Instruction tuning code large language models. CoRR, abs/2308.07124, 2023.

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.

Maxwell Nye, Anders Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models. ArXiv, abs/2112.00114, 2021.

OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.

OpenAI. ChatML, 2022. URL https://github.com/openai/openaipython/blob/e389823ba013a24b4c32ce38fa0bd87e6bccae94/chatml.md.

OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023.

OpenCompass Team. OpenCompass: A universal evaluation platform for foundation models, 2023. URL https://opencompass.org.cn/leaderboard-llm.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ b1efde53be364a73914f58805a001731-Abstract-Conference.html.

Denis Paperno, Germa´n Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Ferna´ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics, 2016. doi: 10.18653/v1/ p16-1144. URL https://doi.org/10.18653/v1/p16-1144.

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023a.

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023b.

Qwen Team, Alibaba Group. Evaluation benchmark for code intepreter, 2023a. URL https://github.com/QwenLM/Qwen-Agent/tree/main/benchmark.

Qwen Team, Alibaba Group. Evaluation benchmark for tool usage through ReAct prompting, 2023b. URL https://github.com/QwenLM/Qwen-7B/tree/main/eval.

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.

Scott E. Reed, Konrad Zolna, Emilio Parisotto, Sergio Go´mez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum?id=1ikK0kHjvj.

Baptiste Rozie`re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Je´re´my Rapin, et al. Code Llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. SocialIQA: Com-monsense reasoning about social interactions. CoRR, abs/1904.09728, 2019. URL http:

//arxiv.org/abs/1904.09728.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic´, Daniel Hesslow, Roman Castagne´, Alexandra Sasha Luccioni, Franc¸ois Yvon, Matthias Galle´, et al. BLOOM: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.

Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Noam Shazeer. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hug-gingGPT: Solving AI tasks with ChatGPT and its friends in HuggingFace. arXiv preprint arXiv:2303.17580, 2023.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan-zaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.

Qingyi Si, Tong Wang, Naibin Gu, Rui Liu, and Zheng Lin. Alpaca-CoT: An instruction-tuning platform with unified interface of instruction collection, parameter-efficient methods, and large language models, 2023. URL https://github.com/PhoebusSi/alpaca-CoT.

Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.

Stability AI. StableBeluga2, 2023. URL https://huggingface.co/stabilityai/ StableBeluga2.

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.

Jianlin Su. Improving transformer: Length extrapolation ability and position robustness, 2023a. URL

https://spaces.ac.cn/archives/9444.

Jianlin Su. The magical effect of the Bias term: RoPE + Bias = better length extrapolation, 2023b. URL https://spaces.ac.cn/archives/9577.

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.

Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Hang Yan, Xiangyang Liu, Yunfan Shao, Qiong Tang, Xingjian Zhao, Ke Chen, Yining Zheng, Zhejian Zhou, Ruixiao Li, Jun Zhan, Yunhua Zhou, Linyang Li, Xiaogui Yang, Lingling Wu, Zhangyue Yin, Xuanjing Huang, and Xipeng Qiu. MOSS: Training conversational language models from synthetic data, 2023a.

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023b.

Mirac Suzgun, Nathan Scales, Nathanael Scha¨rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.

Marc Szafraniec, Baptiste Rozie`re, Hugh Leather, Patrick Labatut, Franc¸ois Charton, and Gabriel Synnaeve. Code translation with compiler representations. In The Eleventh International Confer-ence on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=XomEU3eNeSQ.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4149–4158. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1421. URL https://doi.org/10.18653/v1/n19-1421.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model, 2023. URL https://github.com/tatsu-lab/stanford_alpaca.

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science, 2022.

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen S. Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Ol-son, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Agu¨era y Arcas, Claire Cui, Marian Croak, Ed H. Chi, and Quoc Le. LaMDA: Language models for dialog applications. CoRR, abs/2201.08239, 2022. URL https://arxiv.org/abs/2201.08239.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothe´e Lacroix, Baptiste Rozie`re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aure´lien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/ 10.48550/arXiv.2307.09288.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Huai hsin Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. ArXiv, abs/2203.11171, 2022.

Yan Wang, Xiaojiang Liu, and Shuming Shi. Deep neural solver for math word problems. In Conference on Empirical Methods in Natural Language Processing, 2017. URL https://api. semanticscholar.org/CorpusID:910689.

Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? Exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023b.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 13484–13508. Association for Computational Linguistics, 2023c. doi: 10.18653/v1/2023.acl-long.754. URL https://doi.org/10.18653/v1/ 2023.acl-long.754.

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. CodeT5+: Open code large language models for code understanding and generation. CoRR, abs/2305.07922, 2023d. doi: 10.48550/arXiv.2305.07922. URL https://doi.org/10. 48550/arXiv.2305.07922.

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a. URL https://openreview.net/forum?id= gEZrGCozdqR.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed Huai hsin Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022b. URL https://api.semanticscholar.org/ CorpusID:249674500.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022c.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Re´mi Louf, Morgan Funtowicz, et al. HuggingFace’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.

Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Yongdong Zhang, and Zhendong Mao. ExpertPrompting: Instructing large language models to be distinguished experts. arXiv preprint arXiv:2305.14688, 2023a.

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023b.

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196, 2023c.

Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, and Yang Liu. Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658, 2023d.

Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, and Zhiying Wu. Baichuan 2: Open large-scale language models. Technical report, Baichuan Inc., 2023. URL https://cdn.baichuan-ai.com/paper/Baichuan2-technical-report. pdf.

Jianxin Yang. Firefly. https://github.com/yangjianxin1/Firefly, 2023.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2023.

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023a.

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023b.

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. RRHF: Rank responses to align language models with human feedback without tears, 2023c.

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. MAmmoTH: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Llu´ıs Ma`rquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p19-1472.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. GLM-130B: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.

Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. CoRR, abs/2303.12570, 2023a. doi: 10.48550/arXiv.2303.12570. URL https://doi.org/ 10.48550/arXiv.2303.12570.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on GAOKAO benchmark. CoRR, abs/2305.12474, 2023b. doi: 10.48550/arXiv.2305.12474. URL https://doi.org/10.48550/arXiv. 2305.12474.

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. CodeGeeX: A pre-trained model for code generation with multilingual evaluations on humaneval-x. CoRR, abs/2303.17568, 2023. doi: 10.48550/arXiv.2303.17568. URL https://doi.org/10.48550/arXiv.2303.17568.

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023a. doi: 10.48550/arXiv.2304.06364. URL https://doi.org/ 10.48550/arXiv.2304.06364.

Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250, 2023b.

Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Huai hsin Chi. Least-to-most prompting enables complex reasoning in large language models. ArXiv, abs/2205.10625, 2022.

\n

A       Appendix

A.1          More Training Details

A.1.1           Data Format for Qwen-Chat

Different from conventional pretraining based on autoregressive next-token prediction, despite using a similar training task, there should be a specially design data format for SFT and RLHF to build a conversational AI assistant model. Common formats include “human-assistant” and ChatML formats. As to our knowledge, one of the earliest examples of the human-assistant format comes from Anthropic (Bai et al., 2022b), which adds a special phrase “\n\nhuman: ” in front of the user input and “\n\nassistant: ” in front of the assistant response. It is easy for the base language model to transfer to the pattern of conversational AI. However, as the specific phrases are common words, it might be hard for the model to disambiguate from these words in other contexts.

Instead, we turned to the ChatML format proposed by OpenAI.5 This format allows the use of special tokens, i.e., “<im_start>” and “<im_end>”, that do not appear in pretraining, and thus resolve the aforementioned problem. We demonstrate an example of the format below.

\

A.2          Evaluation

A.2.1           Automatic Evaluation

To provide a whole picture of the performance of our model series QWEN, here in this section we illustrate the detailed performance of our models as well as the baselines in the comprehensive benchmark evaluation proposed by OpenCompass Team (2023). We report the results in multiple tables based on the officially provided categories, including examination, language, knowledge, understanding, and reasoning. In terms of the performance of the baseline models, we report the higher results between the reported ones and those on the leaderboard.

\ Examination Here we evaluate the models on a series of datasets relevant to the examination. The datasets include:

  • MMLU (Hendrycks et al., 2020) Massive Multi-task Language Understanding is designed for measuring language understanding capabilities. We report 5-shot results.

  • C-Eval (Huang et al., 2023) C-Eval is a Chinese evaluation dataset spanning 52 diverse disciplines. We report 5-shot results.

  • CMMLU (Li et al., 2023c) CMMLU is designed for assessing language understanding capabilities in Chinese. We report 5-shot results.

  • AGIEval (Zhong et al., 2023a) This is a benchmark consisting of human-centric examina-tions, including college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We report zero-shot results.

  • Gaokao-Bench (Zhang et al., 2023b) This is a benchmark with Gaokao (Chinese college-entrance examination) questions. We report zero-shot results.

  • ARC (Clark et al., 2018) ARC is a dataset consisting of grade-school level, multiple-choice science questions. It includes an easy set and a challenge set, which are referred by ARC-e and ARC-c. We report zero-shot results.

    Table 13: Results on MMLU. All are tested with five-shot accuracy. We provide the reported results of the other models for comparison.

    Table 14: Leaderboard results of C-Eval. We include the results of both proprietary models and open-source models. Note that there are a number of models on the leaderboard with very few details, in terms of proprietary models, we only report the results of GPT-3.5, GPT-4, InternLM and ChatGLM2.

    \n In terms of MMLU, we report the detailed results in Table 13. In terms of C-Eval, we report the results in Table 14. For the rest of the datasets, we report the results in Table 15. Note that AGIEval includes the parts of Chinese and English, while LLAMA 2 only reported the results in the English part, so we use the results on OpenCompass.

    Table 15: Results on the other datasets of examination. Specifically, we report the results on CMMLU, AGIEval, ARC-e, and ARC-c.

\ Additionally, while CMMLU, AGIEval, and Gaokao-Bench are related to Chinese, and MPT, Falcon, and the LLaMA series were not optimized for Chinese, these models achieved low performance on the datasets.

\ Knowledge and Understanding Here we evaluate the models on a series of datasets relevant to knowledge and natural language understanding. The datasets include

  • BoolQ (Clark et al., 2019) This is a QA dataset, where the questions are about passages of Wikipedia, and the model should answer yes or no to the given possible answer. We report zero-shot results.
  • CommonsenseQA (Talmor et al., 2019) This is a dataset of multiple-choice question answering that asseses the understanding of commonsense knowledge. We report 8-shot results.
  • NaturalQuestions (Kwiatkowski et al., 2019) It is a dataset of QA where the questions are from users and the answers are verified by experts. We report zero-shot results.
  • LAMBADA (Paperno et al., 2016) This is dataset to evaluate language understanding by word prediction. It consists of passages related to human subjects. We report zero-shot results.

We report the results in Table 16.

Reasoning We report the evaluation results on the datasets concerning reasoning, focusing on natural language reasoning. For the others, such as mathematics and coding, as we have illustrated detailed results, here we do not report those results repeatedly. The datasets for evaluation include:

  • HellaSwag (Zellers et al., 2019) This is a commonsense natural language inference (NLI) dataset, where the questions are easy for humans but struggling for previous language models. We report zero-shot results.

  • PIQA (Bisk et al., 2020) This is an NLI dataset assessing the physical knowledge. We report zero-shot results.

    Table 16: Results on the datasets concerning knowledge and understanding. Specifically, we report the results on BoolQ, CommonsenseQA, NaturalQuestions, and LAMBADA.

    Table 17: Results on the datasets related to natural language reasoning. Specifically, we report the results on HellaSwag, PIQA, SIQA, and OCNLI.

\

  • SIQA (Sap et al., 2019) This is an NLI dataset evaluating social commonsense intelligence. We report zero-shot results.
  • OCNLI (Hu et al., 2020) This is an NLI dataset focusing on Chinese. We report zero-shot results.

We report the results in Table 17.

\

A.2.2           Human Evaluation

In this section, we demonstrate the cases of human analysis. In our self-constructed evaluation dataset, the instructions are either manually written data or manual revised from public datasets, such as CLiB6, C-Eval (Huang et al., 2023), FacTool (Chern et al., 2023b), LeetCode7), etc.

In terms of each case, we demonstrate the responses and Elo ratings8 of all models for comparison. Specifically, as the data in our human evaluation are in Chinese, we also provide their translations in English.

\

A.3          Analysis of Code Interpreter

Here we provide a case of comparison between CODE LLAMA and QWEN-CHAT. This case demonstrates the advantages of QWEN-CHAT in processing tabular data and performing complex tasks.

Figure 5: Example showcasing QWEN-CHAT’s ability in using a code interpreter via ReAct prompting. The ReAct instruction is omitted for clarity. QWEN creates a two-step plan and first investigates the columns present in the CSV file before proceeding to draw the plot, as shown in the top-left figure. CODE LLAMA, however, attempts to draw the plot based on non-existent columns in its initial attempt, as seen in the bottom figure. CODE LLAMA can only reliably perform the task if the columns are provided in the user query, as shown in the top-right figure.

\

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\