2025-05-20 18:00:03
4 Human Evaluations and Safety Testing, and 4.1 Prompts for Evaluation
5 Benchmark Evaluations and 5.1 Text
7 Conclusion, Acknowledgements, Contributors, and References
Appendix
B. Additional Information of Human Evaluations
Chameleon builds upon the lineage of works exploring token-based approaches for multimodal learning. The idea of using discrete tokens to represent continuous modalities like images was first explored in works like BEiT (Bao et al., 2021), which proposed a self-supervised vision representation learning method based on tokenized image patches. Aghajanyan et al. (2022) extended this idea to learning from mixed-modal documents through interleaved image and text tokens, allowing for joint reasoning over both modalities within a unified architecture. CM3Leon (Yu et al., 2023) further scaled up this approach to autoregressive text-to-image generation, building on the initial proposal of token-based image generation in DALL-E (Ramesh et al., 2021).
\ As a fully token-based early-fusion model, Chameleon differs from late-fusion approaches like Flamingo (Alayrac et al., 2022) which encode images and text separately before combining them at a later stage. Other models like LLaVA (Liu et al., 2023a), IDEFICS (Laurençon et al., 2023), and VisualGPT (Chen et al., 2022) also maintain separate image and text encoders. In contrast, Chameleon’s unified token space allows it to seamlessly reason over and generate interleaved image and text sequences, without the need for modality-specific components. This early-fusion approach, however, comes with significant challenges in terms of representation learning and alignment, as discussed in Baltrušaitis et al. (2018).
\ The most similar model to Chameleon is Gemini (Gemini et al., 2023), which also uses an early-fusion token-based approach. However, a key difference is that Gemini uses separate image decoders, whereas Chameleon is an end-to-end dense model without any routing components. This makes Chameleon a more general-purpose model for both multimodal understanding and generation tasks, similar in spirit to the Perceiver (Jaegle et al., 2021) architecture which also aims for a unified model across modalities and tasks.
\ In summary, Chameleon builds on a rich history of work in multimodal learning and token-based architectures, while pushing the boundaries in terms of model scale and architecture design. By demonstrating strong performance across a wide range of vision-language tasks and enabling new capabilities in mixed-modal reasoning and generation, Chameleon represents a significant step towards realizing the vision of general-purpose multimodal foundation models.
\
:::info Author:
(1) Chameleon Team, FAIR at Meta.
:::
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
2025-05-20 17:00:03
4 Human Evaluations and Safety Testing, and 4.1 Prompts for Evaluation
5 Benchmark Evaluations and 5.1 Text
7 Conclusion, Acknowledgements, Contributors, and References
Appendix
B. Additional Information of Human Evaluations
We next evaluate Chameleon on the segment of tasks that requires text generation conditioned on an image, specifically on image captioning and visual question-answering tasks, and present results of Chameleon-34B in Table 7. Together with our pre-trained model, we also present results with a model fine-tuned on all tasks together (Chameleon-34B-MultiTask), as well as models exclusively fine-tuned for the specific evaluation tasks (Chameleon-34B-SFT).
\ We evaluate against available open-source late-fusion models: specifically Flamingo 80B (Alayrac et al., 2022), IDEFICS 80B (Laurençon et al., 2023), and Llava-1.5 (Liu et al., 2023a), as well as recent closed-source models, such as Gemini (Gemini et al., 2023) and GPT4-V (OpenAI, 2023). We note that we did not take any special care when formatting the pre-training data to ensure that 0-shot inference can be effectively done. Therefore, we augment the input images or questions with the published prompts used by other models. This was purposefully done to maintain the fidelity of the pre-training data.
\ • Image Captioning: For image captioning evaluations we report CiDER (Vedantam et al., 2015) scores on the Karpathy test split of MS-COCO (Lin et al., 2014), and the Karpathy test split of Flickr30k (Plummer et al., 2015) using the pycocoevalcap (Chen et al., 2020) package. For Chameleon models, we restrict captions to 30 tokens. We evaluated GPT-4V and Gemini models using several prompts and generation lengths via their APIs and report the best performance that we were able to achieve.
\ In the open-source pre-trained category, Chameleon-34B (2-shot) outperforms the larger 80B models of both Flamingo and IDEFICS on COCO with 32-shots, while matching their performance on Flickr30k. With respect to fine-tuned/closed-source models, both multi-task and SFT variants of Chameleon-34B outperform all other models on COCO, while for Flickr30k, the SFT model outperforms other models with the multitask model being a close competitor.
\ • Visual Question Answering: For visual question answering (VQA) we report performance on the testdev split of VQA-v2 (Goyal et al., 2017). For VQA-v2, the pre-trained Chameleon-34B model with 2-shots matches the 32-shot performance of the larger Flamingo and IDEFICS models, while for finetuned/closed models, Chameleon-34B-Multitask approaches the performance of IDEFICS-80B-Instruct and Gemini Pro, but trails larger models such as Flamingo-80B-FT, GPT-4V, and Gemini Ultra. Llava-1.5 outperforms Chameleon-34B on VQAv2 potentially owing to its additional fine-tuning on
\
\ conversations from GPT-4, ShareGPT (ShareGPT, 2023), GQA (Hudson and Manning, 2019), and region-level VQA datasets, but significantly trails behind on the other tasks.
\ In general, we find Chameleon is fairly competitive on both image captioning and VQA tasks. It rivals other models by using much fewer in-context training examples and with smaller model sizes, in both pre-trained and fine-tuned model evaluations.
\
:::info Author:
(1) Chameleon Team, FAIR at Meta.
:::
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
2025-05-20 16:00:03
4 Human Evaluations and Safety Testing, and 4.1 Prompts for Evaluation
5 Benchmark Evaluations and 5.1 Text
7 Conclusion, Acknowledgements, Contributors, and References
Appendix
B. Additional Information of Human Evaluations
Given the general capabilities of Chameleon, there is not a single model that we can directly evaluate against; therefore, we evaluate against the best models in every category within our capabilities.
We evaluate the general text-only capabilities of our pre-trained (not SFT’d) model against other state-of-theart text-only large language models. We follow the evaluation protocol outlined by Touvron et al. (2023). Specifically we evaluate all models, using an in-house evaluation platform on the areas of commonsense reasoning, reading comprehension, math problems, and world knowledge. We report our results in Table 6.
\
\ • Commonsense Reasoning and Reading Comprehension: We report 0-shot performance on the following benchmarks that measure commonsense reasoning and reading comprehension capabilities: PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), ARC-Easy (Clark et al., 2018), ARC-Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), and BoolQ (Clark et al., 2019). We score the prompt with each candidate answer and compute accuracy using the candidate with the highest score. All baseline model performances except a few are taken directly from the reported sources. We observe that Chameleon-7B and Chameleon-34B are competitive with the corresponding Llama-2 models, with Chameleon-34B even outperforming Llama-2 70B on 5/8 tasks and performing on par with Mixtral 8x7B.
\ • MATH and World Knowledge We report 8-shot performance on GSM8K (Cobbe et al., 2021) i.e., grade school math word problems and 4-shot performance on the MATH (Hendrycks et al., 2021) benchmark. We report maj@N exact match accuracy for both benchmarks by sampling N generations from the model (greedy sampling for N=1) and choosing the answer via majority voting. Despite training for additional modalities, both Chameleon models demonstrate strong math capabilities. On GSM8k, Chameleon-7B outperforms the corresponding Llama-2 models, with performance comparable to Mistral 7B (50.9 vs 52.1 maj@8). Furthermore, Chameleon-34B can outperform Llama2-70B on maj@1 (61.4 vs 56.8) and Mixtral 8x7B on maj@32 (77.0 vs 75.1). Similarly, on MATH, Chameleon-7B outperforms Llama-2 and matches Mistral 7B on maj@4, while Chameleon-34B outperforms Llama2-70B, approaching the performance of Mixtral 8x7B on maj@4 (24.7 vs 28.4).
\ We also report performance on MMLU (Hendrycks et al., 2020), which measures world/in-domain knowledge and problem-solving abilities using 57 subjects, including elementary mathematics, US history, computer science, and law. Both Chameleon models outperform their Llama-2 counterparts with Chameleon-34B approaching the performance of Mixtral 8x7B/Gemini-Pro (65.8 vs 70.6/71.8).
\ Overall, Chameleon outperforms LLaMa-2 across the board, with performance approaching Mistral 7B/Mixtral 8x7B (Jiang et al., 2023, 2024) on some tasks. These gains are likely due to multiple factors. First, we do two epochs over the LLaMa-2 pre-training data, and in general use more compute for pretraining. Second, including code data significantly improves performance on text-only reasoning tasks. Lastly, having higher quality data in the last 20% of pre-training significantly improves performance.
\
:::info Author:
(1) Chameleon Team, FAIR at Meta.
:::
:::info This paper is available on arxiv under CC BY 4.0 DEED license.
:::
\
2025-05-20 14:10:51
How are you, hacker?
🪐Want to know what's trending right now?:
The Techbeat by HackerNoon has got you covered with fresh content from our trending stories of the day! Set email preference here.
## Boost Team Wellness With UzairaAdvisory and Save 10% on All Our Business, Marketing & Tech Services
By @uzairamemon [ 6 Min read ]
Book Your Preventive Healthcare Seminar with UzairaAdvisory—Elevate Team Wellness and Unlock 10% OFF Our High-Impact Business, Marketing & Technology Solutions! Read More.
By @riteshmodi [ 10 Min read ] Discover how Model Context Protocol works, why it matters, and how it's transforming AI from isolated chatbots into assistants that can access your data. Read More.
By @drewchapin [ 3 Min read ] It's now possible to have an entire AI product team, which empowers solo founders to build faster, reduce cognitive load, and scale smarter with automation. Read More.
By @ninjatechai [ 6 Min read ] Ninja is proving 2025 is the year of AI agents. Outpacing OpenAI, Google, and others in tackling hallucinations, millions rely on it for coding, writing & more. Read More.
By @seedfunding [ 2 Min read ] Tesseral raises $3.3M to simplify enterprise-grade authentication for B2B devs with secure, open source infrastructure built for speed and scale. Read More.
By @mprichard [ 3 Min read ] IPinfo launches IPinfo Lite: a free, unlimited IP geolocation and ASN API with daily updates, commercial use rights, and accurate global coverage. Read More.
By @zacamos [ 5 Min read ] As AI use spreads, so does a new psychological phenomenon: AI-induced delusion. Here's a look at the darker impacts of AI chatbots. Read More.
By @thisweekinjavascript [ 3 Min read ] Chrome's V8 team just dropped a game-changing feature that makes JavaScript blazingly fast! Read More.
By @rezmoss [ 7 Min read ] This article will show you how to create a simple HTTP load balancer in Go, using only the standard library. Read More.
By @thisweekinjavascript [ 3 Min read ] The latest release transforms how the AI assistant understands your coding style, making it feel like it's truly part of your development team. Read More.
By @paoloap [ 12 Min read ] Discover 10 unusual algorithms that defied logic, broke rules, and transformed the way we think about technology and problem-solving. Read More.
By @hacker-jbemyqj [ 4 Min read ] The 2025 job market isn't just tough—it's officially broken. Read More.
By @ronnie_huss [ 3 Min read ] Ronnie Huss breaks down how an attacker exploited Mobius to mint 9 quadrillion tokens and steal $2M, exposing deeper flaws in Web3’s contract culture. Read More.
By @ujwalarklagud [ 4 Min read ] Transform your path to exit. A founder shares how 'Deep Dialogue' reveals unspoken industry truths, unlocking resilient growth and unconventional success. Read More.
By @torram [ 2 Min read ] Discover how Torram is building a Bitcoin-native, Proof-of-Stake network that brings fast, secure, and decentralized finality to the Bitcoin blockchain. Read More.
By @gmmishra [ 6 Min read ] The technology sector is experiencing a profound evolution, propelled by the emergence of generative artificial intelligence (GenAI) and "The Great Flattening" Read More.
By @hackmarketing [ 3 Min read ] Reach 400K+ tech readers with HackerNoon newsletter ads. High engagement, low CPC, and 500-1000 leads per issue. Learn more here. Read More.
By @editingprotocol [ 3 Min read ] Want to get your articles translated into 77 different languages? HackerNoon has got you covered. Read More.
By @duplication [ 7 Min read ] Explore how big data, GIS, and 3D mapping tools like Cesium and GeoPandas power the development of spatial digital twins in smart cities. Read More.
By @regularization [ 4 Min read ]
Chameleon is a powerful AI model that understands and generates images and text together, outperforming major models like GPT-4V and Gemini-Pro. Read More.
🧑💻 What happened in your world this week? It's been said that writing can help consolidate technical knowledge, establish credibility, and contribute to emerging community standards. Feeling stuck? We got you covered ⬇️⬇️⬇️
ANSWER THESE GREATEST INTERVIEW QUESTIONS OF ALL TIME
We hope you enjoy this worth of free reading material. Feel free to forward this email to a nerdy friend who'll love you for it.
See you on Planet Internet! With love,
The HackerNoon Team ✌️
2025-05-20 11:00:02
2. Background
2.1 Amortized Stochastic Variational Bayesian GPLVM
2.2 Encoding Domain Knowledge through Kernels
3. Our Model and Pre-Processing and Likelihood
4. Results and Discussion and 4.1 Each Component is Crucial to Modifies Model Performance
4.3 Consistency of Latent Space with Biological Factors
4. Conclusion, Acknowledgement, and References
In the sections below, we discuss a set of modifications to the baseline model presented above, which form the main contributions of this work. In particular, we show that row (library) normalizing the data, using an appropriate likelihood, incorporating batch and cell-cycle information via SE-ARD+Linear and PerSE-ARD+Linear (Section 2.2) and implementing a modified encoder significantly improves the BGPLVM’s performance. We present the schematic of the modified BGPLVM in Figure 1.
Raw scRNA-seq data are discrete and must be pre-processed to better align with the Gaussian likelihood in the probabilistic model of the baseline discussed above, which we call OBGPLVM (short for Original Bayesian GPLVM). However, the assumption that this pre-processed data are normally distributed is not necessarily justified. Instead of adjusting the data to fit our model, we aim to better adapt our likelihood to the data. In particular, we only normalize the total counts per cell (i.e. library size) to account for technical factors (Lun et al., 2016) and adopt a negative binomial likelihood like that in scVI (detailed in Appendix A.1).
\
\ In our initial experiments, we found that the more complex the likelihood function was (in terms of parameters to be learned), the worse the resulting BGPLVM-learned latent space was. While one may expect the more complex and expressive likelihoods to perform better, this opposite trend may be because the model is non-identifiable. That is, especially since the loss function does not explicitly optimize for latent space representations, the extra parameters may overfit and cause the model to fail to learn important biological signals. One such ablation study is presented in Appendix B.3.2. Due to this observation, we focus on the simplest (and best performing) negative binomial based likelihood, ApproxPoisson.
\
:::info This paper is available on arxiv under CC BY-SA 4.0 DEED license.
:::
:::info Authors:
(1) Sarah Zhao, Department of Statistics, Stanford University, ([email protected]);
(2) Aditya Ravuri, Department of Computer Science, University of Cambridge ([email protected]);
(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard ([email protected]);
(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge ([email protected]).
:::
\
2025-05-20 10:30:03
2. Background
2.1 Amortized Stochastic Variational Bayesian GPLVM
2.2 Encoding Domain Knowledge through Kernels
3. Our Model and Pre-Processing and Likelihood
4. Results and Discussion and 4.1 Each Component is Crucial to Modifies Model Performance
4.3 Consistency of Latent Space with Biological Factors
4. Conclusion, Acknowledgement, and References
A key benefit of using GPLVMs is that we can encode prior information into the generative model, especially through the kernel design, allowing for more interpretable latent spaces and less training data. Here, we highlight kernels tailored to scRNA-seq data that correct for batch and cell-cycle nuisance factors as introduced by Lalchand et al. (2022a).
\ Batch correction kernel formulation In order to correct for confounding batch effects through the GP formulation, Lalchand et al. (2022a) proposed the following kernel structure with an additive linear kernel term to capture random effects:
\
\ Cell-cycle phase kernel When certain genes strongly reflect cell-cycle phase effects, obscuring key biological factors, a kernel designed to explicitly address a cell-cycle latent variable can effectively mitigate these effects. This motivates the use of adding a periodic kernel to the above kernel formulation. In particular, we specify the first latent dimension as a proxy for cell-cycle information and model our kernel as:
\
\
:::info This paper is available on arxiv under CC BY-SA 4.0 DEED license.
:::
:::info Authors:
(1) Sarah Zhao, Department of Statistics, Stanford University, ([email protected]);
(2) Aditya Ravuri, Department of Computer Science, University of Cambridge ([email protected]);
(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard ([email protected]);
(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge ([email protected]).
:::
\