MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

How Chameleon Advances Multimodal AI with Unified Tokens

2025-05-20 18:00:03

Table of Links

Abstract and 1 Introduction

2 Pre-Training

2.1 Tokenization

2.2 Pre-Training Data

2.3 Stability

2.4 Inference

3 Alignment and 3.1 Data

3.2 Fine-Tuning Strategy

4 Human Evaluations and Safety Testing, and 4.1 Prompts for Evaluation

4.2 Baselines and Evaluations

4.3 Inter-annotator Agreement

4.4 Safety Testing

4.5 Discussion

5 Benchmark Evaluations and 5.1 Text

5.2 Image-To-Text

6 Related Work

7 Conclusion, Acknowledgements, Contributors, and References

Appendix

A. Samples

B. Additional Information of Human Evaluations

6 Related Work

Chameleon builds upon the lineage of works exploring token-based approaches for multimodal learning. The idea of using discrete tokens to represent continuous modalities like images was first explored in works like BEiT (Bao et al., 2021), which proposed a self-supervised vision representation learning method based on tokenized image patches. Aghajanyan et al. (2022) extended this idea to learning from mixed-modal documents through interleaved image and text tokens, allowing for joint reasoning over both modalities within a unified architecture. CM3Leon (Yu et al., 2023) further scaled up this approach to autoregressive text-to-image generation, building on the initial proposal of token-based image generation in DALL-E (Ramesh et al., 2021).

\ As a fully token-based early-fusion model, Chameleon differs from late-fusion approaches like Flamingo (Alayrac et al., 2022) which encode images and text separately before combining them at a later stage. Other models like LLaVA (Liu et al., 2023a), IDEFICS (Laurençon et al., 2023), and VisualGPT (Chen et al., 2022) also maintain separate image and text encoders. In contrast, Chameleon’s unified token space allows it to seamlessly reason over and generate interleaved image and text sequences, without the need for modality-specific components. This early-fusion approach, however, comes with significant challenges in terms of representation learning and alignment, as discussed in Baltrušaitis et al. (2018).

\ The most similar model to Chameleon is Gemini (Gemini et al., 2023), which also uses an early-fusion token-based approach. However, a key difference is that Gemini uses separate image decoders, whereas Chameleon is an end-to-end dense model without any routing components. This makes Chameleon a more general-purpose model for both multimodal understanding and generation tasks, similar in spirit to the Perceiver (Jaegle et al., 2021) architecture which also aims for a unified model across modalities and tasks.

\ In summary, Chameleon builds on a rich history of work in multimodal learning and token-based architectures, while pushing the boundaries in terms of model scale and architecture design. By demonstrating strong performance across a wide range of vision-language tasks and enabling new capabilities in mixed-modal reasoning and generation, Chameleon represents a significant step towards realizing the vision of general-purpose multimodal foundation models.

\

:::info Author:

(1) Chameleon Team, FAIR at Meta.

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Comparing Chameleon AI to Leading Image-to-Text Models

2025-05-20 17:00:03

Table of Links

Abstract and 1 Introduction

2 Pre-Training

2.1 Tokenization

2.2 Pre-Training Data

2.3 Stability

2.4 Inference

3 Alignment and 3.1 Data

3.2 Fine-Tuning Strategy

4 Human Evaluations and Safety Testing, and 4.1 Prompts for Evaluation

4.2 Baselines and Evaluations

4.3 Inter-annotator Agreement

4.4 Safety Testing

4.5 Discussion

5 Benchmark Evaluations and 5.1 Text

5.2 Image-To-Text

6 Related Work

7 Conclusion, Acknowledgements, Contributors, and References

Appendix

A. Samples

B. Additional Information of Human Evaluations

5.2 Image-To-Text

We next evaluate Chameleon on the segment of tasks that requires text generation conditioned on an image, specifically on image captioning and visual question-answering tasks, and present results of Chameleon-34B in Table 7. Together with our pre-trained model, we also present results with a model fine-tuned on all tasks together (Chameleon-34B-MultiTask), as well as models exclusively fine-tuned for the specific evaluation tasks (Chameleon-34B-SFT).

\ We evaluate against available open-source late-fusion models: specifically Flamingo 80B (Alayrac et al., 2022), IDEFICS 80B (Laurençon et al., 2023), and Llava-1.5 (Liu et al., 2023a), as well as recent closed-source models, such as Gemini (Gemini et al., 2023) and GPT4-V (OpenAI, 2023). We note that we did not take any special care when formatting the pre-training data to ensure that 0-shot inference can be effectively done. Therefore, we augment the input images or questions with the published prompts used by other models. This was purposefully done to maintain the fidelity of the pre-training data.

\ • Image Captioning: For image captioning evaluations we report CiDER (Vedantam et al., 2015) scores on the Karpathy test split of MS-COCO (Lin et al., 2014), and the Karpathy test split of Flickr30k (Plummer et al., 2015) using the pycocoevalcap (Chen et al., 2020) package. For Chameleon models, we restrict captions to 30 tokens. We evaluated GPT-4V and Gemini models using several prompts and generation lengths via their APIs and report the best performance that we were able to achieve.

\ In the open-source pre-trained category, Chameleon-34B (2-shot) outperforms the larger 80B models of both Flamingo and IDEFICS on COCO with 32-shots, while matching their performance on Flickr30k. With respect to fine-tuned/closed-source models, both multi-task and SFT variants of Chameleon-34B outperform all other models on COCO, while for Flickr30k, the SFT model outperforms other models with the multitask model being a close competitor.

\ • Visual Question Answering: For visual question answering (VQA) we report performance on the testdev split of VQA-v2 (Goyal et al., 2017). For VQA-v2, the pre-trained Chameleon-34B model with 2-shots matches the 32-shot performance of the larger Flamingo and IDEFICS models, while for finetuned/closed models, Chameleon-34B-Multitask approaches the performance of IDEFICS-80B-Instruct and Gemini Pro, but trails larger models such as Flamingo-80B-FT, GPT-4V, and Gemini Ultra. Llava-1.5 outperforms Chameleon-34B on VQAv2 potentially owing to its additional fine-tuning on

\ Table 7 Model Performances on Image-to-Text Capabilities. ∗ Evaluated using API.

\ conversations from GPT-4, ShareGPT (ShareGPT, 2023), GQA (Hudson and Manning, 2019), and region-level VQA datasets, but significantly trails behind on the other tasks.

\ In general, we find Chameleon is fairly competitive on both image captioning and VQA tasks. It rivals other models by using much fewer in-context training examples and with smaller model sizes, in both pre-trained and fine-tuned model evaluations.

\

:::info Author:

(1) Chameleon Team, FAIR at Meta.

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Chameleon AI Shows Competitive Edge Over LLaMa-2 and Other Models

2025-05-20 16:00:03

Table of Links

Abstract and 1 Introduction

2 Pre-Training

2.1 Tokenization

2.2 Pre-Training Data

2.3 Stability

2.4 Inference

3 Alignment and 3.1 Data

3.2 Fine-Tuning Strategy

4 Human Evaluations and Safety Testing, and 4.1 Prompts for Evaluation

4.2 Baselines and Evaluations

4.3 Inter-annotator Agreement

4.4 Safety Testing

4.5 Discussion

5 Benchmark Evaluations and 5.1 Text

5.2 Image-To-Text

6 Related Work

7 Conclusion, Acknowledgements, Contributors, and References

Appendix

A. Samples

B. Additional Information of Human Evaluations

5 Benchmark Evaluations

Given the general capabilities of Chameleon, there is not a single model that we can directly evaluate against; therefore, we evaluate against the best models in every category within our capabilities.

5.1 Text

We evaluate the general text-only capabilities of our pre-trained (not SFT’d) model against other state-of-theart text-only large language models. We follow the evaluation protocol outlined by Touvron et al. (2023). Specifically we evaluate all models, using an in-house evaluation platform on the areas of commonsense reasoning, reading comprehension, math problems, and world knowledge. We report our results in Table 6.

\ Table 6 Comparison of overall performance on collective academic benchmarks against open-source foundational models.∗ Evaluated using our framework/using API. For GSM8k/MATH, we report maj@1 unless mentioned otherwise.

\ • Commonsense Reasoning and Reading Comprehension: We report 0-shot performance on the following benchmarks that measure commonsense reasoning and reading comprehension capabilities: PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), ARC-Easy (Clark et al., 2018), ARC-Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), and BoolQ (Clark et al., 2019). We score the prompt with each candidate answer and compute accuracy using the candidate with the highest score. All baseline model performances except a few are taken directly from the reported sources. We observe that Chameleon-7B and Chameleon-34B are competitive with the corresponding Llama-2 models, with Chameleon-34B even outperforming Llama-2 70B on 5/8 tasks and performing on par with Mixtral 8x7B.

\ • MATH and World Knowledge We report 8-shot performance on GSM8K (Cobbe et al., 2021) i.e., grade school math word problems and 4-shot performance on the MATH (Hendrycks et al., 2021) benchmark. We report maj@N exact match accuracy for both benchmarks by sampling N generations from the model (greedy sampling for N=1) and choosing the answer via majority voting. Despite training for additional modalities, both Chameleon models demonstrate strong math capabilities. On GSM8k, Chameleon-7B outperforms the corresponding Llama-2 models, with performance comparable to Mistral 7B (50.9 vs 52.1 maj@8). Furthermore, Chameleon-34B can outperform Llama2-70B on maj@1 (61.4 vs 56.8) and Mixtral 8x7B on maj@32 (77.0 vs 75.1). Similarly, on MATH, Chameleon-7B outperforms Llama-2 and matches Mistral 7B on maj@4, while Chameleon-34B outperforms Llama2-70B, approaching the performance of Mixtral 8x7B on maj@4 (24.7 vs 28.4).

\ We also report performance on MMLU (Hendrycks et al., 2020), which measures world/in-domain knowledge and problem-solving abilities using 57 subjects, including elementary mathematics, US history, computer science, and law. Both Chameleon models outperform their Llama-2 counterparts with Chameleon-34B approaching the performance of Mixtral 8x7B/Gemini-Pro (65.8 vs 70.6/71.8).

\ Overall, Chameleon outperforms LLaMa-2 across the board, with performance approaching Mistral 7B/Mixtral 8x7B (Jiang et al., 2023, 2024) on some tasks. These gains are likely due to multiple factors. First, we do two epochs over the LLaMa-2 pre-training data, and in general use more compute for pretraining. Second, including code data significantly improves performance on text-only reasoning tasks. Lastly, having higher quality data in the last 20% of pre-training significantly improves performance.

\

:::info Author:

(1) Chameleon Team, FAIR at Meta.

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

The TechBeat: IPinfo Launches IPinfo Lite: Unlimited Country-Level Geolocation API & Database Download (5/20/2025)

2025-05-20 14:10:51

How are you, hacker? 🪐Want to know what's trending right now?: The Techbeat by HackerNoon has got you covered with fresh content from our trending stories of the day! Set email preference here. ## Boost Team Wellness With UzairaAdvisory and Save 10% on All Our Business, Marketing & Tech Services By @uzairamemon [ 6 Min read ] Book Your Preventive Healthcare Seminar with UzairaAdvisory—Elevate Team Wellness and Unlock 10% OFF Our High-Impact Business, Marketing & Technology Solutions! Read More.

Model Context Protocol Is the Kind of AI Future All Of Us Should Want to See

By @riteshmodi [ 10 Min read ] Discover how Model Context Protocol works, why it matters, and how it's transforming AI from isolated chatbots into assistants that can access your data. Read More.

Why I’m Jealous of Today’s Builders

By @drewchapin [ 3 Min read ] It's now possible to have an entire AI product team, which empowers solo founders to build faster, reduce cognitive load, and scale smarter with automation. Read More.

Ninja Deep Research: The AI Agent Everyone Can Actually Start Using Now

By @ninjatechai [ 6 Min read ] Ninja is proving 2025 is the year of AI agents. Outpacing OpenAI, Google, and others in tackling hallucinations, millions rely on it for coding, writing & more. Read More.

Tesseral Raises $3.3M Seed to Bring Open Source Auth to B2B Software

By @seedfunding [ 2 Min read ] Tesseral raises $3.3M to simplify enterprise-grade authentication for B2B devs with secure, open source infrastructure built for speed and scale. Read More.

IPinfo Launches IPinfo Lite: Unlimited Country-Level Geolocation API & Database Download

By @mprichard [ 3 Min read ] IPinfo launches IPinfo Lite: a free, unlimited IP geolocation and ASN API with daily updates, commercial use rights, and accurate global coverage. Read More.

Is AI Making People Delusional?

By @zacamos [ 5 Min read ] As AI use spreads, so does a new psychological phenomenon: AI-induced delusion. Here's a look at the darker impacts of AI chatbots. Read More.

JavaScript Just Became 10X faster Thanks to a New Game-changing Feature From Chrome

By @thisweekinjavascript [ 3 Min read ] Chrome's V8 team just dropped a game-changing feature that makes JavaScript blazingly fast! Read More.

This 150-Line Go Script Is Actually a Full-On Load Balancer

By @rezmoss [ 7 Min read ] This article will show you how to create a simple HTTP load balancer in Go, using only the standard library. Read More.

Visual Studio Code 1.100: AI Gets Personal!

By @thisweekinjavascript [ 3 Min read ] The latest release transforms how the AI assistant understands your coding style, making it feel like it's truly part of your development team. Read More.

The 10 Weirdest, Most Brilliant Algorithms Ever Devised and What They Actually Do

By @paoloap [ 12 Min read ] Discover 10 unusual algorithms that defied logic, broke rules, and transformed the way we think about technology and problem-solving. Read More.

The 2025 Job Market Reality Check: Why Old-School Job Search Tactics Are Dead

By @hacker-jbemyqj [ 4 Min read ] The 2025 job market isn't just tough—it's officially broken. Read More.

9 Quadrillion Reasons Web3 Still Isn’t Ready

By @ronnie_huss [ 3 Min read ] Ronnie Huss breaks down how an attacker exploited Mobius to mint 9 quadrillion tokens and steal $2M, exposing deeper flaws in Web3’s contract culture. Read More.

The Startup Playbook Is a Lie. Ask Better Questions.

By @ujwalarklagud [ 4 Min read ] Transform your path to exit. A founder shares how 'Deep Dialogue' reveals unspoken industry truths, unlocking resilient growth and unconventional success. Read More.

Decentralization by Design: How Torram Aligns with Bitcoin’s Core Ethos

By @torram [ 2 Min read ] Discover how Torram is building a Bitcoin-native, Proof-of-Stake network that brings fast, secure, and decentralized finality to the Bitcoin blockchain. Read More.

Advancing Your Software Engineering Career in 2025

By @gmmishra [ 6 Min read ] The technology sector is experiencing a profound evolution, propelled by the emergence of generative artificial intelligence (GenAI) and "The Great Flattening" Read More.

Reach 400K+ Tech Readers with HackerNoon Newsletter Ads

By @hackmarketing [ 3 Min read ] Reach 400K+ tech readers with HackerNoon newsletter ads. High engagement, low CPC, and 500-1000 leads per issue. Learn more here. Read More.

How to Become Mr. Worldwide and Get Your Articles Translated to 77 Different Languages

By @editingprotocol [ 3 Min read ] Want to get your articles translated into 77 different languages? HackerNoon has got you covered. Read More.

How Digital Twins Use Big Data to Mirror the Real World

By @duplication [ 7 Min read ] Explore how big data, GIS, and 3D mapping tools like Cesium and GeoPandas power the development of spatial digital twins in smart cities. Read More.

This AI Model Doesn’t See the Line Between Text and Images

By @regularization [ 4 Min read ] Chameleon is a powerful AI model that understands and generates images and text together, outperforming major models like GPT-4V and Gemini-Pro. Read More. 🧑‍💻 What happened in your world this week? It's been said that writing can help consolidate technical knowledge, establish credibility, and contribute to emerging community standards. Feeling stuck? We got you covered ⬇️⬇️⬇️ ANSWER THESE GREATEST INTERVIEW QUESTIONS OF ALL TIME We hope you enjoy this worth of free reading material. Feel free to forward this email to a nerdy friend who'll love you for it. See you on Planet Internet! With love, The HackerNoon Team ✌️

Improved BGPLVM for scRNA-seq: Pre-Processing and Likelihood

2025-05-20 11:00:02

Table of Links

Abstract and 1. Introduction

2. Background

2.1 Amortized Stochastic Variational Bayesian GPLVM

2.2 Encoding Domain Knowledge through Kernels

3. Our Model and Pre-Processing and Likelihood

3.2 Encoder

4. Results and Discussion and 4.1 Each Component is Crucial to Modifies Model Performance

4.2 Modified Model achieves Significant Improvements over Standard Bayesian GPLVM and is Comparable to SCVI

4.3 Consistency of Latent Space with Biological Factors

4. Conclusion, Acknowledgement, and References

A. Baseline Models

B. Experiment Details

C. Latent Space Metrics

D. Detailed Metrics

3 OUR MODEL

In the sections below, we discuss a set of modifications to the baseline model presented above, which form the main contributions of this work. In particular, we show that row (library) normalizing the data, using an appropriate likelihood, incorporating batch and cell-cycle information via SE-ARD+Linear and PerSE-ARD+Linear (Section 2.2) and implementing a modified encoder significantly improves the BGPLVM’s performance. We present the schematic of the modified BGPLVM in Figure 1.

3.1 PRE-PROCESSING AND LIKELIHOOD

Raw scRNA-seq data are discrete and must be pre-processed to better align with the Gaussian likelihood in the probabilistic model of the baseline discussed above, which we call OBGPLVM (short for Original Bayesian GPLVM). However, the assumption that this pre-processed data are normally distributed is not necessarily justified. Instead of adjusting the data to fit our model, we aim to better adapt our likelihood to the data. In particular, we only normalize the total counts per cell (i.e. library size) to account for technical factors (Lun et al., 2016) and adopt a negative binomial likelihood like that in scVI (detailed in Appendix A.1).

\

\ In our initial experiments, we found that the more complex the likelihood function was (in terms of parameters to be learned), the worse the resulting BGPLVM-learned latent space was. While one may expect the more complex and expressive likelihoods to perform better, this opposite trend may be because the model is non-identifiable. That is, especially since the loss function does not explicitly optimize for latent space representations, the extra parameters may overfit and cause the model to fail to learn important biological signals. One such ablation study is presented in Appendix B.3.2. Due to this observation, we focus on the simplest (and best performing) negative binomial based likelihood, ApproxPoisson.

\

:::info This paper is available on arxiv under CC BY-SA 4.0 DEED license.

:::


:::info Authors:

(1) Sarah Zhao, Department of Statistics, Stanford University, ([email protected]);

(2) Aditya Ravuri, Department of Computer Science, University of Cambridge ([email protected]);

(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard ([email protected]);

(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge ([email protected]).

:::

\

Encoding Biological Knowledge in GPLVM Kernels for scRNA-seq

2025-05-20 10:30:03

Table of Links

Abstract and 1. Introduction

2. Background

2.1 Amortized Stochastic Variational Bayesian GPLVM

2.2 Encoding Domain Knowledge through Kernels

3. Our Model and Pre-Processing and Likelihood

3.2 Encoder

4. Results and Discussion and 4.1 Each Component is Crucial to Modifies Model Performance

4.2 Modified Model achieves Significant Improvements over Standard Bayesian GPLVM and is Comparable to SCVI

4.3 Consistency of Latent Space with Biological Factors

4. Conclusion, Acknowledgement, and References

A. Baseline Models

B. Experiment Details

C. Latent Space Metrics

D. Detailed Metrics

2.2 ENCODING DOMAIN KNOWLEDGE THROUGH KERNELS

A key benefit of using GPLVMs is that we can encode prior information into the generative model, especially through the kernel design, allowing for more interpretable latent spaces and less training data. Here, we highlight kernels tailored to scRNA-seq data that correct for batch and cell-cycle nuisance factors as introduced by Lalchand et al. (2022a).

\ Batch correction kernel formulation In order to correct for confounding batch effects through the GP formulation, Lalchand et al. (2022a) proposed the following kernel structure with an additive linear kernel term to capture random effects:

\

\ Cell-cycle phase kernel When certain genes strongly reflect cell-cycle phase effects, obscuring key biological factors, a kernel designed to explicitly address a cell-cycle latent variable can effectively mitigate these effects. This motivates the use of adding a periodic kernel to the above kernel formulation. In particular, we specify the first latent dimension as a proxy for cell-cycle information and model our kernel as:

\

\

:::info This paper is available on arxiv under CC BY-SA 4.0 DEED license.

:::

:::info Authors:

(1) Sarah Zhao, Department of Statistics, Stanford University, ([email protected]);

(2) Aditya Ravuri, Department of Computer Science, University of Cambridge ([email protected]);

(3) Vidhi Lalchand, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard ([email protected]);

(4) Neil D. Lawrence, Department of Computer Science, University of Cambridge ([email protected]).

:::

\