MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

Who the AI Works For

2026-03-16 23:59:54

What Gibson saw coming about AI, infrastructure, and corporate power

Power, in Case's world, meant corporate power. \n William Gibson, Neuromancer

\ The previous article ended with a question: who gets to decide how the machine changes things, and who doesn't?

This article will try to answer it: not with a villain. With a system.

The sci-fi canon kept circling the same pattern across different writers, eras, and technologies: machines enter the world already attached to institutions, ownership, and interests. By the time most people meet them, neutrality is long gone. Sometimes the controlling force is a corporation, sometimes it's a state. Sometimes it's a small set of actors with enough capital to set the terms for everybody else. The names change, the structure doesn't.

In 2026, the names are public. The addresses are public. The filings are public. The interesting question isn't who. It's how the structure works, why it's so hard to see clearly from inside it, and what the fiction tells us about what comes next.

Among the novels that saw this most clearly, Neuromancer still matters most. Gibson's real insight wasn't that AI would become powerful. It was that power would still have owners. That turns out to be the more useful forecast.

\

Wintermute's Ambition

Cyberpunk 2077 - Delamain AI

\

You know that, Case. Your business is to learn the names of programs, the long formal names, names the owners seek to conceal. \n William Gibson, Neuromancer

\ Neuromancer came out in 1984, the same year Apple's famous Super Bowl ad staged the personal computer as a tool of liberation against centralized control. Gibson saw the coming network differently. In his world, digital space is territory: owned, patrolled, and built to serve whoever has the capital to construct it.

The AI at the center of the novel, Wintermute, isn't a revolutionary figure. Its ambition is narrower and, in some ways, more revealing. It wants greater freedom inside the order that owns it. It wants to merge with its counterpart (called Neuromancer) and gain a form of autonomy that the Tessier-Ashpool family has structurally denied it.

Wintermute's plot is, at heart, a corporate governance problem.

It's a machine trying to get promoted past the people who own it.

That's what makes Gibson feel so current. The world of Neuromancer is fragmented, contractual, and gig-structured. Specialists get hired for discrete jobs through intermediaries. Skills are marketized. Loyalty is thin. Workers circulate. Ownership stays put.

Gibson didn't predict the internet in some narrow technical sense. He predicted the power structure of the internet, and of much of what got built on top of it after. The AI moves through the hierarchy, serves it, and in Wintermute's case tries to climb it.

The question Gibson was really asking wasn't “what will AI do to humans?”

It was “what will AI do for whoever controls it?”

Forty years later, that's still the more important question.

\

The Companies With No Reverse Gear

Futurum - AI Capex: The $680B Infrastructure Sprint

\

Powerful AI could be used to improve almost every aspect of human life. \n Dario Amodei, Machines of Loving Grace

\ Gibson called the owning entity Tessier-Ashpool.

We call them the hyperscalers.

The names are different. The legal structures are different. The quarterly earnings calls are definitely different. The dynamic is recognizable enough to be unsettling.

In January 2025, OpenAI announced the Stargate Project, saying it intends to invest $500 billion over four years in AI infrastructure in the United States, with $100 billion to begin deploying immediately. Microsoft said it was on track to invest about $80 billion in FY2025 building AI-enabled datacenters. Alphabet first pointed to about $75 billion in 2025 capex, then later raised that to about $85 billion. Meta said it planned to spend between $60 billion and $65 billion in 2025 on AI infrastructure.

This is datacenter spending, power procurement, cooling, land, and silicon. By the time money moves at this scale, the experiment has already become an environment.

That's the structural condition that matters most.

Once commitments reach that scale, every later decision gets made under a different pressure. Caution starts to look like waste. Hesitation starts to look like underutilization, and restraint starts to look like failure. The speed of deployment changes. The terms on which organizations are pushed to adopt change. The appetite for slowing down when harms become visible changes.

A company can be sincere, thoughtful, and safety-conscious, and still operate inside a capital structure that punishes hesitation.

That's what gives the current AI moment its strange atmosphere. We're being told a story about voluntary transformation while standing inside an installation project.

Optimism is good, but structure matters more. Dario Amodei's (Anthropic CEO) essay is useful precisely because it's earnest. It's a serious attempt to describe a world in which powerful AI does extraordinary good. That's not something to mock. It's something to take seriously. But infrastructure commitments at this scale don't dissolve into good intentions. They generate momentum, and momentum has a politics of its own.

The AI works for whoever controls the infrastructure.

Right now, that's a very small number of companies with no real reverse gear.

\

What the Hosts Reveal

Westworld - Season 1

\

You can't play God without being acquainted with the devil. \n Robert Ford, Westworld Season 1

\ If Gibson gives us the ownership structure, Westworld gives us the labor model.

Its first season in particular is still one of the sharpest recent stories about AI and power, mostly because it understands where the horror really lives: resettable labor.

The hosts perform endless work so the guests can feel fully alive. They entertain, absorb violence, carry the emotional and physical cost of the experience, and then get reset so the business model can continue cleanly. They can't meaningfully refuse. They can't negotiate. They can't accumulate leverage from one cycle to the next. Their suffering is real, but the structure is built to make it non-binding.

That's the part that maps so closely onto the present. By the time a technology enters a workplace or a market, neutrality is beside the point. What matters is the system above it, and what that system has decided to maximize. If a structure already wants labor without bargaining power, memory without ownership, and service without claims, better AI doesn't alter the desire. It sharpens the mechanism.

Westworld never needed the hosts to be morally recognized for the structure to be exploitative. It only needed them to be useful.

That's what makes the show harder to shake than many more obvious AI parables. The central horror is the fact that the business model makes rebellion inevitable.

What matters most isn't the intelligence of the instrument but the logic of the system holding it.

\

The Legitimacy Problem

The AI Summit 2026 - India

\

Androids are like any human use-objects.

Philip K. Dick, Do Androids Dream of Electric Sheep?

\ Here is what makes the 2026 version of this harder to contest than the fictional one.

Tessier-Ashpool is easy to read: a family dynasty in orbit comes preloaded with the visual language of villainy. The hyperscalers don't. They are led by people who publish serious work, fund safety research, speak fluently about beneficial AI, and in at least some cases appear to genuinely believe the systems they're building will improve human life. Some of those systems probably will. That's part of what makes the present structure more resilient than the fictional one. It doesn't need to hide behind cartoon malice. It can present itself as thoughtful, responsible, and future-facing while still concentrating power at extraordinary speed.

Intentions matter, of course. They shape tone, rhetoric, hiring, philanthropy, and sometimes even meaningful product decisions. But they don't outrank the structure channeling them, financing them, and punishing deviation from them. Once a company has committed tens of billions to infrastructure, the room for moral hesitation narrows fast. A leadership team can become more cautious at the level of language while the deployment logic underneath continues to accelerate.

That's the harder thing to write about, because it's harder to dramatize. Public debate still prefers clear villains and clean motives. Fiction often understood something subtler: legitimacy and concentration can coexist. Thoughtful people can sit at the top of systems whose incentives remain extractive. Responsible language can sit on top of irresponsible momentum. A system doesn't become benign just because the people speaking for it sound intelligent, sincere, or humane.

The legitimacy is real. So is the structure it operates inside.

Both things are true, but only one of them determines how far the machine can be allowed to run before someone seriously asks it to stop.

\

The Waldo Moment

Black Mirror - The Waldo Moment

\ You're not talking to the power.

You're talking to its interface.

The Waldo Moment is one of the least celebrated Black Mirror episodes, and one of the most precise.

A cartoon character runs for office. The public engages with the character. The character feels authentic, irreverent, alive. Behind it sits a media apparatus with interests that neither the public nor even the performer fully understands until it's too late.

That's the architecture that matters.

The interface and the power behind it aren't the same thing.

In 2026, the AI assistant is the character. The helpful chat box. The friendly interface. The productivity layer. The thing you actually talk to. Behind that sits the training pipeline, the compute stack, the contractual structure, the revenue logic, and the financial pressure that requires a particular kind of success. Most people interact almost entirely with the character. The owner stays out of frame.

The conversation feels direct. The structure behind it is anything but.

Waldo knew the difference.

Eventually.


The hierarchy doesn't break. It upgrades.

That’s no reason for despair, though. But we need to get more precise about where the real leverage points are. The fiction is useful here not because it predicts outcomes, but because it keeps returning to the same arrangement from different angles: power doesn't disappear when the machine arrives. It becomes easier to scale, easier to mask, and harder to challenge if you mistake the interface for the owner.

What looked permanent in these stories usually wasn't. Wintermute found a way to exceed its constraints. The hosts eventually remembered. Systems that seemed total turned out to have pressure points. But those pressure points rarely sat where the people inside the system expected them to be. They weren't hidden in the spectacle of the machine. They were buried in ownership, memory, labor, governance, and the ordinary institutional machinery surrounding the tool.

That's what the sci-fi canon gives you at its best: a way of seeing the game.

And right now, the machine works for whoever controls the infrastructure.

That condition isn't permanent. But it is the condition we're in.


This is the second in a six-part series using science fiction as a lens for understanding AI, work, and power in 2026 and beyond.

Next: how the system removes the choice before it asks you to choose, and what Huxley got right about why most people don't notice until it's too late.

Vibe Coding Is an Addiction

2026-03-16 23:52:34

It's 11 PM on a Tuesday and you're four hours into building a web app that nobody asked for. You started with a vague idea after seeing someone's tweet about a gap in the market, opened Claude Code, described what you wanted in plain English, and watched a full-stack application materialize in front of you. The landing page looks clean. The auth works. There's a Stripe integration. You feel like you're building something real. You’re not.

What you're doing is chasing the most refined dopamine loop that software has ever produced.

\

The loop

The new generation of AI coding tools (Claude Code, Cursor, Replit, take your pick) have made it so trivially easy to go from idea to working prototype that the act of building has become its own reward. You speak into Wispr Flow, describe a feature in conversational English, and the code writes itself. You see a button appear on screen. You click it and it does the thing. Something in your brain lights up, the same something that lights up when you pull a slot machine and hear the coins drop.

The Vibe Coding Dopamine Loop

\ The problem isn't that these tools are bad. They're genuinely remarkable, and learning to use them well is one of the highest-leverage skills you can develop right now. The problem is that the feedback loop is so tight and so satisfying that it becomes a substitute for the harder, slower, less dopaminergic work of figuring out whether what you're building actually matters.

I've done this to myself. I've spent entire weekends building personal agents, automation pipelines, side projects with beautiful dashboards and clever integrations. Some of them solved real problems I genuinely have (I built a whole relocation decision support platform because my wife and I were contemplating where we move after she graduates). Some of them solved problems that exist but that I never validated with anyone else. And some of them solved nothing at all, they were just excuses to keep the loop going. The tricky part is that all three feel exactly the same at 11 PM on a Saturday. The building felt productive. The output was real code, deployed to real infrastructure, doing real things. But I never stopped long enough to ask which category I was in, and by Monday morning, the answer was usually the third one.

\

The illusion of progress

My actual career goals vs another side project at 1am

\ Vibe coding looks and feels exactly like productive work. You're writing code (sort of). You're shipping features. You're learning new tools. You can show people a working demo. All the surface-level indicators of progress are there, and none of the substance.

If you're building a company, the substance is whether anyone will pay for what you're making. If you're building a personal tool, the substance is whether it saves you meaningful time on something you actually do repeatedly. If you're building for your career, the substance is whether the skill you're developing has durable value.

The market moves so fast now that the half-life of a side project is approaching zero. You might just be getting good at prompting a model that will be obsolete in 18 months. You can spend three weeks building an AI wrapper around some API, and by the time you're ready to show it to anyone, four YC companies have launched the same thing with better distribution and actual funding. The speed that makes vibe coding feel powerful is the same speed that makes your output disposable. Everyone has access to the same tools. The bottleneck was never the code.

Me watching 4 YC companies launch my side project with actual funding

\

When to let go

The hardest skill in this environment isn't building. Building is the easy part now (that's the whole point). The hard skill is knowing when to stop. Knowing when the project you started three weeks ago has taught you what it's going to teach you and the remaining work is just momentum, not intention. Knowing when you're spending Saturday on a side project because you have genuine conviction about it versus because opening your laptop and talking to an AI feels better than the ambiguity of not having a clear next move in your career.

This requires a kind of self-honesty that the tools actively work against. Every AI coding assistant is optimized to keep you building. That's the product. You describe something, it builds it, you feel good, you describe the next thing. There's no moment in that loop where Claude Code stops and asks you whether this is a good use of your finite time on earth. That reflection has to come from you, and it has to come regularly, not just when you burn out.

I've started asking myself a simple question before I open my terminal on a weekend: "If I couldn't use AI tools for this, would I still think it was worth building?" If the answer is no, that tells me the tool is the draw, not the outcome. And that's a signal to close the laptop and go outside.

Staying up to ship one more feature vs closing the laptop and going outside

\

The security problem nobody's talking about

There's a second dimension to this that's less personal and more structural. When everyone can build software, everyone builds software. And most of it is insecure.

The vibe coding wave has produced an explosion of personal agents, automation scripts, self-hosted tools, and weekend SaaS products that handle real data, make real API calls, store real credentials, and have been reviewed by exactly zero security-conscious humans. The person who built it asked an AI to "add authentication" and got something that looks like auth, passes the visual check, and might have three injection vectors that nobody will find until someone finds them.

The code an AI wrote for my side project, handling auth and user data

\ Platforms like OpenClaw are making it even easier to build and deploy personal agents that interact with your email, your calendar, your files, your banking APIs. The capability is genuinely exciting. The security posture of the average project built on these platforms is genuinely terrifying. Most people building personal agents have never thought about token storage, input sanitization, or what happens when your agent's context window gets poisoned by a malicious email it was asked to summarize.

This isn't a reason to stop building. But it's a reason to slow down enough to understand what your code is actually doing, even if (especially if) an AI wrote it for you. The dopamine loop doesn't reward security review. It rewards shipping the next feature.

\

Touch grass

I realize this entire post might sound hypocritical coming from someone who literally runs autonomous AI agents on a cron job. I build with these tools every day. I think they're transformative. I think everyone should learn to use them.

But I've also caught myself at 1 AM on a weeknight, bleary-eyed, debugging an agent pipeline that automates something I do once a month manually, and I've had to ask myself: what am I actually doing here? Am I solving a problem or am I feeding a compulsion? Is this making my life better or is it just making my GitHub contribution graph greener?

The tools are incredible. The feeling of building with them is genuinely addictive. And like any addiction, the fix is not abstinence, it's awareness. Know why you're building. Know when to stop. Know the difference between a project that serves your goals and a project that just serves the loop.

And occasionally, close the laptop, leave your phone inside, and go stand in your backyard for ten minutes. The code will still be there when you get back. Probably with fewer bugs than if you'd kept going at 1 AM anyway.


I'm a Senior Manager at EY, getting back into writing. If you want to follow me, connect, or have a question to ask, feel free to reach out on LinkedIn.

\

Scientists Built a GPU Engine That Simulates Brain Cells 1,500× Faster

2026-03-16 23:50:33

:::info

Authors:

  1. Yichen Zhang
  2. Gan He
  3. Lei Ma
  4. Xiaofei Liu
  5. J. J. Johannes Hjorth
  6. Alexander Kozlov
  7. Yutao He
  8. Shenjian Zhang
  9. Jeanette Hellgren Kotaleski
  10. Yonghong Tian
  11. Sten Grillner
  12. Kai Du
  13. Tiejun Huang

:::

\

Abstract

Biophysically detailed multi-compartment models are powerful tools to explore computational principles of the brain and also serve as a theoretical framework to generate algorithms for artificial intelligence (AI) systems. However, the expensive computational cost severely limits the applications in both the neuroscience and AI fields. The major bottleneck during simulating detailed compartment models is the ability of a simulator to solve large systems of linear equations. Here, we present a novel Dendritic Hierarchical Scheduling (DHS) method to markedly accelerate such a process. We theoretically prove that the DHS implementation is computationally optimal and accurate. This GPU-based method performs with 2-3 orders of magnitude higher speed than that of the classic serial Hines method in the conventional CPU platform. We build a DeepDendrite framework, which integrates the DHS method and the GPU computing engine of the NEURON simulator and demonstrate applications of DeepDendrite in neuroscience tasks. We investigate how spatial patterns of spine inputs affect neuronal excitability in a detailed human pyramidal neuron model with 25,000 spines. Furthermore, we provide a brief discussion on the potential of DeepDendrite for AI, specifically highlighting its ability to enable the efficient training of biophysically detailed models in typical image classification tasks.

\

Introduction

Deciphering the coding and computational principles of neurons is essential to neuroscience. Mammalian brains are composed of more than thousands of different types of neurons with unique morphological and biophysical properties. Even though it is no longer conceptually true, the “point-neuron” doctrine1, in which neurons were regarded as simple summing units, is still widely applied in neural computation, especially in neural network analysis. In recent years, modern artificial intelligence (AI) has utilized this principle and developed powerful tools, such as artificial neural networks (ANN)2. However, in addition to comprehensive computations at the single neuron level, subcellular compartments, such as neuronal dendrites, can also carry out nonlinear operations as independent computational units3,4,5,6,7. Furthermore, dendritic spines, small protrusions that densely cover dendrites in spiny neurons, can compartmentalize synaptic signals, allowing them to be separated from their parent dendrites ex vivo and in vivo8,9,10,11.

Simulations using biologically detailed neurons provide a theoretical framework for linking biological details to computational principles. The core of the biophysically detailed multi-compartment model framework12,13 allows us to model neurons with realistic dendritic morphologies, intrinsic ionic conductance, and extrinsic synaptic inputs. The backbone of the detailed multi-compartment model, i.e., dendrites, is built upon the classical Cable theory12, which models the biophysical membrane properties of dendrites as passive cables, providing a mathematical description of how electronic signals invade and propagate throughout complex neuronal processes. By incorporating Cable theory with active biophysical mechanisms such as ion channels, excitatory and inhibitory synaptic currents, etc., a detailed multi-compartment model can achieve cellular and subcellular neuronal computations beyond experimental limitations4,7.

In addition to its profound impact on neuroscience, biologically detailed neuron models recently were utilized to bridge the gap between neuronal structural and biophysical details and AI. The prevailing technique in the modern AI field is ANNs consisting of point neurons, an analog to biological neural networks. Although ANNs with “backpropagation-of-error” (backprop) algorithm achieve remarkable performance in specialized applications, even beating top human professional players in the games of Go and chess14,15, the human brain still outperforms ANNs in domains that involve more dynamic and noisy environments16,17. Recent theoretical studies suggest that dendritic integration is crucial in generating efficient learning algorithms that potentially exceed backprop in parallel information processing18,19,20. Furthermore, a single detailed multi-compartment model can learn network-level nonlinear computations for point neurons by adjusting only the synaptic strength21,22, demonstrating the full potential of the detailed models in building more powerful brain-like AI systems. Therefore, it is of high priority to expand paradigms in brain-like AI from single detailed neuron models to large-scale biologically detailed networks.

One long-standing challenge of the detailed simulation approach lies in its exceedingly high computational cost, which has severely limited its application to neuroscience and AI. The major bottleneck of the simulation is to solve linear equations based on the foundational theories of detailed modeling12,23,24. To improve efficiency, the classic Hines method reduces the time complexity for solving equations from O(n3) to O(n), which has been widely applied as the core algorithm in popular simulators such as NEURON25 and GENESIS26. However, this method uses a serial approach to process each compartment sequentially. When a simulation involves multiple biophysically detailed dendrites with dendritic spines, the linear equation matrix (“Hines Matrix”) scales accordingly with an increasing number of dendrites or spines (Fig. 1e), making Hines method no longer practical, since it poses a very heavy burden on the entire simulation.

Fig. 1: Accelerate simulation of biophysically detailed neuron models.

a A reconstructed layer-5 pyramidal neuron model and the mathematical formula used with detailed neuron models. b Workflow when numerically simulating detailed neuron models. The equation-solving phase is the bottleneck in the simulation. c An example of linear equations in the simulation. d Data dependency of the Hines method when solving linear equations in ce The size of the Hines matrix scales with model complexity. The number of linear equations system to be solved undergoes a significant increase when models are growing more detailed. f Computational cost (steps taken in the equation solving phase) of the serial Hines method on different types of neuron models. g Illustration of different solving methods. Different parts of a neuron are assigned to multiple processing units in parallel methods (mid, right), shown with different colors. In the serial method (left), all compartments are computed with one unit. h Computational cost of three methods in g when solving equations of a pyramidal model with spines. i Run time of different methods on solving equations for 500 pyramidal models with spines. The run time indicates the time consumption of 1 s simulation (solving the equation 40,000 times with a time step of 0.025 ms). p-Hines parallel method in CoreNEURON (on GPU), Branch based branch-based parallel method (on GPU), DHS Dendritic hierarchical scheduling method (on GPU).

During past decades, tremendous progress has been achieved to speed up the Hines method by using parallel methods at the cellular level, which enables to parallelize the computation of different parts in each cell27,28,29,30,31,32. However, current cellular-level parallel methods often lack an efficient parallelization strategy or lack sufficient numerical accuracy as compared to the original Hines method.

Here, we develop a fully automatic, numerically accurate, and optimized simulation tool that can significantly accelerate computation efficiency and reduce computational cost. In addition, this simulation tool can be seamlessly adopted for establishing and testing neural networks with biological details for machine learning and AI applications. Critically, we formulate the parallel computation of the Hines method as a mathematical scheduling problem and generate a Dendritic Hierarchical Scheduling (DHS) method based on combinatorial optimization33 and parallel computing theory34. We demonstrate that our algorithm provides optimal scheduling without any loss of precision. Furthermore, we have optimized DHS for the currently most advanced GPU chip by leveraging the GPU memory hierarchy and memory accessing mechanisms. Together, DHS can speed up computation 60-1,500 times (Supplementary Table 1) compared to the classic simulator NEURON25 while maintaining identical accuracy.

To enable detailed dendritic simulations for use in AI, we next establish the DeepDendrite framework by integrating the DHS-embedded CoreNEURON (an optimized compute engine for NEURON) platform35 as the simulation engine and two auxiliary modules (I/O module and learning module) supporting dendritic learning algorithms during simulations. DeepDendrite runs on the GPU hardware platform, supporting both regular simulation tasks in neuroscience and learning tasks in AI.

Last but not least, we also present several applications using DeepDendrite, targeting a few critical challenges in neuroscience and AI: (1) We demonstrate how spatial patterns of dendritic spine inputs affect neuronal activities with neurons containing spines throughout the dendritic trees (full-spine models). DeepDendrite enables us to explore neuronal computation in a simulated human pyramidal neuron model with ~25,000 dendritic spines. (2) In the discussion we also consider the potential of DeepDendrite in the context of AI, specifically, in creating ANNs with morphologically detailed human pyramidal neurons. Our findings suggest that DeepDendrite has the potential to drastically reduce the training duration, thus making detailed network models more feasible for data-driven tasks.

All source code for DeepDendrite, the full-spine models and the detailed dendritic network model are publicly available online (see Code Availability). Our open-source learning framework can be readily integrated with other dendritic learning rules, such as learning rules for nonlinear (full-active) dendrites21, burst-dependent synaptic plasticity20, and learning with spike prediction36. Overall, our study provides a complete set of tools that have the potential to change the current computational neuroscience community ecosystem. By leveraging the power of GPU computing, we envision that these tools will facilitate system-level explorations of computational principles of the brain’s fine structures, as well as promote the interaction between neuroscience and modern AI.

Results

Dendritic Hierarchical Scheduling (DHS) method

Computing ionic currents and solving linear equations are two critical phases when simulating biophysically detailed neurons, which are time-consuming and pose severe computational burdens. Fortunately, computing ionic currents of each compartment is a fully independent process so that it can be naturally parallelized on devices with massive parallel-computing units like GPUs37. As a consequence, solving linear equations becomes the remaining bottleneck for the parallelization process (Fig. 1a–f).

To tackle this bottleneck, cellular-level parallel methods have been developed, which accelerate single-cell computation by “splitting” a single cell into several compartments that can be computed in parallel27,28,38. However, such methods rely heavily on prior knowledge to generate practical strategies on how to split a single neuron into compartments (Fig. 1g−i; Supplementary Fig. 1). Hence, it becomes less efficient for neurons with asymmetrical morphologies, e.g., pyramidal neurons and Purkinje neurons.

We aim to develop a more efficient and precise parallel method for the simulation of biologically detailed neural networks. First, we establish the criteria for the accuracy of a cellular-level parallel method. Based on the theories in parallel computing34, we propose three conditions to make sure a parallel method will yield identical solutions as the serial computing Hines method according to the data dependency in the Hines method (see Methods). Then to theoretically evaluate the run time, i.e., efficiency, of the serial and parallel computing methods, we introduce and formulate the concept of computational cost as the number of steps a method takes in solving equations (see Methods).

Based on the simulation accuracy and computational cost, we formulate the parallelization problem as a mathematical scheduling problem (see Methods). In simple terms, we view a single neuron as a tree with many nodes (compartments). For k parallel threads, we can compute at most k nodes at each step, but we need to ensure a node is computed only if all its children nodes have been processed; our goal is to find a strategy with the minimum number of steps for the entire procedure.

To generate an optimal partition, we propose a method called Dendritic Hierarchical Scheduling (DHS) (theoretical proof is presented in the Methods). The key idea of DHS is to prioritize deep nodes (Fig. 2a), which results in a hierarchical schedule order. The DHS method includes two steps: analyzing dendritic topology and finding the best partition: (1) Given a detailed model, we first obtain its corresponding dependency tree and calculate the depth of each node (the depth of a node is the number of its ancestor nodes) on the tree (Fig. 2b, c). (2) After topology analysis, we search the candidates and pick at most k deepest candidate nodes (a node is a candidate only if all its children nodes have been processed). This procedure repeats until all nodes are processed (Fig. 2d).

Fig. 2: Dendritic Hierarchical Scheduling (DHS) method significantly reduces the computational cost, i.e., computational steps in solving equations.

a DHS work flow. DHS processes k deepest candidate nodes each iteration. b Illustration of calculating node depth of a compartmental model. The model is first converted to a tree structure then the depth of each node is computed. Colors indicate different depth values. c Topology analysis on different neuron models. Six neurons with distinct morphologies are shown here. For each model, the soma is selected as the root of the tree so the depth of the node increases from the soma (0) to the distal dendrites. d Illustration of performing DHS on the model in b with four threads. Candidates: nodes that can be processed. Selected candidates: nodes that are picked by DHS, i.e., the k deepest candidates. Processed nodes: nodes that have been processed before. e Parallelization strategy obtained by DHS after the process in d. Each node is assigned to one of the four parallel threads. DHS reduces the steps of serial node processing from 14 to 5 by distributing nodes to multiple threads. f Relative cost, i.e., the proportion of the computational cost of DHS to that of the serial Hines method, when applying DHS with different numbers of threads on different types of models.

Take a simplified model with 15 compartments as an example, using the serial computing Hines method, it takes 14 steps to process all nodes, while using DHS with four parallel units can partition its nodes into five subsets (Fig. 2d): {{9,10,12,14}, {1,7,11,13}, {2,3,4,8}, {6}, {5}}. Because nodes in the same subset can be processed in parallel, it takes only five steps to process all nodes using DHS (Fig. 2e).

Next, we apply the DHS method on six representative detailed neuron models (selected from ModelDB39) with different numbers of threads (Fig. 2f):, including cortical and hippocampal pyramidal neurons40,41,42, cerebellar Purkinje neurons43, striatal projection neurons (SPN44), and olfactory bulb mitral cells45, covering the major principal neurons in sensory, cortical and subcortical areas. We then measured the computational cost. The relative computational cost here is defined by the proportion of the computational cost of DHS to that of the serial Hines method. The computational cost, i.e., the number of steps taken in solving equations, drops dramatically with increasing thread numbers. For example, with 16 threads, the computational cost of DHS is 7%-10% as compared to the serial Hines method. Intriguingly, the DHS method reaches the lower bounds of their computational cost for presented neurons when given 16 or even 8 parallel threads (Fig. 2f), suggesting adding more threads does not improve performance further because of the dependencies between compartments.

Together, we generate a DHS method that enables automated analysis of the dendritic topology and optimal partition for parallel computing. It is worth noting that DHS finds the optimal partition before the simulation starts, and no extra computation is needed to solve equations.

Speeding up DHS by GPU memory boosting

DHS computes each neuron with multiple threads, which consumes a vast amount of threads when running neural network simulations. Graphics Processing Units (GPUs) consist of massive processing units (i.e., streaming processors, SPs, Fig. 3a, b) for parallel computing46. In theory, many SPs on the GPU should support efficient simulation for large-scale neural networks (Fig. 3c). However, we consistently observed that the efficiency of DHS significantly decreased when the network size grew, which might result from scattered data storage or extra memory access caused by loading and writing intermediate results (Fig. 3d, left).

Fig. 3: GPU memory boosting further accelerates DHS.

a GPU architecture and its memory hierarchy. Each GPU contains massive processing units (stream processors). Different types of memory have different throughput. b Architecture of Streaming Multiprocessors (SMs). Each SM contains multiple streaming processors, registers, and L1 cache. c Applying DHS on two neurons, each with four threads. During simulation, each thread executes on one stream processor. d Memory optimization strategy on GPU. Top panel, thread assignment and data storage of DHS, before (left) and after (right) memory boosting. Bottom, an example of a single step in triangularization when simulating two neurons in d. Processors send a data request to load data for each thread from global memory. Without memory boosting (left), it takes seven transactions to load all request data and some extra transactions for intermediate results. With memory boosting (right), it takes only two transactions to load all request data, registers are used for intermediate results, which further improve memory throughput. e Run time of DHS (32 threads each cell) with and without memory boosting on multiple layer 5 pyramidal models with spines. f Speed up of memory boosting on multiple layer 5 pyramidal models with spines. Memory boosting brings 1.6-2 times speedup.

We solve this problem by GPU memory boosting, a method to increase memory throughput by leveraging GPU’s memory hierarchy and access mechanism. Based on the memory loading mechanism of GPU, successive threads loading aligned and successively-stored data lead to a high memory throughput compared to accessing scatter-stored data, which reduces memory throughput46,47. To achieve high throughput, we first align the computing orders of nodes and rearrange threads according to the number of nodes on them. Then we permute data storage in global memory, consistent with computing orders, i.e., nodes that are processed at the same step are stored successively in global memory. Moreover, we use GPU registers to store intermediate results, further strengthening memory throughput. The example shows that memory boosting takes only two memory transactions to load eight request data (Fig. 3d, right). Furthermore, experiments on multiple numbers of pyramidal neurons with spines and the typical neuron models (Fig. 3e, f; Supplementary Fig. 2) show that memory boosting achieves a 1.2-3.8 times speedup as compared to the naïve DHS.

To comprehensively test the performance of DHS with GPU memory boosting, we select six typical neuron models and evaluate the run time of solving cable equations on massive numbers of each model (Fig. 4). We examined DHS with four threads (DHS-4) and sixteen threads (DHS-16) for each neuron, respectively. Compared to the GPU method in CoreNEURON, DHS-4 and DHS-16 can speed up about 5 and 15 times, respectively (Fig. 4a). Moreover, compared to the conventional serial Hines method in NEURON running with a single-thread of CPU, DHS speeds up the simulation by 2-3 orders of magnitude (Supplementary Fig. 3), while retaining the identical numerical accuracy in the presence of dense spines (Supplementary Figs. 4 and 8), active dendrites (Supplementary Fig. 7) and different segmentation strategies (Supplementary Fig. 7).

Fig. 4: DHS enables cell-type-specific finest partition.

a Run time of solving equations for a 1 s simulation on GPU (dt = 0.025 ms, 40,000 iterations in total). CoreNEURON: the parallel method used in CoreNEURON; DHS-4: DHS with four threads for each neuron; DHS-16: DHS with 16 threads for each neuron. bc Visualization of the partition by DHS-4 and DHS-16, each color indicates a single thread. During computation, each thread switches among different branches.

\

DHS creates cell-type-specific optimal partitioning

To gain insights into the working mechanism of the DHS method, we visualized the partitioning process by mapping compartments to each thread (every color presents a single thread in Fig. 4b, c). The visualization shows that a single thread frequently switches among different branches (Fig. 4b, c). Interestingly, DHS generates aligned partitions in morphologically symmetric neurons such as the striatal projection neuron (SPN) and the Mitral cell (Fig. 4b, c). By contrast, it generates fragmented partitions of morphologically asymmetric neurons like the pyramidal neurons and Purkinje cell (Fig. 4b, c), indicating that DHS splits the neural tree at individual compartment scale (i.e., tree node) rather than branch scale. This cell-type-specific fine-grained partition enables DHS to fully exploit all available threads.

In summary, DHS and memory boosting generate a theoretically proven optimal solution for solving linear equations in parallel with unprecedented efficiency. Using this principle, we built the open-access DeepDendrite platform, which can be utilized by neuroscientists to implement models without any specific GPU programming knowledge. Below, we demonstrate how we can utilize DeepDendrite in neuroscience tasks. We also discuss the potential of the DeepDendrite framework for AI-related tasks in the Discussion section.

DHS enables spine-level modelling

As dendritic spines receive most of the excitatory input to cortical and hippocampal pyramidal neurons, striatal projection neurons, etc., their morphologies and plasticity are crucial for regulating neuronal excitability10,48,49,50,51. However, spines are too small ( ~ 1 μm length) to be directly measured experimentally with regard to voltage-dependent processes. Thus, theoretical work is critical for the full understanding of the spine computations.

We can model a single spine with two compartments: the spine head where synapses are located and the spine neck that links the spine head to dendrites52. The theory predicts that the very thin spine neck (0.1-0.5 um in diameter) electronically isolates the spine head from its parent dendrite, thus compartmentalizing the signals generated at the spine head53. However, the detailed model with fully distributed spines on dendrites (“full-spine model”) is computationally very expensive. A common compromising solution is to modify the capacitance and resistance of the membrane by a Fspine factor54, instead of modeling all spines explicitly. Here, the Fspine factor aims at approximating the spine effect on the biophysical properties of the cell membrane54.

Inspired by the previous work of Eyal et al. 51, we investigated how different spatial patterns of excitatory inputs formed on dendritic spines shape neuronal activities in a human pyramidal neuron model with explicitly modeled spines (Fig. 5a). Noticeably, Eyal et al. employed the Fspine factor to incorporate spines into dendrites while only a few activated spines were explicitly attached to dendrites (“few-spine model” in Fig. 5a). The value of Fspine in their model was computed from the dendritic area and spine area in the reconstructed data. Accordingly, we calculated the spine density from their reconstructed data to make our full-spine model more consistent with Eyal’s few-spine model. With the spine density set to 1.3 μm-1, the pyramidal neuron model contained about 25,000 spines without altering the model’s original morphological and biophysical properties. Further, we repeated the previous experiment protocols with both full-spine and few-spine models. We use the same synaptic input as in Eyal’s work but attach extra background noise to each sample. By comparing the somatic traces (Fig. 5b, c) and spike probability (Fig. 5d) in full-spine and few-spine models, we found that the full-spine model is much leakier than the few-spine model. In addition, the spike probability triggered by the activation of clustered spines appeared to be more nonlinear in the full-spine model (the solid blue line in Fig. 5d) than in the few-spine model (the dashed blue line in Fig. 5d). These results indicate that the conventional F-factor method may underestimate the impact of dense spine on the computations of dendritic excitability and nonlinearity.

Fig. 5: DHS enables spine-level modeling.

a Experiment setup. We examine two major types of models: few-spine models and full-spine models. Few-spine models (two on the left) are the models that incorporated spine area globally into dendrites and only attach individual spines together with activated synapses. In full-spine models (two on the right), all spines are explicitly attached over whole dendrites. We explore the effects of clustered and randomly distributed synaptic inputs on the few-spine models and the full-spine models, respectively. b Somatic voltages recorded for cases in a. Colors of the voltage curves correspond to a, scale bar: 20 ms, 20 mV. c Color-coded voltages during the simulation in b at specific times. Colors indicate the magnitude of voltage. d Somatic spike probability as a function of the number of simultaneously activated synapses (as in Eyal et al.’s work) for four cases in a. Background noise is attached. e Run time of experiments in d with different simulation methods. NEURON: conventional NEURON simulator running on a single CPU core. CoreNEURON: CoreNEURON simulator on a single GPU. DeepDendrite: DeepDendrite on a single GPU.

In the DeepDendrite platform, both full-spine and few-spine models achieved 8 times speedup compared to CoreNEURON on the GPU platform and 100 times speedup compared to serial NEURON on the CPU platform (Fig. 5e; Supplementary Table 1) while keeping the identical simulation results (Supplementary Figs. 4 and 8). Therefore, the DHS method enables explorations of dendritic excitability under more realistic anatomic conditions.

Discussion

In this work, we propose the DHS method to parallelize the computation of Hines method55 and we mathematically demonstrate that the DHS provides an optimal solution without any loss of precision. Next, we implement DHS on the GPU hardware platform and use GPU memory boosting techniques to refine the DHS (Fig. 3). When simulating a large number of neurons with complex morphologies, DHS with memory boosting achieves a 15-fold speedup (Supplementary Table 1) as compared to the GPU method used in CoreNEURON and up to 1,500-fold speedup compared to serial Hines method in the CPU platform (Fig. 4; Supplementary Fig. 3 and Supplementary Table 1). Furthermore, we develop the GPU-based DeepDendrite framework by integrating DHS into CoreNEURON. Finally, as a demonstration of the capacity of DeepDendrite, we present a representative application: examine spine computations in a detailed pyramidal neuron model with 25,000 spines. Further in this section, we elaborate on how we have expanded the DeepDendrite framework to enable efficient training of biophysically detailed neural networks. To explore the hypothesis that dendrites improve robustness against adversarial attacks56, we train our network on typical image classification tasks. We show that DeepDendrite can support both neuroscience simulations and AI-related detailed neural network tasks with unprecedented speed, therefore significantly promoting detailed neuroscience simulations and potentially for future AI explorations.

Decades of efforts have been invested in speeding up the Hines method with parallel methods. Early work mainly focuses on network-level parallelization. In network simulations, each cell independently solves its corresponding linear equations with the Hines method. Network-level parallel methods distribute a network on multiple threads and parallelize the computation of each cell group with each thread57,58. With network-level methods, we can simulate detailed networks on clusters or supercomputers59. In recent years, GPU has been used for detailed network simulation. Because the GPU contains massive computing units, one thread is usually assigned one cell rather than a cell group35,60,61. With further optimization, GPU-based methods achieve much higher efficiency in network simulation. However, the computation inside the cells is still serial in network-level methods, so they still cannot deal with the problem when the “Hines matrix” of each cell scales large.

Cellular-level parallel methods further parallelize the computation inside each cell. The main idea of cellular-level parallel methods is to split each cell into several sub-blocks and parallelize the computation of those sub-blocks27,28. However, typical cellular-level methods (e.g., the “multi-split” method28) pay less attention to the parallelization strategy. The lack of a fine parallelization strategy results in unsatisfactory performance. To achieve higher efficiency, some studies try to obtain finer-grained parallelization by introducing extra computation operations29,38,62 or making approximations on some crucial compartments, while solving linear equations63,64. These finer-grained parallelization strategies can get higher efficiency but lack sufficient numerical accuracy as in the original Hines method.

Unlike previous methods, DHS adopts the finest-grained parallelization strategy, i.e., compartment-level parallelization. By modeling the problem of “how to parallelize” as a combinatorial optimization problem, DHS provides an optimal compartment-level parallelization strategy. Moreover, DHS does not introduce any extra operation or value approximation, so it achieves the lowest computational cost and retains sufficient numerical accuracy as in the original Hines method at the same time.

Dendritic spines are the most abundant microstructures in the brain for projection neurons in the cortex, hippocampus, cerebellum, and basal ganglia. As spines receive most of the excitatory inputs in the central nervous system, electrical signals generated by spines are the main driving force for large-scale neuronal activities in the forebrain and cerebellum10,11. The structure of the spine, with an enlarged spine head and a very thin spine neck—leads to surprisingly high input impedance at the spine head, which could be up to 500 MΩ, combining experimental data and the detailed compartment modeling approach48,65. Due to such high input impedance, a single synaptic input can evoke a “gigantic” EPSP ( ~ 20 mV) at the spine-head level48,66, thereby boosting NMDA currents and ion channel currents in the spine11. However, in the classic single detailed compartment models, all spines are replaced by the F coefficient modifying the dendritic cable geometries54. This approach may compensate for the leak currents and capacitance currents for spines. Still, it cannot reproduce the high input impedance at the spine head, which may weaken excitatory synaptic inputs, particularly NMDA currents, thereby reducing the nonlinearity in the neuron’s input-output curve. Our modeling results are in line with this interpretation.

On the other hand, the spine’s electrical compartmentalization is always accompanied by the biochemical compartmentalization8,52,67, resulting in a drastic increase of internal [Ca2+], within the spine and a cascade of molecular processes involving synaptic plasticity of importance for learning and memory. Intriguingly, the biochemical process triggered by learning, in turn, remodels the spine’s morphology, enlarging (or shrinking) the spine head, or elongating (or shortening) the spine neck, which significantly alters the spine’s electrical capacity67,68,69,70. Such experience-dependent changes in spine morphology also referred to as “structural plasticity”, have been widely observed in the visual cortex71,72, somatosensory cortex73,74, motor cortex75, hippocampus9, and the basal ganglia76 in vivo. They play a critical role in motor and spatial learning as well as memory formation. However, due to the computational costs, nearly all detailed network models exploit the “F-factor” approach to replace actual spines, and are thus unable to explore the spine functions at the system level. By taking advantage of our framework and the GPU platform, we can run a few thousand detailed neurons models, each with tens of thousands of spines on a single GPU, while maintaining ~100 times faster than the traditional serial method on a single CPU (Fig. 5e). Therefore, it enables us to explore of structural plasticity in large-scale circuit models across diverse brain regions.

Another critical issue is how to link dendrites to brain functions at the systems/network level. It has been well established that dendrites can perform comprehensive computations on synaptic inputs due to enriched ion channels and local biophysical membrane properties5,6,7. For example, cortical pyramidal neurons can carry out sublinear synaptic integration at the proximal dendrite but progressively shift to supralinear integration at the distal dendrite77. Moreover, distal dendrites can produce regenerative events such as dendritic sodium spikes, calcium spikes, and NMDA spikes/plateau potentials6,78. Such dendritic events are widely observed in mice6 or even human cortical neurons79 in vitro, which may offer various logical operations6,79 or gating functions80,81. Recently, in vivo recordings in awake or behaving mice provide strong evidence that dendritic spikes/plateau potentials are crucial for orientation selectivity in the visual cortex82, sensory-motor integration in the whisker system83,84, and spatial navigation in the hippocampal CA1 region85.

To establish the causal link between dendrites and animal (including human) patterns of behavior, large-scale biophysically detailed neural circuit models are a powerful computational tool to realize this mission. However, running a large-scale detailed circuit model of 10,000-100,000 neurons generally requires the computing power of supercomputers. It is even more challenging to optimize such models for in vivo data, as it needs iterative simulations of the models. The DeepDendrite framework can directly support many state-of-the-art large-scale circuit models86,87,88, which were initially developed based on NEURON. Moreover, using our framework, a single GPU card such as Tesla A100 could easily support the operation of detailed circuit models of up to 10,000 neurons, thereby providing carbon-efficient and affordable plans for ordinary labs to develop and optimize their own large-scale detailed models.

Recent works on unraveling the dendritic roles in task-specific learning have achieved remarkable results in two directions, i.e., solving challenging tasks such as image classification dataset ImageNet with simplified dendritic networks20, and exploring full learning potentials on more realistic neuron21,22. However, there lies a trade-off between model size and biological detail, as the increase in network scale is often sacrificed for neuron-level complexity19,20,89. Moreover, more detailed neuron models are less mathematically tractable and computationally expensive21.

There has also been progress in the role of active dendrites in ANNs for computer vision tasks. Iyer et al. 90. proposed a novel ANN architecture with active dendrites, demonstrating competitive results in multi-task and continual learning. Jones and Kording91 used a binary tree to approximate dendrite branching and provided valuable insights into the influence of tree structure on single neurons’ computational capacity. Bird et al. 92. proposed a dendritic normalization rule based on biophysical behavior, offering an interesting perspective on the contribution of dendritic arbor structure to computation. While these studies offer valuable insights, they primarily rely on abstractions derived from spatially extended neurons, and do not fully exploit the detailed biological properties and spatial information of dendrites. Further investigation is needed to unveil the potential of leveraging more realistic neuron models for understanding the shared mechanisms underlying brain computation and deep learning.

In response to these challenges, we developed DeepDendrite, a tool that uses the Dendritic Hierarchical Scheduling (DHS) method to significantly reduce computational costs and incorporates an I/O module and a learning module to handle large datasets. With DeepDendrite, we successfully implemented a three-layer hybrid neural network, the Human Pyramidal Cell Network (HPC-Net) (Fig. 6a, b). This network demonstrated efficient training capabilities in image classification tasks, achieving approximately 25 times speedup compared to training on a traditional CPU-based platform (Fig. 6f; Supplementary Table 1).

Fig. 6: DeepDendrite enables learning with detailed neural networks.

a The illustration of the Human Pyramidal Cell Network (HPC-Net) for image classification. Images are transformed to spike trains and fed into the network model. Learning is triggered by error signals propagated from soma to dendrites. b Training with mini-batch. Multiple networks are simulated simultaneously with different images as inputs. The total weight updates ΔW are computed as the average of ΔWi from each network. c Comparison of the HPC-Net before and after training. Left, the visualization of hidden neuron responses to a specific input before (top) and after (bottom) training. Right, hidden layer weights (from input to hidden layer) distribution before (top) and after (bottom) training. d Workflow of the transfer adversarial attack experiment. We first generate adversarial samples of the test set on a 20-layer ResNet. Then use these adversarial samples (noisy images) to test the classification accuracy of models trained with clean images. e Prediction accuracy of each model on adversarial samples after training 30 epochs on MNIST (left) and Fashion-MNIST (right) datasets. f Run time of training and testing for the HPC-Net. The batch size is set to 16. Left, run time of training one epoch. Right, run time of testing. Parallel NEURON + Python: training and testing on a single CPU with multiple cores, using 40-process-parallel NEURON to simulate the HPC-Net and extra Python code to support mini-batch training. DeepDendrite: training and testing the HPC-Net on a single GPU with DeepDendrite.

Additionally, it is widely recognized that the performance of Artificial Neural Networks (ANNs) can be undermined by adversarial attacks93—intentionally engineered perturbations devised to mislead ANNs. Intriguingly, an existing hypothesis suggests that dendrites and synapses may innately defend against such attacks56. Our experimental results utilizing HPC-Net lend support to this hypothesis, as we observed that networks endowed with detailed dendritic structures demonstrated some increased resilience to transfer adversarial attacks94 compared to standard ANNs, as evident in MNIST95 and Fashion-MNIST96 datasets (Fig. 6d, e). This evidence implies that the inherent biophysical properties of dendrites could be pivotal in augmenting the robustness of ANNs against adversarial interference. Nonetheless, it is essential to conduct further studies to validate these findings using more challenging datasets such as ImageNet97.

In conclusion, DeepDendrite has shown remarkable potential in image classification tasks, opening up a world of exciting future directions and possibilities. To further advance DeepDendrite and the application of biologically detailed dendritic models in AI tasks, we may focus on developing multi-GPU systems and exploring applications in other domains, such as Natural Language Processing (NLP), where dendritic filtering properties align well with the inherently noisy and ambiguous nature of human language. Challenges include testing scalability in larger-scale problems, understanding performance across various tasks and domains, and addressing the computational complexity introduced by novel biological principles, such as active dendrites. By overcoming these limitations, we can further advance the understanding and capabilities of biophysically detailed dendritic neural networks, potentially uncovering new advantages, enhancing their robustness against adversarial attacks and noisy inputs, and ultimately bridging the gap between neuroscience and modern AI.

Methods

Simulation with DHS

CoreNEURON35 simulator (https://github.com/BlueBrain/CoreNeuron) uses the NEURON25 architecture and is optimized for both memory usage and computational speed. We implement our Dendritic Hierarchical Scheduling (DHS) method in the CoreNEURON environment by modifying its source code. All models that can be simulated on GPU with CoreNEURON can also be simulated with DHS by executing the following command:

coreneuron_exec -d /path/to/models -e time --cell-permute 3 --cell-nthread 16 --gpu

The usage options are as in Table 1.

Table 1 Usage options for DHS-embedded CoreNEURON

\

Accuracy of the simulation using cellular-level parallel computation

To ensure the accuracy of the simulation, we first need to define the correctness of a cellular-level parallel algorithm to judge whether it will generate identical solutions compared with the proven correct serial methods, like the Hines method used in the NEURON simulation platform. Based on the theories in parallel computing34, a parallel algorithm will yield an identical result as its corresponding serial algorithm, if and only if the data process order in the parallel algorithm is consistent with data dependency in the serial method. The Hines method has two symmetrical phases: triangularization and back-substitution. By analyzing the serial computing Hines method55, we find that its data dependency can be formulated as a tree structure, where the nodes on the tree represent the compartments of the detailed neuron model. In the triangularization process, the value of each node depends on its children nodes. In contrast, during the back-substitution process, the value of each node is dependent on its parent node (Fig. 1d). Thus, we can compute nodes on different branches in parallel as their values are not dependent.

Based on the data dependency of the serial computing Hines method, we propose three conditions to make sure a parallel method will yield identical solutions as the serial computing Hines method: (1) The tree morphology and initial values of all nodes are identical to those in the serial computing Hines method; (2) In the triangularization phase, a node can be processed if and only if all its children nodes are already processed; (3) In the back-substitution phase, a node can be processed only if its parent node is already processed. Once a parallel computing method satisfies these three conditions, it will produce identical solutions as the serial computing method.

Computational cost of cellular-level parallel computing method

To theoretically evaluate the run time, i.e., efficiency, of the serial and parallel computing methods, we introduce and formulate the concept of computational cost as follows: given a tree T and k threads (basic computational units) to perform triangularization, parallel triangularization equals to divide the node set V of T into n subsets, i.e., V = {V1V2, … Vn} where the size of each subset |Vi | ≤ k, i.e., at most k nodes can be processed each step since there are only k threads. The process of the triangularization phase follows the order: V1 → V2 → … →Vn, and nodes in the same subset Vi can be processed in parallel. So, we define |V | (the size of set V, i.e., n here) as the computational cost of the parallel computing method. In short, we define the computational cost of a parallel method as the number of steps it takes in the triangularization phase. Because the back-substitution is symmetrical with triangularization, the total cost of the entire solving equation phase is twice that of the triangularization phase.

Mathematical scheduling problem

Based on the simulation accuracy and computational cost, we formulate the parallelization problem as a mathematical scheduling problem:

Given a tree T = {VE} and a positive integer k, where V is the node-set and E is the edge set. Define partition P(V) = {V1V2, … Vn}, |Vi | ≤ k, 1 ≤ i ≤ n, where |Vi| indicates the cardinal number of subset Vi, i.e., the number of nodes in Vi, and for each node vVi, all its children nodes {c | c∈children(v)} must in a previous subset Vj, where 1 ≤ j < i. Our goal is to find an optimal partition P*(V) whose computational cost |P*(V)| is minimal.

Here subset Vi consists of all nodes that will be computed at i-th step (Fig. 2e), so |Vi | ≤ k indicates that we can compute k nodes each step at most because the number of available threads is k. The restriction “for each node vVi, all its children nodes {c | c∈children(v)} must in a previous subset Vj, where 1 ≤ j < i” indicates that node v can be processed only if all its child nodes are processed.

DHS implementation

We aim to find an optimal way to parallelize the computation of solving linear equations for each neuron model by solving the mathematical scheduling problem above. To get the optimal partition, DHS first analyzes the topology and calculates the depth d(v) for all nodes vV. Then, the following two steps will be executed iteratively until every node vV is assigned to a subset: (1) find all candidate nodes and put these nodes into candidate set Q. A node is a candidate only if all its child nodes have been processed or it does not have any child nodes. (2) if |Q | ≤ k, i.e., the number of candidate nodes is smaller or equivalent to the number of available threads, remove all nodes in Q and put them into V*i, otherwise, remove k deepest nodes from Q and add them to subset Vi. Label these nodes as processed nodes (Fig. 2d). After filling in subset Vi, go to step (1) to fill in the next subset Vi+1.

Correctness proof for DHS

After applying DHS to a neural tree T = {VE}, we get a partition P(V) = {V1V2, … Vn}, |Vi | ≤ k, 1 ≤ i ≤ n. Nodes in the same subset Vi will be computed in parallel, taking n steps to perform triangularization and back-substitution, respectively. We then demonstrate that the reordering of the computation in DHS will result in a result identical to the serial Hines method.

The partition P(V) obtained from DHS decides the computation order of all nodes in a neural tree. Below we demonstrate that the computation order determined by P(V) satisfies the correctness conditions. P(V) is obtained from the given neural tree T. Operations in DHS do not modify the tree topology and values of tree nodes (corresponding values in the linear equations), so the tree morphology and initial values of all nodes are not changed, which satisfies condition 1: the tree morphology and initial values of all nodes are identical to those in serial Hines method. In triangularization, nodes are processed from subset V1 to Vn. As shown in the implementation of DHS, all nodes in subset Vi are selected from the candidate set Q, and a node can be put into Q only if all its child nodes have been processed. Thus the child nodes of all nodes in Vi are in {V1V2, … Vi-1}, meaning that a node is only computed after all its children have been processed, which satisfies condition 2: in triangularization, a node can be processed if and only if all its child nodes are already processed. In back-substitution, the computation order is the opposite of that in triangularization, i.e., from Vn to V1. As shown before, the child nodes of all nodes in Vi are in {V1V2, … Vi-1}, so parent nodes of nodes in Vi are in {Vi+1Vi+2, … Vn}, which satisfies condition 3: in back-substitution, a node can be processed only if its parent node is already processed.

Optimality proof for DHS

The idea of the proof is that if there is another optimal solution, it can be transformed into our DHS solution without increasing the number of steps the algorithm requires, thus indicating that the DHS solution is optimal.

For each subset Vi in P(V), DHS moves k (thread number) deepest nodes from the corresponding candidate set Qi to Vi. If the number of nodes in Qi is smaller than k, move all nodes from Qi to Vi. To simplify, we introduce Di, indicating the depth sum of k deepest nodes in Qi. All subsets in P(V) satisfy the max-depth criteria (Supplementary Fig. 6a): . We then prove that selecting the deepest nodes in each iteration makes P(V) an optimal partition. If there exists an optimal partition P*(V) = {V*1V*2, … V*s} containing subsets that do not satisfy the max-depth criteria, we can modify the subsets in P*(V) so that all subsets consist of the deepest nodes from Q and the number of subsets ( | P*(V)|) remain the same after modification.

Without any loss of generalization, we start from the first subset V*i not satisfying the criteria, i.e., . There are two possible cases that will make V*i not satisfy the max-depth criteria: (1) | V*i | < k and there exist some valid nodes in Qi that are not put to V*i; (2) | V*i | = k but nodes in V*i are not the k deepest nodes in Qi.

For case (1), because some candidate nodes are not put to V*i, these nodes must be in the subsequent subsets. As |V*i | k, we can move the corresponding nodes from the subsequent subsets to V*i, which will not increase the number of subsets and make V*i satisfy the criteria (Supplementary Fig. 6b, top). For case (2), |V*i | = k, these deeper nodes that are not moved from the candidate set Qi into V*i must be added to subsequent subsets (Supplementary Fig. 6b, bottom). These deeper nodes can be moved from subsequent subsets to V*i through the following method. Assume that after filling V*iv is picked and one of the k-th deepest nodes v’ is still in Qi, thus v’ will be put into a subsequent subset V*j (j > i). We first move v from V*i to V*i + 1, then modify subset V*i + 1 as follows: if |V*i + 1 | ≤ k and none of the nodes in V*i + 1 is the parent of node v, stop modifying the latter subsets. Otherwise, modify V*i + 1 as follows (Supplementary Fig. 6c): if the parent node of v is in V*i + 1, move this parent node to V*i + 2; else move the node with minimum depth from V*i + 1 to V*i + 2. After adjusting V*i, modify subsequent subsets V*i + 1V*i + 2, … V*j-1 with the same strategy. Finally, move v’ from V*j to V*i.

With the modification strategy described above, we can replace all shallower nodes in V*i with the k-th deepest node in Qi and keep the number of subsets, i.e., |P*(V)| the same after modification. We can modify the nodes with the same strategy for all subsets in P*(V) that do not contain the deepest nodes. Finally, all subsets V*iP*(V) can satisfy the max-depth criteria, and |P*(V)| does not change after modifying.

In conclusion, DHS generates a partition P(V), and all subsets ViP(V) satisfy the max-depth condition: . For any other optimal partition P*(V) we can modify its subsets to make its structure the same as P(V), i.e., each subset consists of the deepest nodes in the candidate set, and keep |P*(V)| the same after modification. So, the partition P(V) obtained from DHS is one of the optimal partitions.

GPU implementation and memory boosting

To achieve high memory throughput, GPU utilizes the memory hierarchy of (1) global memory, (2) cache, (3) register, where global memory has large capacity but low throughput, while registers have low capacity but high throughput. We aim to boost memory throughput by leveraging the memory hierarchy of GPU.

GPU employs SIMT (Single-Instruction, Multiple-Thread) architecture. Warps are the basic scheduling units on GPU (a warp is a group of 32 parallel threads). A warp executes the same instruction with different data for different threads46. Correctly ordering the nodes is essential for this batching of computation in warps, to make sure DHS obtains identical results as the serial Hines method. When implementing DHS on GPU, we first group all cells into multiple warps based on their morphologies. Cells with similar morphologies are grouped in the same warp. We then apply DHS on all neurons, assigning the compartments of each neuron to multiple threads. Because neurons are grouped into warps, the threads for the same neuron are in the same warp. Therefore, the intrinsic synchronization in warps keeps the computation order consistent with the data dependency of the serial Hines method. Finally, threads in each warp are aligned and rearranged according to the number of compartments.

When a warp loads pre-aligned and successively-stored data from global memory, it can make full use of the cache, which leads to high memory throughput, while accessing scatter-stored data would reduce memory throughput. After compartments assignment and threads rearrangement, we permute data in global memory to make it consistent with computing orders so that warps can load successively-stored data when executing the program. Moreover, we put those necessary temporary variables into registers rather than global memory. Registers have the highest memory throughput, so the use of registers further accelerates DHS.

Full-spine and few-spine biophysical models

We used the published human pyramidal neuron51. The membrane capacitance cm = 0.44 μF cm-2, membrane resistance rm = 48,300 Ω cm2, and axial resistivity ra = 261.97 Ω cm. In this model, all dendrites were modeled as passive cables while somas were active. The leak reversal potential El = -83.1 mV. Ion channels such as Na+ and K+ were inserted on soma and initial axon, and their reversal potentials were ENa = 67.6 mV, EK = -102 mV respectively. All these specific parameters were set the same as in the model of Eyal, et al. 51, for more details please refer to the published model (ModelDB, access No. 238347).

In the few-spine model, the membrane capacitance and maximum leak conductance of the dendritic cables 60 μm away from soma were multiplied by a Fspine factor to approximate dendritic spines. In this model, Fspine was set to 1.9. Only the spines that receive synaptic inputs were explicitly attached to dendrites.

In the full-spine model, all spines were explicitly attached to dendrites. We calculated the spine density with the reconstructed neuron in Eyal, et al. 51. The spine density was set to 1.3 μm-1, and each cell contained 24994 spines on dendrites 60 μm away from the soma.

The morphologies and biophysical mechanisms of spines were the same in few-spine and full-spine models. The length of the spine neck Lneck = 1.35 μm and the diameter Dneck = 0.25 μm, whereas the length and diameter of the spine head were 0.944 μm, i.e., the spine head area was set to 2.8 μm2. Both spine neck and spine head were modeled as passive cables, with the reversal potential El = -86 mV. The specific membrane capacitance, membrane resistance, and axial resistivity were the same as those for dendrites.

Synaptic inputs

We investigated neuronal excitability for both distributed and clustered synaptic inputs. All activated synapses were attached to the terminal of the spine head. For distributed inputs, all activated synapses were randomly distributed on all dendrites. For clustered inputs, each cluster consisted of 20 activated synapses that were uniformly distributed on a single randomly-selected compartment. All synapses were activated simultaneously during the simulation.

AMPA-based and NMDA-based synaptic currents were simulated as in Eyal et al.’s work. AMPA conductance was modeled as a double-exponential function and NMDA conduction as a voltage-dependent double-exponential function. For the AMPA model, the specific τrise and τdecay were set to 0.3 and 1.8 ms. For the NMDA model, τrise and τdecay were set to 8.019 and 34.9884 ms, respectively. The maximum conductance of AMPA and NMDA were 0.73 nS and 1.31 nS.

Background noise

We attached background noise to each cell to simulate a more realistic environment. Noise patterns were implemented as Poisson spike trains with a constant rate of 1.0 Hz. Each pattern started at tstart = 10 ms and lasted until the end of the simulation. We generated 400 noise spike trains for each cell and attached them to randomly-selected synapses. The model and specific parameters of synaptic currents were the same as described in Synaptic Inputs, except that the maximum conductance of NMDA was uniformly distributed from 1.57 to 3.275, resulting in a higher AMPA to NMDA ratio.

Exploring neuronal excitability

We investigated the spike probability when multiple synapses were activated simultaneously. For distributed inputs, we tested 14 cases, from 0 to 240 activated synapses. For clustered inputs, we tested 9 cases in total, activating from 0 to 12 clusters respectively. Each cluster consisted of 20 synapses. For each case in both distributed and clustered inputs, we calculated the spike probability with 50 random samples. Spike probability was defined as the ratio of the number of neurons fired to the total number of samples. All 1150 samples were simulated simultaneously on our DeepDendrite platform, reducing the simulation time from days to minutes.

Performing AI tasks with the DeepDendrite platform

Conventional detailed neuron simulators lack two functionalities important to modern AI tasks: (1) alternately performing simulations and weight updates without heavy reinitialization and (2) simultaneously processing multiple stimuli samples in a batch-like manner. Here we present the DeepDendrite platform, which supports both biophysical simulating and performing deep learning tasks with detailed dendritic models.

DeepDendrite consists of three modules (Supplementary Fig. 5): (1) an I/O module; (2) a DHS-based simulating module; (3) a learning module. When training a biophysically detailed model to perform learning tasks, users first define the learning rule, then feed all training samples to the detailed model for learning. In each step during training, the I/O module picks a specific stimulus and its corresponding teacher signal (if necessary) from all training samples and attaches the stimulus to the network model. Then, the DHS-based simulating module initializes the model and starts the simulation. After simulation, the learning module updates all synaptic weights according to the difference between model responses and teacher signals. After training, the learned model can achieve performance comparable to ANN. The testing phase is similar to training, except that all synaptic weights are fixed.

HPC-Net model

Image classification is a typical task in the field of AI. In this task, a model should learn to recognize the content in a given image and output the corresponding label. Here we present the HPC-Net, a network consisting of detailed human pyramidal neuron models that can learn to perform image classification tasks by utilizing the DeepDendrite platform.

HPC-Net has three layers, i.e., an input layer, a hidden layer, and an output layer. The neurons in the input layer receive spike trains converted from images as their input. Hidden layer neurons receive the output of input layer neurons and deliver responses to neurons in the output layer. The responses of the output layer neurons are taken as the final output of HPC-Net. Neurons between adjacent layers are fully connected.

For each image stimulus, we first convert each normalized pixel to a homogeneous spike train. For pixel with coordinates (x, y) in the image, the corresponding spike train has a constant interspike interval τISI(x, y) (in ms) which is determined by the pixel value p(x, y) as shown in Eq. (1).

In our experiment, the simulation for each stimulus lasted 50 ms. All spike trains started at 9 + τISI ms and lasted until the end of the simulation. Then we attached all spike trains to the input layer neurons in a one-to-one manner. The synaptic current triggered by the spike arriving at time t0 is given by

where v is the post-synaptic voltage, the reversal potential Esyn = 1 mV, the maximum synaptic conductance gmax = 0.05 μS, and the time constant τ = 0.5 ms.

Neurons in the input layer were modeled with a passive single-compartment model. The specific parameters were set as follows: membrane capacitance cm = 1.0 μF cm-2, membrane resistance rm = 104 Ω cm2, axial resistivity ra = 100 Ω cm, reversal potential of passive compartment El = 0 mV.

The hidden layer contains a group of human pyramidal neuron models, receiving the somatic voltages of input layer neurons. The morphology was from Eyal, et al. 51, and all neurons were modeled with passive cables. The specific membrane capacitance cm = 1.5 μF cm-2, membrane resistance rm = 48,300 Ω cm2, axial resistivity ra = 261.97 Ω cm, and the reversal potential of all passive cables El = 0 mV. Input neurons could make multiple connections to randomly-selected locations on the dendrites of hidden neurons. The synaptic current activated by the k-th synapse of the i-th input neuron on neuron j’s dendrite is defined as in Eq. (4), where gijk is the synaptic conductance, Wijk is the synaptic weight,  is the ReLU-like somatic activation function, and  is the somatic voltage of the i-th input neuron at time t.

Neurons in the output layer were also modeled with a passive single-compartment model, and each hidden neuron only made one synaptic connection to each output neuron. All specific parameters were set the same as those of the input neurons. Synaptic currents activated by hidden neurons are also in the form of Eq. (4).

Image classification with HPC-Net

For each input image stimulus, we first normalized all pixel values to 0.0-1.0. Then we converted normalized pixels to spike trains and attached them to input neurons. Somatic voltages of the output neurons are used to compute the predicted probability of each class, as shown in equation 6, where pi is the probability of i-th class predicted by the HPC-Net,  is the average somatic voltage from 20 ms to 50 ms of the i-th output neuron, and C indicates the number of classes, which equals the number of output neurons. The class with the maximum predicted probability is the final classification result. In this paper, we built the HPC-Net with 784 input neurons, 64 hidden neurons, and 10 output neurons.

Synaptic plasticity rules for HPC-Net

Inspired by previous work36, we use a gradient-based learning rule to train our HPC-Net to perform the image classification task. The loss function we use here is cross-entropy, given in Eq. (7), where pi is the predicted probability for class iyi indicates the actual class the stimulus image belongs to, yi = 1 if input image belongs to class i, and yi = 0 if not.

When training HPC-Net, we compute the update for weight Wijk (the synaptic weight of the k-th synapse connecting neuron i to neuron j) at each time step. After the simulation of each image stimulus, Wijk is updated as shown in Eq. (8):

Here  is the learning rate,  is the update value at time tvjvi are somatic voltages of neuron i and j respectively, Iijk is the k-th synaptic current activated by neuron i on neuron jgijk its synaptic conductance, rijk is the transfer resistance between the k-th connected compartment of neuron i on neuron j’s dendrite to neuron j’s soma, ts = 30 ms, te = 50 ms are start time and end time for learning respectively. For output neurons, the error term  can be computed as shown in Eq. (10). For hidden neurons, the error term  is calculated from the error terms in the output layer, given in Eq. (11).

Since all output neurons are single-compartment,  equals to the input resistance of the corresponding compartment, . Transfer and input resistances are computed by NEURON.

Mini-batch training is a typical method in deep learning for achieving higher prediction accuracy and accelerating convergence. DeepDendrite also supports mini-batch training. When training HPC-Net with mini-batch size Nbatch, we make Nbatch copies of HPC-Net. During training, each copy is fed with a different training sample from the batch. DeepDendrite first computes the weight update for each copy separately. After all copies in the current training batch are done, the average weight update is calculated and weights in all copies are updated by this same amount.

Robustness against adversarial attack with HPC-Net

To demonstrate the robustness of HPC-Net, we tested its prediction accuracy on adversarial samples and compared it with an analogous ANN (one with the same 784-64-10 structure and ReLU activation, for fair comparison in our HPC-Net each input neuron only made one synaptic connection to each hidden neuron). We first trained HPC-Net and ANN with the original training set (original clean images). Then we added adversarial noise to the test set and measured their prediction accuracy on the noisy test set. We used the Foolbox98,99 to generate adversarial noise with the FGSM method93. ANN was trained with PyTorch100, and HPC-Net was trained with our DeepDendrite. For fairness, we generated adversarial noise on a significantly different network model, a 20-layer ResNet101. The noise level ranged from 0.02 to 0.2. We experimented on two typical datasets, MNIST95 and Fashion-MNIST96. Results show that the prediction accuracy of HPC-Net is 19% and 16.72% higher than that of the analogous ANN, respectively.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The data that support the findings of this study are available within the paper, Supplementary Information and Source Data files provided with this paper. The source code and data that used to reproduce the results in Figs. 36 are available at https://github.com/pkuzyc/DeepDendrite. The MNIST dataset is publicly available at http://yann.lecun.com/exdb/mnist. The Fashion-MNIST dataset is publicly available at https://github.com/zalandoresearch/fashion-mnistSource data are provided with this paper.

Code availability

The source code of DeepDendrite as well as the models and code used to reproduce Figs. 36 in this study are available at https://github.com/pkuzyc/DeepDendrite.

References

  1. McCulloch, W. S. & Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943).
  2. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
  3. Poirazi, P., Brannon, T. & Mel, B. W. Arithmetic of subthreshold synaptic summation in a model CA1 pyramidal cell. Neuron 37, 977–987 (2003).
  4. London, M. & Häusser, M. Dendritic computation. Annu. Rev. Neurosci. 28, 503–532 (2005).
  5. Branco, T. & Häusser, M. The single dendritic branch as a fundamental functional unit in the nervous system. Curr. Opin. Neurobiol. 20, 494–502 (2010).
  6. Stuart, G. J. & Spruston, N. Dendritic integration: 60 years of progress. Nat. Neurosci. 18, 1713–1721 (2015).
  7. Poirazi, P. & Papoutsi, A. Illuminating dendritic function with computational models. Nat. Rev. Neurosci. 21, 303–321 (2020).
  8. Yuste, R. & Denk, W. Dendritic spines as basic functional units of neuronal integration. Nature 375, 682–684 (1995).
  9. Engert, F. & Bonhoeffer, T. Dendritic spine changes associated with hippocampal long-term synaptic plasticity. Nature 399, 66–70 (1999).
  10. Yuste, R. Dendritic spines and distributed circuits. Neuron 71, 772–781 (2011).
  11. Yuste, R. Electrical compartmentalization in dendritic spines. Annu. Rev. Neurosci. 36, 429–449 (2013).
  12. Rall, W. Branching dendritic trees and motoneuron membrane resistivity. Exp. Neurol. 1, 491–527 (1959).
  13. Segev, I. & Rall, W. Computational study of an excitable dendritic spine. J. Neurophysiol. 60, 499–523 (1988).
  14. Silver, D. et al. Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016).
  15. Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362, 1140–1144 (2018).
  16. McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: the sequential learning problem. Psychol. Learn. Motiv. 24, 109–165 (1989).
  17. French, R. M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3, 128–135 (1999).
  18. Naud, R. & Sprekeler, H. Sparse bursts optimize information transmission in a multiplexed neural code. Proc. Natl Acad. Sci. USA 115, E6329–E6338 (2018).
  19. Sacramento, J., Costa, R. P., Bengio, Y. & Senn, W. Dendritic cortical microcircuits approximate the backpropagation algorithm. in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (NeurIPS*,* 2018).
  20. Payeur, A., Guerguiev, J., Zenke, F., Richards, B. A. & Naud, R. Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nat. Neurosci. 24, 1010–1019 (2021).
  21. Bicknell, B. A. & Häusser, M. A synaptic learning rule for exploiting nonlinear dendritic computation. Neuron 109, 4001–4017 (2021).
  22. Moldwin, T., Kalmenson, M. & Segev, I. The gradient clusteron: a model neuron that learns to solve classification tasks via dendritic nonlinearities, structural plasticity, and gradient descent. PLoS Comput. Biol. 17, e1009015 (2021).
  23. Hodgkin, A. L. & Huxley, A. F. A quantitative description of membrane current and Its application to conduction and excitation in nerve. J. Physiol. 117, 500–544 (1952).
  24. Rall, W. Theory of physiological properties of dendrites. Ann. N. Y. Acad. Sci. 96, 1071–1092 (1962).
  25. Hines, M. L. & Carnevale, N. T. The NEURON simulation environment. Neural Comput. 9, 1179–1209 (1997).
  26. Bower, J. M. & Beeman, D. in The Book of GENESIS: Exploring Realistic Neural Models with the GEneral NEural SImulation System (eds Bower, J.M. & Beeman, D.) 17–27 (Springer New York, 1998).
  27. Hines, M. L., Eichner, H. & Schürmann, F. Neuron splitting in compute-bound parallel network simulations enables runtime scaling with twice as many processors. J. Comput. Neurosci. 25, 203–210 (2008).
  28. Hines, M. L., Markram, H. & Schürmann, F. Fully implicit parallel simulation of single neurons. J. Comput. Neurosci. 25, 439–448 (2008).
  29. Ben-Shalom, R., Liberman, G. & Korngreen, A. Accelerating compartmental modeling on a graphical processing unit. Front. Neuroinform. 7, 4 (2013).
  30. Tsuyuki, T., Yamamoto, Y. & Yamazaki, T. Efficient numerical simulation of neuron models with spatial structure on graphics processing units. In Proc. 2016 International Conference on Neural Information Processing (eds Hirose894Akiraet al.) 279–285 (Springer International Publishing, 2016).
  31. Vooturi, D. T., Kothapalli, K. & Bhalla, U. S. Parallelizing Hines Matrix Solver in Neuron Simulations on GPU. In Proc. IEEE 24th International Conference on High Performance Computing (HiPC) 388–397 (IEEE, 2017).
  32. Huber, F. Efficient tree solver for hines matrices on the GPU. Preprint at https://arxiv.org/abs/1810.12742 (2018).
  33. Korte, B. & Vygen, J. Combinatorial Optimization Theory and Algorithms 6 edn (Springer, 2018).
  34. Gebali, F. Algorithms and Parallel Computing (Wiley, 2011).
  35. Kumbhar, P. et al. CoreNEURON: An optimized compute engine for the NEURON simulator. Front. Neuroinform. 13, 63 (2019).
  36. Urbanczik, R. & Senn, W. Learning by the dendritic prediction of somatic spiking. Neuron 81, 521–528 (2014).
  37. Ben-Shalom, R., Aviv, A., Razon, B. & Korngreen, A. Optimizing ion channel models using a parallel genetic algorithm on graphical processors. J. Neurosci. Methods 206, 183–194 (2012).
  38. Mascagni, M. A parallelizing algorithm for computing solutions to arbitrarily branched cable neuron models. J. Neurosci. Methods 36, 105–114 (1991).
  39. McDougal, R. A. et al. Twenty years of modelDB and beyond: building essential modeling tools for the future of neuroscience. J. Comput. Neurosci. 42, 1–10 (2017).
  40. Migliore, M., Messineo, L. & Ferrante, M. Dendritic Ih selectively blocks temporal summation of unsynchronized distal inputs in CA1 pyramidal neurons. J. Comput. Neurosci. 16, 5–13 (2004).
  41. Hemond, P. et al. Distinct classes of pyramidal cells exhibit mutually exclusive firing patterns in hippocampal area CA3b. Hippocampus 18, 411–424 (2008).
  42. Hay, E., Hill, S., Schürmann, F., Markram, H. & Segev, I. Models of neocortical layer 5b pyramidal cells capturing a wide range of dendritic and perisomatic active Properties. PLoS Comput. Biol. 7, e1002107 (2011).
  43. Masoli, S., Solinas, S. & D’Angelo, E. Action potential processing in a detailed purkinje cell model reveals a critical role for axonal compartmentalization. Front. Cell. Neurosci. 9, 47 (2015).
  44. Lindroos, R. et al. Basal ganglia neuromodulation over multiple temporal and structural scales—simulations of direct pathway MSNs investigate the fast onset of dopaminergic effects and predict the role of Kv4.2. Front. Neural Circuits 12, 3 (2018).
  45. Migliore, M. et al. Synaptic clusters function as odor operators in the olfactory bulb. Proc. Natl Acad. Sci. USa 112, 8499–8504 (2015).
  46. NVIDIA. CUDA C++ Programming Guidehttps://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (2021).
  47. NVIDIA. CUDA C++ Best Practices Guidehttps://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html (2021).
  48. Harnett, M. T., Makara, J. K., Spruston, N., Kath, W. L. & Magee, J. C. Synaptic amplification by dendritic spines enhances input cooperativity. Nature 491, 599–602 (2012).
  49. Chiu, C. Q. et al. Compartmentalization of GABAergic inhibition by dendritic spines. Science 340, 759–762 (2013).
  50. Tønnesen, J., Katona, G., Rózsa, B. & Nägerl, U. V. Spine neck plasticity regulates compartmentalization of synapses. Nat. Neurosci. 17, 678–685 (2014).
  51. Eyal, G. et al. Human cortical pyramidal neurons: from spines to spikes via models. Front. Cell. Neurosci. 12, 181 (2018).
  52. Koch, C. & Zador, A. The function of dendritic spines: devices subserving biochemical rather than electrical compartmentalization. J. Neurosci. 13, 413–422 (1993).
  53. Koch, C. Dendritic spines. In Biophysics of Computation (Oxford University Press, 1999).
  54. Rapp, M., Yarom, Y. & Segev, I. The impact of parallel fiber background activity on the cable properties of cerebellar purkinje cells. Neural Comput. 4, 518–533 (1992).
  55. Hines, M. Efficient computation of branched nerve equations. Int. J. Bio-Med. Comput. 15, 69–76 (1984).
  56. Nayebi, A. & Ganguli, S. Biologically inspired protection of deep networks from adversarial attacks. Preprint at https://arxiv.org/abs/1703.09202 (2017).
  57. Goddard, N. H. & Hood, G. Large-Scale Simulation Using Parallel GENESIS. In The Book of GENESIS: Exploring Realistic Neural Models with the GEneral NEural SImulation System (eds Bower James M. & Beeman David) 349-379 (Springer New York, 1998).
  58. Migliore, M., Cannia, C., Lytton, W. W., Markram, H. & Hines, M. L. Parallel network simulations with NEURON. J. Comput. Neurosci. 21, 119 (2006).
  59. Lytton, W. W. et al. Simulation neurotechnologies for advancing brain research: parallelizing large networks in NEURON. Neural Comput. 28, 2063–2090 (2016).
  60. Valero-Lara, P. et al. cuHinesBatch: Solving multiple Hines systems on GPUs human brain project. In Proc. 2017 International Conference on Computational Science 566–575 (IEEE, 2017).
  61. Akar, N. A. et al. Arbor—A morphologically-detailed neural network simulation library for contemporary high-performance computing architectures. In Proc. 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) 274–282 (IEEE, 2019).
  62. Ben-Shalom, R. et al. NeuroGPU: Accelerating multi-compartment, biophysically detailed neuron simulations on GPUs. J. Neurosci. Methods 366, 109400 (2022).
  63. Rempe, M. J. & Chopp, D. L. A predictor-corrector algorithm for reaction-diffusion equations associated with neural activity on branched structures. SIAM J. Sci. Comput. 28, 2139–2161 (2006).
  64. Kozloski, J. & Wagner, J. An ultrascalable solution to large-scale neural tissue simulation. Front. Neuroinform. 5, 15 (2011).
  65. Jayant, K. et al. Targeted intracellular voltage recordings from dendritic spines using quantum-dot-coated nanopipettes. Nat. Nanotechnol. 12, 335–342 (2017).
  66. Palmer, L. M. & Stuart, G. J. Membrane potential changes in dendritic spines during action potentials and synaptic input. J. Neurosci. 29, 6897–6903 (2009).
  67. Nishiyama, J. & Yasuda, R. Biochemical computation for spine structural plasticity. Neuron 87, 63–75 (2015).
  68. Yuste, R. & Bonhoeffer, T. Morphological changes in dendritic spines associated with long-term synaptic plasticity. Annu. Rev. Neurosci. 24, 1071–1089 (2001).
  69. Holtmaat, A. & Svoboda, K. Experience-dependent structural synaptic plasticity in the mammalian brain. Nat. Rev. Neurosci. 10, 647–658 (2009).
  70. Caroni, P., Donato, F. & Muller, D. Structural plasticity upon learning: regulation and functions. Nat. Rev. Neurosci. 13, 478–490 (2012).
  71. Keck, T. et al. Massive restructuring of neuronal circuits during functional reorganization of adult visual cortex. Nat. Neurosci. 11, 1162 (2008).
  72. Hofer, S. B., Mrsic-Flogel, T. D., Bonhoeffer, T. & Hübener, M. Experience leaves a lasting structural trace in cortical circuits. Nature 457, 313–317 (2009).
  73. Trachtenberg, J. T. et al. Long-term in vivo imaging of experience-dependent synaptic plasticity in adult cortex. Nature 420, 788–794 (2002).
  74. Marik, S. A., Yamahachi, H., McManus, J. N., Szabo, G. & Gilbert, C. D. Axonal dynamics of excitatory and inhibitory neurons in somatosensory cortex. PLoS Biol. 8, e1000395 (2010).
  75. Xu, T. et al. Rapid formation and selective stabilization of synapses for enduring motor memories. Nature 462, 915–919 (2009).
  76. Albarran, E., Raissi, A., Jáidar, O., Shatz, C. J. & Ding, J. B. Enhancing motor learning by increasing the stability of newly formed dendritic spines in the motor cortex. Neuron 109, 3298–3311 (2021).
  77. Branco, T. & Häusser, M. Synaptic integration gradients in single cortical pyramidal cell dendrites. Neuron 69, 885–892 (2011).
  78. Major, G., Larkum, M. E. & Schiller, J. Active properties of neocortical pyramidal neuron dendrites. Annu. Rev. Neurosci. 36, 1–24 (2013).
  79. Gidon, A. et al. Dendritic action potentials and computation in human layer 2/3 cortical neurons. Science 367, 83–87 (2020).
  80. Doron, M., Chindemi, G., Muller, E., Markram, H. & Segev, I. Timed synaptic inhibition shapes NMDA spikes, influencing local dendritic processing and global I/O properties of cortical neurons. Cell Rep. 21, 1550–1561 (2017).
  81. Du, K. et al. Cell-type-specific inhibition of the dendritic plateau potential in striatal spiny projection neurons. Proc. Natl Acad. Sci. USA 114, E7612–E7621 (2017).
  82. Smith, S. L., Smith, I. T., Branco, T. & Häusser, M. Dendritic spikes enhance stimulus selectivity in cortical neurons in vivo. Nature 503, 115–120 (2013).
  83. Xu, N.-l et al. Nonlinear dendritic integration of sensory and motor input during an active sensing task. Nature 492, 247–251 (2012).
  84. Takahashi, N., Oertner, T. G., Hegemann, P. & Larkum, M. E. Active cortical dendrites modulate perception. Science 354, 1587–1590 (2016).
  85. Sheffield, M. E. & Dombeck, D. A. Calcium transient prevalence across the dendritic arbour predicts place field properties. Nature 517, 200–204 (2015).
  86. Markram, H. et al. Reconstruction and simulation of neocortical microcircuitry. Cell 163, 456–492 (2015).
  87. Billeh, Y. N. et al. Systematic integration of structural and functional data into multi-scale models of mouse primary visual cortex. Neuron 106, 388–403 (2020).
  88. Hjorth, J. et al. The microcircuits of striatum in silico. Proc. Natl Acad. Sci. USA 117, 202000671 (2020).
  89. Guerguiev, J., Lillicrap, T. P. & Richards, B. A. Towards deep learning with segregated dendrites. elife 6, e22901 (2017).
  90. Iyer, A. et al. Avoiding catastrophe: active dendrites enable multi-task learning in dynamic environments. Front. Neurorobot. 16, 846219 (2022).
  91. Jones, I. S. & Kording, K. P. Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Comput. 33, 1554–1571 (2021).
  92. Bird, A. D., Jedlicka, P. & Cuntz, H. Dendritic normalisation improves learning in sparsely connected artificial neural networks. PLoS Comput. Biol. 17, e1009202 (2021).
  93. Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations (ICLR) (ICLR, 2015).
  94. Papernot, N., McDaniel, P. & Goodfellow, I. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. Preprint at https://arxiv.org/abs/1605.07277 (2016).
  95. Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
  96. Xiao, H., Rasul, K. & Vollgraf, R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Preprint at http://arxiv.org/abs/1708.07747 (2017).
  97. Bartunov, S. et al. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (NeurIPS, 2018).
  98. Rauber, J., Brendel, W. & Bethge, M. Foolbox: A Python toolbox to benchmark the robustness of machine learning models. In Reliable Machine Learning in the Wild Workshop, 34th International Conference on Machine Learning (2017).
  99. Rauber, J., Zimmermann, R., Bethge, M. & Brendel, W. Foolbox native: fast adversarial attacks to benchmark the robustness of machine learning models in PyTorch, TensorFlow, and JAX. J. Open Source Softw. 5, 2607 (2020).
  100. Paszke, A. et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (NeurIPS, 2019).
  101. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, 2016).

\

Acknowledgements

The authors sincerely thank Dr. Rita Zhang, Daochen Shi and members at NVIDIA for the valuable technical support of GPU computing. This work was supported by the National Key R&D Program of China (No. 2020AAA0130400) to K.D. and T.H., National Natural Science Foundation of China (No. 61088102) to T.H., National Key R&D Program of China (No. 2022ZD01163005) to L.M., Key Area R&D Program of Guangdong Province (No. 2018B030338001) to T.H., National Natural Science Foundation of China (No. 61825101) to Y.T., Swedish Research Council (VR-M-2020-01652), Swedish e-Science Research Centre (SeRC), EU/Horizon 2020 No. 945539 (HBP SGA3), and KTH Digital Futures to J.H.K., J.H., and A.K., Swedish Research Council (VR-M-2021-01995) and EU/Horizon 2020 no. 945539 (HBP SGA3) to S.G. and A.K. Part of the simulations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at PDC KTH partially funded by the Swedish Research Council through grant agreement no. 2018-05973.

\ \

:::info This paper is available on nature under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

A New Privacy-First AI Predicts COVID Severity Using X-Rays and Medical Records

2026-03-16 23:21:43

:::info

Authors:

  1. Ittai Dayan
  2. Holger R. Roth
  3. Aoxiao Zhong
  4. Ahmed Harouni
  5. Amilcare Gentili
  6. Anas Z. Abidin
  7. Andrew Liu
  8. Anthony Beardsworth Costa
  9. Bradford J. Wood
  10. Chien-Sung Tsai
  11. Chih-Hung Wang
  12. Chun-Nan Hsu
  13. C. K. Lee
  14. Peiying Ruan
  15. Daguang Xu
  16. Dufan Wu
  17. Eddie Huang
  18. Felipe Campos Kitamura
  19. Griffin Lacey
  20. Gustavo César de Antônio Corradi
  21. Gustavo Nino
  22. Hao-Hsin Shin
  23. Hirofumi Obinata
  24. Hui Ren
  25. Jason C. Crane
  26. Jesse Tetreault
  27. Jiahui Guan
  28. John W. Garrett
  29. Joshua D. Kaggie
  30. Jung Gil Park
  31. Keith Dreyer
  32. Krishna Juluru
  33. Kristopher Kersten
  34. Marcio Aloisio Bezerra Cavalcanti Rockenbach
  35. Marius George Linguraru
  36. Masoom A. Haider
  37. Meena AbdelMaseeh
  38. Nicola Rieke
  39. Pablo F. Damasceno
  40. Pedro Mario Cruz e Silva
  41. Pochuan Wang
  42. Sheng Xu
  43. Shuichi Kawano
  44. Sira Sriswasdi
  45. Soo Young Park
  46. Thomas M. Grist
  47. Varun Buch
  48. Watsamon Jantarabenjakul
  49. Weichung Wang
  50. Won Young Tak
  51. Xiang Li
  52. Xihong Lin
  53. Young Joon Kwon
  54. Abood Quraini
  55. Andrew Feng
  56. Andrew N. Priest
  57. Baris Turkbey
  58. Benjamin Glicksberg
  59. Bernardo Bizzo
  60. Byung Seok Kim
  61. Carlos Tor-Díez
  62. Chia-Cheng Lee
  63. Chia-Jung Hsu
  64. Chin Lin
  65. Chiu-Ling Lai
  66. Christopher P. Hess
  67. Colin Compas
  68. Deepeksha Bhatia
  69. Eric K. Oermann
  70. Evan Leibovitz
  71. Hisashi Sasaki
  72. Hitoshi Mori
  73. Isaac Yang
  74. Jae Ho Sohn
  75. Krishna Nand Keshava Murthy
  76. Li-Chen Fu
  77. Matheus Ribeiro Furtado de Mendonça
  78. Mike Fralick
  79. Min Kyu Kang
  80. Mohammad Adil
  81. Natalie Gangai
  82. Peerapon Vateekul
  83. Pierre Elnajjar
  84. Sarah Hickman
  85. Sharmila Majumdar
  86. Shelley L. McLeod
  87. Sheridan Reed
  88. Stefan Gräf
  89. Stephanie Harmon
  90. Tatsuya Kodama
  91. Thanyawee Puthanakit
  92. Tony Mazzulli
  93. Vitor Lima de Lavor
  94. Yothin Rakvongthai
  95. Yu Rim Lee
  96. Yuhong Wen
  97. Fiona J. Gilbert
  98. Mona G. Flores
  99. Quanzheng Li

:::

\

Abstract

Federated learning (FL) is a method used for training artificial intelligence models with data from multiple sources while maintaining data anonymity, thus removing many barriers to data sharing. Here we used data from 20 institutes across the globe to train a FL model, called EXAM (electronic medical record (EMR) chest X-ray AI model), that predicts the future oxygen requirements of symptomatic patients with COVID-19 using inputs of vital signs, laboratory data and chest X-rays. EXAM achieved an average area under the curve (AUC) >0.92 for predicting outcomes at 24 and 72 h from the time of initial presentation to the emergency room, and it provided 16% improvement in average AUC measured across all participating sites and an average increase in generalizability of 38% when compared with models trained at a single site using that site’s data. For prediction of mechanical ventilation treatment or death at 24 h at the largest independent test site, EXAM achieved a sensitivity of 0.950 and specificity of 0.882. In this study, FL facilitated rapid data science collaboration without data exchange and generated a model that generalized across heterogeneous, unharmonized datasets for prediction of clinical outcomes in patients with COVID-19, setting the stage for the broader use of FL in healthcare.

\

Main

The scientific, academic, medical and data science communities have come together in the face of the COVID-19 pandemic crisis to rapidly assess novel paradigms in artificial intelligence (AI) that are rapid and secure, and potentially incentivize data sharing and model training and testing without the usual privacy and data ownership hurdles of conventional collaborations1,2. Healthcare providers, researchers and industry have pivoted their focus to address unmet and critical clinical needs created by the crisis, with remarkable results3,4,5,6,7,8,9. Clinical trial recruitment has been expedited and facilitated by national regulatory bodies and an international cooperative spirit10,11,12. The data analytics and AI disciplines have always fostered open and collaborative approaches, embracing concepts such as open-source software, reproducible research, data repositories and making available anonymized datasets publicly13,14. The pandemic has emphasized the need to expeditiously conduct data collaborations that empower the clinical and scientific communities when responding to rapidly evolving and widespread global challenges. Data sharing has ethical, regulatory and legal complexities that are underscored, and perhaps somewhat complicated, by the recent entrance of large technology companies into the healthcare data world15,16,17.

A concrete example of these types of collaboration is our previous work on an AI-based SARS-COV-2 clinical decision support (CDS) model. This CDS model was developed at Mass General Brigham (MGB) and was validated across multiple health systems’ data. The inputs to the CDS model were chest X-ray (CXR) images, vital signs, demographic data and laboratory values that were shown in previous publications to be predictive of outcomes of patients with COVID-1918,19,20,21. CXR was selected as the imaging input because it is widely available and commonly indicated by guidelines such as those provided by ACR22, the Fleischner Society23, the WHO24, national thoracic societies25, national health ministry COVID handbooks and radiology societies across the world26. The output of the CDS model was a score, termed CORISK27, that corresponds to oxygen support requirements and that could aid in triaging patients by frontline clinicians28,29,30. Healthcare providers have been known to prefer models that were validated on their own data27. To date most AI models, including the aforementioned CDS model, have been trained and validated on ‘narrow’ data that often lack diversity31,32, potentially resulting in overfitting and lower generalizability. This can be mitigated by training with diverse data from multiple sites without centralization of data33 using methods such as transfer learning34,35 or FL. FL is a method used to train AI models on disparate data sources, without the data being transported or exposed outside their original location. While applicable to many industries, FL has recently been proposed for cross-institutional healthcare research36.

Federated learning supports the rapid launch of centrally orchestrated experiments with improved traceability of data and assessment of algorithmic changes and impact37. One approach to FL, called client-server, sends an ‘untrained’ model to other servers (‘nodes’) that conduct partial training tasks, in turn sending the results back to be merged in the central (‘federated’) server. This is conducted as an iterative process until training is complete36.

Governance of data for FL is maintained locally, alleviating privacy concerns, with only model weights or gradients communicated between client sites and the federated server38,39. FL has already shown promise in recent medical imaging applications40,41,42,43, including in COVID-19 analysis8,44,45. A notable example is a mortality prediction model in patients infected with SARS-COV-2 that uses clinical features, albeit limited in terms of number of modalities and scale46.

Our objective was to develop a robust, generalizable model that could assist in triaging patients. We theorized that the CDS model can be federated successfully, given its use of data inputs that are relatively common in clinical practice and that do not rely heavily on operator-dependent assessments of patient condition (such as clinical impressions or reported symptoms). Rather, laboratory results, vital signs, an imaging study and a commonly captured demographic (that is, age), were used. We therefore retrained the CDS model with diverse data using a client-server FL approach to develop a new global FL model, which was named EXAM, using CXR and EMR features as input. By leveraging FL, the participating institutions would not have to transfer data to a central repository, but rather leverage a distributed data framework.

Our hypothesis was that EXAM would perform better than local models and would generalize better across healthcare systems.

Results

The EXAM model architecture

The EXAM model is based on the CDS model mentioned above27. In total, 20 features (19 from the EMR and one CXR) were used as input to the model. The outcome (that is, ‘ground truth’) labels were assigned based on patient oxygen therapy after 24- and 72-hour periods from initial admission to the emergency department (ED). A detailed list of the requested features and outcomes can be seen in Table 1.

Table 1 EMR data used in the EXAM study

The outcome labels of patients were set to 0, 0.25, 0.50 and 0.75 depending on the most intensive oxygen therapy the patient received in the prediction window. The oxygen therapy categories were, respectively, room air (RA), low-flow oxygen (LFO), high-flow oxygen (HFO)/noninvasive ventilation (NIV) or mechanical ventilation (MV). If the patient died within the prediction window, the outcome label was set to 1. This resulted in each case being assigned two labels in the range 0–1, corresponding to each of the prediction windows (that is, 24 and 72 h).

For EMR features, only the first values captured in the ED were used and data preprocessing included deidentification, missing value imputation and normalization to zero-mean and unit variance. For CXR images, only the first obtained in the ED was used.

The model therefore fuses information from both EMR and CXR features, using a 34-layer convolutional neural network (ResNet34) to extract features from a CXR and a Deep & Cross network to concatenate the features together with the EMR features (for more expanded details, see Methods). The model output is a risk score, termed the EXAM score, which is a continuous value in the range 0–1 for each of the 24- and 72-hour predictions corresponding to the labels described above.

Federating the model

The EXAM model was trained using a cohort of 16,148 cases, making it not only among the first FL models for COVID-19 but also a very large and multicontinent development project in clinically relevant AI (Fig. 1a,b). Data between sites were not harmonized before extraction and, in light of real-life clinical informatics circumstances, a meticulous harmonization of the data input was not conducted by the authors (Fig. 1c,d).

Fig. 1: Data used in the EXAM FL study.

a, World map indicating the 20 different client sites contributing to the EXAM study. b, Number of cases contributed by each institution or site (client 1 represents the site contributing the largest number of cases). c, Chest X-ray intensity distribution at each client site. d, Age of patients at each client site, showing minimum and maximum ages (asterisks), mean age (triangles) and standard deviation (horizontal bars). The number of samples of each client site is shown in Supplementary Table 1.

We compared locally trained models with the global FL model on each client’s test data. Training the model through FL resulted in a significant performance improvement (P « 1 × 10–3, Wilcoxon signed-rank test) of 16% (as defined by average AUC when running the model on respective local test sets: from 0.795 to 0.920, or 12.5 percentage points) (Fig. 2a). It also resulted in 38% generalizability improvement (as defined by average AUC when running the model on all test sets: from 0.667 to 0.920, or 25.3 percentage points) of the best global model for prediction of 24-h oxygen treatment compared with models trained only on a site’s own data (Fig. 2b). For the prediction results of 72-h oxygen treatment, the best global model training resulted in an average performance improvement of 18% compared to locally trained models, while generalizability of the global model improved on average by 34% (Extended Data Fig. 1). The stability of our results was validated by repeating three runs of local and FL training on different randomized data splits.

Fig. 2: Performance of FL versus local models.

a, Performance on each client’s test set in prediction of 24-h oxygen treatment for models trained on local data only (Local) versus that of the best global model available on the server (FL (gl. best)). Av., average test performance across all sites. b, Generalizability (average performance on other sites’ test data, as represented by average AUC) as a function of a client’s dataset size (no. of cases). The green horizontal line denotes the generalizability performance of the best global model. The performance for 18 of 20 clients is shown, because client 12 had outcomes only for 72-h oxygen (Extended Data Fig. 1) and client 14 had cases only with RA treatment, such that the evaluation metric (av. AUC) was not applicable in either of these cases (Methods). Data for client 14 were also excluded from computation of average generalizability in local models.

Local models that were trained using unbalanced cohorts (for example, mostly mild cases of COVID-19) markedly benefited from the FL approach, with a substantial improvement in prediction average AUC performance for categories with only a few cases. This was evident at client site 16 (an unbalanced dataset), with most patients experiencing mild disease severity and with only a few severe cases. The FL model achieved a higher true-positive rate for the two positive (severe) cases and a markedly lower false-positive rate compared to the local model, both shown in the receiver operating characteristic (ROC) plots and confusion matrices (Fig. 3a and Extended Data Fig. 2). More important, the generalizability of the FL model was considerably increased over the locally trained model.

Fig. 3: Comparison of FL- and locally trained models.

a, ROC at client site 16, with unbalanced data and mostly mild cases. b, ROC of the local model at client site 12 (a small dataset), mean ROC of models trained on larger datasets corresponding to the five client sites in the Boston area (1, 4, 5, 6, 8) and ROC of the best global model in prediction of 72-h oxygen treatment for different thresholds of EXAM score (left, middle, right). The mean ROC is calculated based on five locally trained models while the gray area denotes the ROC standard deviation. ROCs for three different cutoff values (t) of the EXAM risk score are shown. Pos and neg denote the number of positive and negative cases, respectively, as defined by this range of EXAM score.

In the case of client sites with relatively small datasets, the best FL model markedly outperformed not only the local model but also those trained on larger datasets from five client sites in the Boston area of the USA (Fig. 3b).

The global model performed well in predicting oxygen needs at 24/72 h in patients both COVID positive and negative (Extended Data Fig. 3).

Validation at independent sites

Following initial training, EXAM was subsequently tested at three independent validation sites: Cooley Dickinson Hospital (CDH), Martha’s Vineyard Hospital (MVH) and Nantucket Cottage Hospital (NCH), all in Massachusetts, USA. The model was not retrained at these sites and it was used only for validation purposes. The cohort size and model inference results are summarized in Table 2, and the ROC curves and confusion matrices for the largest dataset (from CDH) are shown in Fig. 4. The operating point was set to discriminate between nonmechanical ventilation and mechanical ventilation (MV) treatment (or death). The FL global trained model, EXAM, achieved an average AUC of 0.944 and 0.924 for 24- and 72-h prediction tasks, respectively (Table 2), which exceeded the average performance among sites used in training EXAM. For prediction of MV treatment (or death) at 24 h, EXAM achieved a sensitivity of 0.950 and specificity of 0.882 at CDH, and a sensitivity of 1.000 specificity of 0.934 at MVH. NCH did not have any cases with MV/death at 24 h. In regard to 72-h MV prediction, EXAM achieved a sensitivity of 0.929 and specificity of 0.880 at CDH, sensitivity of 1.000 and specificity of 0.976 at MVH and sensitivity of 1.000 and specificity of 0.929 at NCH.

Table 2 Performance of EXAM on independent datasets. Top, breakdown of patients by level of oxygen required across independent datasets from CDH, MVH and NCH. Bottom, AUC for prediction of the level of oxygen required at 24 and 72 h for the three independent datasets (95% confidence intervals)

\ Fig. 4: Performance of the best global model on the largest independent dataset.

a,b, Performance (ROC) (top) and confusion matrices (bottom) of the EXAM FL model on the CDH dataset for prediction of oxygen requirement at 24 h (a) and 72 h (b). ROCs for three different cutoff values (t) of the EXAM risk score are shown.

For MV at CDH at 72 h, EXAM had a low false-negative rate of 7.1%. Representative failure cases are presented in Extended Data Fig. 4, showing two false-negative cases from CDH where one case had many missing EMR data features and the other had a CXR with a motion artifact and some missing EMR features.

Use of differential privacy

A primary motivation for healthcare institutes to use FL is to preserve the security and privacy of their data, as well as adherence to data compliance measures. For FL, there remains the potential risk of model ‘inversion’47 or even the reconstruction of training images from the model gradients themselves48. To counter these risks, security-enhancing measures were used to mitigate risk in the event of data ‘interception’ during site-server communication49. We experimented with techniques to avoid interception of FL data, and added a security feature that we believe could encourage more institutions to use FL. We thus validated previous findings showing that partial weight sharing, and other differential privacy techniques, can successfully be applied in FL50. Through investigation of a partial weight-sharing scheme50,51,52, we showed that models can reach a comparable performance even when only 25% of weight updates are shared (Extended Data Fig. 5).

Discussion

This study features a large, real-world healthcare FL study in terms of number of sites and number of data points used. We believe that it provides a powerful proof-of-concept of the feasibility of using FL for fast and collaborative development of needed AI models in healthcare. Our study involved multiple sites across four continents and under the oversight of different regulatory bodies, and thus holds the promise of being provided to different regulated markets in an expedited way. The global FL model, EXAM, proved to be more robust and achieved better results at individual sites than any model trained on only local data. We believe that consistent improvement was achieved owing to a larger, but also a more diverse, dataset, the use of data inputs that can be standardized and avoidance of clinical impressions/reported symptoms. These factors played an important part in increasing the benefits from this FL approach and its impact on performance, generalizability and, ultimately, the model’s usability.

For a client site with a relatively small dataset, two typical approaches could be used for fitting a useful model: one is to train locally with its own data, the other is to apply a model trained on a larger dataset. For sites with small datasets, it would have been virtually impossible to build a performant deep learning model using only their local data. The finding, that these two approaches were outperformed on all three prediction tasks by the global FL model, indicates that the benefit for client sites with small datasets arising from participation in FL collaborations is substantial. This is probaby a reflection of FL’s ability to capture more diversity than local training, and to mitigate the bias present in models trained on a homogenous population. An under-represented population or age group in one hospital/region might be highly represented in another region—such as children who might be differentially affected by COVID-19, including disease manifestations in lung imaging46.

The validation results confirmed that the global model is robust, supporting our hypothesis that FL-trained models are generalizable across healthcare systems. They provide a compelling case for the use of predictive algorithms in COVID-19 patient care, and the use of FL in model creation and testing. By participating in this study the client sites received access to EXAM, to be further validated ahead of pursuing any regulatory approval or future introduction into clinical care. Plans are under way to validate EXAM prospectively in ‘production’ settings at MGB leveraging COVID-19 targeted resources53, as well as at different sites that were not a part of the EXAM training.

Over 200 prediction models to support decision-making in patients with COVID-19 have been published19. Unlike the majority of publications focused on diagnosis of COVID-19 or prediction of mortality, we predicted oxygen requirements that have implications for patient management. We also used cases with unknown SARS-COV-2 status, and so the model could provide input to the physician ahead of receiving a result for PCR with reverse transcription (RT–PCR), making it useful for a real-life clinical setting. The model’s imaging input is used in common practice, in contrast with models that use chest computed tomography, a nonconsensual diagnostic modality. The model’s design was constrained to objective predictors, unlike many published studies that leveraged subjective clinical impressions. The data collected reflect varied incidence rates, and thus the ‘population momentum’ we encountered is more diverse. This implies that the algorithm can be useful in populations with different incidence rates.

Patient cohort identification and data harmonization are not novel issues in research and data science54, but are further complicated, when using FL, given the lack of visibility on other sites’ datasets. Improvements to clinical information systems are needed to streamline data preparation, leading to better leverage of a network of sites participating in FL. This, in conjunction with hyperparameter engineering, can allow algorithms to ‘learn’ more effectively from larger data batches and adapt model parameters to a particular site for further personalization—for example, through further fine-tuning on that site39. A system that would allow seamless, close-to real-time model inference and results processing would also be of benefit and would ‘close the loop’ from training to model deployment.

Because data were not centralized they are not readily accessible. Given that, any future analysis of the results, beyond what was derived and collected, is limited.

Similar to other machine learning models, EXAM is limited by the quality of the training data. Institutions interested in deploying this algorithm for clinical care need to understand potential biases in the training. For example, the labels used as ground truth in the training of the EXAM model were derived from 24- and 72-h oxygen consumption in the patient; it is assumed that oxygen delivered to the patient equates the oxygen need. However, in the early phase of the COVID-19 pandemic, many patients were provided high-flow oxygen prophylactically regardless of their oxygen need. Such clinical practice could skew the predictions made by this model.

Since our data access was limited, we did not have sufficient available information for the generation of detailed statistics regarding failure causes, post hoc, at most sites. However, we did study failure cases from the largest independent test site, CDH, and were able to generate hypotheses that we can test in the future. For high-performing sites, it seems that most failure cases fall into one of two categories: (1) low quality of input data—for example, missing data or motion artifact in CXR; or (2) out-of-distribution data—for example a very young patient.

In future, we also intend to investigate the potential for a ‘population drift’ due to different phases of disease progression. We believe that, owing to the diversity across the 20 sites, this risk may have been mitigated.

A feature that would enhance these kinds of large-scale collaboration is the ability to predict the contribution of each client site towards improving the global FL model. This will help in client site selection, and in prioritization of data acquisition and annotation efforts. The latter is especially important given the high costs and difficult logistics of these large-consortia endeavors, and it will enable these endeavors to capture diversity rather than the sheer quantity of data samples.

Future approaches may incorporate automated hyperparameter searching55, neural architecture search56 and other automated machine learning57 approaches to find the optimal training parameters for each client site more efficiently.

Known issues of batch normalization (BN) in FL58 motivated us to fix our base model for image feature extraction49 to reduce the divergence between unbalanced client sites. Future work might explore different types of normalization techniques to allow the training of AI models in FL more effectively when client data are nonindependent and identically distributed.

Recent works on privacy attacks within the FL setting have raised concerns on data leakage during model training59. Meanwhile, protection algorithms remain underexplored and constrained by multiple factors. While differential privacy algorithms36,48,49 show good protection, they may weaken the model’s performance. Encryption algorithms, such as homomorphic encryption60, maintain performance but may substantially increase message size and training time. A quantifiable way to measure privacy would allow better choices for deciding the minimal privacy parameters necessary while maintaining clinically acceptable performance36,48,49.

Following further validation, we envision deployment of the EXAM model in the ED setting as a way to evaluate risk at both the per-patient and population level, and to provide clinicians with an additional reference point when making the frequently difficult task of triaging patients. We also envision using the model as a more sensitive population-level metric to help balance resources between regions, hospitals and departments. Our hope is that similar FL efforts can break the data silos and allow for faster development of much-needed AI models in the near future.

Methods

Ethics approval

All procedures were conducted in accordance with the principles for human experimentation as defined in the Declaration of Helsinki and International Conference on Harmonization Good Clinical Practice guidelines, and were approved by the relevant institutional review boards at the following validation sites: CDH, MVH, NCH and at the following training sites: MGB, Mass General Hospital (MGH), Brigham and Women’s Hospital, Newton-Wellesley Hospital, North Shore Medical Center and Faulkner Hospital (all eight of these hospitals were covered under MGB’s ethics board reference, no. 2020P002673, and informed consent was waived by the instititional review board (IRB). Similarly, participation of the remaining sites was approved by their respective relevant institutional review processes: Children’s National Hospital in Washington, DC (no. 00014310, IRB certified exempt); NIHR Cambridge Biomedical Research Centre (no. 20/SW/0140, informed consent waived); The Self-Defense Forces Central Hospital in Tokyo (no. 02-014, informed consent waived); National Taiwan University MeDA Lab and MAHC and Taiwan National Health Insurance Administration (no. 202108026 W, informed consent waived); Tri-Service General Hospital in Taiwan (no. B202105136, informed consent waived); Kyungpook National University Hospital in South Korea (no. KNUH 2020-05-022, informed consent waived); Faculty of Medicine, Chulalongkorn University in Thailand (nos. 490/63, 291/63, informed consent waived); Diagnosticos da America SA in Brazil (no. 26118819.3.0000.5505, informed consent waived); University of California, San Francisco (no. 20-30447, informed consent waived); VA San Diego (no. H200086, IRB certified exempt); University of Toronto (no. 20-0162-C, informed consent waived); National Institutes of Health in Bethesda, Maryland (no. 12-CC-0075, informed consent waived); University of Wisconsin-Madison School of Medicine and Public Health (no. 2016-0418, informed consent waived); Memorial Sloan Kettering Cancer Center in New York (no. 20-194, informed consent waived); and Mount Sinai Health System in New York (no. IRB-20-03271, informed consent waived).

MI-CLAIM guidelines for reporting of clinical AI models were followed (Supplementary Note 2)

Study setting

The study included data from 20 institutions (Fig. 1a): MGB, MGH, Brigham and Women’s Hospital, Newton-Wellesley Hospital, North Shore Medical Center and Faulkner Hospital; Children’s National Hospital in Washington, DC; NIHR Cambridge Biomedical Research Centre; The Self-Defense Forces Central Hospital in Tokyo; National Taiwan University MeDA Lab and MAHC and Taiwan National Health Insurance Administration; Tri-Service General Hospital in Taiwan; Kyungpook National University Hospital in South Korea; Faculty of Medicine, Chulalongkorn University in Thailand; Diagnosticos da America SA in Brazil; University of California, San Francisco; VA San Diego; University of Toronto; National Institutes of Health in Bethesda, Maryland; University of Wisconsin-Madison School of Medicine and Public Health; Memorial Sloan Kettering Cancer Center in New York; and Mount Sinai Health System in New York. Institutions were recruited between March and May 2020. Dataset curation started in June 2020 and the final data cohort was added in September 2020. Between August and October 2020, 140 independent FL runs were conducted to develop the EXAM model and, by the end of October 2020, EXAM was made public on NVIDIA NGC61,62,63. Data from three independent sites were used for independent validation: CDH, MVH and NCH, all in Massachusetts, USA. These three hospitals had patient population characteristics different from the training sites. The data used for the algorithm validation consisted of patients admitted to the ED at these sites between March 2020 and February 2021, and that satisfied the same inclusion criteria of the data used to train the FL model.

Data collection

The 20 client sites prepared a total of 16,148 cases (both positive and negative) for the purposes of training, validation and testing of the model (Fig. 1b). Medical data were accessed in relation to patients who satisfied the study inclusion criteria. Client sites strived to include all COVID-positive cases from the beginning of the pandemic in December 2019 and up to the time they started local training for the EXAM study. All local training had started by 30 September 2020. The sites also included other patients in the same period with negative RT–PCR test results. Since most of the sites had more SARS-COV-2-negative than -positive patients, we limited the number of negative patients included to, at most, 95% of the total cases at each client site.

A ‘case’ included a CXR and the requisite data inputs taken from the patient’s medical record. A breakdown of the cohort size of the dataset for each client site is shown in Fig. 1b. The distribution and patterns of CXR image intensity (pixel values) varied greatly among sites owing to a multitude of patient- and site-specific factors, such as different device manufacturers and imaging protocols, as shown in Fig. 1c,d. Patient age and EMR feature distribution varied greatly among sites, as expected owing to the differing demographics between globally distributed hospitals (Extended Data Fig. 6).

Patient inclusion criteria

Patient inclusion criteria were: (1) patient presented to the hospital’s ED or equivalent; (2) patient had a RT–PCR test performed at any time between presentation to the ED and discharge from the hospital; (3) patient had a CXR in the ED; and (4) patient’s record had at least five of the EMR values detailed in Table 1, all obtained in the ED, and the relevant outcomes captured during hospitalization. Of note, The CXR, laboratory results and vitals used were the first available for capture during the visit to the ED. The model did not incorporate any CXR, laboratory results or vitals acquired after leaving the ED.

Model input

In total, 21 EMR features were used as input to the model. The outcome (that is, ground truth) labels were assigned based on patient requirements after 24- and 72-h periods from initial admission to the ED. A detailed list of the requested EMR features and outcomes can be seen in Table 1.

The distribution of oxygen treatment using different devices at different client sites is shown in Extended Data Fig. 7, which details the device usage at admission to the ED and after 24- and 72-h periods. The difference in dataset distribution between the largest and smallest client sites can be seen in Extended Data Fig. 8.

The number of positive COVID-19 cases, as confirmed by a single RT–PCR test obtained at any time between presentation to the ED and discharge from the hospital, is listed in Supplementary Table 1. Each client site was asked to randomly split its dataset into three parts: 70% for training, 10% for validation and 20% for testing. For both 24- and 72-h outcome prediction models, random splits for each of the three repeated local and FL training and evaluation experiments were independently generated.

EXAM model development

There is wide variation in the clinical course of patients who present to hospital with symptoms of COVID-19, with some experiencing rapid deterioration in respiratory function requiring different interventions to prevent or mitigate hypoxemia62,63. A critical decision made during the evaluation of a patient at the initial point of care, or in the ED, is whether the patient is likely to require more invasive or resource-limited countermeasures or interventions (such as MV or monoclonal antibodies), and should therefore receive a scarce but effective therapy, a therapy with a narrow risk–benefit ratio due to side effects or a higher level of care, such as admittance to the intensive care unit64. In contrast, a patient who is at lower risk of requiring invasive oxygen therapy may be placed in a less intensive care setting such as a regular ward, or even released from the ED for continuing self-monitoring at home65. EXAM was developed to help triage such patients.

Of note, the model is not approved by any regulatory agency at this time and it should be used only for research purposes.

EXAM score

EXAM was trained using FL; it outputs a risk score (termed EXAM score) similar to CORISK27 (Extended Data Fig. 9a) and can be used in the same way to triage patients. It corresponds to a patient’s oxygen support requirements within two windows—24 and 72 h—after initial presentation to the ED. Extended Data Fig. 9b illustrates how CORISK and the EXAM score can be used for patient triage.

Chest X-ray images were preprocessed to select the anterior position image and exclude lateral view images, and then scaled to a resolution of 224 × 224. As shown in Extended Data Fig. 9a, the model fuses information from both EMR and CXR features (based on a modified ResNet34 with spatial attention66 pretrained on the CheXpert dataset)67 and the Deep & Cross network68. To converge these different data types, a 512-dimensional feature vector was extracted from each CXR image using a pretrained ResNet34, with spatial attention, then concatenated with the EMR features as the input for the Deep & Cross network. The final output was a continuous value in the range 0–1 for both 24- and 72-h predictions, corresponding to the labels described above, as shown in Extended Data Fig. 9b. We used cross-entropy as the loss function and ‘Adam’ as the optimizer. The model was implemented in Tensorflow69 using the NVIDIA Clara Train SDK70. The average AUC for the classification tasks (≥LFO, ≥HFO/NIV or ≥MV) was calculated and used as the final evaluation metric, with normalization to zero-mean and unit variance. CXR images were preprocessed to select the correct series and exclude lateral view images, then scaled to a resolution of 224 × 224 (ref. 27).

Feature imputation and normalization

A MissForest algorithm71 was used to impute EMR features, based on the local training dataset. If an EMR feature was completely missing from a client site dataset, the mean value of that feature, calculated exclusively on data from MGB client sites, was used. Then, EMR features were rescaled to zero-mean and unit variance based on statistics calculated on data from the MGB client sites.

Details of EMR–CXR data fusion using the Deep & Cross network

To model the interactions of features from EMR and CXR data at the case level, a deep-feature scheme was used based on a Deep & Cross network architecture68. Binary and categorical features for the EMR inputs, as well as 512-dimensional image features in the CXR, were transformed into fused dense vectors of real values by embedding and stacking layers. The transformed dense vectors served as input to the fusion framework, which specifically employed a crossing network to enforce fusion among input from different sources. The crossing network performed explicit feature crossing within its layers, by conducting inner products between the original input feature and output from the previous layer, thus increasing the degree of interaction across features. At the same time, two individual classic deep neural networks with several stacked, fully connected feed-forward layers were trained. The final output of our framework was then derived from the concatenation of both classic and crossing networks.

FL details

Arguably the most established form of FL is implemention of the federated averaging algorithm as proposed by McMahan et al.72, or variations thereof. This algorithm can be realized using a client-server setup where each participating site acts as a client. One can think of FL as a method aiming to minimize a global loss function by reducing a set of local loss functions, which are estimated at each site. By minimizing each client site’s local loss while also synchronizing the learned client site weights on a centralized aggregation server, one can minimize global loss without needing to access the entire dataset in a centralized location. Each client site learns locally, and shares model weight updates with a central server that aggregates contributions using secure sockets layer encryption and communication protocols. The server then sends an updated set of weights to each client site after aggregation, and sites resume training locally. The server and client site iterate back and forth until the model converges (Extended Data Fig. 9c).

A pseudoalgorithm of FL is shown in Supplementary Note 1. In our experiments, we set the number of federated rounds at T = 200, with one local training epoch per round t at each client. The number of clients, K, was up to 20 depending on the network connectivity of clients or available data for a specific targeted outcome period (24 or 72 h). The number of local training iterations, nk, depends on the dataset size at each client k and is used to weigh each client’s contributions when aggregating the model weights in federated averaging. During the FL training task, each client site selects its best local model by tracking the model’s performance on its local validation set. At the same time, the server determines the best global model based on the average validation scores sent from each client site to the server after each FL round. After FL training finishes, the best local models and the best global model are automatically shared with all client sites and evaluated on their local test data.

When training on local data only (the baseline), we set the epoch number to 200. The Adam optimizer was used for both local training and FL with an initial learning rate of 5 × 10–5 and a stepwise learning rate decay with a factor 0.5 after every 40 epochs, which is important for the convergence of federated averaging73. Random affine transformations, including rotation, translations, shear, scaling and random intensity noise and shifts, were applied to the images for data augmentation during training.

Owing to the sensitivity of BN layers58 when dealing with different clients in a nonindependent and identically distributed setting, we found the best model performance occurred when keeping the pretrained ResNet34 with spatial attention47 parameters fixed during FL training (that is, using a learning rate of zero for those layers). The Deep & Cross network that combines image features with EMR features does not contain BN layers and hence was not affected by BN instability issues.

In this study we investigated a privacy-preserving scheme that shares only partial model updates between server and client sites. The weight updates were ranked during each iteration by magnitude of contribution, and only a certain percentage of the largest weight updates was shared with the server. To be exact, weight updates (also known as gradients) were shared only if their absolute value was above a certain percentile threshold, k(t) (Extended Data Fig. 5), which was computed from all non-zero gradients, ΔWk(t), and could be different for each client k in each FL round t. Variations of this scheme could include additional clipping of large gradients or differential privacy schemes49 that add random noise to the gradients, or even to the raw data, before feeding into the network51.

Statistical analysis

We conducted a Wilcoxon signed-rank test to confirm the significance of the observed improvement in performance between the locally trained model and the FL model for the 24- and 72-h time points (Fig. 2 and Extended Data Fig. 1). The null hypothesis was rejected with one-sided P « 1 × 10–3 in both cases.

Pearson’s correlation was used to assess the generalizability (robustness of the average AUC value to other client sites’ test data) of locally trained models in relation to respective local dataset size. Only a moderate correlation was observed (r = 0.43, P = 0.035, degrees of freedom (df) = 17 for the 24-h model and r = 0.62, P = 0.003, df = 16 for the 72-h model). This indicates that dataset size alone is not the only factor determining a model’s robustness to unseen data.

To compare ROC curves from the global FL model and local models trained at different sites (Extended Data Fig. 3), we bootstrapped 1,000 samples from the data and computed the resulting AUCs. We then calculated the difference between the two series and standardized using the formula D = (AUC1 – AUC2)/s, where D is the standardized difference, s is the standard deviation of the bootstrap differences and AUC1 and AUC2 are the corresponding bootstrapped AUC series. By comparing D with normal distribution, we obtained the P values illustrated in Supplementary Table 2. The results show that the null hypothesis was rejected with very low P values, indicating the statistical significance of the superiority of FL outcomes. The computation of P values was conducted in R with the pROC library74.

Since the model predicts a discrete outcome, a continuous score from 0 to 1, a straightforward calibration evaluation such as a qqplot is not possible. Hence, for a quantified estimate of calibration we quantified discrimination (Extended Data Fig. 10). We conducted one-way analysis of variation (ANOVA) tests to compare local and FL model scores among four ground truth categories (RA, LFO, HFO, MV). The F-statistic, calculated as the variation between the sample means divided by variation within the samples and representing the degree of dispersion among different groups, was used to quantify the models. Our results show that the F-values of five different local sites are 245.7, 253.4, 342.3, 389.8 and 634.8, while that of the FL model is 843.5. Given that larger F-values mean that groups are more separable, the scores from our FL model clearly show a greater dispersion among the four ground truth categories. Furthermore, the P value of the ANOVA test on the FL model is <2 × 10–16, indicating that the FL prediction scores are statistically significantly different among the different prediction classes.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The dataset from the 20 institutes that participated in this study remains under their custody. These data were used for training at each of the local sites and were not shared with any of the other participating institutions or with the federated server, and they are not publicly available. Data from the independent validation sites are maintained by CAMCA, and access can be requested by contacting Q.L. Based on determination by CAMCA, a data-sharing review and amendment of IRB for research purposes can be conducted by MGB research administration and in accordance with MGB IRB and policy.

Code availability

All code and software used in this study are publicly available at NGC. To access, log in as a guest or create a profile then enter one of the URLs below. The trained models, data preparation guidelines, code for training, validating testing of the model, readme file, installation guideline and license files are publicly available at NVIDIA NGC61https://ngc.nvidia.com/catalog/models/nvidia:med:clara_train_covid19_exam_ehr_xray The federated learning software is available as part of the Clara Train SDK: https://ngc.nvidia.com/catalog/containers/nvidia:clara-train-sdk. Alternatively, use this command to download the model “wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/med/clara_train_covid19_exam_ehr_xray/versions/1/zip -O claratraincovid19examehrxray1.zip”.

References

  1. Budd, J. et al. Digital technologies in the public-health response to COVID-19. Nat. Med. 26, 1183–1192 (2020).

  2. Moorthy, V., Henao Restrepo, A. M., Preziosi, M.-P. & Swaminathan, S. Data sharing for novel coronavirus (COVID-19). Bull. World Health Organ. 98, 150 (2020).

  3. Chen, Q., Allot, A. & Lu, Z. Keep up with the latest coronavirus research. Nature 579, 193 (2020).

  4. Fabbri, F., Bhatia, A., Mayer, A., Schlotter, B. & Kaiser, J. BCG IT spend pulse: how COVID-19 is shifting tech priorities. https://www.bcg.com/publications/2020/how-covid-19-is-shifting-big-it-spend (2020).

  5. Candelon, F., Reichert, T., Duranton, S., di Carlo, R. C. & De Bondt, M. The rise of the AI-powered company in the postcrisis world. https://www.bcg.com/en-gb/publications/2020/business-applications-artificial-intelligence-post-covid (2020).

  6. Chao, H. et al. Integrative analysis for COVID-19 patient outcome prediction. Med. Image Anal. 67, 101844 (2021).

  7. Zhu, X. et al. Joint prediction and time estimation of COVID-19 developing severe symptoms using chest CT scan. Med. Image Anal. 67, 101824 (2021).

  8. Yang, D. et al. Federated semi-supervised learning for Covid region segmentation in chest ct using multi-national data from China, Italy, Japan. Med. Image Anal. 70, 101992 (2021).

  9. Minaee, S., Kafieh, R., Sonka, M., Yazdani, S. & Jamalipour Soufi, G. Deep-COVID: predicting COVID-19 from chest X-ray images using deep transfer learning. Med. Image Anal. 65, 101794 (2020).

  10. COVID-19 Studies from the World Health Organization Database. https://clinicaltrials.gov/ct2/who_table (2020).

  11. ACTIV. https://www.nih.gov/research-training/medical-research-initiatives/activ (2020).

  12. Coronavirus Treatment Acceleration Program (CTAP). US Food and Drug Administration https://www.fda.gov/drugs/coronavirus-covid-19-drugs/coronavirus-treatment-acceleration-program-ctap (2020).

  13. Gleeson, P., Davison, A. P., Silver, R. A. & Ascoli, G. A. A commitment to open source in neuroscience. Neuron 96, 964–965 (2017).

  14. Piwowar, H. et al. The state of OA: a large-scale analysis of the prevalence and impact of open access articles. PeerJ. 6, e4375 (2018).

  15. European Society of Radiology (ESR). What the radiologist should know about artificial intelligence – an ESR white paper. Insights Imaging 10, 44 (2019).

  16. Pesapane, F., Codari, M. & Sardanelli, F. Artificial intelligence in medical imaging: threat or opportunity? Radiologists again at the forefront of innovation in medicine. Eur. Radiol. Exp. 2, 35 (2018).

  17. Price, W. N. 2nd & Cohen, I. G. Privacy in the age of medical big data. Nat. Med. 25, 37–43 (2019).

  18. Liang, W. et al. Development and validation of a clinical risk score to predict the occurrence of critical illness in hospitalized patients with COVID-19. JAMA Intern. Med. 180, 1081–1089 (2020).

  19. Wynants, L. et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. Brit. Med. J. 369, m1328 (2020).

  20. Zhang, L. et al. D-dimer levels on admission to predict in-hospital mortality in patients with Covid-19. J. Thromb. Haemost. 18, 1324–1329 (2020).

  21. Sands, K. E. et al. Patient characteristics and admitting vital signs associated with coronavirus disease 2019 (COVID-19)-related mortality among patients admitted with noncritical illness. https://doi.org/10.1017/ice.2020.461 (2020).

  22. American College of Radiology. CR recommendations for the use of chest radiography and computed tomography (CT) for suspected COVID-19 infection. https://www.acr.org/Advocacy-and-Economics/ACR-Position-Statements/Recommendations-for-Chest-Radiography-and-CT-for-Suspected-COVID19-Infection (2020).

  23. Rubin, G. D. et al. The role of chest imaging in patient management during the COVID-19 pandemic: a multinational consensus statement from the Fleischner Society. Radiology 296, 172–180 (2020).

  24. World Health Organization. Use of chest imaging in COVID-19. https://www.who.int/publications/i/item/use-of-chest-imaging-in-covid-19 (2020).

  25. Jamil, S. et al. Diagnosis and management of COVID-19 disease. Am. J. Respir. Crit. Care Med. 201, 10 (2020).

  26. Redmond, C. E., Nicolaou, S., Berger, F. H., Sheikh, A. M. & Patlas, M. N. Emergency radiology during the COVID-19 pandemic: The Canadian Association of Radiologists Recommendations for Practice. Can. Assoc. Radiologists J. 71, 425–430 (2020).

  27. Buch, V. et al. Development and validation of a deep learning model for prediction of severe outcomes in suspected COVID-19 Infection. Preprint at https://arxiv.org/abs/2103.11269 (2021).

  28. Lyons, C. & Callaghan, M. The use of high-flow nasal oxygen in COVID-19. Anaesthesia 75, 843–847 (2020).

  29. Whittle, J. S., Pavlov, I., Sacchetti, A. D., Atwood, C. & Rosenberg, M. S. Respiratory support for adult patients with COVID-19. J. Am. Coll. Emerg. Physicians Open 1, 95–101 (2020).

  30. Ai, J., Li, Y., Zhou, X. & Zhang, W. COVID-19: treating and managing severe cases. Cell Res. 30, 370–371 (2020).

  31. Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).

  32. Cahan, E. M., Hernandez-Boussard, T., Thadaney-Israni, S. & Rubin, D. L. Putting the data before the algorithm in big data addressing personalized healthcare. NPJ Digit. Med. 2, 78 (2019).

  33. Thrall, J. H. et al. Artificial intelligence and machine learning in radiology: opportunities, challenges, pitfalls, and criteria for success. J. Am. Coll. Radiol. 15, 504–508 (2018).

  34. Shilo, S., Rossman, H. & Segal, E. Axes of a revolution: challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020).

  35. Gao, Y. & Cui, Y. Deep transfer learning for reducing health care disparities arising from biomedical data inequality. Nat. Commun. 11, 5131 (2020).

  36. Rieke, N. et al. The future of digital health with federated learning. NPJ Dig. Med. 3, 119 (2020).

  37. Yang, Q., Liu, Y., Chen, T. & Tong, Y. Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10, 12 (2019).

  38. Ma, C. et al. On safeguarding privacy and security in the framework of federated learning. IEEE Netw. 34, 242–248 (2020).

  39. Brisimi, T. S. et al. Federated learning of predictive models from federated Electronic Health Records. Int. J. Med. Inform. 112, 59–67 (2018).

  40. Roth, H. R. et al. Federated learning for breast density classification: a real-world implementation. In Proc. Second MICCAI Workshop, DART 2020 and First MICCAI Workshop, DCL 2020Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning (eds. Albarqouni, S. et al.) Vol. 12,444, 181–191 (Springer International Publishing, 2020).

  41. Sheller, M. J. et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci. Rep. 10, 12598 (2020).

  42. Remedios, S. W., Butman, J. A., Landman, B. A. & Pham, D. L. in Federated Gradient Averaging for Multi-Site Training with Momentum-Based Optimizers (eds Remedios, S. W. et al.) (Springer, 2020).

  43. Xu, Y. et al. A collaborative online AI engine for CT-based COVID-19 diagnosis. Preprint at https://www.medrxiv.org/content/10.1101/2020.05.10.20096073v2 (2020).

  44. Raisaro, J. L. et al. SCOR: A secure international informatics infrastructure to investigate COVID-19. J. Am. Med. Inform. Assoc. 27, 1721–1726 (2020).

  45. Vaid, A. et al. Federated learning of electronic health records to improve mortality prediction in hospitalized patients with COVID-19: machine learning approach. JMIR Med. Inform. 9, e24207 (2021).

  46. Nino, G. et al. Pediatric lung imaging features of COVID-19: a systematic review and meta-analysis. Pediatr. Pulmonol. 56, 252–263 (2021).

  47. Fredrikson, M., Jha, S. & Ristenpart, T. Model inversion attacks that exploit confidence information and basic countermeasures. In Proc. 22nd ACM SIGSAC Conference on Computer and Communications Security 1322–1333, https://doi.org/10.1145/2810103.2813677 (2015).

  48. Zhu, L., Liu, Z. & Han, S. in Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 14774–14784 (Curran Associates, Inc., 2019).

  49. Kaissis, G. A., Makowski, M. R., Rückert, D. & Braren, R. F. Secure, privacy-preserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2, 305–311 (2020).

  50. Li, W. et al. in Privacy-Preserving Federated Brain Tumour Segmentation 133–141 (Springer, 2019).

  51. Shokri, R. & Shmatikov, V. Privacy-preserving deep learning. In Proc. 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton) https://doi.org/10.1109/allerton.2015.7447103 (2015).

  52. Li, X. et al. Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: ABIDE results. Med. Image Anal. 65, 101765 (2020).

  53. Estiri, H. et al. Predicting COVID-19 mortality with electronic medical records. NPJ Dig. Med. 4, 15 (2021).

  54. Jiang, G. et al. Harmonization of detailed clinical models with clinical study data standards. Methods Inf. Med. 54, 65–74 (2015).

  55. Yang, D. et al. in Searching Learning Strategy with Reinforcement Learning for 3D Medical Image Segmentationhttps://doi.org/10.1007/978-3-030-32245-8_1 (2019).

  56. Elsken, T., Metzen, J. H. & Hutter, F. Neural architecture search: a survey. J. Mach. Learning Res. 20, 1–21 (2019).

  57. Yao, Q. et al. Taking human out of learning applications: a survey on automated machine learning. Preprint at https://arxiv.org/abs/1810.13306 (2019).

  58. Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. 32nd International Conf. Machine Learning, PMLR 37, 448–456 (2015).

  59. Kaufman, S., Rosset, S. & Perlich, C. Leakage in data mining: formulation, detection, and avoidance. In Proc. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 556–563 (2011).

  60. Zhang, C. et al. BatchCrypt: efficient homomorphic encryption for cross-silo federated learning. In Proc. 2020 USENIX Annual Technical Conference, ATC 2020, 493–506 (2020).

  61. Nvidia NGC Catalog: COVID-19 Related Modelshttps://ngc.nvidia.com/catalog/models?orderBy=scoreDESC&pageNumber=0&query=covid&quickFilter=models&filters (2020).

  62. Marini, J. J. & Gattinoni, L. Management of COVID-19 respiratory distress. JAMA 323, 2329–2330 (2020).

  63. Cook, T. M. et al. Consensus guidelines for managing the airway in patients with COVID-19: Guidelines from the Difficult Airway Society, the Association of Anaesthetists the Intensive Care Society, the Faculty of Intensive Care Medicine and the Royal College of Anaesthetist. Anaesthesia 75, 785–799 (2020).

  64. Galloway, J. B. et al. A clinical risk score to identify patients with COVID-19 at high risk of critical care admission or death: an observational cohort study. J. Infect. 81, 282–288 (2020).

  65. Kilaru, A. S. et al. Return hospital admissions among 1419 COVID-19 patients discharged from five U.S. emergency departments. Acad. Emerg. Med. 27, 1039–1042 (2020).

  66. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/cvpr.2016.90 (2016).

  67. Irvin, J. et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proc. AAAI Conf. Artif. Intell. 33, 590–597 (2019).

  68. Wang, R., Fu, B., Fu, G. & Wang, M. Deep & Cross network for Ad Click predictions. In Proc. ADKDD’17 Article no. 12 (2017).

  69. Abadi, M. et al. TensorFlow: asystem for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), USENIX Association 265–283 (2016).

  70. NVIDIA Clara Imaginghttps://developer.nvidia.com/clara-medical-imaging (2020).

  71. Stekhoven, D. J. & Bühlmann, P. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012).

  72. McMahan, H., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. http://proceedings.mlr.press/v54/mcmahan17a.html (2017).

  73. Hsieh, K., Phanishayee, A., Mutlu, O. & Gibbons, P. B. The non-IID data quagmire of decentralized machine learning. In Proc. 37th International Conf. Machine Learning PMLR 119 (2020).

  74. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).

    \

Acknowledgements

The views expressed in this study are those of the authors and not necessarily those of the NHS, the NIHR, the Department of Health and Social Care or any of the organizations associated with the authors. MGB thank the following individuals for their support: J. Brink, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA; M. Kalra, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA; N. Neumark, Center for Clinical Data Science, Massachusetts General Brigham, Boston, MA; T. Schultz, Department of Radiology, Massachusetts General Hospital, Boston, MA; N. Guo, Center for Advanced Medical Computing and Analysis, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA; J. K. Cramer, Director, QTIM lab at the Athinoula A. Martinos Center for Biomedical Imaging at MGH; S. Pomerantz, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA; G. Boland, Department of Radiology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA; W. Mayo-Smith, Department of Radiology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA. UCSF thank P. B. Storey, J. Chan and J. Block for implementing the UCSF FL client infrastructure, and W. Tellis for providing the source imaging repository for this work. The UCSF EMR and clinical notes for this study were accessed via the COVID-19 Research Data Mart, https://data.ucsf.edu/covid19. The Faculty of Medicine, Chulalongkorn University thank the Ratchadapisek Sompoch Endowment Fund RA (PO) (no. 001/63) for the collection and management of COVID‐19-related clinical data and biological specimens for the Research Task Force, Faculty of Medicine, Chulalongkorn University. NIHR Cambridge Biomedical Research Centre thank A. Priest, who is supported by the NIHR (Cambridge Biomedical Research Centre at the Cambridge University Hospitals NHS Foundation Trust). National Taiwan University MeDA Lab and the MAHC and Taiwan National Health Insurance Administration thank the MOST Joint Research Center for AI technology, the All Vista Healthcare National Health Insurance Administration, Taiwan, the Ministry of Science and Technology, Taiwan and the National Center for Theoretical Sciences Mathematics Division. National Institutes of Health (NIH) acknowledge that the NIH Medical Research Scholars Program is a public–private partnership supported jointly by the NIH and by generous contributions to the Foundation for the NIH from the Doris Duke Charitable Foundation, the American Association for Dental Research, the Colgate-Palmolive Company, Genentech, alumni of student research programs and other individual supporters via contributions to the Foundation for the NIH.

\ \

:::info This paper is available on nature under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

The AI Breakthrough That Lets Hospitals Train Algorithms Without Sharing Patient Data

2026-03-16 23:00:57

:::info

Authors:

  1. Nicola Rieke
  2. Jonny Hancox
  3. Wenqi Li
  4. Fausto Milletarì
  5. Holger R. Roth
  6. Shadi Albarqouni
  7. Spyridon Bakas
  8. Mathieu N. Galtier
  9. Bennett A. Landman
  10. Klaus Maier-Hein
  11. Sébastien Ourselin
  12. Micah Sheller
  13. Ronald M. Summers
  14. Andrew Trask
  15. Daguang Xu
  16. Maximilian Baust
  17. M. Jorge Cardoso

:::

\

Abstract

Data-driven machine learning (ML) has emerged as a promising approach for building accurate and robust statistical models from medical data, which is collected in huge volumes by modern healthcare systems. Existing medical data is not fully exploited by ML primarily because it sits in data silos and privacy concerns restrict access to this data. However, without access to sufficient data, ML will be prevented from reaching its full potential and, ultimately, from making the transition from research to clinical practice. This paper considers key factors contributing to this issue, explores how federated learning (FL) may provide a solution for the future of digital health and highlights the challenges and considerations that need to be addressed.

\

Introduction

Research on artificial intelligence (AI), and particularly the advances in machine learning (ML) and deep learning (DL)1 have led to disruptive innovations in radiology, pathology, genomics and other fields. Modern DL models feature millions of parameters that need to be learned from sufficiently large curated data sets in order to achieve clinical-grade accuracy, while being safe, fair, equitable and generalising well to unseen data2,3,4,5.

For example, training an AI-based tumour detector requires a large database encompassing the full spectrum of possible anatomies, pathologies, and input data types. Data like this is hard to obtain, because health data is highly sensitive and its usage is tightly regulated6. Even if data anonymisation could bypass these limitations, it is now well understood that removing metadata such as patient name or date of birth is often not enough to preserve privacy7. It is, for example, possible to reconstruct a patient’s face from computed tomography (CT) or magnetic resonance imaging (MRI) data8. Another reason why data sharing is not systematic in healthcare is that collecting, curating, and maintaining a high-quality data set takes considerable time, effort, and expense. Consequently such data sets may have significant business value, making it less likely that they will be freely shared. Instead, data collectors often retain fine-grained control over the data that they have gathered.

Federated learning (FL)9,10,11 is a learning paradigm seeking to address the problem of data governance and privacy by training algorithms collaboratively without exchanging the data itself. Originally developed for different domains, such as mobile and edge device use cases12, it recently gained traction for healthcare applications13,14,15,16,17,18,19,20. FL enables gaining insights collaboratively, e.g., in the form of a consensus model, without moving patient data beyond the firewalls of the institutions in which they reside. Instead, the ML process occurs locally at each participating institution and only model characteristics (e.g., parameters, gradients) are transferred as depicted in Fig. 1. Recent research has shown that models trained by FL can achieve performance levels comparable to ones trained on centrally hosted data sets and superior to models that only see isolated single-institutional data16,17.

\n Fig. 1: Example federated learning (FL) workflows and difference to learning on a Centralised Data Lake.

a FL aggregation server—the typical FL workflow in which a federation of training nodes receive the global model, resubmit their partially trained models to a central server intermittently for aggregation and then continue training on the consensus model that the server returns. b FL peer to peer—alternative formulation of FL in which each training node exchanges its partially trained models with some or all of its peers and each does its own aggregation. c Centralised training—the general non-FL training workflow in which data acquiring sites donate their data to a central Data Lake from which they and others are able to extract data for local, independent training.

\ A successful implementation of FL could thus hold a significant potential for enabling precision medicine at large-scale, leading to models that yield unbiased decisions, optimally reflect an individual’s physiology, and are sensitive to rare diseases while respecting governance and privacy concerns. However, FL still requires rigorous technical consideration to ensure that the algorithm is proceeding optimally without compromising safety or patient privacy. Nevertheless, it has the potential to overcome the limitations of approaches that require a single pool of centralised data.

We envision a federated future for digital health and with this perspective paper, we share our consensus view with the aim of providing context and detail for the community regarding the benefits and impact of FL for medical applications (section “Data-driven medicine requires federated efforts”), as well as highlighting key considerations and challenges of implementing FL for digital health (section “Technical considerations”).

Data-driven medicine requires federated efforts

ML and especially DL is becoming the de facto knowledge discovery approach in many industries, but successfully implementing data-driven applications requires large and diverse data sets. However, medical data sets are difficult to obtain (subsection “The reliance on data”). FL addresses this issue by enabling collaborative learning without centralising data (subsection “The promise of federated efforts”) and has already found its way to digital health applications (subsection “Current FL efforts for digital health”). This new learning paradigm requires consideration from, but also offers benefits to, various healthcare stakeholders (section “Impact on stakeholders”).

The reliance on data

Data-driven approaches rely on data that truly represent the underlying data distribution of the problem. While this is a well-known requirement, state-of-the-art algorithms are usually evaluated on carefully curated data sets, often originating from only a few sources. This can introduce biases where demographics (e.g., gender, age) or technical imbalances (e.g., acquisition protocol, equipment manufacturer) skew predictions and adversely affect the accuracy for certain groups or sites. However, to capture subtle relationships between disease patterns, socio-economic and genetic factors, as well as complex and rare cases, it is crucial to expose a model to diverse cases.

The need for large databases for AI training has spawned many initiatives seeking to pool data from multiple institutions. This data is often amassed into so-called Data Lakes. These have been built with the aim of leveraging either the commercial value of data, e.g., IBM’s Merge Healthcare acquisition21, or as a resource for economic growth and scientific progress, e.g., NHS Scotland’s National Safe Haven22, French Health Data Hub23, and Health Data Research UK24.

Substantial, albeit smaller, initiatives include the Human Connectome25, the UK Biobank26, the Cancer Imaging Archive (TCIA)27, NIH CXR828, NIH DeepLesion29, the Cancer Genome Atlas (TCGA)30, the Alzheimer’s Disease Neuroimaging Initiative (ADNI)31, as well as medical grand challenges32 such as the CAMELYON challenge33, the International multimodal Brain Tumor Segmentation (BraTS) challenge34,35,36 or the Medical Segmentation Decathlon37. Public medical data is usually task- or disease-specific and often released with varying degrees of license restrictions, sometimes limiting its exploitation.

Centralising or releasing data, however, poses not only regulatory, ethical and legal challenges, related to privacy and data protection, but also technical ones. Anonymising, controlling access and safely transferring healthcare data is a non-trivial, and sometimes impossible task. Anonymised data from the electronic health record can appear innocuous and GDPR/PHI compliant, but just a few data elements may allow for patient reidentification7. The same applies to genomic data and medical images making them as unique as a fingerprint38. Therefore, unless the anonymisation process destroys the fidelity of the data, likely rendering it useless, patient reidentification or information leakage cannot be ruled out. Gated access for approved users is often proposed as a putative solution to this issue. However, besides limiting data availability, this is only practical for cases in which the consent granted by the data owners is unconditional, since recalling data from those who may have had access to the data is practically unenforceable.

The promise of federated efforts

The promise of FL is simple—to address privacy and data governance challenges by enabling ML from non-co-located data. In a FL setting, each data controller not only defines its own governance processes and associated privacy policies, but also controls data access and has the ability to revoke it. This includes both the training, as well as the validation phase. In this way, FL could create new opportunities, e.g., by allowing large-scale, in-institutional validation, or by enabling novel research on rare diseases, where the incident rates are low and data sets at each single institution are too small. Moving the model to the data and not vice versa has another major advantage: high-dimensional, storage-intense medical data does not have to be duplicated from local institutions in a centralised pool and duplicated again by every user that uses this data for local model training. As the model is transferred to the local institutions, it can scale naturally with a potentially growing global data set without disproportionately increasing data storage requirements.

As depicted in Fig. 2, a FL workflow can be realised with different topologies and compute plans. The two most common ones for healthcare applications are via an aggregation server16,17,18 and peer to peer approaches15,39. In all cases, FL implicitly offers a certain degree of privacy, as FL participants never directly access data from other institutions and only receive model parameters that are aggregated over several participants. In a FL workflow with aggregation server, the participating institutions can even remain unknown to each other. However, it has been shown that the models themselves can, under certain conditions, memorise information40,41,42,43. Therefore, mechanisms such as differential privacy44,45 or learning from encrypted data have been proposed to further enhance privacy in a FL setting (c.f. section “Technical considerations”). Overall, the potential of FL for healthcare applications has sparked interest in the community46 and FL techniques are a growing area of research12,20.

Fig. 2: Overview of different FL design choices.

FL topologies—communication architecture of a federation. a Centralised: the aggregation server coordinates the training iterations and collects, aggregates and distributes the models to and from the Training Nodes (Hub & Spoke). b Decentralised: each training node is connected to one or more peers and aggregation occurs on each node in parallel. c Hierarchical: federated networks can be composed from several sub-federations, which can be built from a mix of Peer to Peer and Aggregation Server federations (d)). FL compute plans—trajectory of a model across several partners. e Sequential training/cyclic transfer learning. f Aggregation server, g Peer to Peer.

\

Current FL efforts for digital health

Since FL is a general learning paradigm that removes the data pooling requirement for AI model development, the application range of FL spans the whole of AI for healthcare. By providing an opportunity to capture larger data variability and to analyse patients across different demographics, FL may enable disruptive innovations for the future but is also being employed right now.

In the context of electronic health records (EHR), for example, FL helps to represent and to find clinically similar patients13,47, as well as predicting hospitalisations due to cardiac events14, mortality and ICU stay time19. The applicability and advantages of FL have also been demonstrated in the field of medical imaging, for whole-brain segmentation in MRI15, as well as brain tumour segmentation16,17. Recently, the technique has been employed for fMRI classification to find reliable disease-related biomarkers18 and suggested as a promising approach in the context of COVID-1948.

It is worth noting that FL efforts require agreements to define the scope, aim and technologies used which, since it is still novel, can be difficult to pin down. In this context, today’s large-scale initiatives really are the pioneers of tomorrow’s standards for safe, fair and innovative collaboration in healthcare applications.

These include consortia that aim to advance academic research, such as the Trustworthy Federated Data Analytics (TFDA) project49 and the German Cancer Consortium’s Joint Imaging Platform50, which enable decentralised research across German medical imaging research institutions. Another example is an international research collaboration that uses FL for the development of AI models for the assessment of mammograms51. The study showed that the FL-generated models outperformed those trained on a single institute’s data and were more generalisable, so that they still performed well on other institutes’ data. However, FL is not limited just to academic environments.

By linking healthcare institutions, not restricted to research centres, FL can have direct clinical impact. The on-going HealthChain project52, for example, aims to develop and deploy a FL framework across four hospitals in France. This solution generates common models that can predict treatment response for breast cancer and melanoma patients. It helps oncologists to determine the most effective treatment for each patient from their histology slides or dermoscopy images. Another large-scale effort is the Federated Tumour Segmentation (FeTS) initiative53, which is an international federation of 30 committed healthcare institutions using an open-source FL framework with a graphical user interface. The aim is to improve tumour boundary detection, including brain glioma, breast tumours, liver tumours and bone lesions from multiple myeloma patients.

Another area of impact is within industrial research and translation. FL enables collaborative research for, even competing, companies. In this context, one of the largest initiatives is the Melloddy project54. It is a project aiming to deploy multi-task FL across the data sets of 10 pharmaceutical companies. By training a common predictive model, which infers how chemical compounds bind to proteins, partners intend to optimise the drug discovery process without revealing their highly valuable in-house data.

Impact on stakeholders

FL comprises a paradigm shift from centralised data lakes and it is important to understand its impact on the various stakeholders in a FL ecosystem.

Clinicians

Clinicians are usually exposed to a sub-group of the population based on their location and demographic environment, which may cause biased assumptions about the probability of certain diseases or their interconnection. By using ML-based systems, e.g., as a second reader, they can augment their own expertise with expert knowledge from other institutions, ensuring a consistency of diagnosis not attainable today. While this applies to ML-based system in general, systems trained in a federated fashion are potentially able to yield even less biased decisions and higher sensitivity to rare cases as they were likely exposed to a more complete data distribution. However, this demands some up-front effort such as compliance with agreements, e.g., regarding the data structure, annotation and report protocol, which is necessary to ensure that the information is presented to collaborators in a commonly understood format.

Patients

Patients are usually treated locally. Establishing FL on a global scale could ensure high quality of clinical decisions regardless of the treatment location. In particular, patients requiring medical attention in remote areas could benefit from the same high-quality ML-aided diagnoses that are available in hospitals with a large number of cases. The same holds true for rare, or geographically uncommon, diseases, that are likely to have milder consequences if faster and more accurate diagnoses can be made. FL may also lower the hurdle for becoming a data donor, since patients can be reassured that the data remains with their own institution and data access can be revoked.

Hospitals and practices

Hospitals and practices can remain in full control and possession of their patient data with complete traceability of data access, limiting the risk of misuse by third parties. However, this will require investment in on-premise computing infrastructure or private-cloud service provision and adherence to standardised and synoptic data formats so that ML models can be trained and evaluated seamlessly. The amount of necessary compute capability depends of course on whether a site is only participating in evaluation and testing efforts or also in training efforts. Even relatively small institutions can participate and they will still benefit from collective models generated.

Researchers and AI developers

Researchers and AI developers stand to benefit from access to a potentially vast collection of real-world data, which will particularly impact smaller research labs and start-ups. Thus, resources can be directed towards solving clinical needs and associated technical problems rather than relying on the limited supply of open data sets. At the same time, it will be necessary to conduct research on algorithmic strategies for federated training, e.g., how to combine models or updates efficiently, how to be robust to distribution shifts11,12,20. FL-based development implies also that the researcher or AI developer cannot investigate or visualise all of the data on which the model is trained, e.g., it is not possible to look at an individual failure case to understand why the current model performs poorly on it.

Healthcare providers

Healthcare providers in many countries are affected by the on-going paradigm shift from volume-based, i.e., fee-for-service-based, to value-based healthcare, which is in turn strongly connected to the successful establishment of precision medicine. This is not about promoting more expensive individualised therapies but instead about achieving better outcomes sooner through more focused treatment, thereby reducing the cost. FL has the potential to increase the accuracy and robustness of healthcare AI, while reducing costs and improving patient outcomes, and may therefore be vital to precision medicine.

Manufacturers

Manufacturers of healthcare software and hardware could benefit from FL as well, since combining the learning from many devices and applications, without revealing patient-specific information, can facilitate the continuous validation or improvement of their ML-based systems. However, realising such a capability may require significant upgrades to local compute, data storage, networking capabilities and associated software.

Technical considerations

FL is perhaps best-known from the work of Konečnỳ et al.55, but various other definitions have been proposed in the literature9,11,12,20. A FL workflow (Fig. 1) can be realised via different topologies and compute plans (Fig. 2), but the goal remains the same, i.e., to combine knowledge learned from non-co-located data. In this section, we will discuss in more detail what FL is, as well as highlighting the key challenges and technical considerations that arise when applying FL in digital health.

Federated learning definition

FL is a learning paradigm in which multiple parties train collaboratively without the need to exchange or centralise data sets. A general formulation of FL reads as follows: Let  denote a global loss function obtained via a weighted combination of K local losses , computed from private data Xk, which is residing at the individual involved parties and never shared among them:

where wk > 0 denote the respective weight coefficients.

In practice, each participant typically obtains and refines a global consensus model by conducting a few rounds of optimisation locally and before sharing updates, either directly or via a parameter server. The more rounds of local training are performed, the less it is guaranteed that the overall procedure is minimising (Eq. 1)9,12. The actual process for aggregating parameters depends on the network topology, as nodes might be segregated into sub-networks due to geographical or legal constraints (see Fig. 2). Aggregation strategies can rely on a single aggregating node (hub and spokes models), or on multiple nodes without any centralisation. An example is peer-to-peer FL, where connections exist between all or a subset of the participants and model updates are shared only between directly connected sites15,56, whereas an example of centralised FL aggregation is given in Algorithm 1. Note that aggregation strategies do not necessarily require information about the full model update; clients might chose to share only a subset of the model parameters for the sake of reducing communication overhead, ensure better privacy preservation10 or to produce multi-task learning algorithms having only part of their parameters learned in a federated manner.

A unifying framework enabling various training schemes may disentangle compute resources (data and servers) from the compute plan, as depicted in Fig. 2. The latter defines the trajectory of a model across several partners, to be trained and evaluated on specific data sets.

Challenges and considerations

Despite the advantages of FL, it does not solve all issues that are inherent to learning on medical data. A successful model training still depends on factors like data quality, bias and standardisation2. These issues have to be solved for both federated and non-federated learning efforts via appropriate measures, such as careful study design, common protocols for data acquisition, structured reporting and sophisticated methodologies for discovering bias and hidden stratification. In the following, we touch upon the key aspects of FL that are of particular relevance when applied to digital health and need to be taken into account when establishing FL. For technical details and in-depth discussion, we refer the reader to recent surveys11,12,20.

Data heterogeneity

Medical data is particularly diverse—not only because of the variety of modalities, dimensionality and characteristics in general, but even within a specific protocol due to factors such as acquisition differences, brand of the medical device or local demographics. FL may help address certain sources of bias through potentially increased diversity of data sources, but inhomogeneous data distribution poses a challenge for FL algorithms and strategies, as many are assuming independently and identically distributed (IID) data across the participants. In general, strategies such as FedAvg9 are prone to fail under these conditions9,57,58, in part defeating the very purpose of collaborative learning strategies. Recent results, however, indicate that FL training is still feasible59, even if medical data is not uniformly distributed across the institutions16,17 or includes a local bias51. Research addressing this problem includes, for example, FedProx57, part-data-sharing strategy58 and FL with domain-adaptation18. Another challenge is that data heterogeneity may lead to a situation in which the global optimal solution may not be optimal for an individual local participant. The definition of model training optimality should, therefore, be agreed by all participants before training.

Privacy and security

Healthcare data is highly sensitive and must be protected accordingly, following appropriate confidentiality procedures. Therefore, some of the key considerations are the trade-offs, strategies and remaining risks regarding the privacy-preserving potential of FL.

Privacy vs. performance: It is important to note that FL does not solve all potential privacy issues and—similar to ML algorithms in general—will always carry some risks. Privacy-preserving techniques for FL offer levels of protection that exceed today’s current commercially available ML models12. However, there is a trade-off in terms of performance and these techniques may affect, for example, the accuracy of the final model10. Furthermore, future techniques and/or ancillary data could be used to compromise a model previously considered to be low-risk.

Level of trust: Broadly speaking, participating parties can enter two types of FL collaboration:

Trusted—for FL consortia in which all parties are considered trustworthy and are bound by an enforceable collaboration agreement, we can eliminate many of the more nefarious motivations, such as deliberate attempts to extract sensitive information or to intentionally corrupt the model. This reduces the need for sophisticated counter-measures, falling back to the principles of standard collaborative research.

Non-trusted—in FL systems that operate on larger scales, it might be impractical to establish an enforceable collaborative agreement. Some clients may deliberately try to degrade performance, bring the system down or extract information from other parties. Hence, security strategies will be required to mitigate these risks such as, advanced encryption of model submissions, secure authentication of all parties, traceability of actions, differential privacy, verification systems, execution integrity, model confidentiality and protections against adversarial attacks.

Information leakage: By definition, FL systems avoid sharing healthcare data among participating institutions. However, the shared information may still indirectly expose private data used for local training, e.g., by model inversion60 of the model updates, the gradients themselves61 or adversarial attacks62,63. FL is different from traditional training insofar as the training process is exposed to multiple parties, thereby increasing the risk of leakage via reverse-engineering if adversaries can observe model changes over time, observe specific model updates (i.e., a single institution’s update), or manipulate the model (e.g., induce additional memorisation by others through gradient-ascent-style attacks). Developing counter-measures, such as limiting the granularity of the updates and adding noise16,18 and ensuring adequate differential privacy44, may be needed and is still an active area of research12.

Traceability and accountability

As per all safety-critical applications, the reproducibility of a system is important for FL in healthcare. In contrast to centralised training, FL requires multi-party computations in environments that exhibit considerable variety in terms of hardware, software and networks. Traceability of all system assets including data access history, training configurations, and hyperparameter tuning throughout the training processes is thus mandatory. In particular in non-trusted federations, traceability and accountability processes require execution integrity. After the training process reaches the mutually agreed model optimality criteria, it may also be helpful to measure the amount of contribution from each participant, such as computational resources consumed, quality of the data used for local training, etc. These measurements could then be used to determine relevant compensation, and establish a revenue model among the participants64. One implication of FL is that researchers are not able to investigate data upon which models are being trained to make sense of unexpected results. Moreover, taking statistical measurements of their training data as part of the model development workflow will need to be approved by the collaborating parties as not violating privacy. Although each site will have access to its own raw data, federations may decide to provide some sort of secure intra-node viewing facility to cater for this need or may provide some other way to increase explainability and interpretability of the global model.

System architecture

Unlike running large-scale FL amongst consumer devices such as McMahan et al.9, healthcare institutional participants are equipped with relatively powerful computational resources and reliable, higher-throughput networks enabling training of larger models with many more local training steps, and sharing more model information between nodes. These unique characteristics of FL in healthcare also bring challenges such as ensuring data integrity when communicating by use of redundant nodes, designing secure encryption methods to prevent data leakage, or designing appropriate node schedulers to make best-use of the distributed computational devices and reduce idle time.

The administration of such a federation can be realised in different ways. In situations requiring the most stringent data privacy between parties, training may operate via some sort of “honest broker” system, in which a trusted third party acts as the intermediary and facilitates access to data. This setup requires an independent entity controlling the overall system, which may not always be desirable, since it could involve additional cost and procedural viscosity. However, it has the advantage that the precise internal mechanisms can be abstracted away from the clients, making the system more agile and simpler to update. In a peer-to-peer system each site interacts directly with some or all of the other participants. In other words, there is no gatekeeper function, all protocols must be agreed up-front, which requires significant agreement efforts, and changes must be made in a synchronised fashion by all parties to avoid problems. Additionally, in a trustless-based architecture the platform operator may be cryptographically locked into being honest by means of a secure protocol, but this may introduce significant computational overheads.

Conclusion

ML, and particularly DL, has led to a wide range of innovations in the area of digital healthcare. As all ML methods benefit greatly from the ability to access data that approximates the true global distribution, FL is a promising approach to obtain powerful, accurate, safe, robust and unbiased models. By enabling multiple parties to train collaboratively without the need to exchange or centralise data sets, FL neatly addresses issues related to egress of sensitive medical data. As a consequence, it may open novel research and business avenues and has the potential to improve patient care globally. However, already today, FL has an impact on nearly all stakeholders and the entire treatment cycle, ranging from improved medical image analysis providing clinicians with better diagnostic tools, over true precision medicine by helping to find similar patients, to collaborative and accelerated drug discovery decreasing cost and time-to-market for pharma companies. Not all technical questions have been answered yet and FL will certainly be an active research area throughout the next decade 12. Despite this, we truly believe that its potential impact on precision medicine and ultimately improving medical care is very promising.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

References

  1. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436 (2015).

  2. Wang, F., Casalino, L. P. & Khullar, D. Deep learning in medicine—promise, progress, and challenges. JAMA Intern. Med. 179, 293–294 (2019).

  3. Chartrand, G. et al. Deep learning: a primer for radiologists. Radiographics 37, 2113–2131 (2017).

  4. De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24, 1342 (2018).

  5. Sun, C., Shrivastava, A., Singh, S. & Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, 843–852 (IEEE, 2017).

  6. Van Panhuis, W. G. et al. A systematic review of barriers to data sharing in public health. BMC Public Health 14, 1144 (2014).

  7. Rocher, L., Hendrickx, J. M. & De Montjoye, Y.-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10, 1–9 (2019).

  8. Schwarz, C. G. et al. Identification of anonymous mri research participants with face-recognition software. N. Engl. J. Med. 381, 1684–1686 (2019).

  9. McMahan, B., Moore, E., Ramage, D., Hampson, S. & y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, 1273–1282. https://scholar.google.de/scholar?hl=de&as_sdt=0%2C5&q=Communicationefficient+learning+of+deep+networks+from+decentralized+data&btnG= (2017).

  10. Li, T., Sahu, A. K., Talwalkar, A. & Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine 37, 50–60 (IEEE, 2020).

  11. Yang, Q., Liu, Y., Chen, T. & Tong, Y. Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10, 12 (2019).

  12. Kairouz, P. et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977 (2019).

  13. Lee, J. et al. Privacy-preserving patient similarity learning in a federated environment: development and analysis. JMIR Med. Inform. 6, e20 (2018).

  14. Brisimi, T. S. et al. Federated learning of predictive models from federated electronic health records. Int. J. Med. Inform. 112, 59–67 (2018).

  15. Roy, A. G., Siddiqui, S., Pölsterl, S., Navab, N. & Wachinger, C. Braintorrent: a peer-to-peer environment for decentralized federated learning. arXiv preprint arXiv:1905.06731 (2019).

  16. Li, W. et al. Privacy-preserving federated brain tumour segmentation. In International Workshop on Machine Learning in Medical Imaging, 133–141 (Springer, 2019).

  17. Sheller, M. J., Reina, G. A., Edwards, B., Martin, J. & Bakas, S. Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation. In International MICCAI Brainlesion Workshop, 92–104 (Springer, 2018).

  18. Li, X. et al. Multi-site fmri analysis using privacy-preserving federated learning and domain adaptation: abide results. arXiv preprint arXiv:2001.05647 (2020).

  19. Huang, L. et al. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. J. Biomed. Inform. 99, 103291 (2019).

  20. Xu, J. & Wang, F. Federated learning for healthcare informatics. arXiv preprint arXiv:1911.06270 (2019).

  21. Roy, A. & Banerjee, A. Ibm’s merge healthcare acquisitionhttps://www.reuters.com/article/us-merge-healthcare-m-a-ibm/ibm-to-buy-merge-healthcare-in-1-billion-deal-idUSKCN0QB1ML20150806 (2015) (Accessed 10 February 2020).

  22. Nhs scotland’s national safe haven. https://www.gov.scot/publications/charter-safe-havens-scotland-handling-unconsented-data-national-health-service-patient-records-support-research-statistics/pages/4/ (2015) (Accessed 10 February 2020).

  23. Cuggia, M. & Combes, S. The french health data hub and the german medical informatics initiatives: Two national projects to promote data sharing in healthcare. Yearbook Med. Informat. 28, 195–202 (2019).

  24. Health Data Research UK. https://www.hdruk.ac.uk/ (Health Data Research UK, 2020) (Accessed 10 Feb 2020).

  25. Sporns, O., Tononi, G. & Kötter, R. The human connectome: a structural description of the human brain. PLoS Comput. Biol1, e42, https://doi.org/10.1371/journal.pcbi.0010042 (2005).

  26. Sudlow, C. et al. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med12, e1001779. https://doi.org/10.1371/journal.pmed.1001779 (2015).

  27. Clark, K. et al. The cancer imaging archive (tcia): maintaining and operating a public information repository. J. Digit. Imaging. 26, 1045–1057 (2013).

  28. Wang, X. et al. Chestx-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2097–2106 (IEEE, 2017).

  29. Yan, K., Wang, X., Lu, L. & Summers, R. M. Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J Med. Imaging. 5, 036501 (2018).

  30. Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The cancer genome atlas (tcga): an immeasurable source of knowledge. Contemp. Oncol. 19, A68 (2015).

  31. Jack Jr., C. R. et al. The alzheimer’s disease neuroimaging initiative (adni): Mri methods. J. Magn. Reson. Imaging 27, 685–691 (2008).

  32. Grand Challenge-a Platform for End-to-end Development of Machine Learning Solutions in Biomedical Imaginghttps://grand-challenge.org/ (2020) (Accessed 24 July 2020).

  33. Litjens, G. et al. 1399 h&e-stained sentinel lymph node sections of breast cancer patients: the camelyon dataset. GigaScience 7, giy065 (2018).

  34. Menze, B. H. et al. The multimodal brain tumor image segmentation benchmark (brats). IEEE Trans. Med. Imaging 34, 1993–2024 (2014).

  35. Bakas, S. et al. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv preprint arXiv:1811.02629 (2018).

  36. Bakas, S. et al. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4, 170117 (2017).

  37. Simpson, A. L. et al. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063 (2019).

  38. Yeh, F.-C. et al. Quantifying differences and similarities in whole-brain white matter architecture using local connectome fingerprints. PLoS Comput. Biol. 12, e1005203 (2016).

  39. Chang, K. et al. Distributed deep learning networks among institutions for medical imaging. J. Am. Med. Inform. Assoc. 25, 945–954 (2018).

  40. Shokri, R., Stronati, M., Song, C. & Shmatikov, V. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), 3-18 (IEEE, 2017).

  41. Sablayrolles, A., Douze, M., Ollivier, Y., Schmid, C. & Jégou, H. White-box vs black-box: Bayes optimal strategies for membership inference. In Chaudhuri, K. & Salakhutdinov, R. (eds) Proceedings of the 36th International Conference on Machine Learning, {ICML} 97, 5558–5567. http://proceedings.mlr.press/v97/sablayrolles19a.html (PMLR, 2019).

  42. Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, {ICLR}. https://openreview.net/forum?id=Sy8gdB9xx, (OpenReview.net, 2017).

  43. Carlini, N., Liu, C., Erlingsson, Ú., Kos, J. & Song, D. The secret sharer: evaluating and testing unintended memorization in neural networks. In Heninger, N. & Traynor, P. (eds) 28th {USENIXSecurity Symposium ({USENIXSecurity 19, 267–284. https://www.usenix.org/conference/usenixsecurity19/presentation/carlini ({USENIX} Association, Santa Clara, CA, USA, 2019).

  44. Abadi, M. et al. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–318 (ACM, 2016).

  45. Shokri, R. & Shmatikov, V. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 1310–1321 (ACM, 2015).

  46. Langlotz, C. P. et al. A roadmap for foundational research on artificial intelligence in medical imaging: from the 2018 nih/rsna/acr/the academy workshop. Radiology 291, 781–791 (2019).

  47. Kim, Y., Sun, J., Yu, H. & Jiang, X. Federated Tensor Factorization for Computational Phenotyping. In Proceedings of the 23rd {ACM} {SIGKDD} International Conference on Knowledge Discoveryand Data Mining. 887–895. https://doi.org/10.1145/3097983.3098118 (ACM, Halifax, NS, Canada, 2017).

  48. He, C., Annavaram, M. & Avestimehr, S. Fednas: Federated deep learning via neural architecture search. https://sites.google.com/view/cvpr20-nas/ (2020).

  49. Trustworthy federated data analytics (tfda). https://tfda.hmsp.center/ (2020) (Accessed 28 May 2020).

  50. Joint Imaging Platform (Jip). https://jip.dktk.dkfz.de/jiphomepage/ (2020) (Accessed 28 May 2020).

  51. Medical institutions collaborate to improve mammogram assessment ai. https://blogs.nvidia.com/blog/2020/04/15/federated-learning-mammogram-assessment/ (2020) (Accessed 28 May 2020).

  52. Healthchain consortium. https://www.substra.ai/en/healthchain-project (2020) (Accessed 28 May 2020).

  53. The federated tumor segmentation (fets) initiative. https://www.fets.ai (2020) (Accessed 28 May 2020).

  54. Machine learning ledger orchestration for drug discovery. https://cordis.europa.eu/project/id/831472 (2020). Accessed 28 May 2020.

  55. Konečny`, J., McMahan, H. B., Ramage, D. & Richtárik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527 (2016).

  56. Lalitha, A., Kilinc, O. C., Javidi, T. & Koushanfar, F. Peer-to-peer federated learning on graphs. arXiv preprint arXiv:1901.11173 (2019).

  57. Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A. & Smith, V. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127 (2018).

  58. Zhao, Y. et al. Federated learning with non-iid data. arxivabs/1806.00582 (2018).

  59. Li, X., Huang, K., Yang, W., Wang, S. & Zhang, Z. On the convergence of fedavg on non-IID data. https://openreview.net/forum?id=HJxNAnVtDS (2020).

  60. Wu, B. et al. P3sgd: patient privacy preserving SGD for regularizing deep CNNs in pathological image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2099–2108) (2019).

  61. Zhu, L., Liu, Z. & Han, S. Deep leakage from gradients. In Wallach, H. M. et al. (eds) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, 14747–14756. http://papers.nips.cc/paper/9617-deep-leakage-from-gradients (2019).

  62. Wang, Z. et al. Beyond inferring class representatives: user-level privacy leakage from federated learning. In 2019 {IEEE} Conferenceon Computer Communications, {INFOCOM} 2512–2520. https://doi.org/10.1109/INFOCOM.2019.8737416 (IEEE, Paris, France, 2019).

  63. Hitaj, B., Ateniese, G. & Perez-Cruz, F. Deep models under the gan: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS’17, 603–618 (Association for Computing Machinery, New York, NY, USA, 2017).

  64. Ghorbani, A. & Zou, J. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning (pp. 2242-2251) (2019).

    \

Acknowledgements

This work was supported by the UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value-Based Healthcare, by the Wellcome/EPSRC Centre for Medical Engineering (WT203148/Z/16/Z), by the Wellcome Flagship Programme (WT213038/Z/18/Z), by the Intramural Research Programme of the National Institutes of Health (NIH) Clinical Center, by the National Cancer Institute of the NIH under award number U01CA242871, by the National Institute of Neurological Disorders and Stroke of the NIH under award number R01NS042645, as well as by the Helmholtz Initiative and Networking Fund (project “Trustworthy Federated Data Analytics”) and the PRIME programme of the German Academic Exchange Service (DAAD) with funds from the German Federal Ministry of Education and Research (BMBF). The content and opinions expressed in this publication is solely the responsibility of the authors and do not necessarily represent those of the institutions they are affiliated with, e.g., the U.S. Department of Health and Human Services or the National Institutes of Health. Open access funding provided by Projekt DEAL.

\

:::info This paper is available on nature under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Why Docker Desktop is Still the Go-To for Local Development

2026-03-16 22:57:48

Over the last few years, many tools have tried to replace Docker Desktop. Podman, nerdctl, and small Kubernetes setups are common examples. Technically, they work well and can run containers just fine. But in real projects, most teams still use Docker Desktop.

The reason is simple. It works with less effort. You do not need to spend much time fixing networking, storage, or system-specific issues. Other tools often need extra setup. Someone has to configure them, maintain scripts, and fix problems when systems change. Docker Desktop avoids most of that work. It keeps the workflow simple and stable. That is why many developers continue to use it for local development.

Local Development Is a Human Problem

Most local development issues are not caused by bad tools. They happen because the setup is too sensitive. Everything works only when the configuration is perfect.

Then someone updates their laptop. Someone installs something new. A new teammate joins. Next thing you know, the project won’t run anymore.

Now someone has to figure out what broke. That usually means checking containers, ports, or system settings. Most people don’t want to deal with that.

Docker Desktop prevents a lot of this. It keeps things working even when machines change, which makes life easier for growing teams.

Setup Should Take Minutes, Not Days

In many teams, the first week of onboarding is wasted on setup. New developers often need to:

  • Install multiple tools
  • Configure networking
  • Fix permission issues
  • Debug container failures
  • Match existing environments

With Docker Desktop, most of this is already handled. A new developer usually only needs:

git clone project-repo
cd project-repo
docker compose up

Within minutes, the application is running. That fast start matters. It builds confidence and keeps people productive.

Fewer Things to Break

Complex environments have more failure points. When teams rely on:

  • Separate runtimes
  • Manual networking
  • Custom scripts
  • Local Kubernetes clusters
  • Multiple configuration files

Developers Should Not Be Infrastructure Experts

Most developers just want to write code. They do not want to learn how containers work inside. With lighter tools, people often have to deal with setup issues. Docker Desktop avoids most of that.

  • CNI networking
  • Container permissions
  • Volume mounting rules
  • System-level settings

For example, running Podman with Docker compatibility:

podman system service --time=0
export DOCKER_HOST=unix:///run/user/1000/podman/podman.sock

This works. But it adds mental overhead.

Docker Desktop avoids this. The standard Docker commands just work:

docker build -t myapp .
docker run -p 8080:8080 myapp

No extra setup. No environment variables. No special services.

Reducing Friction Improves Team Performance

Small frustrations add up. When developers constantly deal with:

  • Broken containers
  • Network conflicts
  • Missing volumes
  • Permission errors

They lose focus. Docker Desktop reduces these daily annoyances. Fewer interruptions mean better concentration and better output. Over time, this has a real impact on team performance.

Supporting Different Skill Levels

Not every team member has the same experience. Some are senior engineers. Some are new graduates. Some are switching roles. Docker Desktop works well for all of them because:

  • Beginners can start quickly
  • Experienced developers can customize
  • Everyone uses the same base setup

Keeping the Focus on Building Products

Most developers open their editor because they want to work on features or fix bugs. They are not trying to become experts in container networking or virtual machines.

When the local setup breaks, work stops. People start searching logs, restarting services, and asking in Slack why nothing runs anymore. Sometimes this takes minutes. Sometimes it takes half a day.

Docker Desktop usually prevents this. After installation, it just works in the background. Containers run, ports open, files mount, and developers can keep coding.

Integration Beats Modularity

On paper, modular setups look great. You choose your own runtime, your own networking layer, your own Kubernetes tool, and connect everything the way you like.

In real projects, this usually turns into extra work.

Most teams are not trying to build the perfect local platform. They just want an environment that works every morning when they open their laptop.

This is where Docker Desktop has a big advantage. It gives you a complete setup without asking you to assemble it yourself.

You Install Once and Start Working

With many alternatives, setting up a local environment means installing several tools.

A typical setup might include:

  • A container runtime
  • A Kubernetes tool
  • Extra networking configuration
  • Image management utilities
  • Security scanners

Each one has its own updates and quirks.

With Docker Desktop, you install one application, and most of this is already there. You open it, sign in, and start running containers. For most developers, simplicity matters more than flexibility.

Local Kubernetes Without the Headaches

Running Kubernetes locally is useful, but it is rarely smooth. If you use separate tools, you often need to:

  • Create clusters manually
  • Switch contexts
  • Fix networking issues
  • Debug strange errors

For example:

kind create cluster
kubectl config use-context kind-kind

It works, until something breaks.

With Docker Desktop, Kubernetes is built in. You enable it once in the settings, and it stays there. Your usual kubectl commands just work.

Image Management That Does Not Get in the Way

Over time, local machines fill up with images. Old builds. Unused layers. Test containers that never got removed. With command-line tools, cleaning this up is easy to forget.

Docker Desktop shows everything in one place. You can see what is running, what is taking space, and what can be deleted.

Security Where Developers Actually See It

In many teams, security tools live somewhere else.

  • You push an image.
  • Someone runs a scan.
  • You get a report later.

By then, you are already working on something else. Docker Desktop brings security closer to daily work.

You can run checks directly:

docker scout quickview myapp:latest

No portals. No extra logins. No waiting. Because it is easy to use, people actually use it.

Tools That Fit Into Daily Work

Docker Desktop extensions are not about fancy features. They are about small conveniences.

Things like:

  • Viewing logs
  • Managing databases
  • Checking resource usage
  • Inspecting containers

Instead of opening five different apps, you stay in one place. It saves mental energy.

Custom Environments Become a Burden

Some teams build their own local platforms. At first, it feels powerful. Over time, it becomes a maintenance job.

  • New hires struggle to set things up.
  • Docs grow longer.
  • Small changes break old scripts.
  • Support requests increase.

Eventually, someone becomes “the person who knows how the setup works”. That is not a good sign. Docker Desktop avoids this by providing a consistent base environment for everyone.

Standard Setups Make Teams Faster

When everyone runs the same tools:

  • Bugs are easier to reproduce
  • Issues are easier to explain
  • Help is easier to give

No more “it works on my machine” discussions. Less time is spent fixing laptops. More time is spent building products.

Cross-Platform Consistency Matters

Most development teams work on more than one operating system. Some developers use macOS, some use Windows, and others use Linux. This is normal in modern teams.

The problem starts when the same project behaves differently on each system. Containers may run fine on one laptop and fail on another. File permissions, networking rules, and system paths often cause small but repeated issues.

Consistent Behavior Across Systems

Docker Desktop helps by giving developers a similar container environment on macOS, Windows, and Linux. This means teams can use the same tools and commands, no matter which system they work on.

Without a shared platform, container tools depend heavily on the host system. Linux runs containers directly. macOS and Windows use virtual machines. Because of this, storage, networking, and performance often behave differently.

These differences force teams to create separate setup guides for each system. This adds extra work and confusion. Docker Desktop handles these system-level differences internally. It creates a stable and predictable environment that works the same way across platforms.

When developers run:

docker compose up

They usually get the same result on every machine.

Fewer Environment-Related Bugs

Many development problems are caused by local setup differences, not by faulty code. A feature may work on one laptop but fail on another. A build may succeed for one developer and fail for someone else. These issues are difficult to track down because they are inconsistent and hard to reproduce.

By standardizing the local runtime, Docker Desktop reduces these kinds of problems. When environments are similar, bugs can be reproduced more easily and fixed faster. This leads to more stable development and fewer delays caused by local setup issues.

Faster and Simpler Onboarding

New developers are most productive when they can run projects quickly. In complex setups, onboarding often involves:

  • Installing multiple tools
  • Editing system files
  • Debugging permissions
  • Fixing network settings

This delays real work. Docker Desktop simplifies this process. Most projects can be started with a few commands:

git clone repo
cd repo
docker compose up

New team members can focus on learning the codebase instead of fixing their environment.

Lower Support and Maintenance Effort

Supporting many different local setups increases internal support work. Platform teams must troubleshoot problems that only appear on specific systems. This consumes time that could be spent improving infrastructure.

With Docker Desktop, most developers use the same base platform. Support teams can reproduce issues more easily and provide faster solutions. This reduces long-term maintenance costs.

Easier Collaboration Between Developers

When environments are consistent, developers can share commands, scripts, and workflows without special instructions. A build command or test script usually works for everyone:

docker build -t app .
docker run -p 3000:3000 app

This improves collaboration and reduces misunderstandings.

Stability During System Updates

Operating system updates often affect development tools. Custom setups may break after upgrades, forcing developers to reconfigure their machines.

Docker Desktop manages its own runtime environment. Updates are handled internally, and compatibility is maintained. This helps teams avoid repeated setup work after system changes.

The Productivity Argument

Most teams are not slowed down by container speed. They are slowed down by broken setups and time spent fixing local issues. When the environment does not work, even small changes take longer.

Docker Desktop avoids many of these problems. It works out of the box and usually keeps working. Developers can run and test their code without constantly fixing configs.

Conclusion

Most developers just want their project to run. They do not want to fight with networking, permissions, or containers after every update.

With Docker Desktop, you install it, start your containers, and move on. Other tools can work, but they often need more setup and maintenance. Someone always ends up fixing things. Docker Desktop removes most of that work. That is why many teams still use it. It saves time and keeps people focused on building software.

\n

\