2025-11-05 00:02:55
How are you, hacker?
🪐 What’s happening in tech today, November 4, 2025?
The HackerNoon Newsletter brings the HackerNoon homepage straight to your inbox. On this day, The first car-free Sunday happened in Netherlands in 1973, Barack Obama was elected as America's first Black President in 2008, The tomb of Tutankhamen was discovered in the Valley of Kings. in 1922, and we present you with these top quality stories.

By @hacker-Antho [ 5 Min read ] Researchers used a technique called concept injection to test whether AI can notice its own internal states. Read More.
🧑💻 What happened in your world this week?
It's been said that writing can help consolidate technical knowledge, establish credibility, and contribute to emerging community standards. Feeling stuck? We got you covered ⬇️⬇️⬇️
ANSWER THESE GREATEST INTERVIEW QUESTIONS OF ALL TIME
We hope you enjoy this worth of free reading material. Feel free to forward this email to a nerdy friend who'll love you for it.See you on Planet Internet! With love, The HackerNoon Team ✌️

2025-11-05 00:00:06
Experimental results and 5.1. Experiment Setup
Conclusion and future work and References
\
Supplementary Material
Instance-incremental learning (IIL) focuses on learning continually with data of the same classes. Compared to class-incremental learning (CIL), the IIL is seldom explored because IIL suffers less from catastrophic forgetting (CF). However, besides retaining knowledge, in real-world deployment scenarios where the class space is always predefined, continual and cost-effective model promotion with the potential unavailability of previous data is a more essential demand. Therefore, we first define a new and more practical IIL setting as promoting the model’s performance besides resisting CF with only new observations. Two issues have to be tackled in the new IIL setting: 1) the notorious catastrophic forgetting because of no access to old data, and 2) broadening the existing decision boundary to new observations because of concept drift. To tackle these problems, our key insight is to moderately broaden the decision boundary to fail cases while retain the old boundary. Hence, we propose a novel decision boundary-aware distillation method with consolidating knowledge to teacher to ease the student learning new knowledge. We also establish the benchmarks on existing datasets Cifar-100 and ImageNet. Notably, extensive experiments demonstrate that the teacher model can be a better incremental learner than the student model, which overturns previous knowledge distillation-based methods treating student as the main role.
In recent years, many excellent deep-learning-based networks are proposed for variety of tasks, such as image classification, segmentation, and detection. Although these networks perform well on the training data, they inevitably fail on some new data that is not trained in real-world application. Continually and efficiently promoting a deployed model’s performance on these new data is an essential demand. Current solution of retraining the network using all accumulated data has two drawbacks: 1) with the increasing data size, the training cost gets higher each time, for example, more GPUs hours and larger carbon footprint [20], and 2) in some cases the old data is no longer accessible because of the privacy policy or limited budget for data storage. In the case where only a little or no old data is available or utilized, retraining the deep learning model with new data always cause the performance degradation on the old data, i.e., the catastrophic forgetting (CF) problem. To address CF problem, incremental learning [4, 5, 22, 29], also known as continual learning, is proposed. Incremental learning significantly promotes the practical value of deep learning models and is attracting intense research interests.
\

\ According to whether the new data comes from seen classes, incremental learning can be divided into three scenarios [16, 17]: instance-incremental learning (IIL) [3, 16] where all new data belongs to the seen classes, class-incremental learning (CIL) [4, 12, 15, 22] where new data has different class labels, and hybrid-incremental learning [6, 30] where new data consists of new observations from both old and new classes. Compare to CIL, IIL is relatively unexplored because it is less susceptible to the CF. Lomonaco and Maltoni [16] reported that fine-tuning a model with early stopping can well tame the CF problem in IIL. However, this conclusion not always holds when there is no access to the old training data and the new data has a much smaller size than old data, as depicted in Fig. 1. Fine-tuning often results in a shift in the decision boundary rather than expanding it to accommodate new observations. Besides retaining old knowledge, the real deployment concerns more on efficient model promotion in IIL. For instance, in the defect detection of industry products, classes of defect are always limited to known categories. But the morphology of those defects is varying time to time. Failures on those unseen defects should be corrected timely and efficiently to avoid the defective products flowing into the market. Unfortunately, existing research primarily focuses on retaining knowledge on old data rather than enriching the knowledge with new observations.
\ In this paper, to fast and cost-effective enhance a trained model with new observations of seen classes, we first define a new IIL setting as retaining the learned knowledge as well as promoting the model’s performance on new observations without access to old data. In simple words, we aim to promote the existing model by only leveraging the new data and attain a performance that is comparable to the model retrained with all accumulated data. The new IIL is challenging due to the concept drift [6] caused by the new observations, such as the color or shape variation compared to the old data. Hence, two issues have to be tackled in the new IIL setting: 1) the notorious catastrophic forgetting because of no access to old data, and 2) broadening the existing decision boundary to new observations.
\ To address above issues in the new IIL setting, we propose a novel IIL framework based on the teacher-student structure. The proposed framework consists of a decision boundary-aware distillation (DBD) process and a knowledge consolidation (KC) process. The DBD allows the student model to learn from new observations with awareness of the existing inter-class decision boundaries, which enables the model to determine where to strengthen its knowledge and where to retain it. However, the decision boundary is untraceable when there are insufficient samples located around the boundary because of no access to the old data in IIL. To overcome this, we draw inspiration from the practice of dusting the floor with flour to reveal hidden footprints. Similarly, we introduce random Gaussian noise to pollute the input space and manifest the learned decision boundary for distillation. During training the student model with boundary distillation, the updated knowledge is further consolidate back to the teacher model intermittently and repeatedly with the EMA mechanism [28]. Utilizing teacher model as the target model is a pioneering attempt and its feasibility is explained theoretically.
\ According to the new IIL setting, we reorganize the training set of some existing datasets commonly used in CIL, such as Cifar-100 [11] and ImageNet [24] to establish the benchmarks. Model is evaluated on the test data as well as the non-available base data in each incremental phase. Our main contributions can be summarized as follows: 1) We define a new IIL setting to seek for fast and cost-effective model promotion on new observations and establish the benchmarks; 2) We propose a novel decision boundary-aware distillation method to retain the learned knowledge as well as enriching it with new data; 3) We creatively consolidate the learned knowledge from student to teacher model to attain better performance and generalizability, and prove the feasibility theoretically; and 4) Extensive experiments demonstrate that the proposed method well accumulates knowledge with only new data while most of existing incremental learning methods failed.
\
:::info This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license.
:::
:::info Authors:
(1) Qiang Nie, Hong Kong University of Science and Technology (Guangzhou);
(2) Weifu Fu, Tencent Youtu Lab;
(3) Yuhuan Lin, Tencent Youtu Lab;
(4) Jialin Li, Tencent Youtu Lab;
(5) Yifeng Zhou, Tencent Youtu Lab;
(6) Yong Liu, Tencent Youtu Lab;
(7) Qiang Nie, Hong Kong University of Science and Technology (Guangzhou);
(8) Chengjie Wang, Tencent Youtu Lab.
:::
\
2025-11-04 23:27:22
From Classical Results into Differential Machine Learning
4.1 Risk Neutral Valuation Approach
4.2 Differential Machine learning: building the loss function
Simulation-European Call Option
Consider the Black-Scholes model:
\

\
This experiment begins by shorting a European call option with maturity T. The derivative will be hedged by trading the underlying asset. A ∆ hedging strategy is considered, and the portfolio consisting of positions on the underlying is rebalanced weekly, according to the newly computed ∆ weights.
\
\

\ \ A simulation of the evolution of Z across n paths is conducted, producing n PnL values. A histogram is used to visualize the distribution of the PnL values across paths. The different methods are then subjected to this experiment, and the results are compared to the BlackScholes case.
\ The PnL values are reported relative to the portfolio value at period 0, which is the premium of the sold European call option. The relative hedging error is measured by the standard deviation of the histogram produced. This metric is widely used in evaluating the performance of different models, as in Frandsen et al., 2022.
\
7.3.1 Monomial Basis
\
\

\ \ 7.3.2 Neural Network Basis
\ Alternatively, the regression can be conducted using a parametric basis, such as a neural network, where:
\
\

\ \ The architecture of the neural network is crucial, and a multi-layer neural network with l > 1 is preferred, as supported by Proposition 5.3 and empirical studies in various applications. In this example, the layer dimension is set to l = 4, which can be further fine-tuned for explanatory power. The back-propagation algorithm is used to update the weights and biases after each epoch, achieved by minimizing the loss function with respect to the inner parameters through stochastic gradient descent as in Kingma and Ba, 2014
\
From proposition 3.4, we can infer, given the sample ((x1, z1)), . . . ,(xn, zn)), that the loss function with respect to the training sample is:
\
\

\
\
\
\
\

\ \ By computing qi through the simulation of at least two different paths, applying the indicator function, and averaging the resulting quantities, the ∆ hedging estimate is obtained. This method allows for efficient and accurate hedging strategies, making it a valuable tool in the field of mathematical finance.
\ 7.4.1 Neural Network basis
\ All the details of this implementation can be found in Huge and Savine, 2020. Examining equation (30), the first part is the same loss function as in the LSMC case. Still, the second part constitutes the mean square difference between the differential labels and the derivative of the entire neural network with respect to price. So, we need to obtain the derivative of the feed-forward neural network. Feed-forward neural networks are efficiently differentiated by backpropagation.
\ Then recapitulating the feed-forward equations:
\
\

\ \ Recall that the inputs are states and the predictors are prices for the first part, hence, these differentials are predicted risk sensitivities, obtained by differentiation of the line above, in the reverse order:
\
\

\ \ Then the implementation can be divided into the following two steps:
\
The neural network for the standard feed-forward equations (35)-(37) is built, paying careful attention to the use of the functionalities of software to store all intermediate values. The neural network architecture will comprehend 4 hidden layers, that is a multi-layer structure as prescribed in section 6.2.1 Note that the activation function needs to be differentiable, in order for equations (38)-(40), to be applied, so the following Huge and Savine, 2020, a soft-plus function was chosen.
\
Implement as a standard function in Python the equations (35)-(37). Note that the intermediate values stored before are going to be the domain of this function.
\
Combine both functions in a single function, named the Twin Tower.
\
Train the Twin Tower with respect to the loss equation(14).
\
:::info Author:
(1) Pedro Duarte Gomes, Department of Mathematics, University of Copenhagen.
:::
:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.
:::
\
2025-11-04 23:18:59
Table of Links
From Classical Results into Differential Machine Learning
4.1 Risk Neutral Valuation Approach
4.2 Differential Machine learning: building the loss function
Simulation-European Call Option
A parametric basis can be thought of as a set of functions made up of linear combinations of relatively few basis functions with a simple structure and depending non-linearly on a set of “inner” parameters e.g., feed-forward neural networks with one hidden layer and linear output activation units. In contrast, classical approximation schemes do not use inner parameters but employ fixed basis functions, and the corresponding approximators exhibit only a linear dependence on the external parameters.
\ However, experience has shown that optimization of functionals over a variable basis such as feed-forward neural networks often provides surprisingly good suboptimal solutions.
\ A well-known functional-analytical fact is the employing the Stone-Weierstrass theorem, it is possible to construct several examples of fixed basis, such as the monomial basis, a set that is dense in the space of continuous function whose completion is L2. The limitations of the fixed basis are well studied and can be summarized as the following.
\

\ The variance-bias trade-off can be translated into two major problems:
\
Underfitting happens due to the fact that high bias can cause an algorithm to miss the relevant relations between features and target outputs. This happens with a small number of parameters. In the previous terminology, that corresponds to a low d value (see Equation 4).
\
The variance is an error of sensitivity to small fluctuations in the training set. It is a measure of spread or variations in our predictions. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs, which is denominated as overfitting. This, in turn, happens with a high number of parameters. In the previous terminology, that corresponds to a high d value (see Equation 4).
\ The following result resumes the problem discussed. I will state it as in Barron, 1993, and the proof can be found in Barron, 1993 and Gnecco et al., 2012
\

\

\ So, there is a need to study the class of basis, that can adjust to the data. That is the case with the parametric basis.
\

\ From Hornik et al., 1989, we find the following relevant results:

\ The flexibility and approximation power of neural networks makes them an excellent choice as the parametric basis.
\ 6.2.1 Depth
In practical applications, it has been noted that a multi-layer neural network, outperforms a single-layer neural network. This is still a question under investigation, once the top-of-the-art mathematical theories cannot account for the multi-layer comparative success. However, it is possible to create some counter-examples, where the single-layer neural network would not approach the target function as in the following proposition:

\ Therefore it is beneficial or at least risk-averse to select a multi-layer feed-forward neural network, instead of a single-layer feed-forward neural network
\ 6.2.2 Width
This section draws inspiration from the works Barron, 1993 andTelgarsky, 2020. Its primary objective is to investigate the approximating capabilities of a neural network based on the number of nodes or neurons. I provide some elaboration on this result, once it is not so well known and it does not require any assumption regarding the activation function unlike in Barron, 1994.
\

\ This sampling procedure correctly represents the mean as:

\

\ As the number of nodes, d, increases, the approximation capability improves. This result, contrary to Proposition 5.1, establishes an upper bound that is independent of the dimension of the target function. By comparing both theorems, it can be argued that there is a clear advantage for feed-forward neural networks when d > 2 for d ∈ N.
\
:::info Author:
(1) Pedro Duarte Gomes, Department of Mathematics, University of Copenhagen.
:::
:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.
:::
\
2025-11-04 23:10:16
From Classical Results into Differential Machine Learning
4.1 Risk Neutral Valuation Approach
4.2 Differential Machine learning: building the loss function
Simulation-European Call Option
The following section discusses two methods: the Least Squares Monte Carlo and the Differential Machine Learning method. The aim is to construct a loss function under the assumption of risk neutrality. We will derive the corresponding loss function and establish two analogous propositions, illustrating the power of the theoretical construction presented earlier. This unique perspective allows for a comprehensive comparison of the two methods, which has not been explored in the existing literature.
\ As in Cox, 2000, the valuation of a derivative contract involves the functional g0,t representing the expected conditional value:
\

\ The objective is to compute the expectation of X conditional on information at time t. Note that Q is not necessarily unique, meaning there might not be market completeness.
\
The main interest lies in estimating equation (3) to obtain the price of the derivative. This result contributes to the construction of the Differential Machine Learning approach. The following result is widely known in the literature and can be revisited by the reader in Pelsser and Schweizer, 2016.
\

\ The proposition that characterizes the Differential Machine learning method is originally stated and proven in the current document. Its proof is essential to understand the implementation of this method, such as how to compute unbiased estimates of the labels for the differentials.
\

\ The ∆ would be computed as:

\ which measures the derivative price sensitivity in relation to oscillations of the underlying.
\ If the differentiation could go inside the conditional expectation, an estimator for ∆ can be created through discretization methods. That would imply that H would need to be restricted to the differentiable elements. An analogous proof to the one of the proposition would be achieved if the space with differentiable elements that have is a separable Hilbert space.
\ However, pay-off functions are non-differentiable or even discontinuous as in the case of the European standard options and digital options respectively. So a broader concept of derivative needs to be defined. Generalized function theory allows a less abstract approach to weak derivative definition compared to distribution theory as in Rudin, 1974.
\ To assure the unbiasedness of the ∆ estimator, there is a need for a result that supports the passage of the derivative inside the expectation. The following result is not a very well-known result and can be found in the seventh chapter of Jones, 1966
\

\ Assuming that the pay-off function is locally integrable, as it is something expected in financial theory, once the conditional mean of the pay-offs across time is the price of the derivative product, given its idiosyncratic characteristics. This solves the main problem in Broadie and Glasserman, 1996, by only requiring the pay-off function to be locally integrable.
\ Applying proposition 3.4 to equation 8:

\ The aim now is to construct a space that accounts for the ∂g(Z, T), so that it defines a separable Hilbert space.
\ This will lead to the theory on the Sobolev spaces, once the latter account for the weak derivatives (derivatives in the distribution sense):
\

\ It is well known the space above is a Sobolev space and hence it is a complete space as it can be revisited by the reader in Rudin, 1974 and Leoni, 2017.
\ Now we are in the position to prove Proposition 4.3.
\

\ In equation (7), the second term accounts for the training with respect to ∆, mathematically the shape of the function.
\

\ Before delving into the procedure of building a fixed basis and how it compares with a variable basis, an illustration of an application of Proposition 4.4 is contemplated in the following example.
\
Example 5.1 In this example, let’s verify proposition 4.4 in an application in digital options The aim is to compute ∆. A digital option is a form of option that allows traders to manually set a strike price. The digital option provides traders with a fixed payout in the case when the market price of the underlying asset exceeds the strike price.
\ So the pay-off function can be written as:

\ where K is the strike price, settled upon the celebration of the contract. Without loss of generality, assume that r = 0, then:

\ Taking the derivative inside the expectation :

\ here it was used in the last step, that a weak derivative of a heavy side function is the Dirac delta. Considering the distribution density to be f, whose cumulative distribution is expressed by F, then:

\ Going the other way around:

\ Then, it can be concluded that:

\ Now, a question could be posed, on how the differential labels can be found. According to this example insofar, we would need to know a priori the density f. However, in a scenario where we are presented with data, and not a simulation, an estimator of the expectation of the delta Dirac function on a sample would need to be found. Since delta Dirac would be infinite on a single point, an approximation can be considered. See in Jones, 1966 that the delta Dirac integral over a test function or a good function is :
\

\ Choosing a large N
\

\ we finally get an integral we can discretize by the usual numerical methods. This procedure can be applied to any pay-off function, hence to any derivative product, solving the problem in Broadie and Glasserman, 1996.
\
:::info Author:
(1) Pedro Duarte Gomes, Department of Mathematics, University of Copenhagen.
:::
:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.
:::
\
2025-11-04 23:00:32
:::info Author:
(1) Pedro Duarte Gomes, Department of Mathematics, University of Copenhagen.
:::
From Classical Results into Differential Machine Learning
4.1 Risk Neutral Valuation Approach
4.2 Differential Machine learning: building the loss function
Simulation-European Call Option
This article introduces the groundbreaking concept of the financial differential machine learning algorithm through a rigorous mathematical framework. Diverging from existing literature on financial machine learning, the work highlights the profound implications of theoretical assumptions within financial models on the construction of machine learning algorithms.
\ This endeavour is particularly timely as the finance landscape witnesses a surge in interest towards data-driven models for the valuation and hedging of derivative products. Notably, the predictive capabilities of neural networks have garnered substantial attention in both academic research and practical financial applications.
\ The approach offers a unified theoretical foundation that facilitates comprehensive comparisons, both at a theoretical level and in experimental outcomes. Importantly, this theoretical grounding lends substantial weight to the experimental results, affirming the differential machine learning method’s optimality within the prevailing context.
\ By anchoring the insights in rigorous mathematics, the article bridges the gap between abstract financial concepts and practical algorithmic implementations.
\
Differential Machine Learning, Risk Neutral valuation, Derivative Pricing, Hilbert Spaces Orthogonal Projection, Generalized Function Theory
\
Within the dynamic landscape of financial modelling, the quest for reliable pricing and hedging mechanisms persists as a pivotal challenge. This article aims to introduce an encompassing theory of pricing valuation uniquely rooted in the domain of machine learning. A primary focus lies in overcoming a prominent hurdle encountered in implementing the differential machine learning algorithm, specifically addressing the critical need for unbiased estimation of differential labels from data sources, as highlighted in studies by Huge (2020) and Broadie (1996). This breakthrough holds considerable importance for contemporary practitioners across diverse institutional settings, offering tangible solutions and charting a course toward refined methodologies. Furthermore, this endeavour not only caters to the immediate requirements of practitioners but also furnishes invaluable insights that can shape forthcoming research endeavours in this domain.
\ The article sets off from the premise that the pricing and hedging functions can be thought of as elements of a Hilbert space, in a similar way as Pelsser and Schweizer, 2016. A natural extension of these elements across time, originally attained in the current article, is accomplished by the Hahn Banach extension theorem, an extension that would translate as an improvement of the functional through the means of the incorporation of the accumulating information. This functional analytical approach conveys the necessary level of abstraction to justify, and discuss the different possibilities of implementation of the financial models contemplated in Huge and Savine, 2020 and Pelsser and Schweizer, 2016. So, a bridge will be built from the deepest theoretical considerations into the practicality of the implementations, keeping as a goal mathematical rigour in the exposition of the arguments. Modelling in Hilbert spaces allows the problem to be reduced into two main challenges: the choice of a loss function and the choice of an appropriate basis function. A discussion about the virtues and limitations of two main classes of basis functions is going to unravel, mainly supported by the results in Hornik et al., 1989,Barron, 1993 and Telgarsky, 2020. A rigorous mathematical derivation of the loss functions, for the two different risk-neutral methods, is going to be exposed, where the result for the second method, was stated and proven originally in the current document. The two methods are the Least Squares Monte Carlo and the Differential machine learning, inspired in Pelsser and Schweizer, 2016 and Huge and Savine, 2020, respectively. It is noted that the first exposition of the Least Squares Monte Carlo Method was accomplished in Longstaff and Schwartz, 2001. The derivation of the differential machine learning loss function using generalized function theory allows us to relax the assumptions of almost sure differentiability and almost sure Lipschitz continuity of the pay-off function in Broadie and Glasserman, 1996. Instead, the unbiased estimate of the derivative labels only requires the assumption of local integrability of the pay-off function, which it must clearly satisfy, given the financial context. This allows the creation of a technique to obtain estimates of the labels for any derivative product, solving the biggest limitation in Huge and Savine, 2020. The differential machine learning algorithm efficiently computes differentials as unbiased estimates of ground truth risks, irrespective of the transaction or trading book, and regardless of the stochastic simulation model.
\ The implementations are going to be completely justified by the arguments developed in the theoretical sections. The implementation of the differential machine learning method relies on Huge and Savine, 2020. The objective of this simulation is to assess the effectiveness of various models in learning the Black-Scholes model within the context of a European option contract. Initially, a comparison will be drawn between the prices and delta weights across various spot prices. Subsequently, the distribution of Profit and Loss (PnL) across different paths will be examined, providing the relative hedging errors metric. These will serve the purpose of illustrating theoretical developments.
\

\ Since the dual of a Hilbert space is itself a Hilbert space,g can be considered a functional.[3]
\ Considering the sequence of conditional Hilbert spaces:
\

\ Now the pricing or hedging functional incorporates the accumulated information from period 0 to period l.
\ This allows us to see that the increasing information would shape the function, which is something well-seen, in statistical learning, with the use of increasing training sets defined across time. [4]
\ We will begin by dwelling upon the problem of how to find function g, developing the theoretical statistical objects that are necessary for that aim. The aim is to estimate the pricing or hedging functions. So, a criterion needs to be established in the theoretical framework.
\ Let Z and X be two respectively d and p dimensional real-valued random variables, following some unknown joint distribution p(z, x). The expectation of the loss function associated with a predictor g can be defined as:
\

\ The objective is to find the element g ∈ H which achieves the smallest possible expected loss. Assume a certain parameter vector θ ∈ Θ, where Θ is a compact set in the Euclidean space. As the analytical evaluation of the expected value is impossible, a training sample (zi , xi) for i = 1, …, n drawn from p(z, x) is collected. An approximate solution to the problem can then be found by minimising the empirical approximation of the expected loss:
\
\

\
:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.
:::
[1] H represents all prior knowledge, one of the common constraints for option pricing are non-negativity and positiveness on the second order derivatives
\ [2] Even when Z is expressed as a diffusion, T is finite so the different paths could never display infinite variance
\ [3] This property is easily verified by building the following map ϕH′ → H, defined by ϕ(v) = fv, where fv(x) = ⟨x, v⟩, for x ∈ H is an antilinear bijective isometry.
\ [4] The functional analytical results can be revisited by the reader in Rudin, 1974