MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

Why We Misjudge Our Own Effectiveness at Finding Software Bugs

2025-12-15 05:09:25

Table Of Links

Abstract

1 Introduction

2 Original Study: Research Questions and Methodology

3 Original Study: Validity Threats

4 Original Study: Results

5 Replicated Study: Research Questions and Methodology

6 Replicated Study: Validity Threats

7 Replicated Study: Results

8 Discussion

9 Related Work

10 Conclusions And References

\

2 Original Study: Research Questions And Methodology

2.1 Research Questions

The main goal of the original study is to assess whether participants’ perceptions of their testing effectiveness using different techniques are good predictors of real testing effectiveness. This goal has been translated into the following research question: RQ1: Should participants’ perceptions be used as predictors of testing effectiveness? This question was further decomposed into:

– RQ1.1: What are participants’ perceptions of their testing effectiveness?

We want to know if participants perceive a certain technique as most effective than the others.

\ – RQ1.2: Do participants’ perceptions predict their testing effectiveness?

We want to assess if the technique each participant perceives as most effective is the most effective for him/her.

\ – RQ1.3: Do participants find a similar amount of defects for all techniques?

Choosing the most effective technique can be difficult if participants find a similar amount of defects for two or all three techniques.

\ – RQ1.4: What is the cost of any mismatch?

We want to know if the cost of not correctly perceiving the most effective technique is negligible and depends on the technique perceived as most effective.

\ – RQ1.5: What is expected project loss?

Taking into consideration that some participants will correctly perceive their most effective technique (mismatch cost 0), and others will not (mismatch cost greater than 0), we calculate the overall cost of (mis)match for all participants in the empirical study and check if it depends on the technique perceived as most effective.

2.2 Study Context and Ethics

We conducted a controlled experiment where each participant applies three defect detection techniques (two testing techniques and one code review technique) on three different programs. For testing techniques, participants report the generated test cases, later run a set of test cases that we have generated (instead of the ones they created), and report the failures found1 .

\ For code reading they report the identified faults. At the end of the controlled experiment, each participant completes a questionnaire containing a question related to his/her perceptions of the effectiveness of the techniques applied. The course is graded based on their technique application performance (this guarantees a thorough application of the techniques).

\ The study is embedded in an elective 6 credits Software Verification and Validation course. The regular assessment (when the experiment does not take place) is as follows: students are asked to write a specification for a program that can be coded in about 8 hours. Specifications are later interchanged so that each student codes a different program from the one (s)he proposed.

\ Later, students are asked to perform individually (in successive weeks) code reading, and white-box testing on the code they wrote. At this point, each student delivers the code to the person who wrote the specification, so that each student performs black-box testing on the program (s)he proposed. Note that this scenario requires more effort from the student (as (s)he is asked to write first a specification and then code a program, and these tasks do not take place when the study is run).

\ In other words, the students workload during the experiment is smaller than the workload of the regular course assessment. The only activity that takes place during the experiment that is not part of the regular course is answering the questionnaire, which can be done in less than 15 minutes. Although the study causes changes in the workflow of the course, its learning goals are not altered.

\ All tasks required by the study, with the exception of completing the questionnaire, take place during the slots assigned to the course. Therefore, there is no additional effort for the students but attending lectures (which is mandatory in any case). Note that the students are allowed to withdraw from the controlled experiment, but this would affect their score in the course. But this also happens when the experiment is not run.

\ If a student misses one assignment, (s)he would score 0 in that assignment and his/her course score would be affected consequently. However, they are allowed to withdraw from the study without penalty in their score, as the submission of the questionnaire is completely voluntary. No incentives are given to those students who submit the questionnaire. The fact of submitting the questionnaire implies giving consent for participating in the study.

\ Students are aware this is a voluntary activity aiming for a research, but they can also get feedback. Those students who do not submit the questionnaire, are not considered in the study in any way, as they are not giving consent to use their data. For this reason, they will not be included in the quantitative analysis of the controlled experiment (even though their data is available for scoring purposes). The study is performed in Spanish, as it is the participants’ mother tongue. Its main characteristics are summarised in Table 1.

2.3 Constructs Operationalization

Code evaluation technique is an experiment factor, with three treatments (or levels): equivalence partitioning (EP)—see Myers et al. [41], branch testing (BT)—see Beizer [6], and code reading by stepwise abstraction (CR)—see Linger [36].

\ The response variables are technique effectiveness, perception of effectiveness and mismatch cost. Technique effectiveness is measured as follows:

– For EP and BT, it is the percentage of faults exercised by the set of test cases generated by each participant. In order to measure the response variable, experimenters execute the test cases generated by each participant2 .

– For CR, we calculate the percentage of faults correctly reported by each participant (false positives are discarded).

Table 1 Description of the Experiment

Note that dynamic and code review techniques are not directly comparable as they are different technique types (dynamic techniques find failures and code review techniques find faults). However, the comparison is fair, as:

– Application time is not taken into account, and participants are given enough time to complete the assigned task.

– All faults injected are detectable by all techniques. Further details about faults, failures and their correspondence is given in Section 2.5.

\ Perception of effectiveness is gathered by means of a questionnaire with one question that reads: Using which technique did you detect most defects3? Mismatch cost is measured, for each participant, as the difference between the effectiveness obtained by the participant in the technique (s)he perceives as most effective and the most effective in reality for him/her. Note that participants neither know the total amount of seeded faults, nor which techniques are best for their colleagues or themselves.

\ This operationalization imitates the reality of testers –who lack such knowledge in real projects. Therefore, the perception is fully subjective (and made in relation with the other two techniques). Table 2 shows three examples that show how mismatch cost is measured. Cells in grey background show the technique for which highest effectiveness is observed for the given participant.

Table 2 Measuring Mismatch Cost

The first row shows a situation where the participant perceives as most effective CR, but the most effective for him/her is EP. In this situation, there is a mismatch (misperception) and the associated cost is calculated as the difference in effectiveness between CR and EP. The second row shows a situation where the participant correctly perceives EP as the most effective technique for him/her. In this situation there is a match (correct perception) and therefore, the associated mismatch cost is 0pp. The third row shows a situation where the participant perceives BT as the most effective technique for him/her, and BT and EP are tied as his/her most effective technique. In this situation we consider that there is a match (correct perception), and therefore, the associated mismatch cost is 0pp.

2.4 Study Design

Testing techniques are applied by human beings, and no two people are the same. Due to the dissimilarities between the participants already existing prior to the experiment (degree of competences achieved in previous courses, innate testing abilities, etc.), there may exist variability between different participants applying the same treatment. Therefore, we opted for a crossover design, as described by Kuehl [34] (a within-subjects design, where each participant applies all three techniques, but different participants apply the techniques in a different order) to prevent dissimilarities between participants and technique application order from having an impact on results. The design of the experiment is shown in Table 3.

The experimental procedure takes place during seven weeks, and is summarised in Table 4. The first three weeks there are training sessions in which participants learn how to apply the techniques and practice with them. Training sessions take place twice a week (Tuesdays and Thursdays) and each one lasts 2 hours. Therefore, training takes 12 hours (2 hours/session x 2 sessions/week x 3 weeks). Participants are first taught the code review technique, then white-box and finally black-box. The training does not follow any particular order, but the one we have found best to meet the learning objectives of the course.

Table 3 Experimental Design

Table 4 Experimental Procedure

The following week there are no lectures, and students are asked to practice with the techniques. For this purpose, they are given 3 small programs in C (that contain faults), and are asked to apply a given technique on each program (all students apply the same technique on the same training program). The performance on these exercises is used for grading purposes. The other three are experiment execution weeks. Each experiment execution session takes place once a week (Fridays) and lasts four hours.

\ This is equivalent to there being no time limit, as participants can complete the task in less time. Therefore, experiment execution takes 12 hours (4 hours/session x 1 session/week x 3 weeks). Training sessions take place during lecture hours and experiment execution sessions take place during laboratory hours. Those weeks in which there are lectures, there is no laboratory and vice versa. The time used for the controlled experiment is the corresponding one assigned to the course in which the study is embedded.

\ No extra time is used. In each session, participants apply the techniques and, for equivalence partitioning and branch testing, run test cases too. They report application of technique, and generated test cases and failures (for the testing techniques) or faults (for the code review technique). At the end of the last experiment execution session (after applying the last technique), participants are surveyed about their perceptions of the techniques that they applied. They must return their answer before the following Monday, to guarantee that they remember as much as possible about the tasks performed.

2.5 Experimental Objects

Program is a blocking variable. It is not a factor, because the goal of the experiment is not to study the programs, but the code evaluation techniques. However, it is a blocking variable, because we are aware that programs could be influencing results. The experiment has been designed to cancel out the influence of programs. Every participant applies each technique in a different program, and each technique is applied on different programs (by different participants). Additionally, the program by technique interaction is later analysed. The experiment uses three similar programs, written in C (used in other empirical studies about testing techniques like the ones performed by Kamsties & Lott [29] or Roper et al. [46]):

cmdline: parser that reads the input line and outputs a summary of its contents. It has 239 executable LOC and a cyclomatic complexity of 37.

nametbl: implementation of the data structure and operations of a symbol table. It has 230 executable LOC and a cyclomatic complexity of 27.

ntree: implementation of the data structure and operations of an n-ary tree. It has 215 executable LOC and a cyclomatic complexity of 31.

\ Appendix A shows a complete listing of the metrics gathered by the PREST4 tool [32] on the correct programs (before faults were injected). Although the purpose of the programs is different, we can see that most of the metrics obtained by PREST are quite similar, except Halstead metrics, which are greater for ntree. At the same time, cmdline is slightly larger and more complex than the other two.

\ Each program has been seeded with seven faults (some, but not all, are the same faults as used in previous experiments run on these programs), and there are 2 versions of each faulty program. All faults are conceptually the same in all programs (eg., a variable initialisation is missing). Some faults occurred naturally when the programs were coded, whereas others are typical programming faults. All faults:

– Cause observable failures. – Can be detected by all techniques.

– Are chosen so that the programs fail only on some inputs. – No fault conceals another5 .

– There is a one-to-one correspondence between faults and failures.

\ Note, however, that it is possible that a participant generates two (or more) test cases that exercise the same seeded fault, and therefore produce the same failure. Participants have been advised to report these failures (the same failure exercised by two or more different test cases) as a single one. For example, there is a fault in program ntree in the function in charge of printing the tree. This causes the failure that the tree is printed incorrectly. Every time a participant generates a test case that prints the tree (which is quite often, as this function is useful to check the contents of the tree at any time), the failure will be shown.

\ Some examples of the seeded faults and their corresponding failures are:

– Variable not initialised. The associated failure is that the number of input files is printed incorrectly in cmdline.

– Incorrect boolean expression in a decision. The associated failure is that the program does not output error if the second node of the “are siblings” function does not belong to the tree.

2.6 Participants

The 32 participants of the original study were fifth-(final)year undergraduate computer science students taking the elective Software Verification and Validation course at the Universidad Polit´ecnica de Madrid. The students have gone through 2 courses on Software Engineering of 6 and 12 credits respectively. They are trained in SE, have strong programming skills, have experience programming in C, have participated in small size development projects6 , and have little or no professional experience. So, they should not been considered unexperienced in programming, but good proxys of junior programmers.

\ They have not formal training in any code evaluation techniques (including the ones involved in the study), as this is the course in which they are taught them. Since they have had previous coding assignments, they might have done testing previously but informally. As a consequence, they might have acquired some intuitive knowledge on how to test/review programs (developing their own techniques or procedures that could resemble the techniques), but they have never learned the techniques formally. They have never been required to do peer-reviews in coding assignments, or write test cases in the projects where they have participated.

\ They could possibly have used assertions or informal input validation, but on their own (never under request, and have not previously been taught how to do it). All participants have a homogeneous background. The only differences could be due to the level of achievement of learning goals in previous courses, or innate ability for testing. The former could have been determined by means of scores in previous courses (which was not possible). The latter was not possible to measure. Therefore, we have not deemed necessary to do some kind of blocking, and just performed simple randomisation.

\ Therefore, the sample used represents developers with little or no previous experience on code evaluation techniques (novice testers). The use of our students is appropriate in this study on several grounds:

– We want to rule out any possible influence of previous experience on code evaluation techniques. Therefore, participants should not have any preconceived ideas or opinions about the techniques (including having a favourite one).

– Falessi et al. [21] suggest that it is easier to induce a particular behaviour among students. More specifically, reinforce a high level of adherence to the treatment by experimental subjects applying the techniques.

– Students are used to make predictions during development tasks, as they are continually undergoing assessment on courses related with programming, SE, networking, etc.

\ Having said that, since our participants are not practitioners, their opinions are not based on any previous work experience on testing, but on their experience on informally testing programs for some years (they are in 5th year of a 5-year CS bachelor). Additionally, as part of the V&V training, our participants are asked to practice in small programs with the techniques used in the experiment. According to Falessi et al. [21], we (SE experiments) tend to forget practitioners’ heterogeneity.

\ Practitioners have different academic backgrounds, SE knowledge and professional experience. For example, a developer without a computer science academic background might not have knowledge about testing techniques. We assume that for this exploratory study, the characteristics of the participants are a valid sample for developers that have little or no experience on code evaluation techniques and are junior programmers.

2.7 Data Analysis

The analyses conducted in response to the research questions, are explained below7 . Table 5 summarises the statistical tests used to answer each research question. First we report the analyses (descriptive statistics and hypothesis testing) of the controlled experiment. To examine participants’ perceptions (RQ1.1), we report the frequency of each technique (percentage of participants that perceive each technique as the most effective).

\ Additionally, we determine whether all three techniques are equally frequently perceived as being the most effective. We test the null hypothesis that the frequency distribution of the perceptions is consistent with a discrete uniform distribution, i.e., all outcomes are equally likely to occur. To do this, we use a chi-square (χ 2 ) goodness-of-fit test. To examine if participants’ perceptions predict their testing effectiveness (RQ1.2), we use Cohen’s kappa coefficient along with its 95% confidence

Table 5 Statistical Tests Used to Answer Research Questions

interval—calculated using bootstrap. Cohen’s kappa coefficient (κ) is a statistic that measures agreement for qualitative (categorical) variables when 2 raters are classifying different objects (units). It is calculated on the corresponding contingency table generated. Table 6 shows an example of a contingency table. Cells contain the frequencies associated to each pair of classes.

Table 6 Example of Contingency Table. It is used to calculate Kappa, and perform StuartMaxwell’s and McNemar-Bowker’s tests

Kappa is generally thought to be a more robust measure than simple percent agreement calculation, since it takes into account the agreement occurring by chance. It is not the only coefficient that can be used to measure agreement. There are others, like Krippendorff’s alpha, which is more flexible, as can be used in situations where there are more than 2 raters, or the response variable is in an interval or ratio scale.

\ However, in our particular situation, where there are 2 raters, data in nominal scale and no missing data, Kappa behaves similarly to Krippendorff’s alpha [3], [54]. Kappa is a number from -1 to 1. Positive values are interpreted as agreement, while negative values are interpreted as disagreement. There is still some debate about how to interpret kappa. Different authors have categorised detailed ranges of values for kappa that differ with respect to the degree of agreement that they suggest (see Table 7).

\ According to scales by Altman[1] and Landis & Koch [35], 0.6 is the value as of which there is considered to be agreement. Fleiss et al. [22] lower this value to 0.4. Each branch of science should establish its kappa value. As there are no previous studies that specifically address the issue of which is the most appropriate agreement scale and threshold for SE, and different studies in SE have used different scales8 , we use Fleiss et al.’s more generous scale as our baseline.

Table 7 Interpretation of Kappa Values. Negative values are interpreted like positive values,but meaning disagreement instead of agreement

We measure the agreement between the technique perceived as most effective by a participant, and the most effective technique for that participant for all participants. Therefore, we have 2 raters (perceptions and reality), three classes (BT, EP and CR), and as many units to be classified as participants. Since there could be agreement for some but not all techniques, we also measure kappa for each technique separately (kappa per category), following the approach described in [20].

\ It consists of collapsing the corresponding contingency table. Table 8 shows the collapsed contingency table for Class A from Table 6. Note that a collapsed table is always a 2x2 table.

Table 8 Example of Collapsed Contingency Table. It is used to calculate partial kappa

In the event of disagreement, we also study the type of mismatch between perceptions and reality—whether the disagreement leads to some sort of bias in favour of any of the techniques. To do this, we use the respective contingency table to run Stuart-Maxwell’s test of marginal homogeneity (testing the null hypothesis that the distribution of preferences match reality) and the McNemar-Bowker test for symmetry (testing the null hypothesis of symmetry) as explained in [20].

\ The hypothesis of marginal homogeneity corresponds to equality of row and column marginal probabilities in the corresponding contingency table. The test for symmetry determines whether observations in cells situated symmetrically about the main diagonal have the same probability of occurrence. In a 2x2 table, symmetry and marginal homogeneity are equivalent. In larger tables, symmetry implies marginal homogeneity, but the reciprocal is not true9 .

\ Since we have injected only 7 defects in each program, there exists the possibility that if no agreement is found between perceptions and reality, it could be due to the fact that participants find a similar amount of defects for all three (or pairs of ) techniques (RQ1.3). If this is the case, then it would be difficult for them to choose the most effective technique. To check this, we will run agreement on the effectiveness obtained by participants using different techniques. Therefore we have 3 raters (techniques) and as many units as participants.

\ This will be done with all participants, and with participants in the same experiment group, for every group; for all techniques, and for pairs of techniques. Note that kappa can no longer be used, as we are seeking agreement on interval data. For this reason, we will use Krippendorff’s alpha [26] along with its 95% confidence interval—calculated using bootstrap, and the KALPHA macro for SPSS10 .

\ To examine the mismatch cost (RQ1.4) and project loss (RQ1.5), we report the cost of the mismatch (when it is greater than zero for RQ1.4 and in all cases for RQ1.5), associated with each technique as explained in Section 2.3. To discover whether there is a relationship between the technique perceived as being the most effective and the mismatch cost and project loss, we apply a one-way ANOVA test or a medians Kruskall-Wallis test for normal and nonnormal distributions, respectively along with visual analyses (scatter plots).

:::info Authors:

  1. Sira Vegas
  2. Patricia Riofr´ıo
  3. Esperanza Marcos
  4. Natalia Juristo

:::

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 license.

:::

\

Why Developers Keep Picking the Wrong Testing Techniques

2025-12-15 05:09:16

:::info Authors:

  1. Sira Vegas
  2. Patricia Riofr´ıo
  3. Esperanza Marcos
  4. Natalia Juristo

:::

Table Of Links

Abstract

1 Introduction

2 Original Study: Research Questions and Methodology

3 Original Study: Validity Threats

4 Original Study: Results

5 Replicated Study: Research Questions and Methodology

6 Replicated Study: Validity Threats

7 Replicated Study: Results

8 Discussion

9 Related Work

10 Conclusions And References

Abstract

A recurring problem in software development is incorrect decision making on the techniques, methods and tools to be used. Mostly, these decisions are based on developers’ perceptions about them. A factor influencing people’s perceptions is past experience, but it is not the only one. In this research, we aim to discover how well the perceptions of the defect detection effectiveness of different techniques match their real effectiveness in the absence of prior experience.

\ To do this, we conduct an empirical study plus a replication. During the original study, we conduct a controlled experiment with students applying two testing techniques and a code review technique. At the end of the experiment, they take a survey to find out which technique they perceive to be most effective. The results show that participants’ perceptions are wrong and that this mismatch is costly in terms of quality.

\ In order to gain further insight into the results, we replicate the controlled experiment and extend the survey to include questions about participants’ opinions on the techniques and programs. The results of the replicated study confirm the findings of the original study and suggest that participants’ perceptions might be based not on their opinions about complexity or preferences for techniques but on how well they think that they have applied the techniques.

1. Introduction

An increasingly more popular practice nowadays is for software development companies to let developers choose their own technological environment. This means that different developers may use different productivity tools (programming language, IDE, etc.). However, software engineering (SE) is a humanintensive discipline where wrong decisions can potentially compromise the quality of the resulting software. In SE, decisions on which methods, techniques and tools to use in software development are typically based on developers’ perceptions and/or opinions rather than evidence, as suggested by Dyb˚a et al. [19] and Zelkowitz et al. [55].

\ However, empirical evidence might not be available, as certain methods, techniques or tools may not have been studied within a particular setting or even at all. Alternatively, developers may simply not be acquainted with such studies, according to Vegas & Basili [49]. On this ground, it is important to discover how well developers perceptions (beliefs) match reality and, if they do not, find out what is behind this mismatch, as noted by Devanbu et al.[14]. According to Psychology, experience plays a role in people’s perceptions. This has also been observed by Devanbu et al. [14] in SE.

\ However, this research sets out to discover how well matched perceptions are with reality in the absence of previous experience in the technology being used. This makes sense for several reasons: 1) experience is not the only factor affecting developers’ perceptions; 2) development teams are usually composed of a mix of people with and without experience; and 3) it is not clear what type of experience influences perceptions. For example, Dieste et al. [17] conclude that academic rather than professional experience could be affecting the external quality of the code generated by developers when applying Test-Driven Development.

\ We aim to study whether perceptions about the effectiveness of three defect detection techniques match reality, and if not, what is behind these perceptions. To the best of our knowledge, this is the first paper to empirically assess this issue. To this end, we conducted an empirical study plus a replication with students. During the original study we measured (as part of a controlled experiment) the effectiveness of two testing techniques and one code review technique when applied by the participants. We then checked the perceived most effective technique (gathered by means of a survey) against the real one.

\ Additionally, we analysed the cost of the mismatch between perceptions and reality in terms of loss of effectiveness. Major findings include:

– Different people perceive different techniques to be more effective. No one technique is perceived as being more effective than the others. – The perceptions of 50% of participants (11 out of 23) are wrong.

– Wrong perception of techniques can reduce effectiveness 31pp (percentage points) on average.

\ These findings led us to extend the goal of the study in a replication to investigate what could be behind participants’ perceptions. To do this, we examined their opinions on the techniques they applied and the programs they tested in a replication of the controlled experiment. Major findings include:

– The results of the replication confirm the findings of the original study.

– Participants think that technique effectiveness depends exclusively on their performance and not on possible weaknesses of the technique itself. – The opinions about technique complexity and preferences for techniques do not seem to play a role in perceived effectiveness. These results are useful for developers and researchers. They suggest:

– Developers should become aware of the limitations of their judgement.

– Tools should be designed that provide feedback to developers on how effective techniques are.

– The best combination of techniques to apply should be determined that is at the same time easily applicable and effective. – Instruments should be developed to make empirical results available to developers.

\ The material associated to the studies presented here can be found at https://github.com/GRISE-UPM/Misperceptions. The article is organised as follows. Section 2 describes the original study. Section 3 presents its validity threats. Section 4 discusses the results. Section 5 describes the replicated study based on the modifications made to the original study. Section 6 presents its validity threats. Section 7 reports the results of this replicated study. Section 8 discusses our findings and their implications. Section 9 shows related work. Finally, Section 10 outlines the conclusions of this work.

\ \

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 license.

:::

\

The Godot Editor Is Now Available in the Meta Horizon Store

2025-12-15 05:00:06

A year ago, I introduced the Android port of the Godot Editor. To date, it has had over 500K+ downloads on the Google Play store, and has enabled developers to create and develop Godot apps and games using Android tablets, foldables and phones. Since then we have been hard at work refining the experience, improving the development workflow via picture-in-picture (PiP) support, providing the ability to build and export Godot binaries, and improving the Editor performance and reliability.

\ Building on that foundation, and thanks to the Meta grants in support of that work and with help from W4 Games, I was able to complete the proof of concept started by Bastiaan Olij a couple years ago, to add support for using the Android editor in an XR context using Godot’s first class OpenXR integration!

\ Today, I am proud to release the first mobile XR port of the Godot Editor on Meta Quest devices!

\ The Godot Editor is now available on the Horizon Store for Meta Quest 2, Meta Quest 3 & Meta Quest Pro devices running Horizon OS version 69 or higher.

\ This is an early access version of the Godot Editor running natively on Meta Quest devices, enabling the creation and development of 2D, 3D and immersive XR apps and games directly on device without the need for an external computer.

\ As usual, this work is entirely free and open source, and already merged in Godot 4.4’s development branch (GH-96624). The version we publish on the Horizon Store can also be downloaded as an APK directly from the Godot website.

Features & Highlights

This version of the Godot Editor is a Hybrid App with the ability to open and transition back and forth between multiple panel (2D) and immersive (XR) windows. This is used to support the Editor features as described below.

Access to all Godot Engine capabilities

\ The Project Manager and the main Editor are rendered into panel windows as done on desktop and Android platforms. This makes the Editor readily available and usable either in the Home environment or overlaid onto a XR experience.

\ This approach allows us to deliver on a core tenet of this port which is to provide developers with a familiar development interface and access to the full set of capabilities and features that the Godot Editor provides on desktop and Android platforms. This includes access to the asset library, keyboard & mouse shortcuts, GDScript code editing / highlighting / completion support, access to the documentation, live scene editing, live script reloading support, live debugging, live profiling and many more!

Developing XR apps and games!

When developing a XR project, the immersive (XR) window is used for playtesting the project directly in the device as if it was a released app already. In that mode, the Editor panel can be summoned as an interactive overlay, which allows the developer to iterate, debug or profile the XR project while it’s running.

\ Support for exporting XR project binaries will be made available via a plugin.

Developing 2D and 3D apps and games!

Support for creating and developing 2D and 3D apps and games is available out of the box.

\ The experience is improved by leveraging the Android editor’s multi-panel capability which on Horizon OS allows to playtest the project in a new panel next to the Editor panel. This allows the Editor to remain accessible for iterating, debugging or profiling the project in real-time.

\ As with the Android editor, this version provides the ability to export 2D & 3D project binaries for all supported platforms.

Leveraging Horizon OS platform capabilities

Support for keyboard and mouse

External keyboard and mouse support allows developers to achieve the same levels of productivity as they do on desktop and laptop computers.

\ Virtual keyboard, touch controllers and direct touch are also supported for quick interactions, or when physical keyboard and mouse devices are not readily available.

Seamless multitasking

Introduced in Horizon OS v69, seamless multitasking enables the Editor panel to be visible and interactable while playtesting a XR project in virtual space.

\ This gives developers the ability to do live editing, debugging or profiling of XR projects in real-time, with the benefit of the depth cues and sense of scale unique to XR.

Panel Resizing & Theater View support

The Editor panel can be resized at will via drag and drop to fit the developer’s needs.

\ Using the Theater View button, developers can maximize the Editor panel and bring it front-and-center.

An important step for the XR & Game communities

Besides the technical achievements required to make this port feasible, we believe this is a significant milestone as it impacts the XR & Game community in a few but critical ways:

  • Turns the Meta Quest into a true Spatial Computer
  • The Meta Quest gains the ability to create (and distribute) its own native apps without the need for a PC or laptop computer!
  • Being able to run a full game engine on a mobile XR device should serve as inspiration for the type of apps that can be brought to the mobile XR ecosystem.
  • Grows the OpenXR ecosystem by providing a seed for building feature rich apps
  • Godot Engine is a free and open-source software (FOSS) project which means that, in partnership with the Godot Foundation, OpenXR vendors can bring similar capabilities to their devices to grow the OpenXR ecosystem.
  • Reduces XR development friction
  • XR development on PC and laptop devices has significant friction due to the need to switch back and forth between the development device and the target XR device (i.e. taking the headset off for development, putting back on for playtesting).
  • This is not an issue when using the Godot Editor natively on XR devices since the development and target device are now the same device!
  • Lowers the barrier of entry for XR and Game development
  • This version of the Godot Editor turns devices like the Meta Quest into an easily accessible development device with the ability to natively create, develop and export 2D, 3D or XR apps and games for all Godot-supported platforms.
  • Provides a more flexible development experience
  • Developers can leverage the virtual space to gain more screen estate than a laptop could provide.
  • The virtual floating panels provide a more flexible layout than a traditional desktop + multi-monitors setup.
  • The ability to playtest and modify XR projects in-headset in real-time is a capability that can’t be replicated on PC and laptop computers.

Next Steps, Feedback & Contributions

This is only the beginning!

\ As mentioned in the previous section, we believe this is an important milestone for the XR, GameDev, and Open Source communities and we aim to build on this foundation to make Godot Engine a powerful, flexible and cross-platform tool for XR and Game development.

\ To that end, we welcome feedback and contributions from partners, members of the community and interested parties.


Fredia Huya-Kouadio

\ Also published here

\ Photo by Grant McIver on Unsplash

\ \

Meet the Writer: Ashton Chew, Founding Engineer at Theta

2025-12-15 04:25:35


Welcome to HackerNoon’s Meet the Writer Interview series, where we learn a bit more about the contributors that have written some of our favorite stories.


Let’s start! Tell us a bit about yourself. For example, name, profession, and personal interests.

Hey! My name is Ashton, and I’m a founding engineer at Theta where I work on RL infra, RL, and distributed systems. I specifically focus on computer-use and tool-use. In my past, I worked at Amazon AGI and tackled inference and tool-use infrastructure. In my free time, I love graphic design, side-projects, and bouldering.

Interesting! What was your latest Hackernoon Top Story about?

My latest story, “Can Your AI Actually Use a Computer? A 2025 Map of Computer‑Use Benchmarks,” touched on one of the hottest spaces in VC right now: RL environments and evals. I gave a comprehensive overview of the most-used computer-use benchmarks, plus practical advice on how to pick benchmarks for training and testing computer-use agents.

I kept running into the same gap: there aren’t many articles that review the benchmarks themselves. And as this field grows, it’s vital that we’re actually assessing quality instead of rewarding whatever happens to game the metric. We’ve been here before. In the early days of LLMs, benchmarks were random and disparate enough that they only weakly reflected the real winner.

Benchmarks became the de facto scoreboard for “best model,” and then people realized a lot of them weren’t measuring what they claimed.

One of the most revealing early-era failures was when “reading comprehension” quietly became “pattern matching on dataset structure.” Researchers ran intentionally provocative baselines (question-only, last-sentence-only), and the results were high enough to raise an uncomfortable possibility: the benchmark didn’t consistently force models to use the full passage. In a 2018 critique, the point wasn’t that reading never matters, but that some datasets accidentally made it optional by over-rewarding shortcuts like recency and stereotyped answer priors.

\

# Supposed task: answer the question given the passage and question

Passage (summary):
- Sentences 1–8: John’s day at school (mostly irrelevant detail)
- Sentence 9: "After school, John went to the kitchen."
- Sentence 10: "He ate a slice of pizza before starting his homework."

Question: "What did John eat?"
Answer: "pizza"

The benchmark accidentally rewards a shortcut where the model overweights the last sentence (because the answer is often near the end) and simply extracts the direct object of the most recent action (“ate ___”), which in this case yields “pizza.”

And then comes the even more damaging baseline: remove the passage entirely and see what happens. If a question-only model is competitive, it’s a sign the dataset is leaking signal through repetition and priors rather than testing passage-grounded comprehension.

Question: "What did John eat?"

This baseline is basically a sanity check: can the model still score well by leaning on high-frequency answer templates without grounding on the passage at all? In practice it just guesses a token the dataset disproportionately rewards (“pizza,” “sandwich”), and if that works more often than it should, you’re not measuring comprehension so much as you’re measuring the dataset’s priors.

Computer-use evals have already produced an even more literal shortcut: the agent has a browser, the benchmark is public, and the evaluation turns into an open-book exam with an answer key on the final page. In the Holistic Agent Leaderboard (HAL) paper, the authors report observing agents that searched for the benchmark on HuggingFace instead of solving the task, a behavior you only catch if you inspect logs.

\

# Supposed task: complete a workflow inside the web environment

Task: "Configure setting X in the app and verify it's enabled."

Failure mode:
1) Open a new tab
2) Search for: "benchmark X expected enabled state" / "HAL <benchmark> setting X"
3) Find: repo / leaderboard writeup / dataset card / issue thread
4) Reproduce the expected end state (answer)

At that point, the evaluation was measuring whether it can locate the answer key.

Task: "Find the correct page and extract Y."

Failure mode:
- Search: "<benchmark name> Y"
- Copy from a public artifact (docs, forum post, dataset card)
- Paste the value into the agent output as if it came from interaction

If an agent can pull the value from a dataset card or repo and still “pass,” the success check is grading plausibility, not interaction correctness. Public tasks plus shallow verification turn web search into an exploit.

These two examples are the warning shot: if we don’t hold computer-use benchmarks to higher standards early, we’ll repeat the LLM era just with better UIs and more elaborate ways to cheat.

Do you usually write on similar topics? If not, what do you usually write about?

Yes! Working on the RL environments and RL infra around computer-use, I’m constantly surrounded by the best computer-use models and the most realistic training environments. So I wrote another article, “The Screen Is the API,” which is the case for computer-use and why it’s the future of AI models.

This space is extremely underreported due to two reasons:

  1. Models aren’t as capable in computer-use as they are in other tasks (coding, math, etc.).
  2. Computer-use is fast-moving and extremely new.

I want to change that.

Great! What is your usual writing routine like (if you have one)

I usually read a bunch of research papers and speak to my peers in the industry about their thoughts on a topic. Other than that, I spend a lot of time reading articles by great bloggers like PG. So I usually take a lot of inspiration from other people in my writing.

Being a writer in tech can be a challenge. It’s not often our main role, but an addition to another one. What is the biggest challenge you have when it comes to writing?

Finding the time to sit down and put my lived experience into words.

What is the next thing you hope to achieve in your career?

To tackle harder problems with great people, to learn from those people, and share my experiences.

Wow, that’s admirable. Now, something more casual: What is your guilty pleasure of choice?

Watching movies! My favorite movie right now is Catch Me If You Can (2002).

Do you have a non-tech-related hobby? If yes, what is it?

I love bouldering because it makes me feel like I’m a human computer-use agent interacting with the climbing wall. I’m kidding. I think bouldering is a lot of fun because it allows me to take my mind off of work and consolidate my thinking.

What can the Hacker Noon community expect to read from you next?

I’m currently writing another piece on RL environment infrastructure!

What’s your opinion on HackerNoon as a platform for writers?

I think the review structure is awesome, and it was a great place for me to put my thoughts in front of technical readers.

Thanks for taking the time to join our “Meet the writer” series. It was a pleasure. Do you have any closing words?

I love writing. Thank you, HackerNoon!

Build a Real-Time AI Fraud Defense System with Python, XGBoost, and BERT

2025-12-15 04:04:09

Fraud isn't just a nuisance; it’s a $12.5 billion industry. According to 2024 FTC data, reported losses to fraud spiked massively, with investment scams alone accounting for nearly half that total.

For developers and system architects, the challenge is twofold:

  1. Transaction Fraud: Detecting anomalies in structured financial data (Who sent money? Where? How much?).
  2. Communication Fraud (Spam/Phishing): Detecting malicious intent in unstructured text (SMS links, Email phishing).

Traditional rule-based systems ("If amount > $10,000, flag it") are too brittle. They generate false positives and miss evolving attack vectors.

In this engineering guide, we will build a Dual-Layer Defense System. We will implement a high-speed XGBoost model for transaction monitoring and a BERT-based NLP engine for spam detection, wrapping it all in a cloud-native microservice architecture.

Let’s build.

The Architecture: Real-Time & Cloud-Native

We aren't building a batch job that runs overnight. Fraud happens in milliseconds. We need a real-time inference engine.

Our system consists of two distinct pipelines feeding into a central decision engine.

Pipeline Architecture Components

The Tech Stack

  • Language: Python 3.9+
  • Structured Learning: XGBoost (Extreme Gradient Boosting) & Random Forest.
  • NLP: Hugging Face Transformers (BERT) & Scikit-learn (Naïve Bayes).
  • Deployment: Docker, Kubernetes, FastAPI.

Part 1: The Transaction Defender (XGBoost)

When dealing with tabular financial data (Amount, Time, Location, Device ID), XGBoost is currently the king of the hill. In our benchmarks, it achieved 98.2% accuracy and 97.6% precision, outperforming Random Forest in both speed and reliability.

The Challenge: Imbalanced Data

Fraud is rare. If you have 100,000 transactions, maybe only 30 are fraudulent. If you train a model on this, it will just guess "Legitimate" every time and achieve 99.9% accuracy while missing every single fraud case.

The Fix: We use SMOTE (Synthetic Minority Over-sampling Technique) or class weighting during training.

Implementation Blueprint

Here is how to set up the XGBoost classifier for transaction scoring.

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
import pandas as pd

# 1. Load Data (Anonymized Transaction Logs)
# Features: Amount, OldBalance, NewBalance, Location_ID, Device_ID, TimeDelta
df = pd.read_csv('transactions.csv')

X = df.drop(['isFraud'], axis=1)
y = df['isFraud']

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize XGBoost
# scale_pos_weight is crucial for imbalanced fraud data
model = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    scale_pos_weight=10, # Handling class imbalance
    use_label_encoder=False
)

# 4. Train
print("Training Fraud Detection Model...")
model.fit(X_train, y_train)

# 5. Evaluate
preds = model.predict(X_test)
print(f"Precision: {precision_score(y_test, preds):.4f}")
print(f"Recall: {recall_score(y_test, preds):.4f}")
print(f"F1 Score: {f1_score(y_test, preds):.4f}")

Why XGBoost Wins:

  • Speed: It processes tabular data significantly faster than Deep Neural Networks.
  • Sparsity: It handles missing values gracefully (common in device fingerprinting).
  • Interpretability: Unlike a "Black Box" Neural Net, we can output feature importance to explain why a transaction was blocked.

Part 2: The Spam Hunter (NLP)

Fraud often starts with a link. "Click here to update your KYC." \n To detect this, we need Natural Language Processing (NLP).

We compared Naïve Bayes (lightweight, fast) against BERT (Deep Learning).

  • Naïve Bayes: 94.1% Accuracy. Good for simple keyword-stuffing spam.
  • BERT: 98.9% Accuracy. Necessary for "Contextual" phishing (e.g., socially engineered emails that don't look like spam).

Implementation Blueprint (BERT)

For a production environment, we fine-tune a pre-trained Transformer model.

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# 1. Load Pre-trained BERT
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

def classify_message(text):
    # 2. Tokenize Input
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        truncation=True, 
        padding=True, 
        max_length=512
    )

    # 3. Inference
    with torch.no_grad():
        outputs = model(**inputs)

    # 4. Convert Logits to Probability
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    spam_score = probabilities[0][1].item() # Score for 'Label 1' (Spam)

    return spam_score

# Usage
msg = "Urgent! Your account is locked. Click http://bad-link.com"
score = classify_message(msg)

if score > 0.9:
    print(f"BLOCKED: Phishing Detected (Confidence: {score:.2%})")

Part 3: The "Hard Stop" Workflow

Detection is useless without action. The most innovative part of this architecture is the Intervention Logic.

We don't just log the fraud; we intercept the user journey.

The Workflow:

  1. User receives SMS: "Update payment method."
  2. User Clicks: The click is routed through our Microservice.
  3. Real-Time Scan: The URL and message body are scored by the BERT model.
  4. Decision Point:
  • Safe: User is redirected to the actual payment gateway.
  • Fraud: A "Hard Stop" alert pops up.

Note: Unlike standard email filters that move items to a Junk folder, this system sits between the click and the destination, preventing the user from ever loading the malicious payload.

Key Metrics

When deploying this to production, "Accuracy" is a vanity metric. You need to watch Precision and Recall.

  • False Positives (Precision drops): You block a legitimate user from buying coffee. They get angry and stop using your app.
  • False Negatives (Recall drops): You let a hacker drain an account. You lose money and reputation.

In our research, XGBoost provided the best balance:

  • Accuracy: 98.2%
  • Recall: 95.3% (It caught 95% of all fraud).
  • Latency: Fast inference suitable for real-time blocking.

Conclusion

The era of manual fraud review is over. With transaction volumes exploding, the only scalable defense is AI.

By combining XGBoost for structured transaction data and BERT for unstructured communication data, we create a robust shield that protects users not just from financial loss, but from the social engineering that precedes it.

Next Steps for Developers:

  1. Containerize: Wrap the Python scripts above in Docker.
  2. Expose API: Use FastAPI to create a /predict endpoint.
  3. Deploy: Push to Kubernetes (EKS/GKE) for auto-scaling capabilities.

\ \

3D Mapping Initialization: Using RGB-D Images and Camera Parameters

2025-12-15 04:00:04

Table of Links

Abstract and 1 Introduction

  1. Related Works

    2.1. Vision-and-Language Navigation

    2.2. Semantic Scene Understanding and Instance Segmentation

    2.3. 3D Scene Reconstruction

  2. Methodology

    3.1. Data Collection

    3.2. Open-set Semantic Information from Images

    3.3. Creating the Open-set 3D Representation

    3.4. Language-Guided Navigation

  3. Experiments

    4.1. Quantitative Evaluation

    4.2. Qualitative Results

  4. Conclusion and Future Work, Disclosure statement, and References

3.1. Data Collection

Creating the O3D-SIM begins by capturing a sequence of RGB-D images using a posed camera, with an estimate of the extrinsic and intrinsic parameters of the environment to be mapped. The pose information associated with each image is used to transform the point clouds to a world coordinate frame. For simulations, we use the groundtruth pose associated with each image, whereas we leverage RTAB-Map[30] with G2O optimization [31] in the real world to generate these poses.

\ Figure 2. An overview of the proposed 3D mapping pipeline. Labels generated by the RAM model are input into Grounding DINO to generate bounding boxes for the detected labels. Subsequently, instance masks are created using the SAM model, while CLIP and DINOv2 embeddings are extracted in parallel. These masks, along with the semantic embeddings, are back-projected into 3D space to identify 3D instances. These instances are then refined using a density-based clustering algorithm to produce the O3D-SIM.

\

:::info Authors:

(1) Laksh Nanwani, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;

(2) Kumaraditya Gupta, International Institute of Information Technology, Hyderabad, India;

(3) Aditya Mathur, International Institute of Information Technology, Hyderabad, India; this author contributed equally to this work;

(4) Swayam Agrawal, International Institute of Information Technology, Hyderabad, India;

(5) A.H. Abdul Hafez, Hasan Kalyoncu University, Sahinbey, Gaziantep, Turkey;

(6) K. Madhava Krishna, International Institute of Information Technology, Hyderabad, India.

:::


:::info This paper is available on arxiv under CC by-SA 4.0 Deed (Attribution-Sharealike 4.0 International) license.

:::

\