• BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes

    David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar Devanbu, Bogdan Vasilescu, Cindy Rubio-González
    In Proceedings of International Conference on Software Engineering (ICSE'19) Montreal, Canada, May 2019.
    Available as: PDF, BibTeX
    Abstract: Fault-detection, localization, and repair methods are vital to software quality; but it is difficult to evaluate their generality, applicability, and current effectiveness. Large, diverse, realistic datasets of durably-reproducible faults and fixes are vital to good experimental evaluation of approaches to software quality, but they are difficult and expensive to assemble and keep current. Modern continuous-integration (CI) approaches, like Travis-CI, which are widely used, fully configurable, and executed within custom-built containers, promise a path toward much larger defect datasets. If we can identify and archive failing and subsequent passing runs, the containers will provide a substantial assurance of durable future reproducibility of build and test. Several obstacles, however, must be overcome to make this a practical reality. We describe BugSwarm, a toolset that navigates these obstacles to enable the creation of a scalable, diverse, realistic, continuously growing set of durably reproducible failing and passing versions of real-world, open-source systems. The BugSwarm toolkit has already gathered 3,091 fail-pass pairs, in Java and Python, all packaged within fully reproducible containers. Furthermore, the toolkit can be run periodically to detect fail-pass activities, thus growing the dataset continually.

Related Papers

  • ActionsRemaker: Reproducing GitHub Actions

    Hao-Nan Zhu, Kevin Z. Guan, Robert M. Furth, Cindy Rubio-González
    Available as: PDF
    Abstract: Mining Continuous Integration and Continuous Delivery (CI/CD) has enabled new research opportunities for the software engineering (SE) research community. However, it remains a challenge to reproduce CI/CD build processes, which is crucial for several areas of research within SE such as fault localization and repair. In this paper, we present ActionsRemaker, a reproducer for GitHub Actions builds. We describe the challenges on reproducing GitHub Actions builds and the design of ActionsRemaker. Evaluation of ActionsRemaker demonstrates its ability to reproduce fail-pass pairs: of 180 pairs from 67 repositories, 130 (72.2%) from 43 repositories are reproducible. We also discuss reasons for unreproducibility. ActionsRemaker is publicly available at, and a demo of the tool can be found at
  • On the Reproducibility of Software Defect Datasets

    Hao-Nan Zhu, Cindy Rubio-González
    Available as: PDF
    Abstract: Software defect datasets are crucial to facilitating the evaluation and comparison of techniques in fields such as fault localization, test generation, and automated program repair. However, the reproducibility of software defect artifacts is not immune to breakage. In this paper, we conduct a study on the reproducibility of software defect artifacts. First, we study five state-of-the-art Java defect datasets. Despite the multiple strategies applied by dataset maintainers to ensure reproducibility, all datasets are prone to breakages. Second, we conduct a case study in which we systematically test the reproducibility of 1,795 software artifacts during a 13-month period. We find that 62.6% of the artifacts break at least once, and 15.3% artifacts break multiple times. We manually investigate the root causes of breakages and handcraft 10 patches, which are automatically applied to 1,055 distinct artifacts in 2,948 fixes. Based on the nature of the root causes, we propose automated dependency caching and artifact isolation to prevent further breakage. In particular, we show that isolating artifacts to eliminate external dependencies increases reproducibility to 95% or higher, which is on par with the level of reproducibility exhibited by the most reliable manually curated dataset.
  • On the Real-World Effectiveness of Static Bug Detectors at Finding Null Pointer Exceptions

    David Tomassi, Cindy Rubio-González
    In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)
    Available as: PDF, BibTeX
    Abstract: Static bug detectors aim at helping developers to automatically find and prevent bugs. In this experience paper, we study the effectiveness of static bug detectors at identifying Null Pointer Dereferences or Null Pointer Exceptions (NPEs). NPEs pervade all programming domains from systems to web development. Specifically, our study measures the effectiveness of five Java static bug detectors: CheckerFramework, ERADICATE, INFER, NULLAWAY, and SPOTBUGS. We conduct our study on 102 real-world and reproducible NPEs from 42 open-source projects found in the BUGSWARM and DEFECTS4J datasets. We apply two known methods to determine whether a bug is found by a given tool, and introduce two new methods that leverage stack trace and code coverage information. Additionally, we provide a categorization of the tool’s capabilities and the bug characteristics to better understand the strengths and weaknesses of the tools. Overall, the tools under study only find 30 out of 102 bugs (29.4%), with the majority found by ERADICATE. Based on our observations, we identify and discuss opportunities to make the tools more effective and useful.
  • Fixing Dependency Errors for Python Build Reproducibility

    Suchita Mukherjee, Abigail Almanza, Cindy Rubio-González
    In ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis
    Available as: PDF, BibTeX
    Abstract: Software reproducibility is important for re-usability and the cumulative progress of research. An important manifestation of unreproducible software is the changed outcome of software builds over time. While enhancing code reuse, the use of open-source dependency packages hosted on centralized repositories such as PyPI can have adverse effects on build reproducibility. Frequent updates to these packages often cause their latest versions to have breaking changes for applications using them. Large Python applications risk their historical builds becoming unreproducible due to the widespread usage of Python dependencies, and the lack of uniform practices for dependency version specification. Manually fixing dependency errors requires expensive developer time and effort, while automated approaches face challenges of parsing unstructured build logs, finding transitive dependencies, and exploring an exponential search space of dependency versions. In this paper, we investigate how open-source Python projects specify dependency versions, and how their reproducibility is impacted by dependency packages. We propose a tool PyDFix to detect and fix unreproducibility in Python builds caused by dependency errors. PyDFix is evaluated on two bug datasets BugSwarm and BugsInPy, both of which are built from real-world open-source projects. PyDFix analyzes a total of 2,702 builds, identifying 1,921 (71.1%) of them to be unreproducible due to dependency errors. From these, PyDFix provides a complete fix for 859 (44.7%) builds, and partial fixes for an additional 632 (32.9%) builds.
  • A Note About: Critical Review of BugSwarm for Fault Localization and Program Repair

    David A. Tomassi, Cindy Rubio-González
    Available as PDF and BibTeX which can be found: Here
    Abstract: Datasets play an important role in the advancement of software tools and facilitate their evaluation. BugSwarm is an infrastructure to automatically create a large dataset of real-world reproducible failures and fixes. In this paper, we respond to Durieux and Abreu's critical review of the BugSwarm dataset, referred to in this paper as CriticalReview. We replicate CriticalReview's study and find several incorrect claims and assumptions about the BugSwarm dataset. We discuss these incorrect claims and other contributions listed by CriticalReview. Finally, we discuss general misconceptions about BugSwarm, and our vision for the use of the infrastructure and dataset.
  • Bugs in the Wild: Examining the Effectiveness of Static Analyzers at Finding Real-World Bugs

    David A. Tomassi
    In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
    Available as PDF and BibTeX which can be found: Here
    Abstract: Static analysis is a powerful technique to find software bugs. In past years, a few static analysis tools have become available for developers to find certain kinds of bugs in their programs. However, there is no evidence on how effective the tools are in finding bugs in real-world software. In this paper, we present a preliminary study on the popular static analyzers ErrorProne and SpotBugs. Specifically, we consider 320 real Java bugs from the BugSwarm dataset, and determine which of these bugs can potentially be found by the analyzers, and how many are indeed detected. We find that 30.3% and 40.3% of the bugs are candidates for detection by ErrorProne and SpotBugs, respectively. Our evaluation shows that the analyzers are relatively easy to incorporate into the tool chain of diverse projects that use the Maven build system. However, the analyzers are not as effective detecting the bugs under study, with only one bug successfully detected by SpotBugs.