Less Data, Better Results: How Active Learning Improves Workflow Anomaly Detection
Chameleon-Powered Research Shows the Path to Efficient Scientific Computing
- Feb. 26, 2025
Scientific workflows help orchestrate complex computations across distributed resources, but they're vulnerable to failures at large scales. Detecting these anomalies typically requires massive training datasets—an expensive and time-consuming process. Researchers from USC, Argonne National Lab, UNC Chapel Hill, and Oak Ridge National Lab have developed an approach that learns more efficiently using active learning to generate only the most valuable training examples.
Research Overview
The research team developed a novel framework that combines active learning with an experimental platform designed specifically for computational workflows. This approach addresses a critical challenge in machine learning: how to minimize the amount of training data needed while maximizing model accuracy.
The key insight is that not all data is equally valuable for training. Rather than blindly collecting massive datasets, a system actively identifies which types of data would be most useful for improving model performance, and then generates precisely those examples on demand.
The solution comes in two parts:
- Poseidon-X: An experimental framework that leverages the Pegasus workflow management system and two NSF-funded cloud testbeds (Chameleon and FABRIC) to provide reproducible, on-demand data collection from workflow executions. This framework supports controlled injection of different types of anomalies, from CPU slowdowns to network issues.
- Active Learning Module: A mathematical framework that guides the data generation process by measuring the model's confidence on training data and triggering new targeted experiments in areas of uncertainty.
The evaluation uses three computational workflows: 1000Genome, Montage, and Predict Future Sales. The findings demonstrate that active learning not only saves computing resources but also improves the accuracy of anomaly detection significantly.
Experimental Implementation on Chameleon
Chameleon's bare-metal infrastructure was essential to this research. The team provisioned three Cascade Lake nodes with 48 cores and 192GB RAM each, deploying Docker containers to create an HTCondor pool for workflow execution. cgroups were leveraged to inject controlled performance anomalies and establish a high-speed connection between Chameleon and FABRIC over a 25Gbps VLAN. This setup allowed precise control over experimental conditions and reproducible results.
Experiment Artifacts
The research artifacts include the Poseidon-X framework (available on GitHub) and the Flow-Bench dataset containing pre-captured workflow traces. Jupyter notebook recipes provide step-by-step guidance for setup. You can find a link to the paper here.
About the User
Krishnan Raghavan, an Assistant Computational Mathematician at Argonne National Laboratory, led this research. He received his Ph.D. in computer engineering from Missouri University of Science and Technology in May 2019. He has coauthored several papers on deep neural networks with applications to big data.
If At First You Don't Succeed, Try, Try, Again...? Insights and LLM-Informed Tooling for Detecting Retry Bugs in Software Systems
Using Chameleon to Hunt Down Elusive Retry Bugs in Software Systems
- Oct. 21, 2024 by
- Bogdan-Alexandru Stoica
Discover how Bogdan Stoica and researchers at the University of Chicago developed Wasabi, an innovative tool that combines fault injection, static analysis, and large language models to detect and analyze retry-related bugs in complex software systems. Learn how Chameleon's bare-metal capabilities enabled precise testing environments for this fascinating research published at SOSP'24.
Chameleon, and Simulating Self Propagating Malware to Evaluate Detection Technology
- July 25, 2022 by
- Jason Hiser
How do you develop and evaluate a new analytic on a network connection data set across large, enterprise systems without malware used to train machine learning models for cyber attacks? Researchers at the University of Virginia approach the problem by simulating self-propagating malware.
Automated Fast-flux Detection using Machine Learning and Genetic Algorithms
- May 23, 2022 by
- Ahmet Aksoy
Interested in protecting remote devices from malicious actors? Learn about how a researcher at the University of Missouri is approaching this problem with genetic algorithms and host fingerprinting! Also included is a YouTube video where Dr. Aksoy discusses this research.
No comments