November’s Chameleon User Experiments blog features Nanqinqin Li, a first-year PhD student at Princeton University. Learn more about Li, his summer research on reproducibility and Solid-State Drive Simulators, and learn where to replicate his experiment on Trovi!
Image: Nanqinqin Li
On his background: My name is Nanqinqin Li. I am a first-year PhD student at Princeton University. I received my master’s degree at the University of Chicago working with Prof. Haryadi Gunawi. My major research interest is storage/file systems. I am always trying to train myself as an individual researcher and to leave a footprint of my own in the evolution of computer science. In my spare time, I like doing a little sports (soccer, swimming, hiking, etc.) and a lot of video gaming with my friends.
On his summer research: This summer, I worked on a reproducibility project with Prof. Gunawi and several other students. The purpose of this work is to get a picture of research reproducibility in the current literature and to reveal interesting findings and potential underlying reasoning. My role was to do a thorough survey on the topic of SSD (solid-state drives) simulators to find papers that published their experiments and then try to reproduce the results on the Chameleon testbed. We chose SSD simulators mainly because they are general software tools for facilitating research on hardware, and reproducibility is the core to such software. The major challenge was mostly the literature review part where we manually searched for papers which covered relevant topics and also published their experiments. After that, the second challenge was to actually reproduce those experiments using Chameleon.
On approaching the research challenge: The approach is simple. We first located the paper that presented the simulator, then traced back to all the papers that cited it. We looked into two popular SSD simulators, SSDsim and FlashSim, and this approach narrowed down the paper pool to 600 papers that actually used/extended the simulators. From this survey, we were able to find 8 papers (about 1%!) that published their experiments (mostly on GitHub), but only one that we were able to reproduce on Chameleon. The major reason for the other failed 7 papers is the poor maintenance of their published code repo (e.g., compilation problem, incompatible with newer kernel version, and runtime errors). These percentages match up to a paper investigating reproducibility that was published a long time ago, which means that most of the time, researchers do not focus much on reproducibility.
On testbed needs: The need here mostly depends on the experiments in the papers. For simulator related research, I don’t think there are special feature needs for the platform as long as it provides basic computing nodes, given that the goal of simulators is to empower research in absence of adequate hardware. Thus, the features unique to Chameleon that I mostly used were the Jupyter integration for experiment orchestration and experiment packaging for publishing. The Jupyter integration provides a great interface for orchestrating the experiment flow, making complicated research results easy to be reproduced in the form of Jupyter notebooks.
On the role of reproducibility in science: I think the role is to greatly inspire new ideas. Aside from well-written papers, repeatable experiments is another (arguably the best) way to precisely convey ideas/designs/arguments in published research, thus making it much easier for new ideas to wander around.
On the greatest obstacles for repeatable research: Based on my experience with those 600 papers I surveyed, I would say the greatest obstacles are 1.) most papers do not publish their experiments along with the paper; and 2.) poor maintenance to their github repo which makes their experiments obsolete rather quickly.
On obstacles that stand in the way of systems experimentation: Given that my research interacts closely with various storage devices, the kind of situation happens when we don’t have a specific type of device we need (such as Intel Optane, open-channel SSDs, etc.). Also, COVID makes it hard for us to sometimes access the devices physically when there is a need, and I think this is where Chameleon could help. For example, one of my previous projects requires constant physical access to the machine because we need to swap on/off SSD devices due to limitations on our hardware supply. On Chameleon, we could actually just configure a few nodes differently to meet our various needs, instantly and pain free.
On his most powerful piece of advice for students beginning research: Failures could destroy your will, while success may also blind your eyes. Keep up your equanimity!
Where to find artifacts related to his experiment:
The tinyTailFlash package has been published to Chameleon (it’s a self-explained Jupyter notebook): https://www.chameleoncloud.org/experiment/share/4
It is expected to take 30 minutes to reproduce the results.
The original paper corresponding to the package is here: https://www.usenix.org/system/files/conference/fast17/fast17-yan.pdf