All About Traces

March 9, 2020
Zhuo Zhen

Tips and Tricks

The characterization of workloads in data centers is critical for resource management systems. To assist resource management studies, several traces and cluster data logs were released. The following table lists several major workloads that have been released and available today.

Source Type	Name	Owner Organization	Latest Version	First Release Year
Commercial Cloud	Alibaba Cluster Trace	Alibaba	v2018	2017
	Azure VM Workload	Microsoft	v2	2017
	Google Cluster Workload Traces	Google	--	2011
HPC	Parallel Workloads Archive (PWA)	The Hebrew University of Jerusalem (Israel)	--	2014
	Grid Workloads Archive (GWA)	Delft University of Technology (TU Delft) (Netherlands)	--	2007
	LANL Mustang Cluster	ATLAS project at Carnegie Mellon University and Los Alamos National Laboratory (LANL)	--	2018
	LANL Trinity Cluster	ATLAS project at Carnegie Mellon University and Los Alamos National Laboratory (LANL)	--	2018
Private Cloud	Two Sigma	Two Sigma	--	TBD
Private Cloud	Statistical Workload Injector for MapReduce (SWIM)	The University of California Berkeley	--	2011

Publicly available cluster workload traces remain scarce for the increasing number of studies on scheduling and resource management research.

Chameleon has been configured using an adaptation of a mainstream open source infrastructure cloud system called OpenStack. OpenStack has been widely used by other testbeds, such as JetStream, as well as academic/research/government organizations around the world. Developing a trace generator tool for OpenStack would encourage the OpenStack users to share their traces and, therefore, benefits various researches which require real-life workload traces. There are two major concerns when sharing traces -- one from the trace provider and one from the trace user. The researchers who use the traces usually ask -- “is this trace usable?” or “does it contain enough information for my experiment?”, while the trace providers would concern if they exclude or hide confidential information from the shared trace. To address the concerns for both the users and the providers, we developed a trace generator tool to extract the appropriate data from the OpenStack databases and also to anonymize certain fields. In addition, Chameleon has released both virtual machine and bare-metal traces generated using the trace generator and published the traces at sciencecloud.org.

What’s Included?

You can generate two tables with the trace generator -- machine events and instance events. The machine events table contains information about the physical hosts. The script captures 5 types of events of a machine, including when the machine is created, enabled, updated, disabled, and deleted. Along with the event type and timestamp, the machine events table also contains the name of the machine and the machine properties. The instance events table includes information about user-created instances (virtual machines or bare-metals). The event column of the table is extracted directly from the event column of the instance_actions_events table of the OpenStack Nova database. The events indicate when an instance gets started, terminated, rebuilt, paused, suspended, etc. For each instance event, we extract two timestamps -- the event start time and the event end time, along with the result of the event (succeeded or failed). Other information includes the name/ID of the instance, the instance owner (user and project), and the instance properties. Two tables can reference each other by the physical host column. To find the detailed column lists of the tables, visit cloud trace format page at sciencecloud.org.

What’s Hidden?

For confidentiality reasons, trace providers may not want to expose information like user/project IDs or physical host names. When using the trace, the users might not need to know the actual IDs but they might want to know which instances belong to one user when running their simulations. The trace generator allows you to anonymize certain fields using different anonymization techniques. For fields like instance id, instance name, user id, project id and physical host name, keyed cryptographic hash technique is applied. We also mask the rack information of a physical host using the ordering technique. The list of observed unique rack values is sorted. Then, we assigned sequential numbers starting with 0 to the items of the rack list, and the observed values are mapped onto these numbers.

About Usage

How to use the Chameleon traces?

The Chameleon traces can be downloaded from sciencecloud.org. For a quick start on downloading and using our cloud traces, we provided a Jupyter Notebook example, which you can upload to Chameleon Jupyter server.

How to use the trace generator?

Generating traces using our trace generator tool is easy if you use OpenStack compute service (Nova). Simply download the scripts from our GitHub repository and follow the instructions for setup, configuration and execution.

It would be a huge encouragement to us to see your work using our cloud traces, and we also encourage research groups and cloud testbed providers to share their cloud traces with us. You can find more information about the Chameleon traces and how to share your traces with us at sciencecloud.org. We look forward to hearing from you!

All About Traces

What’s Included?

What’s Hidden?

About Usage

How to use the Chameleon traces?

How to use the trace generator?

Categories

Featured Posts

No comments