Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers

Misbah Mubarak, Philip Carns, Jonathan Jenkins, Jianping Kelvin Li, Nikhil Jain, Shane Snyder, Robert Ross, Christopher D. Carothers, Abhinav Bhatele, Kwan-Liu Ma

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

HPC systems have shifted to burst buffer storage and high radix interconnect topologies in order to meet the challenges of large-scale, data-intensive scientific computing. Both of these technologies have been studied in detail independently, but the interaction between them is not well understood. I/O traffic and communication traffic from concurrently scheduled applications may interfere with each other in unexpected ways, and this behavior may vary considerably depending on resource allocation, scheduling, and routing policies.In this work, we analyze I/O and network traffic interference on burst-buffer-equipped dragonfly-based systems using the high-resolution packet-level simulations provided by the CODES storage and interconnect simulation framework. The analysis is performed using realistic I/O workload sizes, a variety of resource allocation and network routing strategies employed in production environments, and a dragonfly network configuration modeled after current vendor options. We analyze the impact of interference on both I/O and communication traffic.We observe that although average network packet latency is stable across a wide variety of configurations, the maximum network packet latency in the presence of concurrent I/O traffic is highly sensitive to subtle policy changes. Our simulations reveal a worst-case single packet latency of 4,700 times the average latency for sub-optimal configurations. While a topology-Aware mapping of compute nodes to burst buffer storage nodes can minimize the variation in maximum packet latency, it can slow down the I/O traffic by creating contention on the burst buffer nodes. Overall, balancing I/O and network performance requires careful selection of routing, data placement, and job placement policies.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages204-215
Number of pages12
Volume2017-September
ISBN (Electronic)9781538623268
DOIs
StatePublished - Sep 22 2017
Event2017 IEEE International Conference on Cluster Computing, CLUSTER 2017 - Honolulu, United States
Duration: Sep 5 2017Sep 8 2017

Other

Other2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
CountryUnited States
CityHonolulu
Period9/5/179/8/17

Fingerprint

Buffer storage
Telecommunication traffic
Packet networks
Resource allocation
Topology
Natural sciences computing
Network routing
Network performance
Scheduling

Keywords

  • Burst buffer
  • Checkpoint
  • Discrete-event simulation
  • Dragonfly networks
  • I/O and communication traffic

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Cite this

Mubarak, M., Carns, P., Jenkins, J., Li, J. K., Jain, N., Snyder, S., ... Ma, K-L. (2017). Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers. In Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017 (Vol. 2017-September, pp. 204-215). [8048932] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CLUSTER.2017.25

Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers. / Mubarak, Misbah; Carns, Philip; Jenkins, Jonathan; Li, Jianping Kelvin; Jain, Nikhil; Snyder, Shane; Ross, Robert; Carothers, Christopher D.; Bhatele, Abhinav; Ma, Kwan-Liu.

Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017. Vol. 2017-September Institute of Electrical and Electronics Engineers Inc., 2017. p. 204-215 8048932.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Mubarak, M, Carns, P, Jenkins, J, Li, JK, Jain, N, Snyder, S, Ross, R, Carothers, CD, Bhatele, A & Ma, K-L 2017, Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers. in Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017. vol. 2017-September, 8048932, Institute of Electrical and Electronics Engineers Inc., pp. 204-215, 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017, Honolulu, United States, 9/5/17. https://doi.org/10.1109/CLUSTER.2017.25
Mubarak M, Carns P, Jenkins J, Li JK, Jain N, Snyder S et al. Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers. In Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017. Vol. 2017-September. Institute of Electrical and Electronics Engineers Inc. 2017. p. 204-215. 8048932 https://doi.org/10.1109/CLUSTER.2017.25
Mubarak, Misbah ; Carns, Philip ; Jenkins, Jonathan ; Li, Jianping Kelvin ; Jain, Nikhil ; Snyder, Shane ; Ross, Robert ; Carothers, Christopher D. ; Bhatele, Abhinav ; Ma, Kwan-Liu. / Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers. Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017. Vol. 2017-September Institute of Electrical and Electronics Engineers Inc., 2017. pp. 204-215
@inproceedings{1ab58729c21e4a1db45a010dcf7ac4d6,
title = "Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers",
abstract = "HPC systems have shifted to burst buffer storage and high radix interconnect topologies in order to meet the challenges of large-scale, data-intensive scientific computing. Both of these technologies have been studied in detail independently, but the interaction between them is not well understood. I/O traffic and communication traffic from concurrently scheduled applications may interfere with each other in unexpected ways, and this behavior may vary considerably depending on resource allocation, scheduling, and routing policies.In this work, we analyze I/O and network traffic interference on burst-buffer-equipped dragonfly-based systems using the high-resolution packet-level simulations provided by the CODES storage and interconnect simulation framework. The analysis is performed using realistic I/O workload sizes, a variety of resource allocation and network routing strategies employed in production environments, and a dragonfly network configuration modeled after current vendor options. We analyze the impact of interference on both I/O and communication traffic.We observe that although average network packet latency is stable across a wide variety of configurations, the maximum network packet latency in the presence of concurrent I/O traffic is highly sensitive to subtle policy changes. Our simulations reveal a worst-case single packet latency of 4,700 times the average latency for sub-optimal configurations. While a topology-Aware mapping of compute nodes to burst buffer storage nodes can minimize the variation in maximum packet latency, it can slow down the I/O traffic by creating contention on the burst buffer nodes. Overall, balancing I/O and network performance requires careful selection of routing, data placement, and job placement policies.",
keywords = "Burst buffer, Checkpoint, Discrete-event simulation, Dragonfly networks, I/O and communication traffic",
author = "Misbah Mubarak and Philip Carns and Jonathan Jenkins and Li, {Jianping Kelvin} and Nikhil Jain and Shane Snyder and Robert Ross and Carothers, {Christopher D.} and Abhinav Bhatele and Kwan-Liu Ma",
year = "2017",
month = "9",
day = "22",
doi = "10.1109/CLUSTER.2017.25",
language = "English (US)",
volume = "2017-September",
pages = "204--215",
booktitle = "Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers

AU - Mubarak, Misbah

AU - Carns, Philip

AU - Jenkins, Jonathan

AU - Li, Jianping Kelvin

AU - Jain, Nikhil

AU - Snyder, Shane

AU - Ross, Robert

AU - Carothers, Christopher D.

AU - Bhatele, Abhinav

AU - Ma, Kwan-Liu

PY - 2017/9/22

Y1 - 2017/9/22

N2 - HPC systems have shifted to burst buffer storage and high radix interconnect topologies in order to meet the challenges of large-scale, data-intensive scientific computing. Both of these technologies have been studied in detail independently, but the interaction between them is not well understood. I/O traffic and communication traffic from concurrently scheduled applications may interfere with each other in unexpected ways, and this behavior may vary considerably depending on resource allocation, scheduling, and routing policies.In this work, we analyze I/O and network traffic interference on burst-buffer-equipped dragonfly-based systems using the high-resolution packet-level simulations provided by the CODES storage and interconnect simulation framework. The analysis is performed using realistic I/O workload sizes, a variety of resource allocation and network routing strategies employed in production environments, and a dragonfly network configuration modeled after current vendor options. We analyze the impact of interference on both I/O and communication traffic.We observe that although average network packet latency is stable across a wide variety of configurations, the maximum network packet latency in the presence of concurrent I/O traffic is highly sensitive to subtle policy changes. Our simulations reveal a worst-case single packet latency of 4,700 times the average latency for sub-optimal configurations. While a topology-Aware mapping of compute nodes to burst buffer storage nodes can minimize the variation in maximum packet latency, it can slow down the I/O traffic by creating contention on the burst buffer nodes. Overall, balancing I/O and network performance requires careful selection of routing, data placement, and job placement policies.

AB - HPC systems have shifted to burst buffer storage and high radix interconnect topologies in order to meet the challenges of large-scale, data-intensive scientific computing. Both of these technologies have been studied in detail independently, but the interaction between them is not well understood. I/O traffic and communication traffic from concurrently scheduled applications may interfere with each other in unexpected ways, and this behavior may vary considerably depending on resource allocation, scheduling, and routing policies.In this work, we analyze I/O and network traffic interference on burst-buffer-equipped dragonfly-based systems using the high-resolution packet-level simulations provided by the CODES storage and interconnect simulation framework. The analysis is performed using realistic I/O workload sizes, a variety of resource allocation and network routing strategies employed in production environments, and a dragonfly network configuration modeled after current vendor options. We analyze the impact of interference on both I/O and communication traffic.We observe that although average network packet latency is stable across a wide variety of configurations, the maximum network packet latency in the presence of concurrent I/O traffic is highly sensitive to subtle policy changes. Our simulations reveal a worst-case single packet latency of 4,700 times the average latency for sub-optimal configurations. While a topology-Aware mapping of compute nodes to burst buffer storage nodes can minimize the variation in maximum packet latency, it can slow down the I/O traffic by creating contention on the burst buffer nodes. Overall, balancing I/O and network performance requires careful selection of routing, data placement, and job placement policies.

KW - Burst buffer

KW - Checkpoint

KW - Discrete-event simulation

KW - Dragonfly networks

KW - I/O and communication traffic

UR - http://www.scopus.com/inward/record.url?scp=85032614983&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85032614983&partnerID=8YFLogxK

U2 - 10.1109/CLUSTER.2017.25

DO - 10.1109/CLUSTER.2017.25

M3 - Conference contribution

AN - SCOPUS:85032614983

VL - 2017-September

SP - 204

EP - 215

BT - Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -