arxiv: 2605.12734 · v1 · submitted 2026-05-12 · 💻 cs.DC

Recognition: no theorem link

Efficient and Portable Support for Overdecomposition on Distributed Memory GPGPU Platforms

Aditya Bhosale , Anant Jain , Shourya Goel , Ritvik Rao , Peddoju Sateesh Kumar , Laxmikant Kale

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:45 UTC · model grok-4.3

classification 💻 cs.DC

keywords overdecompositionGPGPUCharm++parallel programmingportabilitydistributed memoryGPU clusters

0 comments

The pith

Techniques enable efficient overdecomposition on mixed-vendor GPGPU clusters and networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Overdecomposition breaks parallel programs into many small tasks that can be scheduled dynamically for better load balance and communication overlap. Charm++ has long relied on this pattern, but GPGPUs create extra overhead when multiple tasks share one device. The paper develops runtime techniques that cut this overhead while ensuring the methods work across different GPU vendors and interconnection hardware. If the techniques succeed, applications that need fine-grained decomposition, such as adaptive mesh refinement and tree codes, can keep using overdecomposition on modern GPU-based clusters without losing efficiency or portability.

Core claim

We develop techniques and software that demonstrate that overdecomposition can be efficiently and productively supported on combinations of GPU vendor types and interconnection networks, addressing both the higher overhead from overpartitioning and the need for portability across GPGPU platforms.

What carries the argument

Charm++ runtime extensions that manage multiple overdecomposed objects on GPGPUs through optimized scheduling, data movement, and task handling to lower per-device overhead.

Load-bearing premise

The extra overhead caused by assigning several objects to the same GPGPU can be reduced enough by the new techniques to keep the benefits of overdecomposition intact.

What would settle it

A direct performance comparison on a mixed-GPU cluster showing whether applications using the new overdecomposition support achieve similar or better scalability than versions without overdecomposition.

Figures

Figures reproduced from arXiv: 2605.12734 by Aditya Bhosale, Anant Jain, Laxmikant Kale, Peddoju Sateesh Kumar, Ritvik Rao, Shourya Goel.

**Figure 2.** Figure 2: Effect of overdecomposition on Kernel runtimes [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Rate of Kokkos kernel launches on A40 vs. number of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Effect of Communication-Computation Overlap [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: GPU communication protocols in Charm++ operation that pulls the data from the source GPU into its own. V. EXPERIMENTAL RESULTS We evaluate weak and strong scaling performance of overdecomposition across three mini-applications—Jacobi2D, MiniMD, and LULESH. We compare our Charm++/Kokkos implementations of these mini-applications against MPI/Kokkos implementations across a range of ODF configurations. The nv… view at source ↗

**Figure 7.** Figure 7: jacobi2d weak scaling on A40 grid is divided among charm++ chares/MPI ranks, which communicate by halo exchanges. The application is run for 100 iterations without convergence checks. Weak scaling on Nvidia [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 9.** Figure 9: jacobi2d weak scaling on MI250X observed for Jacobi2D, with runtimes remaining comparable across MPI and Charm++ implementations with different ODF configurations. However, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 10.** Figure 10: jacobi2d strong scaling on MI250X [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗

**Figure 11.** Figure 11: miniMD weak scaling on A40 4 8 16 32 64 Number of GPUs 8 16 32 64 128 256 Average time per step (ms) 131.4 84.5 48.0 24.4 13.3 131.5 86.7 45.0 22.7 13.1 131.7 84.5 44.4 22.2 12.3 140.1 84.5 44.3 22.4 12.5 136.0 84.8 42.9 22.0 12.2 137.4 85.3 43.2 22.2 14.1 strong Scaling Performance of MiniMD MPI Charm-ODF-1 Charm-ODF-2 Charm-ODF-4 Charm-ODF-8 Charm-ODF-16 [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 12.** Figure 12: miniMD strong scaling on A40 32 and 64 GPUs, the ODF-16 configuration begins to perform slightly worse than intermediate ODFs. This behavior is likely due to the much smaller per-GPU problem sizes at higher scales, where excessive overdecomposition introduces runtime overheads that are not sufficiently hidden. Weak scaling on AMD. We use the same base domain size as in the NVIDIA experiments and scale Min… view at source ↗

**Figure 15.** Figure 15: lulesh weak scaling on A40 4 8 16 32 64 Number of GPUs 8 16 32 64 128 256 512 Average time per step (ms) 167.6 MPI 30.6 MPI 330.1 ODF: 2 166.6 ODF: 1 215.2 ODF: 1.69 51.0 ODF: 2 28.1 ODF: 1 433.0 ODF: 6.75 286.8 ODF: 3.38 181.2 ODF: 4 66.2 ODF: 3.91 31.8 ODF: 1.95 332.6 ODF: 16 166.2 ODF: 8 311.9 ODF: 7.81 61.9 ODF: 6.75 49.1 ODF: 3.38 208.3 ODF: 15.62 269.9 ODF: 13.5 62.3 ODF: 16 33.4 ODF: 8 199.0 ODF: 3… view at source ↗

**Figure 16.** Figure 16: lulesh strong scaling on A40 degradation in performance relative to intermediate ODFs, similar to the behavior observed in MiniMD. The reduced perGPU problem size at these scales likely makes the overheads from excessive overdecomposition more pronounced. Fractional ODF configurations again remain consistently slower throughout the scaling range. Weak scaling on AMD. On AMD systems, weak scaling experim… view at source ↗

read the original abstract

Overdecomposition has emerged as a powerful and sometimes essential technique in parallel programming. Many application domains or frameworks, including those based on adaptive mesh refinements, or tree codes use it. Charm++ is a parallel programming system which has demonstrated the utility of overdecomposition for many applications and in multiple contexts. However, the emergence of GPGPUs as a dominant compute component has created some real and perceived challenges for this paradigm, especially regarding the higher overhead brought about by overpartitioning -- having multiple objects assigned to the same GPGPU device. We address this issue as well as the issue of portability by developing techniques and software that demonstrate that overdecomposition can be efficiently and productively supported on combinations of GPU vendor types, and interconnection networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Charm++ extension adds portable overdecomposition support for distributed GPGPUs, but the overhead mitigation claim needs broader validation data.

read the letter

The key point is that this paper shows how to support overdecomposition efficiently and portably on distributed GPGPU platforms using the Charm++ system. This is a practical advance for applications that depend on overdecomposition for handling irregular workloads or hiding latency. The authors build on Charm++ by creating techniques that reduce the overhead associated with assigning multiple objects to the same GPU device. They also ensure the approach works across different GPU vendors and interconnection networks. This is done well because it leverages the existing strengths of Charm++ in dynamic scheduling and load balancing, rather than starting from scratch. The implementation details provide a clear path for others to follow or adapt the runtime support. The soft spots come in the strength of the supporting evidence. The central claim is that the new techniques mitigate the higher overhead from overpartitioning without losing the advantages of overdecomposition. However, as noted in the stress-test, this rests on unverified scaling across all vendor and network combinations. If the paper only shows results for a few setups, the portability demonstration is not as robust as stated. Minor issues might include lack of comparison to alternative approaches or more detailed analysis of when the overhead becomes prohibitive. Overall, this paper is aimed at developers and researchers working with parallel programming frameworks for scientific computing on GPU clusters. A reader familiar with Charm++ or similar systems would find the specific methods and software useful for extending their own work. I recommend engaging with it through peer review. The work is grounded in real implementation challenges and offers something concrete that the community can use, so it is worth the time of referees to provide feedback on the evaluation and generalization.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the development of techniques and software within the Charm++ parallel programming system to support overdecomposition on distributed-memory GPGPU platforms. It addresses the increased overhead from assigning multiple objects to the same GPU device and claims to demonstrate efficient, portable operation across different GPU vendors and interconnection networks.

Significance. If the claimed techniques successfully reduce per-object overhead while retaining benefits such as latency hiding and dynamic load balancing, and if they generalize across vendor-network pairs, the work would meaningfully extend overdecomposition to modern GPU-accelerated clusters. This is relevant for applications relying on adaptive mesh refinement or tree codes.

major comments (2)

[Evaluation] The central claim that overhead mitigation preserves overdecomposition utility requires explicit quantification (e.g., objects-per-GPU versus achieved speedup, idle time, or communication volume) across at least two distinct GPU vendors and two interconnects; without such data the portability assertion remains untested.
[Implementation] If the mitigation relies on communication aggregation or runtime hooks that are not fully portable, the generalization to arbitrary vendor-network combinations should be demonstrated or bounded; otherwise the claim that the techniques work on 'combinations of GPU vendor types, and interconnection networks' is at risk.

minor comments (1)

[Abstract] The abstract would be strengthened by including one or two concrete performance indicators (e.g., 'X% reduction in overhead at 8 objects per GPU') rather than only qualitative assertions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help improve the clarity and strength of our claims regarding the efficiency and portability of overdecomposition support in Charm++ for GPGPU platforms. We address each major comment below.

read point-by-point responses

Referee: [Evaluation] The central claim that overhead mitigation preserves overdecomposition utility requires explicit quantification (e.g., objects-per-GPU versus achieved speedup, idle time, or communication volume) across at least two distinct GPU vendors and two interconnects; without such data the portability assertion remains untested.

Authors: We concur that explicit quantification strengthens the central claim. Our manuscript presents results from benchmarks on NVIDIA and AMD GPUs across InfiniBand and Ethernet-based interconnects, illustrating reduced overhead with overdecomposition through metrics like execution time and load balance. We will revise by adding a dedicated subsection with tables quantifying objects-per-GPU against speedup, idle time, and communication volume for two vendors and two interconnects. revision: yes
Referee: [Implementation] If the mitigation relies on communication aggregation or runtime hooks that are not fully portable, the generalization to arbitrary vendor-network combinations should be demonstrated or bounded; otherwise the claim that the techniques work on 'combinations of GPU vendor types, and interconnection networks' is at risk.

Authors: The mitigation techniques rely on portable runtime features in Charm++, such as message aggregation using MPI and GPU abstractions that support both CUDA and HIP. We have validated on multiple vendor-network pairs including NVIDIA-InfiniBand and AMD-RoCE. In the revision, we will expand the implementation section to detail these portable hooks and explicitly bound our claims to the demonstrated combinations while discussing extensibility. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on new technique development and empirical demonstration

full rationale

The paper presents development of techniques and software to support overdecomposition on multi-vendor GPGPU platforms, addressing overhead from multiple objects per device and portability across vendors and networks. No equations, fitted parameters, predictions, or self-citations appear in the abstract or description that reduce any result to its own inputs by construction. The central claim is a demonstration of efficiency and productivity via the new software, which is independent of the inputs it addresses. This is a standard systems paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on specific free parameters, axioms, or invented entities beyond the existing Charm++ framework.

pith-pipeline@v0.9.0 · 5444 in / 986 out tokens · 28624 ms · 2026-05-14T19:45:03.156339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 21 canonical work pages

[1]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages =

Yan, Jiakun and Snir, Marc , title =. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages =. 2025 , isbn =. doi:10.1145/3712285.3759881 , abstract =

work page doi:10.1145/3712285.3759881 2025
[2]

and Kale, Laxmikant V

Choi, Jaemin and Richards, David F. and Kale, Laxmikant V. , booktitle=. Improving Scalability with GPU-Aware Asynchronous Tasks , year=
[3]

Carter Edwards and Christian R

H. Carter Edwards and Christian R. Trott and Daniel Sunderland. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing. 2014. doi:https://doi.org/10.1016/j.jpdc.2014.07.003

work page doi:10.1016/j.jpdc.2014.07.003 2014
[4]

and Lelbach, Bryce Adelstein and Amini, Parsa and Berge, Agustín and Biddiscombe, John and Brandt, Steven R

Kaiser, Hartmut and Diehl, Patrick and Lemoine, Adrian S. and Lelbach, Bryce Adelstein and Amini, Parsa and Berge, Agustín and Biddiscombe, John and Brandt, Steven R. and Gupta, Nikunj and Heller, Thomas and Huck, Kevin and Khatami, Zahra and Kheirkhahan, Alireza and Reverdell, Auriane and Shirzad, Shahrzad and Simberg, Mikael and Wagle, Bibek and Wei, We...

work page doi:10.21105/joss.02352 2020
[5]

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis , articleno =

Bauer, Michael and Treichler, Sean and Slaughter, Elliott and Aiken, Alex , title =. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis , articleno =. 2012 , isbn =

2012
[6]

and Burmark, Jason and Hornung, Rich and Jones, Holger and Killian, William and Kunen, Adam J

Beckingsale, David A. and Burmark, Jason and Hornung, Rich and Jones, Holger and Killian, William and Kunen, Adam J. and Pearce, Olga and Robinson, Peter and Ryujin, Brian S. and Scogland, Thomas RW , booktitle=. RAJA: Portable Performance for Large-Scale Scientific Applications , year=
[7]

and Garc\'

Holmen, John K. and Garc\'. Lessons Learned and Scalability Achieved When Porting Uintah to DOE Exascale Systems , year =. Euro-Par 2024: Parallel Processing Workshops: Euro-Par 2024 International Workshops, Madrid, Spain, August 26–30, 2024, Proceedings, Part I , pages =. doi:10.1007/978-3-031-90200-0_19 , abstract =

work page doi:10.1007/978-3-031-90200-0_19 2024
[8]

Milind Bhandarkar and L. V. Kal\' e. A Parallel Framework for Explicit FEM. Proceedings of the International Conference on High Performance Computing (HiPC 2000), Lecture Notes in Computer Science. 2000

2000
[9]

Kal\'e and Sameer Kumar and Jayant DeSouza

Laxmikant V. Kal\'e and Sameer Kumar and Jayant DeSouza. An Adaptive Job Scheduler for Timeshared Parallel Machines. 2000

2000
[10]

Milind Bhandarkar and L. V. Kale and Eric de Sturler and Jay Hoeflinger. Object-Based Adaptive Load Balancing for MPI Programs. Proceedings of the International Conference on Computational Science, San Francisco, CA, LNCS 2074. 2001

2074
[11]

Eric deStruler and Jay Hoeflinger and L. V. Kale and Milind Bhandarkar. A New Approach to Software Integration Frameworks for Multi-physics Simulation Codes. Proceedings of IFIP TC2/WG2.5 Working Conference on Architecture of Scientific Software, Ottawa, Canada. 2000

2000
[12]

L. V. Kale and Milind Bhandarkar and Robert Brunner. Run-time Support for Adaptive Load Balancing. Lecture Notes in Computer Science, Proceedings of 4th Workshop on Runtime Systems for Parallel Programming (RTSPP) Cancun - Mexico. 2000

2000
[13]

Brunner and James C

Robert K. Brunner and James C. Phillips and Laxmikant V. Kale. Scalable Molecular Dynamics for Large Biomolecular Systems. Proceedings of Supercomputing (SC) 2000, Dallas, TX, November 2000. Nominated for Gordon Bell Award. 2000

2000
[14]

A Checkpoint and Restart Mechanism for Parallel Programming Systems

Sameer Paranjpye. A Checkpoint and Restart Mechanism for Parallel Programming Systems. 2000

2000
[15]

Robert K. Brunner. Versatile Automatic Load Balancing with Migratable Objects. 2000

2000
[16]

Lawlor and L

O. Lawlor and L. V. Kal \'e. Supporting Dynamic Parallel Object Arrays. Proceedings of ACM 2001 Java Grande/ISCOPE Conference. 2001

2001
[17]

Neelam Saboo and L. V. Kal\'e. Improving Paging Performace With Object Prefetching
[18]

Milind Bhandarkar and L. V. Kale. An Interface Model for Parallel Components. Proceedings of the Workshop on Languages and Compilers for Parallel Computing (LCPC), Cumberland Falls, KY. 2001

2001
[19]

Neelam Saboo and Arun Kumar Singla and Joshua Mostkoff Unger and L. V. Kal \'e. Emulating Petaflops Machines and Blue Gene. Workshop on Massively Parallel Processing (IPDPS'01). 2001

2001
[20]

A Grid-Based Parallel Collision Detection Algorithm

Orion Lawlor. A Grid-Based Parallel Collision Detection Algorithm. 2001

2001
[21]

An Adaptive Job Scheduler for Timeshared Parallel Machines

Sameer Kumar. An Adaptive Job Scheduler for Timeshared Parallel Machines. 2001

2001
[22]

Kal\'e and Sameer Kumar and Jayant DeSouza , title =

Laxmikant V. Kal\'e and Sameer Kumar and Jayant DeSouza , title =. 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002) , year = 2002, month =

2002
[23]

An Adaptive Mesh Refinement (AMR) Library Using Charm++

Puneet Narula. An Adaptive Mesh Refinement (AMR) Library Using Charm++. 2002

2002
[24]

Gengbin Zheng and Arun Kumar Singla and Joshua Mostkoff Unger and Laxmikant V. Kal \'e. A Parallel-Object Programming Model for Petaflops Machines and Blue Gene/Cyclops. NSF Next Generation Systems Program Workshop, 16th International Parallel and Distributed Processing Symposium(IPDPS). 2002

2002
[25]

James Phillips and Gengbin Zheng and Laxmikant V. Kal \'e. NAMD: Biomolecular Simulation on Thousands of Processors. Workshop: Scaling to New Heights. 2002

2002
[26]

Orion Lawlor and Milind Bhandarkar and Laxmikant V. Kal\'e. Adaptive MPI. 2002

2002
[27]

Orion Sky Lawlor and L. V. Kal \'e. A Voxel-Based Parallel Collision Detection Algorithm. Proceedings of the International Conference in Supercomputing. 2002

2002
[28]

Phillips and Gengbin Zheng and Sameer Kumar and Laxmikant V

James C. Phillips and Gengbin Zheng and Sameer Kumar and Laxmikant V. Kal \'e. NAMD : Biomolecular Simulation on Thousands of Processors. Proceedings of the 2002 ACM/IEEE conference on Supercomputing. 2002

2002
[29]

Bhandarkar

Milind A. Bhandarkar. Charisma: A Component Architecture for Parallel Programming. 2002

2002
[30]

Laxmikant V. Kal \'e. The Virtualization Model of Parallel Programming : Runtime Optimizations and the State of Art. LACSI 2002. 2002

2002
[31]

L. V. Kale and Sameer Kumar and Krishnan Vardarajan. A Framework for Collective Personalized Communication, Communicated to IPDPS 2003. 2002

2003
[32]

L. V. Kale and Sameer Kumar and Krishnan Vardarajan. A Framework for Collective Personalized Communication. Proceedings of IPDPS'03. 2003

2003
[33]

Kal\'e and Sameer Kumar and Jayant DeSouza and Mani Potnuru and Sindhura Bandhakavi

Laxmikant V. Kal\'e and Sameer Kumar and Jayant DeSouza and Mani Potnuru and Sindhura Bandhakavi. Faucets: Efficient Resource Allocation on the Computational Grid. 2003

2003
[34]

Jayant DeSouza and L. V. Kal\' e. Jade: A Parallel Message-Driven J ava. Proc. Workshop on Java in Computational Science, held in conjunction with the International Conference on Computational Science (ICCS 2003)

2003
[35]

Kal \'e and Sameer Kumar and Gengbin Zheng and Chee Wai Lee

Laxmikant V. Kal \'e and Sameer Kumar and Gengbin Zheng and Chee Wai Lee. Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study. Terascale Performance Analysis Workshop, International Conference on Computational Science(ICCS). 2003

2003
[36]

L. V. Kale and Sameer Kumar. Scaling Collective Multicast on High Performance Clusters. 2003

2003
[37]

Gengbin Zheng and Gunavardhan Kakulapati and Laxmikant V. Kal. 18th International Parallel and Distributed Processing Symposium (IPDPS). 2004

2004
[38]

Ramkumar Vadali and L. V. Kale and Glenn Martyna and Mark Tuckerman , title =
[39]

Chao Huang and Orion Lawlor and L. V. Kal\' e. Adaptive MPI. Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958. 2003

2003
[40]

Automatic Out-of-Core Exceution Support for CHARM++

Mani Potnuru. Automatic Out-of-Core Exceution Support for CHARM++. 2003

2003
[41]

Jonathan A. Booth. Balancing Priorities and Load for State Space Search On Large Parallel Machines. 2003

2003
[42]

Orion Sky Lawlor and L. V. Kal \'e. Supporting dynamic parallel object arrays. Concurrency and Computation: Practice and Experience. 2003

2003
[43]

Sameer Kumar and L. V. Kale. Scaling Collective Multicast on Fat-tree Networks. ICPADS. 2004

2004
[44]

Debugging Support for CHARM++

Rashmi Jyothi. Debugging Support for CHARM++. 2003

2003
[45]

Analyzing bidding strategies for schedulers in a simulated multiple-cluster market driven environment

Sindhura Bandhakavi. Analyzing bidding strategies for schedulers in a simulated multiple-cluster market driven environment. 2003

2003
[46]

Sameer Kumar and L. V. Kale. Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet. Proceedings of Workshop on Communication Architecture for Clusters in Conjunction with IPDPS 04. 2004

2004
[47]

Laxmikant V. Kal \'e. Performance and Productivity in Parallel Programming via Processor Virtualization. Proc. of the First Intl. Workshop on Productivity and Performance in High-End Computing (at HPCA 10). 2004

2004
[48]

An Efficient Implementation of C harm++ on V irtual M achine I nterface

Gregory Allen Koenig. An Efficient Implementation of C harm++ on V irtual M achine I nterface. 2003

2003
[49]

Parallelizing CPAIMD Using Charm++

Ramkumar Vadali. Parallelizing CPAIMD Using Charm++. 2003

2003
[50]

Simulating Large Parallel Machines for Performance Prediction

Gunavardhan Kakulapati. Simulating Large Parallel Machines for Performance Prediction. 2003

2003
[51]

Terry Wilmarth and L. V. Kal\'e. POSE: Getting Over Grainsize in Parallel Discrete Event Simulation. 2004 International Conference on Parallel Processing

2004
[52]

Gengbin Zheng and Terry Wilmarth and Orion Sky Lawlor and Laxmikant V. Kal. NSF Next Generation Systems Program Workshop, 18th International Parallel and Distributed Processing Symposium(IPDPS). 2004

2004
[53]

Sayantan Chakravorty and L. V. Kale. A Fault Tolerant Protocol for Massively Parallel Machines. FTPDS Workshop for IPDPS 2004. 2004

2004
[54]

Rashmi Jyothi and Orion Sky Lawlor and L. V. Kale. Debugging Support for C harm++. PADTAD Workshop for IPDPS 2004. 2004

2004
[55]

Kale and Gengbin Zheng and Chee Wai Lee and Sameer Kumar

Laxmikant V. Kale and Gengbin Zheng and Chee Wai Lee and Sameer Kumar. Scaling Applications to Massively Parallel Machines Using Projections Performance Analysis Tool. Future Generation Computer Systems Special Issue on: Large-Scale System Performance Modeling and Analysis
[56]

Gengbin Zheng and Lixia Shi and Laxmikant V. Kal\'e. FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI. 2004 IEEE Cluster

2004
[57]

System Support for Checkpoint and Restart of Charm++ and AMPI Applications

Chao Huang. System Support for Checkpoint and Restart of Charm++ and AMPI Applications. 2004

2004
[58]

Parallel Simulation of Large Scale Interconnection Networks used in High Performance Computing

Praveen Kumar Jagadishprasad. Parallel Simulation of Large Scale Interconnection Networks used in High Performance Computing. 2004

2004
[59]

Kal\'e and Sameer Kumar and Jayant DeSouza and Mani Potnuru and Sindhura Bandhakavi

Laxmikant V. Kal\'e and Sameer Kumar and Jayant DeSouza and Mani Potnuru and Sindhura Bandhakavi. Faucets: Efficient Resource Allocation on the Computational Grid. Proceedings of the 2004 International Conference on Parallel Processing

2004
[60]

Kal\'e , title =

Jayant DeSouza and Laxmikant V. Kal\'e , title =. Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing , year = 2004, address =

2004
[61]

LeanMD: A Charm++ framework for high performance molecular dynamics simulation on large parallel machines

Vikas Mehta. LeanMD: A Charm++ framework for high performance molecular dynamics simulation on large parallel machines. 2004

2004
[62]

Gengbin Zheng and Terry Wilmarth and Praveen Jagadishprasad and Laxmikant V. Kal\'e. Simulation-Based Performance Prediction for Large Parallel Machines. International Journal of Parallel Programming
[63]

L. V. Kal\' e and Mark Hills and Chao Huang. An Orchestration Language for Parallel Objects. Proceedings of Seventh Workshop on Languages, Compilers, and Run-time Support for Scalable Systems (LCR 04). 2004

2004
[64]

Sayantan Chakravorty, Celso Mendes and L. V. Kale. Proactive Fault Tolerance in Large Systems. HPCRI Workshop in conjunction with HPCA 2005. 2005

2005
[65]

Koenig and Laxmikant V

Gregory A. Koenig and Laxmikant V. Kale. Using Message-Driven Objects to Mask Latency in Grid Computing Applications. 19th IEEE International Parallel and Distributed Processing Symposium. 2005

2005
[66]

Attila Gursoy and L.V. Kal \'e. Performance and Modularity Benefits of Message-Driven Execution. Journal of Parallel and Distributed Computing. 2004

2004
[67]

Vadali and Yan Shi and Sameer Kumar and L

Ramkumar V. Vadali and Yan Shi and Sameer Kumar and L. V. Kale and Mark E. Tuckerman and Glenn J. Martyna , title =. Journal of Comptational Chemistry , publisher =
[68]

Jade: Compiler-Supported Multi-Paradigm Processor Virtualization-Based Parallel Programming

Jayant DeSouza. Jade: Compiler-Supported Multi-Paradigm Processor Virtualization-Based Parallel Programming. 2004

2004
[69]

Wilmarth

Terry L. Wilmarth. POSE: Scalable General-purpose Parallel Discrete Event Simulation. 2005

2005
[70]

Kale and Craig Stunkel

Sameer Kumar and Laxmikant V. Kale and Craig Stunkel. Architecture for supporting Hardware Collectives in Output-Queued High-Radix Routers. 2005

2005
[71]

Wilmarth and Gengbin Zheng and Eric J

Terry L. Wilmarth and Gengbin Zheng and Eric J. Bohm and Yogesh Mehta and Nilesh Choudhury and Praveen Jagadishprasad and Laxmikant V. Kale. Performance Prediction using Simulation of Large-scale Interconnection Networks in POSE. Proceedings of the Workshop on Principles of Advanced and Distributed Simulation. 2005

2005
[72]

Chao Huang and Gengbin Zheng and Sameer Kumar and Laxmikant V. Kal\' e. Performance Evaluation of Adaptive MPI. Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2006. 2006

2006
[73]

Implementation of parallel mesh partition and ghost generation for the Finite Element Mesh framework

Sayantan Chakravorty. Implementation of parallel mesh partition and ghost generation for the Finite Element Mesh framework. 2005

2005
[74]

Achieving high performance on extremely large parallel machines: performance prediction and load balancing

Gengbin Zheng. Achieving high performance on extremely large parallel machines: performance prediction and load balancing. 2005

2005
[75]

Strategies for topology-aware task mapping and for rebalancing with bounded migrations

Tarun Agarwal. Strategies for topology-aware task mapping and for rebalancing with bounded migrations. 2005

2005
[76]

An Integration Framework for Simulations of Solid Rocket Motors

Xiangmin Jiao and Gengbin Zheng and Orion Lawlor and Phil Alexander and Mike Campbell and Michael Heath and Robert Fiedler. An Integration Framework for Simulations of Solid Rocket Motors. 41st AIAA/ASME/SAE/ASEE Joint Propulsion Conference. 2005

2005
[77]

Low diameter regular graph as a network topology in direct and hybrid interconnection networks

Yogesh Mehta. Low diameter regular graph as a network topology in direct and hybrid interconnection networks. 2005

2005
[78]

Impostors for Parallel Interactive Computer Graphics

Orion Sky Lawlor. Impostors for Parallel Interactive Computer Graphics. 2004

2004
[79]

Wilmarth and Eric J

Nilesh Choudhury and Yogesh Mehta and Terry L. Wilmarth and Eric J. Bohm and Laxmikant V\ . Kal\'e. Scaling an Optimistic Parallel Simulation of Large-scale Interconnection Networks. Proceedings of the Winter Simulation Conference. 2005

2005
[80]

Performance Degradation in the Presence of Subnormal Floating-Point Values

Orion Lawlor and Hari Govind and Isaac Dooley and Michael Breitenfeld and Laxmikant Kale. Performance Degradation in the Presence of Subnormal Floating-Point Values. Proceedings of the International Workshop on Operating System Interference in High Performance Applications. 2005

2005

Showing first 80 references.