โ† Back to Research

Research Area

Correctness & Verification

You can't fix what you can't measure. Chunhua's correctness research builds the standardized benchmark suites, analysis tools, and evaluation frameworks that the HPC community needs to reproducibly assess and improve data race detection, auto-parallelization, and parallel program correctness.

๐Ÿ“– Overview

Parallel programs are notoriously difficult to get right. Data races โ€” where two threads access the same memory location concurrently and at least one is a write โ€” are subtle, non-deterministic bugs that can corrupt results, cause crashes, or silently produce wrong answers in scientific simulations. Detecting and eliminating these bugs requires both powerful analysis tools and rigorous benchmarks that can evaluate how well those tools work. Chunhua's correctness research addresses both sides of this challenge.

DataRaceBench is a community benchmark suite specifically designed to evaluate data race detection tools. It contains hundreds of microbenchmarks covering a comprehensive range of race patterns in OpenMP programs โ€” from simple shared-variable races to complex patterns involving reductions, atomic operations, and nested parallelism. By providing ground-truth annotations (each microbenchmark is labeled as having or not having a race), DataRaceBench enables apples-to-apples comparison of different detection tools and serves as a regression suite for tool developers. It has been adopted by research groups worldwide and cited in dozens of publications.

AutoParBench extends this philosophy to automatic parallelization: a benchmark framework for evaluating how well compilers and tools can automatically identify and parallelize sequential loops. More recently, Chunhua's group has explored whether LLMs can detect data races โ€” a fascinating intersection of AI and program correctness. The SC-W 2023 paper found that current LLMs show surprising capability on simple race patterns but struggle with the complex, context-dependent races that occur in real HPC codes โ€” highlighting both the promise and the current limits of AI-based program correctness analysis.

๐Ÿ“„ Key Publications

2023 SC-W 2023

Data Race Detection Using Large Language Models

Chunhua Liao et al.

The first systematic evaluation of LLMs as data race detectors, using DataRaceBench as the evaluation harness. Finds that GPT-4 achieves reasonable accuracy on simple races but misses subtle ones, pointing toward hybrid approaches combining compiler analysis with LLM reasoning. Published at the SC23 Workshop on Correctness.

2017โ€“2022 SC / PPoPP / ISC

DataRaceBench: A Benchmark Suite for Evaluating Data Race Detection Tools

Chunhua Liao, Pei-Hung Lin, Joshua Asplund, et al. โ€” LLNL

The foundational DataRaceBench paper introducing the benchmark design philosophy, microbenchmark taxonomy, and evaluation methodology. The suite has been continuously expanded and is now the standard evaluation benchmark for OpenMP race detection tools in the research community.

2020โ€“2022 CC / IPDPS

AutoParBench: A Unified Test Framework for OpenMP-Based Automatic Parallelization Compilers

Chunhua Liao et al. โ€” LLNL

Introduces AutoParBench, a benchmark framework for evaluating automatic parallelization tools. Provides ground-truth parallelization annotations and a correctness-checking methodology to reliably compare different auto-parallelization compilers on realistic loop patterns.

๐Ÿ’ป Software & Tools

๐Ÿ“

AutoParBench

Auto-Parallelization Benchmark

A unified benchmark framework for evaluating automatic parallelization compilers and tools. Contains loops from real scientific codes alongside synthetic patterns, with ground-truth parallelizability annotations and an automated correctness-checking infrastructure.

OpenMPAuto-parallelizationEvaluation

๐Ÿ’ก Impact & Insights

You can't fix what you can't measure. Standardized benchmarks don't just evaluate tools โ€” they define what correctness means and create shared ground truth for a research community.
  • DataRaceBench's systematic taxonomy of race patterns ensures that tool evaluations are comprehensive and comparable โ€” preventing the cherry-picking of easy test cases that would make tools look better than they are.
  • The ground-truth annotation approach is critical for scientific benchmarking: unlike running programs and observing crashes, explicit annotations enable precision/recall analysis of detection tool quality.
  • Community adoption of DataRaceBench has created a virtuous cycle: tool builders use it to find weaknesses, improve their tools, and contribute new patterns back to the suite โ€” continuously raising the bar for the field.
  • The LLM race detection study (SC-W 2023) is important precisely because it reveals limits: understanding where AI fails at correctness analysis is as valuable as celebrating where it succeeds.
  • AutoParBench fills a gap in evaluation infrastructure for automatic parallelization โ€” a tool that generates incorrect parallel code is worse than no tool at all, and rigorous benchmarking catches these failures.