Date | Name | Domain | Focus | Keywords | Task Types | Metrics | Models | Citation | Specification Rating | Specification Reason | Dataset Rating | Dataset Reason | Metrics Rating | Metrics Reason | Reference Solution Rating | Reference Solution Reason | Documentation Rating | Documentation Reason |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2024-05-01 | Jet Classification | Particle Physics | Real-time classification of particle jets using HL-LHC simulation features | classification, real-time ML, jet tagging, QKeras | Classification | Accuracy, AUC | Keras DNN, QKeras quantized DNN | 1 | 9.0 | Task and format (multiple-choice QA with 5 options) are clearly defined; grounded in ConceptNet with consistent structure, though no hardware/system constraints are specified. | 9.0 | Public, versioned, and FAIR-compliant; includes metadata, splits, and licensing; well-integrated with HuggingFace and other ML libraries. | 9.0 | Accuracy is a simple, reproducible metric aligned with task goals; no ambiguity in evaluation. | 8.0 | Several baseline models (e.g., BERT, RoBERTa) are reported with scores; implementations exist in public repos, but not bundled as an official starter kit. | 7.0 | Clear paper, GitHub repo, and integration with HuggingFace Datasets; full reproducibility requires manually connecting models to dataset. |
2024-05-01 | Irregular Sensor Data Compression | Particle Physics | Real-time compression of sparse sensor data with autoencoders | compression, autoencoder, sparse data, irregular sampling | Compression | MSE, Compression ratio | Autoencoder, Quantized autoencoder | 2 | 8.0 | Classification is clearly defined for real-time inference on simulated LHC jets. Input features (HLFs) are documented, though exact latency or resource constraints are not numerically specified. | 9.0 | Two datasets (OpenML and Zenodo) are public, well-formatted, and documented; FAIR principles are followed, though richer metadata would raise confidence to a 10. | 9.0 | AUC and Accuracy are standard, quantitative, and well-aligned with goals of jet tagging and inference efficiency. | 8.0 | Float and quantized Keras/QKeras models are provided with results. Reproducibility is good, though full automation and documentation could be improved. | 8.0 | GitHub contains baseline code, data loaders, and references, but setup for deployment (e.g., FPGA pipeline) requires familiarity with the tooling. |
2024-05-01 | Beam Control | Accelerators and Magnets | Reinforcement learning control of accelerator beam position | RL, beam stabilization, control systems, simulation | Control | Stability, Control loss | DDPG, PPO (planned) | 3, 4 | 9.0 | Task is well defined (real-time compression of sparse, irregular sensor data using autoencoders); latency constraints are implied but not fully quantified. | 8.0 | Dataset is custom and synthetic but described well; FAIR-compliance is partial (reusable and accessible, but not externally versioned with rich metadata). | 9.0 | Uses standard quantitative metrics (MSE, compression ratio) clearly aligned with compression and reconstruction goals. | 7.0 | Baseline (autoencoder and quantized variant) is provided, but training/inference pipeline is minimally documented and needs user setup. | 8.0 | GitHub repo contains core components, but more structured setup instructions and pretrained weights would improve usability. |
2024-07-08 | Ultrafast jet classification at the HL-LHC | Particle Physics | FPGA-optimized real-time jet origin classification at the HL-LHC | jet classification, FPGA, quantization-aware training, Deep Sets, Interaction Networks | Classification | Accuracy, Latency, Resource utilization | MLP, Deep Sets, Interaction Network | 5 | 8.0 | Task is clear (RL control of beam stability), with BOOSTR-based simulator; control objectives are well motivated, but system constraints and reward structure are still under refinement. | 7.0 | BOOSTR dataset exists and is cited, but integration into the benchmark is in early stages; metadata and FAIR structure are limited. | 7.0 | Stability and control loss are mentioned, but metrics are not yet formalized with clear definitions or baselines. | 5.5 | DDPG baseline mentioned; PPO planned; implementation is still in progress with no reproducible results available yet. | 6.0 | GitHub has a defined structure but is incomplete; setup and execution instructions for training/evaluation are not fully established. |
2024-10-15 | Quench detection | Accelerators and Magnets | Real-time detection of superconducting magnet quenches using ML | quench detection, autoencoder, anomaly detection, real-time | Anomaly detection, Quench localization | ROC-AUC, Detection latency | Autoencoder, RL agents (in development) | 10.0 | Real-time jet origin classification under FPGA constraints is clearly defined, with explicit latency targets (~100 ns) and I/O formats. | 9.0 | Data available on Zenodo with DOI, includes constituent-level jets; accessible and well-documented, though not deeply versioned with full FAIR metadata. | 10.0 | Accuracy, latency, and hardware resource usage (LUTs, DSPs) are rigorously measured and aligned with real-time goals. | 9.0 | Includes models (MLP, Deep Sets, Interaction Networks) with quantization-aware training and synthesis results via hls4ml; reproducible but tightly coupled with specific toolchains. | 8.0 | Paper and code (via hls4ml) are sufficient, but a centralized, standalone repo for reproducing all models would enhance accessibility. | |
2024-10-15 | DUNE | Particle Physics | Real-time ML for DUNE DAQ time-series data | DUNE, time-series, real-time, trigger | Trigger selection, Time-series anomaly detection | Detection efficiency, Latency | CNN, LSTM (planned) | 6 | 8.0 | Task (quench detection via anomaly detection) is clearly described; multi-modal sensors, streaming rates, and objective are provided, but constraints (latency thresholds) are qualitative. | 7.0 | Custom dataset using real data from BNL; HDF5 formatted and structured, but access may be internal or limited, and not versioned for public FAIR use. | 8.0 | ROC-AUC and detection latency are defined; relevant and quantitative but not yet paired with benchmark baselines. | 6.0 | Autoencoder prototype exists; RL methods are in development; no fully reproducible pipeline is available yet. | 7.0 | Slides and GDocs outline results; implementation is in progress with limited setup/code release. |
2025-01-08 | Intelligent experiments through real-time AI | Instrumentation and Detectors; Nuclear Physics; Particle Physics | Real-time FPGA-based triggering and detector control for sPHENIX and future EIC | FPGA, Graph Neural Network, hls4ml, real-time inference, detector control | Trigger classification, Detector control, Real-time inference | Accuracy (charm and beauty detection), Latency (micros), Resource utilization (LUT/FF/BRAM/DSP) | Bipartite Graph Network with Set Transformers (BGN-ST), GarNet (edge-classifier) | 7 | 8.0 | Task (trigger-level anomaly detection) is clearly defined for low-latency streaming input, but the problem framing lacks complete architectural/system specs. | 6.0 | Internal DUNE SONIC data; not publicly released and no formal FAIR support; replicability is institutionally gated. | 7.0 | Metrics include detection efficiency and latency, which are relevant, but only lightly supported by baselines or formal eval scripts. | 5.0 | One CNN prototype demonstrated; LSTM planned. No public implementation or ready-to-run example yet. | 6.0 | Slides and some internal documentation exist, but no full pipeline or public GitHub repo yet. |
2025-01-09 | Neural Architecture Codesign for Fast Physics Applications | Physics; Materials Science; Particle Physics | Automated neural architecture search and hardware-efficient model codesign for fast physics applications | neural architecture search, FPGA deployment, quantization, pruning, hls4ml | Classification, Peak finding | Accuracy, Latency, Resource utilization | NAC-based BraggNN, NAC-optimized Deep Sets (jet) | 8 | 10.0 | Task is clearly defined (triggering on rare events with sub-10 micros latency); architecture, constraints, and system context (FPGA, Alveo) are well detailed. | 7.0 | Simulated tracking data from sPHENIX and EIC; internally structured but not yet released in a public FAIR-compliant format. | 10.0 | Accuracy, latency, and hardware resource utilization (LUTs, DSPs) are clearly defined and used in evaluation. | 9.0 | Graph-based models (BGN-ST, GarNet) are implemented and tested on real hardware; reproducibility possible with hls4ml but full scripts not bundled. | 8.0 | Paper is detailed and tool usage (FlowGNN, hls4ml) is described, but repo release and dataset access remain in progress. |
2024-06-24 | Smart Pixels for LHC | Particle Physics; Instrumentation and Detectors | On-sensor, in-pixel ML filtering for high-rate LHC pixel detectors | smart pixel, on-sensor inference, data reduction, trigger | Image Classification, Data filtering | Data rejection rate, Power per pixel | 2-layer pixel NN | 9 | 9.0 | Task (automated neural architecture search for real-time physics) is well formulated with clear latency, model compression, and deployment goals. | 6.0 | Internal Bragg and jet datasets used; not publicly hosted or FAIR-compliant, though mentioned in the paper. | 10.0 | BOP reduction, latency, and accuracy are all quantitatively evaluated. | 8.0 | NAC-generated models for Bragg peak and jet classification are described, but pipeline requires integration of several tools and is not fully packaged. | 7.0 | NAC pipeline, hls4ml usage, and results are discussed; code (e.g., nac-opt) referenced, but replication requires stitching together toolchain and data. |
2023-10-03 | HEDM (BraggNN) | Material Science | Fast Bragg peak analysis using deep learning in diffraction microscopy | BraggNN, diffraction, peak finding, HEDM | Peak detection | Localization accuracy, Inference time | BraggNN | 10 | 10.0 | Fully specified: describes task (data filtering/classification, system design (on-sensor inference), latency (25 ns), and power constraints. | 8.0 | In-pixel charge cluster data used, but dataset release info is minimal; FAIR metadata/versioning limited. | 9.0 | Data rejection rate and power per pixel are clearly defined and directly tied to hardware goals. | 9.0 | 2-layer NN implementation is evaluated in hardware; reproducible via hls4ml flow with results in paper. | 8.0 | Paper is clear; Zenodo asset is referenced, but additional GitHub or setup repo would improve reproducibility. |
2023-12-03 | 4D-STEM | Material Science | Real-time ML for scanning transmission electron microscopy | 4D-STEM, electron microscopy, real-time, image processing | Image Classification, Streamed data inference | Classification accuracy, Throughput | CNN models (prototype) | 11 | 9.0 | Peak localization task is well-defined for diffraction images; input/output described clearly, but no system constraints. | 8.0 | Simulated diffraction images provided; reusable and downloadable, but not externally versioned or FAIR-structured. | 9.0 | Inference speed and localization accuracy are standard and quantitatively reported. | 8.0 | BraggNN model and training pipeline exist, but need stitching from separate repositories. | 8.0 | Paper and codebase are available and usable, though not fully turnkey. |
2023-12-05 | In-Situ High-Speed Computer Vision | Fusion/Plasma | Real-time image classification for in-situ plasma diagnostics | plasma, in-situ vision, real-time ML | Image Classification | Accuracy, FPS | CNN | 12 | 7.0 | General task defined (real-time microscopy inference), but no standardized I/O format, latency constraint, or complete problem framing yet. | 0.0 | Dataset not provided or described in any formal way. | 6.0 | Mentions throughput and accuracy, but metrics are not formally defined or benchmarked. | 2.0 | Prototype CNNs described; no baseline or implementation released. | 5.0 | OpenReview paper and Gemini doc give some insight, but no working code, environment, or example. |
2020-01-01 | BenchCouncil AIBench | General | End-to-end AI benchmarking across micro, component, and application levels | benchmarking, AI systems, application-level evaluation | Training, Inference, End-to-end AI workloads | Throughput, Latency, Accuracy | ResNet, BERT, GANs, Recommendation systems | 13 | 8.0 | Task (plasma diagnostic classification) and real-time deployment described; system specs (FPS targets) implied but not fully quantified. | 6.0 | Dataset is sensor stream-based but not shared or FAIR-documented. | 8.0 | FPS and classification accuracy reported and relevant. | 7.0 | CNN model described and evaluated, but public implementation and benchmarks are not available yet. | 6.0 | Paper and Gemini doc exist, but full setup instructions and tools are still in progress. |
2020-01-01 | BenchCouncil BigDataBench | General | Big data and AI benchmarking across structured, semi-structured, and unstructured data workloads | big data, AI benchmarking, data analytics | Data preprocessing, Inference, End-to-end data pipelines | Data throughput, Latency, Accuracy | CNN, LSTM, SVM, XGBoost | 14 | 9.0 | Evaluates AI at multiple levels (micro to end-to-end); tasks and workloads are clearly defined, though specific I/O formats and constraints vary. | 9.0 | Realistic datasets across diverse domains; FAIR structure for many components, but individual datasets may not all be versioned or richly annotated. | 9.0 | Latency, throughput, and accuracy clearly defined for end-to-end tasks; consistent across models and setups. | 8.0 | Reference implementations for several tasks exist, but setup across all tasks is complex and not fully streamlined. | 8.0 | Central documentation exists, with detailed component breakdowns; environment setup across platforms (e.g., hardware variations) can require manual adjustment. |
2021-10-20 | MLPerf HPC | Cosmology, Climate, Protein Structure, Catalysis | Scientific ML training and inference on HPC systems | HPC, training, inference, scientific ML | Training, Inference | Training time, Accuracy, GPU utilization | CosmoFlow, DeepCAM, OpenCatalyst | 15 | 9.0 | Focused on structured/unstructured data pipelines; clearly defined tasks spanning analytics to AI; some scenarios lack hardware constraint modeling. | 9.0 | Built from 13 real-world sources; structured for realistic big data scenarios; partially FAIR-compliant with documented data motifs. | 9.0 | Covers data throughput, latency, and accuracy; quantitative and benchmark-ready. | 8.0 | Many pipeline and model examples provided using Hadoop/Spark/Flink; setup effort varies by task and platform. | 8.0 | Strong documentation with examples and task specifications; centralized support exists, but task-specific tuning may require domain expertise. |
2023-06-01 | MLCommons Science | Earthquake, Satellite Image, Drug Discovery, Electron Microscope, CFD | AI benchmarks for scientific applications including time-series, imaging, and simulation | science AI, benchmark, MLCommons, HPC | Time-series analysis, Image classification, Simulation surrogate modeling | MAE, Accuracy, Speedup vs simulation | CNN, GNN, Transformer | 16 | 10.0 | Scientific ML tasks (e.g., CosmoFlow, DeepCAM) are clearly defined with HPC system-level constraints and targets. | 9.0 | Public scientific datasets (e.g., cosmology, weather); used consistently, though FAIR-compliance of individual datasets varies slightly. | 10.0 | Training time, GPU utilization, and accuracy are all directly measured and benchmarked across HPC systems. | 9.0 | Reference implementations available and actively maintained; HPC setup may require domain-specific environment. | 9.0 | GitHub repo and papers provide detailed instructions; reproducibility supported across multiple institutions. |
2021-07-05 | LHC New Physics Dataset | Particle Physics; Real-time Triggering | Real-time LHC event filtering for anomaly detection using proton collision data | anomaly detection, proton collision, real-time inference, event filtering, unsupervised ML | Anomaly detection, Event classification | ROC-AUC, Detection efficiency | Autoencoder, Variational autoencoder, Isolation forest | 17 | 7.0 | The problem (anomaly detection for new physics at LHC) is clearly described with goals and background, but lacks a formal task specification or constraints. | 8.0 | Large-scale, public dataset derived from LHC simulations; well-documented and available via Zenodo. | 7.0 | Provides AUROC, accuracy, and anomaly detection metrics but lacks standardized evaluation script. | 5.0 | Baseline models (autoencoders, GANs) are described in associated papers, but implementations vary across papers. | 6.0 | Publicly available papers and datasets with descriptions, but no unified README or training setup. |
2023-07-17 | MLCommons Medical AI | Healthcare; Medical AI | Federated benchmarking and evaluation of medical AI models across diverse real-world clinical data | medical AI, federated evaluation, privacy-preserving, fairness, healthcare benchmarks | Federated evaluation, Model validation | ROC AUC, Accuracy, Fairness metrics | MedPerf-validated CNNs, GaNDLF workflows | 18 | 9.0 | Diverse scientific tasks (earthquake, CFD, microscopy) with detailed problem statements and goals; system constraints not uniformly applied. | 9.0 | Domain-specific datasets (e.g., microscopy, climate); mostly public and structured, but FAIR annotations are not always explicit. | 9.0 | Task-specific metrics (MAE, speedup, accuracy) are clear and reproducible. | 9.0 | Reference models (CNN, GNN, Transformer) provided with training/evaluation pipelines. | 9.0 | Well-documented, open-sourced, and maintained with examples; strong community support and reproducibility focus. |
2024-10-28 | CaloChallenge 2022 | LHC Calorimeter; Particle Physics | Fast generative-model-based calorimeter shower simulation evaluation | calorimeter simulation, generative models, surrogate modeling, LHC, fast simulation | Surrogate modeling | Histogram similarity, Classifier AUC, Generation latency | VAE variants, GAN variants, Normalizing flows, Diffusion models | 19 | 9.0 | Task is clearly defined: real-time anomaly detection from high-rate LHC collisions. Latency and bandwidth constraints are mentioned, though not numerically enforced. | 9.0 | Publicly available via Zenodo, with structured signal/background splits, and rich metadata; nearly fully FAIR. | 9.0 | ROC-AUC and detection efficiency are clearly defined and appropriate for unsupervised anomaly detection. | 8.0 | Several baseline methods (autoencoder, VAE, isolation forest) are evaluated; runnable versions available via community repos but not tightly bundled. | 8.0 | Paper and data documentation are clear, and the dataset is widely reused. Setup requires some manual effort to reproduce full pipelines. |
ongoing | Papers With Code (SOTA Platform) | General ML; All domains | Open platform tracking state-of-the-art results, benchmarks, and implementations across ML tasks and papers | leaderboard, benchmarking, reproducibility, open-source | Multiple (Classification, Detection, NLP, etc.) | Task-specific (Accuracy, F1, BLEU, etc.) | All published models with code | 20 | 9.0 | Evaluation setting (federated clinical benchmarking) is well-defined; I/O interfaces vary slightly by task but are standardized in MedPerf platform. | 8.0 | Uses distributed, real-world clinical datasets across institutions; FAIR compliance varies across hospitals and data hosts. | 9.0 | ROC AUC, accuracy, and fairness metrics are explicitly defined and task-dependent; consistently tracked across institutions. | 8.0 | Validated CNNs and GaNDLF pipelines are used and shared via the MedPerf tool, but some implementations are abstracted behind the platform. | 9.0 | Excellent documentation across MedPerf, GaNDLF, and COFE; reproducibility handled via containerized flows and task templates. |
2022-01-01 | Codabench | General ML; Multiple | Open-source platform for organizing reproducible AI benchmarks and competitions | benchmark platform, code submission, competitions, meta-benchmark | Multiple | Submission count, Leaderboard ranking, Task-specific metrics | Arbitrary code submissions | 21 | 10.0 | Simulation task (generative calorimeter showers) is clearly stated with multiple datasets, fidelity requirements, and performance constraints. | 9.5 | Public datasets available in multiple sizes and formats; well-documented; not versioned | 10.0 | Histogram similarity, classifier AUC, and generation latency are clearly defined and benchmarked across all submissions. | 9.0 | 31 model implementations submitted; some made public and reproducible, though others remain undocumented or private. | 9.0 | Paper, leaderboard, and Gemini doc are comprehensive; unified repo or launchable baseline kit would push this to a 10. |
2021-09-27 | Sabath (SBI-FAIR) | Systems; Metadata | FAIR metadata framework for ML-driven surrogate workflows in HPC systems | meta-benchmark, metadata, HPC, surrogate modeling | Systems benchmarking | Metadata completeness, FAIR compliance | N/A | 22 | 8.0 | The benchmark defines simulation-based inference (SBI) tasks clearly with FAIR principles applied to particle physics datasets. | 8.0 | Data is well-structured for SBI and publicly available with clear licensing. | 8.0 | Includes likelihood and posterior accuracy; metrics well-matched to SBI. | 7.0 | Baseline SBI models are implemented and reproducible. | 6.0 | GitHub repo includes code and instructions, but lacks full tutorials or walkthroughs. |
2022-10-13 | PDEBench | CFD; Weather Modeling | Benchmark suite for ML-based surrogates solving time-dependent PDEs | PDEs, CFD, scientific ML, surrogate modeling, NeurIPS | Supervised Learning | RMSE, boundary RMSE, Fourier RMSE | FNO, U-Net, PINN, Gradient-Based inverse methods | 23 | 9.0 | Clearly defined PDE-solving tasks with well-specified constraints and solution formats. | 9.0 | Includes synthetic and real-world PDE datasets with detailed format descriptions. | 8.0 | Uses L2 error and other norms relevant to PDE solutions. | 7.0 | Includes baseline solvers and trained models across multiple PDE tasks. | 8.0 | Well-organized GitHub with examples, dataset loading scripts, and training configs. |
2024-12-03 | The Well | biological systems, fluid dynamics, acoustic scattering, astrophysical MHD | Foundation model + surrogate dataset spanning 16 physical simulation domains | surrogate modeling, foundation model, physics simulations, spatiotemporal dynamics | Supervised Learning | Dataset size, Domain breadth | FNO baselines, U-Net baselines | 24 | 7.0 | Explores LLM understanding of mental health scenarios; framing is creative but loosely defined. | 6.0 | Dataset is described in concept but not released; privacy limits public access though synthetic proxies are referenced. | 7.0 | Uses manual annotation and quality scores, but lacks standardized automatic metrics. | 6.0 | Provides few-shot prompt examples and human rating calibration details. | 5.0 | Paper gives use cases, but code and data are not yet public. |
2024-10-31 | LLM-Inference-Bench | LLM; HPC/inference | Hardware performance benchmarking of LLMs on AI accelerators | LLM, inference benchmarking, GPU, accelerator, throughput | Inference Benchmarking | Token throughput (tok/s), Latency, Framework-hardware mix performance | LLaMA-2-7B, LLaMA-2-70B, Mistral-7B, Qwen-7B | 25 | 9.0 | PDE tasks (forward/inverse) and I/O structures are clearly specified with detailed PDE context and constraints. | 10.0 | Hosted via DaRUS with a DOI, well-documented, versioned, and FAIR-compliant. | 9.0 | Uses RMSE variants and Fourier-based errors. | 10.0 | Baselines (FNO, U-Net, PINN) implemented and ready-to-run; strong community adoption. | 9.0 | Clean GitHub with usage, dataset links, and tutorial notebooks. |
2023-12-12 | SGLang Framework | LLM Vision | Fast serving framework for LLMs and vision-language models | LLM serving, vision-language, RadixAttention, performance, JSON decoding | Model serving framework | Tokens/sec, Time-to-first-token, Throughput gain vs baseline | LLaVA, DeepSeek, Llama | 26 | 8.0 | Clearly framed around surrogate learning across 16 domains, but not all tasks are formally posed or constrained in a unified benchmark protocol. Paper mentions performance on NVIDIA H100. | 9.0 | FAIR-compliant physics simulation dataset, structured in HDF5 with unified metadata. | 7.0 | Metrics like dataset size and domain coverage are listed, but standardized quantitative model evaluation metrics (e.g., RMSE, MAE) are not enforced. | 9.0 | FNO and U-Net baselines available; full benchmarking implementations pending NeurIPS paper code release. | 10.0 | Site and GitHub offer a unified API, metadata standards, and dataset loading tools; NeurIPS paper adds detailed design context. |
2023-09-12 | vLLM Inference and Serving Engine | LLM; HPC/inference | High-throughput, memory-efficient inference and serving engine for LLMs | LLM inference, PagedAttention, CUDA graph, streaming API, quantization | Inference Benchmarking | Tokens/sec, Time to First Token (TTFT), Memory footprint | LLaMA, Mixtral, FlashAttention-based models | 27 | 9.0 | Benchmarks hardware performance of LLM inference across multiple platforms with well-defined input/output and platform constraints. | 7.0 | Uses structured log files and configs instead of conventional datasets; suitable for inference benchmarking. | 9.0 | Clear throughput, latency, and utilization metrics; platform comparison dashboard enhances evaluation. | 8.0 | Includes reproducible scripts and example runs; models like LLaMA and Mistral are referenced with platform-specific configs. | 8.0 | GitHub contains clear instructions, platform details, and framework comparisons. |
2022-06-22 | vLLM Performance Dashboard | LLM; HPC/inference | Interactive dashboard showing inference performance of vLLM | Dashboard, Throughput visualization, Latency analysis, Metric tracking | Performance visualization | Tokens/sec, TTFT, Memory usage | LLaMA-2, Mistral, Qwen | 28 | 8.0 | Framed as a model-serving tool rather than a benchmark, but includes benchmark configurations and real model tasks. | 6.0 | Mostly uses dummy configs or external model endpoints for evaluation; not designed around a formal dataset. | 8.0 | Well-defined serving metrics: tokens/sec, time-to-first-token, and gain over baselines. | 9.0 | Core framework includes full reproducible serving benchmarks and code; multiple deployment case studies. | 9.0 | High-quality usage guides, examples, and performance tuning docs. |
2022-04-01 | Nixtla NeuralForecast | Time-series forecasting; General ML | High-performance neural forecasting library with >30 models | time-series, neural forecasting, NBEATS, NHITS, TFT, probabilistic forecasting, usability | Time-series forecasting | RMSE, MAPE, CRPS | NBEATS, NHITS, TFT, DeepAR | 29 | 9.0 | Targets high-throughput LLM inference via PagedAttention and memory-optimized serving; benchmarks cover many configs. | 7.0 | Focuses on model configs and streaming input/output pipelines rather than classical datasets. | 9.0 | Strong token/sec, memory usage, and TTFT metrics; comparative plots and logs included. | 9.0 | Benchmarks reproducible via script with support for multiple models and hardware types. | 9.0 | Excellent GitHub docs, CLI/API usage, and deployment walkthroughs. |
2023-06-01 | Nixtla Neural Forecast NHITS | Time-series; General ML | Official NHITS implementation for long-horizon time series forecasting | NHITS, long-horizon forecasting, neural interpolation, time-series | Time-series forecasting | RMSE, MAPE | NHITS | 30 | 7.0 | Primarily a visualization frontend; underlying benchmark definitions come from vLLM project. | 6.0 | No traditional dataset; displays live or logged benchmark metrics. | 9.0 | Live throughput, memory, latency, and TTFT displayed interactively; highly informative for performance analysis. | 7.0 | Dashboard built on vLLM benchmarks but not itself a complete experiment package. | 8.0 | Observable notebooks are intuitive; customization instructions are minimal but UI is self-explanatory. |
2023-10-03 | Nixtla Neural Forecast TimeLLM | Time-series; General ML | Reprogramming LLMs for time series forecasting | Time-LLM, language model, time-series, reprogramming | Time-series forecasting | RMSE, MAPE | Time-LLM | 31 | 7.0 | Describes forecasting with LLMs, but less formal on input/output or task framing. | 6.0 | Uses open time series datasets, but lacks a consolidated data release or splits. | 7.0 | Reports metrics like MASE and SMAPE, standard in forecasting. | 6.0 | Provides TimeLLM with open source, but no other baselines included. | 6.0 | GitHub readme with installation and example usage; lacks API or extensive tutorials. |
2023-10-05 | Nixtla Neural Forecast TimeGPT | Time-series; General ML | Time-series foundation model “TimeGPT” for forecasting and anomaly detection | TimeGPT, foundation model, time-series, generative model | Time-series forecasting, Anomaly detection | RMSE, Anomaly detection metrics | TimeGPT | 32 | 7.0 | Describes forecasting with LLMs, but less formal on input/output or task framing. | 6.0 | Uses open time series datasets, but lacks a consolidated data release or splits. | 7.0 | Reports metrics like MASE and SMAPE, standard in forecasting. | 6.0 | Provides TimeLLM with open source, but no other baselines included. | 6.0 | GitHub readme with installation and example usage; lacks API or extensive tutorials. |
2025-03-03 | HDR ML Anomaly Challenge (Gravitational Waves) | Astrophysics; Time-series | Detecting anomalous gravitational-wave signals from LIGO/Virgo datasets | anomaly detection, gravitational waves, astrophysics, time-series | Anomaly detection | ROC-AUC, Precision/Recall | Deep latent CNNs, Autoencoders | 33 | 8.0 | Novel approach treating forecasting as text generation is explained; framing is less conventional. | 9.0 | Compatible with standard forecasting datasets (e.g., M4, electricity). | 8.0 | RMSE and MAPE are included, but less emphasis on interpretability or time-series domain constraints. | 9.0 | Open-source with reprogramming layers, LLM interface scripts provided. | 8.0 | Model and architecture overview present, though usability guide is slightly lighter than others. |
2025-03-03 | HDR ML Anomaly Challenge (Butterfly) | Genomics; Image/CV | Detecting hybrid butterflies via image anomaly detection in genomic-informed dataset | anomaly detection, computer vision, genomics, butterfly hybrids | Anomaly detection | Classification accuracy, F1 score | CNN-based detectors | 34 | 8.0 | Task of detecting rare anomalies in butterfly physics is well-described with physics motivation. | 7.0 | Real detector data with injected anomalies is available, but requires NDA for full access. | 7.0 | Uses ROC, F1, and anomaly precision, standard in challenge evaluations. | 4.0 | Partial baselines described, but no codebase or reproducible runs. | 6.0 | Challenge site includes overview and metrics, but limited in walkthrough or examples. |
2025-03-03 | HDR ML Anomaly Challenge (Sea Level Rise) | Climate Science; Time-series, Image/CV | Detecting anomalous sea-level rise and flooding events via time-series and satellite imagery | anomaly detection, climate science, sea-level rise, time-series, remote sensing | Anomaly detection | ROC-AUC, Precision/Recall | CNNs, RNNs, Transformers | 35 | 9.0 | Clear anomaly detection objective framed for physical signal discovery (LIGO/Virgo). | 10.0 | Preprocessed waveform data from dual interferometers, public and well-structured. | 9.0 | ROC-AUC, Precision/Recall, and confusion-based metrics are standardized. | 1.0 | No starter model or baseline code linked | 9.0 | Codabench page, GitHub starter kit, and related papers provide strong guidance. |
2025-01-24 | Single Qubit Readout on QICK System | Quantum Computing | Real-time single-qubit state classification using FPGA firmware | qubit readout, hls4ml, FPGA, QICK | Classification | Accuracy, Latency | hls4ml quantized NN | 36 | 8.0 | Task clearly framed around detecting hybrid species via images, but exact labeling methods and hybrid definitions may need elaboration. | 8.0 | Dataset hosted on Codabench; appears structured but details on image sourcing and labeling pipeline are limited. | 9.0 | Classification accuracy and F1 are standard and appropriate. | 1.0 | No starter model or baseline code linked | 7.5 | Codabench task page describes dataset and evaluation method but lacks full API/docs. |
2023-11-20 | GPQA: A Graduate-Level Google-Proof Question and Answer Benchmark | Science (Biology, Physics, Chemistry) | Graduate-level, expert-validated multiple-choice questions hard even with web access | Google-proof, multiple-choice, expert reasoning, science QA | Multiple choice | Accuracy | GPT-4 baseline | 37 | 9.0 | Clear dual-modality task (image + time-series); environmental focus is well described. | 9.0 | Time-series and satellite imagery data provided; sensor info and collection intervals are explained. | 9.0 | ROC-AUC, Precision/Recall are appropriate and robust. | 1.0 | No starter model or baseline code linked | 6.5 | Moderate Codabench documentation with climate context; lacks pipeline-level walkthrough. |
2024-12-13 | SeafloorAI | Marine Science; Vision-Language | Large-scale vision-language dataset for seafloor mapping and geological classification | sonar imagery, vision-language, seafloor mapping, segmentation, QA | Image segmentation, Vision-language QA | Segmentation pixel accuracy, QA accuracy | SegFormer, ViLT-style multimodal models | 38 | 9.0 | Real-time qubit classification task clearly defined in quantum instrumentation context. | 9.0 | Dataset available on Zenodo with signal traces; compact and reproducible. | 9.0 | Accuracy and latency are well defined and crucial in this setting. | 9.0 | GitHub repo has reproducible code and HLS firmware targeting FPGA. | 8.0 | Good setup instructions, but no interactive visualization or starter notebook. |
2024-12-13 | SuperCon3D | Materials Science; Superconductivity | Dataset and models for predicting and generating high-Tc superconductors using 3D crystal structures | superconductivity, crystal structures, equivariant GNN, generative models | Regression (Tc prediction), Generative modeling | MAE (Tc), Validity of generated structures | SODNet, DiffCSP-SC | 39 | 10.0 | Multimodal task (segmentation + natural language QA pairs);. | 10.0 | sonar imagery + masks + descriptions, georeferenced and labeled with QA | 9.0 | Pixel accuracy and QA metrics clearly defined; tasks split by modality. | 8.0 | Baseline models (SegFormer, ViLT) are cited, partial configs likely available. | 8.5 | Paper + GitHub metadata and processing details are comprehensive, though full dataset is not yet available. |
2024-12-13 | GeSS | Scientific ML; Geometric Deep Learning | Benchmark suite evaluating geometric deep learning models under real-world distribution shifts | geometric deep learning, distribution shift, OOD robustness, scientific applications | Classification, Regression | Accuracy, RMSE, OOD robustness delta | GCN, EGNN, DimeNet++ | 40 | 9.0 | Well-defined problem (Tc prediction, generation) with strong scientific motivation (high-Tc materials), but no formal hardware constraints. | 9.0 | Includes curated 3D crystal structures and Tc data; readily downloadable and used in paper models. | 9.0 | MAE and structural validity used, well-established in materials modeling. | 8.0 | Provides two reference models (SODNet, DiffCSP-SC) with results. Code likely available post-conference. | 8.0 | Paper and poster explain design choices well; software availability confirms reproducibility but limited external documentation. |
2024-12-13 | Vocal Call Locator (VCL) | Neuroscience; Bioacoustics | Benchmarking sound-source localization of rodent vocalizations from multi-channel audio | source localization, bioacoustics, time-series, SSL | Sound source localization | Localization error (cm), Recall/Precision | CNN-based SSL models | 41 | 9.0 | Clear benchmark scenarios across GDL tasks under multiple real-world shift settings; OOD settings precisely categorized. | 8.0 | Scientific graph datasets provided in multiple shift regimes; standardized splits across domains. Exact format of data not specified. | 9.0 | Includes base metrics (accuracy, RMSE) plus OOD delta robustness for evaluation under shifts. | 9.0 | Multiple baselines (11 algorithms x 3 backbones) evaluated; setup supports reproducible comparison. | 2.0 | Paper, poster, and source code provide thorough access to methodology and implementation. Setup instructions and accompanying code not present. |
2024-12-13 | MassSpecGym | Cheminformatics; Molecular Discovery | Benchmark suite for discovery and identification of molecules via MS/MS | mass spectrometry, molecular structure, de novo generation, retrieval, dataset | De novo generation, Retrieval, Simulation | Structure accuracy, Retrieval precision, Simulation MSE | Graph-based generative models, Retrieval baselines | 42 | 9.0 | Focused on sound source localization for rodent vocalizations in lab settings; well-scoped. | 9.5 | 767000 annotated audio segments across diverse conditions. Minor deduction for no train/test/valid split. | 9.5 | Localization error, precision/recall used | 7.0 | CNN-based baselines referenced but unclear whether pretrained models or training code are available. | 2.0 | Poster and paper outline benchmark intent and setup; repo expected but not confirmed in dataset card. |
2024-12-13 | Urban Data Layer (UDL) | Urban Computing; Data Engineering | Unified data pipeline for multi-modal urban science research | data pipeline, urban science, multi-modal, benchmark | Prediction, Classification | Task-specific accuracy or RMSE | Baseline regression/classification pipelines | 43 | 9.0 | Three tasks (de novo generation, retrieval, simulation) are clearly defined for MS/MS molecule discovery. | 10.0 | Over 1 million spectra with structure annotations; dataset is open-source and well-documented. | 9.0 | Task-appropriate metrics (structure accuracy, precision, MSE) are specified and used consistently. | 8.0 | Baseline models are available (graph-based and retrieval), though not exhaustive. | 9.0 | GitHub repo and poster provide code and reproducibility guidance. |
2024-12-13 | Delta Squared-DFT | Computational Chemistry; Materials Science | Benchmarking machine-learning corrections to DFT using Delta Squared-trained models for reaction energies | density functional theory, Delta Squared-ML correction, reaction energetics, quantum chemistry | Regression | Mean Absolute Error (eV), Energy ranking accuracy | Delta Squared-ML correction networks, Kernel ridge regression | 44 | 8.0 | Clear goals around unifying urban data formats and tasks (e.g., air quality prediction), though some specifics could be more formal. | 9.0 | Multi-modal data is standardized and accessible; GitHub repo available. | 8.0 | Uses common task metrics like accuracy/RMSE, though varies by task. | 7.0 | Baseline regression/classification models included. | 8.0 | Source code supports pipeline reuse, but formal evaluation splits may vary. |
2024-12-13 | LLMs for Crop Science | Agricultural Science; NLP | Evaluating LLMs on crop trait QA and textual inference tasks with domain-specific prompts | crop science, prompt engineering, domain adaptation, question answering | Question Answering, Inference | Accuracy, F1 score | GPT-4, LLaMA-2-13B, T5-XXL | 45 | 9.0 | The task of ML correction to DFT energy predictions is well-specified. | 9.0 | 10 public reaction datasets with DFT and CC references; well-documented. | 8.0 | Uses MAE and ranking accuracy, suitable for this task. | 8.0 | Includes both Delta^2 and KRR baselines. | 9.0 | Public benchmarks and clear reproducibility via datasets and model code. |
2024-12-13 | SPIQA (LLM) | Multimodal Scientific QA; Computer Vision | Evaluating LLMs on image-based scientific paper figure QA tasks (LLM Adapter performance) | multimodal QA, scientific figures, image+text, chain-of-thought prompting | Multimodal QA | Accuracy, F1 score | LLaVA, MiniGPT-4, Owl-LLM adapter variants | 46 | 6.0 | Task of QA over scientific figures is interesting but not fully formalized in input/output terms. | 6.0 | Uses SPIQA dataset with ~10 adapters; figures and questions are included, but not fully open. | 7.0 | Reports accuracy and F1; fair but no visual reasoning-specific metric. | 6.0 | 10 LLM adapter baselines; results included. | 5.0 | Poster paper and limited documentation; no reproducibility instructions. |
Javier Duarte, Nhan Tran, Ben Hawks, Christian Herwig, Jules Muhizi, Shvetank Prakash, and Vijay Janapa Reddi. Fastml science benchmarks: accelerating real-time scientific edge machine learning. 2022. URL: https://arxiv.org/abs/2207.07958, arXiv:2207.07958. ↩
Javier Duarte, Nhan Tran, Ben Hawks, Christian Herwig, Jules Muhizi, Shvetank Prakash, and Vijay Janapa Reddi. Fastml science benchmarks: accelerating real-time scientific edge machine learning. 2022. URL: https://arxiv.org/abs/2207.07958, arXiv:2207.07958. ↩
Javier Duarte, Nhan Tran, Ben Hawks, Christian Herwig, Jules Muhizi, Shvetank Prakash, and Vijay Janapa Reddi. Fastml science benchmarks: accelerating real-time scientific edge machine learning. 2022. URL: https://arxiv.org/abs/2207.07958, arXiv:2207.07958. ↩
Diana Kafkes and Jason St. John. Boostr: a dataset for accelerator control systems. 2021. URL: https://arxiv.org/abs/2101.08359, arXiv:2101.08359. ↩
Patrick Odagiu, Zhiqiang Que, Javier Duarte, Johannes Haller, Gregor Kasieczka, Artur Lobanov, Vladimir Loncar, Wayne Luk, Jennifer Ngadiuba, Maurizio Pierini, Philipp Rincke, Arpita Seksaria, Sioni Summers, Andre Sznajder, Alexander Tapper, and Thea K. Aarrestad. Ultrafast jet classification on fpgas for the hl-lhc. 2024. URL: https://arxiv.org/abs/2402.01876, arXiv:2402.01876, doi:https://doi.org/10.1088/2632-2153/ad5f10. ↩
A. Abed Abud, B. Abi, R. Acciarri, M. A. Acero, G. Adamov, D. Adams, M. Adinolfi, A. Aduszkiewicz, Z. Ahmad, J. Ahmed, T. Alion, S. Alonso Monsalve, M. Alrashed, C. Alt, A. Alton, P. Amedo, J. Anderson, C. Andreopoulos, M. P. Andrews, F. Andrianala, S. Andringa, N. Anfimov, A. Ankowski, M. Antonova, S. Antusch, A. Aranda-Fernandez, A. Ariga, L. O. Arnold, M. A. Arroyave, J. Asaadi, A. Aurisano, V. Aushev, D. Autiero, M. Ayala-Torres, F. Azfar, H. Back, J. J. Back, C. Backhouse, P. Baesso, I. Bagaturia, L. Bagby, S. Balasubramanian, P. Baldi, B. Baller, B. Bambah, F. Barao, G. Barenboim, G. J. Barker, W. Barkhouse, C. Barnes, G. Barr, J. Barranco Monarca, N. Barros, J. L. Barrow, A. Basharina-Freshville, A. Bashyal, V. Basque, E. Belchior, J. B. R. Battat, F. Battisti, F. Bay, J. L. Bazo Alba, J. F. Beacom, E. Bechetoille, B. Behera, L. Bellantoni, G. Bellettini, V. Bellini, O. Beltramello, D. Belver, N. Benekos, F. Bento Neves, S. Berkman, P. Bernardini, R. M. Berner, H. Berns, S. Bertolucci, M. Betancourt, A. Betancur Rodríguez, M. Bhattacharjee, S. Bhuller, B. Bhuyan, S. Biagi, J. Bian, M. Biassoni, K. Biery, B. Bilki, M. Bishai, A. Bitadze, A. Blake, F. D. M. Blaszczyk, G. C. Blazey, E. Blucher, J. Boissevain, S. Bolognesi, T. Bolton, L. Bomben, M. Bonesini, M. Bongrand, F. Bonini, A. Booth, C. Booth, S. Bordoni, A. Borkum, T. Boschi, N. Bostan, P. Bour, C. Bourgeois, S. B. Boyd, D. Boyden, J. Bracinik, D. Braga, D. Brailsford, A. Brandt, J. Bremer, C. Brew, E. Brianne, S. J. Brice, C. Brizzolari, C. Bromberg, G. Brooijmans, J. Brooke, A. Bross, G. Brunetti, M. Brunetti, N. Buchanan, H. Budd, D. Caiulo, P. Calafiura, J. Calcutt, M. Calin, S. Calvez, E. Calvo, A. Caminata, M. Campanelli, K. Cankocak, D. Caratelli, G. Carini, B. Carlus, P. Carniti, I. Caro Terrazas, H. Carranza, T. Carroll, J. F. Castaño Forero, A. Castillo, C. Castromonte, E. Catano-Mur, C. Cattadori, F. Cavalier, F. Cavanna, S. Centro, G. Cerati, A. Cervelli, A. Cervera Villanueva, M. Chalifour, A. Chappell, E. Chardonnet, N. Charitonidis, A. Chatterjee, S. Chattopadhyay, H. Chen, M. Chen, Y. Chen, Z. Chen, D. Cherdack, C. Chi, S. Childress, A. Chiriacescu, G. Chisnall, K. Cho, S. Choate, D. Chokheli, S. Choubey, A. Christensen, D. Christian, G. Christodoulou, A. Chukanov, E. Church, P. Clarke, T. E. Coan, A. G. Cocco, J. A. B. Coelho, E. Conley, R. Conley, J. M. Conrad, M. Convery, S. Copello, L. Corwin, L. Cremaldi, L. Cremonesi, J. I. Crespo-Anadón, E. Cristaldo, R. Cross, A. Cudd, C. Cuesta, Y. Cui, D. Cussans, M. Dabrowski, O. Dalager, H. da Motta, L. Da Silva Peres, C. David, Q. David, G. S. Davies, S. Davini, J. Dawson, K. De, R. M. De Almeida, P. Debbins, I. De Bonis, M. P. Decowski, A. de Gouvêa, P. C. De Holanda, I. L. De Icaza Astiz, A. Deisting, P. De Jong, A. Delbart, D. Delepine, M. Delgado, A. Dell’Acqua, P. De Lurgio, J. R. T. de Mello Neto, D. M. DeMuth, S. Dennis, C. Densham, G. W. Deptuch, A. De Roeck, V. De Romeri, G. De Souza, R. Dharmapalan, F. Diaz, J. S. Díaz, S. Di Domizio, L. Di Giulio, P. Ding, L. Di Noto, C. Distefano, R. Diurba, M. Diwan, Z. Djurcic, N. Dokania, S. Dolan, M. J. Dolinski, L. Domine, D. Douglas, D. Douillet, G. Drake, F. Drielsma, D. Duchesneau, K. Duffy, P. Dunne, T. Durkin, H. Duyang, O. Dvornikov, D. A. Dwyer, A. S. Dyshkant, M. Eads, A. Earle, D. Edmunds, J. Eisch, L. Emberger, S. Emery, A. Ereditato, C. O. Escobar, G. Eurin, J. J. Evans, E. Ewart, A. C. Ezeribe, K. Fahey, A. Falcone, C. Farnese, Y. Farzan, J. Felix, M. Fernandes Carneiro da Silva, E. Fernandez-Martinez, P. Fernandez Menendez, F. Ferraro, L. Fields, F. Filthaut, A. Fiorentini, R. S. Fitzpatrick, W. Flanagan, B. Fleming, R. Flight, D. V. Forero, J. Fowler, W. Fox, J. Franc, K. Francis, D. Franco, J. Freeman, J. Freestone, J. Fried, A. Friedland, S. Fuess, I. Furic, A. P. Furmanski, A. Gago, H. Gallagher, A. Gallas, A. Gallego-Ros, N. Gallice, V. Galymov, E. Gamberini, T. Gamble, R. Gandhi, R. Gandrajula, F. Gao, S. Gao, D. Garcia-Gamez, M. Á García-Peris, S. Gardiner, D. Gastler, G. Ge, B. Gelli, A. Gendotti, S. Gent, Z. Ghorbani-Moghaddam, D. Gibin, I. Gil-Botella, S. Gilligan, C. Girerd, A. K. Giri, D. Gnani, O. Gogota, M. Gold, S. Gollapinni, K. Gollwitzer, R. A. Gomes, L. V. Gomez Bermeo, L. S. Gomez Fajardo, F. Gonnella, J. A. Gonzalez-Cuevas, D. Gonzalez-Diaz, M. Gonzalez-Lopez, M. C. Goodman, O. Goodwin, S. Goswami, C. Gotti, E. Goudzovski, C. Grace, M. Graham, R. Gran, E. Granados, P. Granger, A. Grant, C. Grant, D. Gratieri, P. Green, L. Greenler, J. Greer, W. C. Griffith, M. Groh, J. Grudzinski, K. Grzelak, W. Gu, V. Guarino, R. Guenette, E. Guerard, A. Guglielmi, B. Guo, K. K. Guthikonda, R. Gutierrez, P. Guzowski, M. M. Guzzo, S. Gwon, A. Habig, H. Hadavand, R. Haenni, A. Hahn, J. Haiston, P. Hamacher-Baumann, T. Hamernik, P. Hamilton, J. Han, D. A. Harris, J. Hartnell, J. Harton, T. Hasegawa, C. Hasnip, R. Hatcher, K. W. Hatfield, A. Hatzikoutelis, C. Hayes, E. Hazen, A. Heavey, K. M. Heeger, J. Heise, K. Hennessy, S. Henry, M. A. Hernandez Morquecho, K. Herner, L. Hertel, V Hewes, A. Higuera, T. Hill, S. J. Hillier, A. Himmel, J. Hoff, C. Hohl, A. Holin, E. Hoppe, G. A. Horton-Smith, M. Hostert, A. Hourlier, B. Howard, R. Howell, J. Huang, J. Huang, J. Hugon, G. Iles, N. Ilic, A. M. Iliescu, R. Illingworth, A. Ioannisian, L. Isenhower, R. Itay, A. Izmaylov, S. Jackson, V. Jain, E. James, B. Jargowsky, F. Jediny, D. Jena, Y. S. Jeong, C. Jesús-Valls, X. Ji, L. Jiang, S. Jiménez, A. Jipa, R. Johnson, B. Jones, S. B. Jones, M. Judah, C. K. Jung, T. Junk, Y. Jwa, M. Kabirnezhad, A. Kaboth, I. Kadenko, I. Kakorin, F. Kamiya, N. Kaneshige, G. Karagiorgi, G. Karaman, A. Karcher, M. Karolak, Y. Karyotakis, S. Kasai, S. P. Kasetti, L. Kashur, N. Kazaryan, E. Kearns, P. Keener, K. J. Kelly, E. Kemp, O. Kemularia, W. Ketchum, S. H. Kettell, M. Khabibullin, A. Khotjantsev, A. Khvedelidze, D. Kim, B. King, B. Kirby, M. Kirby, J. Klein, K. Koehler, L. W. Koerner, S. Kohn, P. P. Koller, L. Kolupaeva, M. Kordosky, T. Kosc, U. Kose, V. A. Kostelecký, K. Kothekar, F. Krennrich, I. Kreslo, Y. Kudenko, V. A. Kudryavtsev, S. Kulagin, J. Kumar, P. Kumar, P. Kunze, N. Kurita, C. Kuruppu, V. Kus, T. Kutter, A. Lambert, B. Land, K. Lande, C. E. Lane, K. Lang, T. Langford, J. Larkin, P. Lasorak, D. Last, C. Lastoria, A. Laundrie, A. Lawrence, I. Lazanu, R. LaZur, T. Le, S. Leardini, J. Learned, P. LeBrun, T. LeCompte, G. Lehmann Miotto, R. Lehnert, M. A. Leigui de Oliveira, M. Leitner, L. Li, S. W. Li, T. Li, Y. Li, H. Liao, C. S. Lin, Q. Lin, S. Lin, A. Lister, B. R. Littlejohn, J. Liu, S. Lockwitz, T. Loew, M. Lokajicek, I. Lomidze, K. Long, K. Loo, D. Lorca, T. Lord, J. M. LoSecco, W. C. Louis, X. -G. Lu, K. B. Luk, X. Luo, N. Lurkin, T. Lux, V. P. Luzio, D. MacFarlane, A. A. Machado, P. Machado, C. T. Macias, J. R. Macier, A. Maddalena, A. Madera, P. Madigan, S. Magill, K. Mahn, A. Maio, A. Major, J. A. Maloney, G. Mandrioli, R. C. Mandujano, J. Maneira, L. Manenti, S. Manly, A. Mann, K. Manolopoulos, M. Manrique Plata, V. N. Manyam, L. Manzanillas, M. Marchan, A. Marchionni, W. Marciano, D. Marfatia, C. Mariani, J. Maricic, R. Marie, F. Marinho, A. D. Marino, D. Marsden, M. Marshak, C. M. Marshall, J. Marshall, J. Marteau, J. Martin-Albo, N. Martinez, D. A. Martinez Caicedo, S. Martynenko, K. Mason, A. Mastbaum, M. Masud, S. Matsuno, J. Matthews, C. Mauger, N. Mauri, K. Mavrokoridis, I. Mawby, R. Mazza, A. Mazzacane, E. Mazzucato, T. McAskill, E. McCluskey, N. McConkey, K. S. McFarland, C. McGrew, A. McNab, A. Mefodiev, P. Mehta, P. Melas, O. Mena, S. Menary, H. Mendez, D. P. Méndez, A. Menegolli, G. Meng, M. D. Messier, W. Metcalf, T. Mettler, M. Mewes, H. Meyer, T. Miao, G. Michna, T. Miedema, J. Migenda, V. Mikola, R. Milincic, W. Miller, J. Mills, C. Milne, O. Mineev, O. G. Miranda, S. Miryala, C. S. Mishra, S. R. Mishra, A. Mislivec, D. Mladenov, I. Mocioiu, K. Moffat, N. Moggi, R. Mohanta, T. A. Mohayai, N. Mokhov, J. Molina, L. Molina Bueno, A. Montanari, C. Montanari, D. Montanari, L. M. Montano Zetina, J. Moon, M. Mooney, A. F. Moor, D. Moreno, C. Morris, C. Mossey, E. Motuk, C. A. Moura, J. Mousseau, W. Mu, L. Mualem, J. Mueller, M. Muether, S. Mufson, F. Muheim, A. Muir, M. Mulhearn, D. Munford, H. Muramatsu, S. Murphy, J. Musser, J. Nachtman, S. Nagu, M. Nalbandyan, R. Nandakumar, D. Naples, S. Narita, D. Navas-Nicolás, A. Navrer-Agasson, N. Nayak, M. Nebot-Guinot, K. Negishi, J. K. Nelson, J. Nesbit, M. Nessi, D. Newbold, M. Newcomer, D. Newhart, H. Newton, R. Nichol, F. Nicolas-Arnaldos, E. Niner, K. Nishimura, A. Norman, A. Norrick, R. Northrop, P. Novella, J. A. Nowak, M. Oberling, J. P. Ochoa-Ricoux, A. Olivares Del Campo, A. Olivier, A. Olshevskiy, Y. Onel, Y. Onishchuk, J. Ott, L. Pagani, S. Pakvasa, G. Palacio, O. Palamara, S. Palestini, J. M. Paley, M. Pallavicini, C. Palomares, J. L. Palomino-Gallo, E. Pantic, V. Paolone, V. Papadimitriou, R. Papaleo, A. Papanestis, S. Paramesvaran, S. Parke, Z. Parsa, M. Parvu, S. Pascoli, L. Pasqualini, J. Pasternak, J. Pater, C. Patrick, L. Patrizii, R. B. Patterson, S. J. Patton, T. Patzak, A. Paudel, B. Paulos, L. Paulucci, Z. Pavlovic, G. Pawloski, D. Payne, V. Pec, S. J. M. Peeters, E. Pennacchio, A. Penzo, O. L. G. Peres, J. Perry, D. Pershey, G. Pessina, G. Petrillo, C. Petta, R. Petti, F. Piastra, L. Pickering, F. Pietropaolo, R. Plunkett, R. Poling, X. Pons, N. Poonthottathil, S. Pordes, J. Porter, M. Potekhin, R. Potenza, B. V. K. S. Potukuchi, J. Pozimski, M. Pozzato, S. Prakash, T. Prakash, S. Prince, D. Pugnere, X. Qian, M. C. Queiroga Bazetto, J. L. Raaf, V. Radeka, J. Rademacker, B. Radics, A. Rafique, E. Raguzin, M. Rai, M. Rajaoalisoa, I. Rakhno, A. Rakotonandrasana, L. Rakotondravohitra, Y. A. Ramachers, R. Rameika, M. A. Ramirez Delgado, B. Ramson, A. Rappoldi, G. Raselli, P. Ratoff, S. Raut, R. F. Razakamiandra, J. S. Real, B. Rebel, M. Reggiani-Guzzo, T. Rehak, J. Reichenbacher, S. D. Reitzner, H. Rejeb Sfar, A. Renshaw, S. Rescia, F. Resnati, A. Reynolds, C. Riccio, G. Riccobene, L. C. J. Rice, J. Ricol, A. Rigamonti, Y. Rigaut, D. Rivera, L. Rochester, M. Roda, P. Rodrigues, M. J. Rodriguez Alonso, E. Rodriguez Bonilla, J. Rodriguez Rondon, S. Rosauro-Alcaraz, M. Rosenberg, P. Rosier, B. Roskovec, M. Rossella, J. Rout, P. Roy, S. Roy, A. Rubbia, C. Rubbia, F. C. Rubio, B. Russell, D. Ruterbories, R. Saakyan, S. Sacerdoti, T. Safford, R. Sahay, N. Sahu, P. Sala, N. Samios, O. Samoylov, M. C. Sanchez, D. A. Sanders, D. Sankey, S. Santana, M. Santos-Maldonado, N. Saoulidou, P. Sapienza, C. Sarasty, I. Sarcevic, G. Savage, V. Savinov, A. Scaramelli, A. Scarff, A. Scarpelli, T. Schaffer, H. Schellman, P. Schlabach, D. Schmitz, K. Scholberg, A. Schukraft, E. Segreto, J. Sensenig, I. Seong, A. Sergi, D. Sgalaberna, M. H. Shaevitz, S. Shafaq, M. Shamma, R. Sharankova, H. R. Sharma, R. Sharma, R. Kumar, T. Shaw, C. Shepherd-Themistocleous, S. Shin, D. Shooltz, R. Shrock, L. Simard, F. Simon, N. Simos, J. Sinclair, G. Sinev, J. Singh, J. Singh, V. Singh, R. Sipos, F. W. Sippach, G. Sirri, A. Sitraka, K. Siyeon, K. Skarpaas VIII, A. Smith, E. Smith, P. Smith, J. Smolik, M. Smy, E. L. Snider, P. Snopok, M. Soares Nunes, H. Sobel, M. Soderberg, C. J. Solano Salinas, S. Söldner-Rembold, N. Solomey, V. Solovov, W. E. Sondheim, M. Sorel, J. Soto-Oton, A. Sousa, K. Soustruznik, F. Spagliardi, M. Spanu, J. Spitz, N. J. C. Spooner, K. Spurgeon, R. Staley, M. Stancari, L. Stanco, R. Stanley, R. Stein, H. M. Steiner, J. Stewart, B. Stillwell, J. Stock, F. Stocker, T. Stokes, M. Strait, T. Strauss, S. Striganov, A. Stuart, J. G. Suarez, H. Sullivan, D. Summers, A. Surdo, V. Susic, L. Suter, C. M. Sutera, R. Svoboda, B. Szczerbinska, A. M. Szelc, R. Talaga, H. A. Tanaka, B. Tapia Oregui, A. Tapper, S. Tariq, E. Tatar, R. Tayloe, A. M. Teklu, M. Tenti, K. Terao, C. A. Ternes, F. Terranova, G. Testera, A. Thea, J. L. Thompson, C. Thorn, S. C. Timm, J. Todd, A. Tonazzo, D. Torbunov, M. Torti, M. Tortola, F. Tortorici, D. Totani, M. Toups, C. Touramanis, J. Trevor, S. Trilov, W. H. Trzaska, Y. T. Tsai, Z. Tsamalaidze, K. V. Tsang, N. Tsverava, S. Tufanli, C. Tull, E. Tyley, M. Tzanov, M. A. Uchida, J. Urheim, T. Usher, S. Uzunyan, M. R. Vagins, P. Vahle, G. A. Valdiviesso, E. Valencia, Z. Vallari, J. W. F. Valle, S. Vallecorsa, R. Van Berg, R. G. Van de Water, F. Varanini, D. Vargas, G. Varner, J. Vasel, S. Vasina, G. Vasseur, N. Vaughan, K. Vaziri, S. Ventura, A. Verdugo, S. Vergani, M. A. Vermeulen, M. Verzocchi, M. Vicenzi, H. Vieira de Souza, C. Vignoli, C. Vilela, B. Viren, T. Vrba, T. Wachala, A. V. Waldron, M. Wallbank, H. Wang, J. Wang, M. H. L. S. Wang, Y. Wang, Y. Wang, K. Warburton, D. Warner, M. Wascko, D. Waters, A. Watson, P. Weatherly, A. Weber, M. Weber, H. Wei, A. Weinstein, D. Wenman, M. Wetstein, A. White, L. H. Whitehead, D. Whittington, M. J. Wilking, C. Wilkinson, Z. Williams, F. Wilson, R. J. Wilson, J. Wolcott, T. Wongjirad, A. Wood, K. Wood, E. Worcester, M. Worcester, C. Wret, W. Wu, W. Wu, Y. Xiao, E. Yandel, G. Yang, K. Yang, S. Yang, T. Yang, A. Yankelevich, N. Yershov, K. Yonehara, T. Young, B. Yu, H. Yu, J. Yu, W. Yuan, R. Zaki, J. Zalesak, L. Zambelli, B. Zamorano, A. Zani, L. Zazueta, G. Zeit, G. P. Zeller, J. Zennamo, K. Zeug, C. Zhang, M. Zhao, E. Zhivun, G. Zhu, P. Zilberman, E. D. Zimmerman, M. Zito, S. Zucchelli, J. Zuklin, V. Zutshi, and R. Zwaska. Deep underground neutrino experiment (dune) near detector conceptual design report. 2021. URL: https://arxiv.org/abs/2103.13910, arXiv:2103.13910. ↩
J. Kvapil, G. Borca-Tasciuc, H. Bossi, K. Chen, Y. Chen, Y. Corrales Morales, H. Da Costa, C. Da Silva, C. Dean, J. Durham, S. Fu, C. Hao, P. Harris, O. Hen, H. Jheng, Y. Lee, P. Li, X. Li, Y. Lin, M. X. Liu, V. Loncar, J. P. Mitrevski, A. Olvera, M. L. Purschke, J. S. Renck, G. Roland, J. Schambach, Z. Shi, N. Tran, N. Wuerfel, B. Xu, D. Yu, and H. Zhang. Intelligent experiments through real-time ai: fast data processing and autonomous detector control for sphenix and future eic detectors. 2025. URL: https://arxiv.org/abs/2501.04845, arXiv:2501.04845. ↩
Jason Weitz, Dmitri Demler, Luke McDermott, Nhan Tran, and Javier Duarte. Neural architecture codesign for fast physics applications. 2025. URL: https://arxiv.org/abs/2501.05515, arXiv:2501.05515. ↩
Benjamin Parpillon, Chinar Syal, Jieun Yoo, Jennet Dickinson, Morris Swartz, Giuseppe Di Guglielmo, Alice Bean, Douglas Berry, Manuel Blanco Valentin, Karri DiPetrillo, Anthony Badea, Lindsey Gray, Petar Maksimovic, Corrinne Mills, Mark S. Neubauer, Gauri Pradhan, Nhan Tran, Dahai Wen, and Farah Fahim. Smart pixels: in-pixel ai for on-sensor data filtering. 2024. URL: https://arxiv.org/abs/2406.14860, arXiv:2406.14860. ↩
Zhengchun Liu, Hemant Sharma, Jun-Sang Park, Peter Kenesei, Antonino Miceli, Jonathan Almer, Rajkumar Kettimuthu, and Ian Foster. Braggnn: fast x-ray bragg peak analysis using deep learning. 2021. URL: https://arxiv.org/abs/2008.08198, arXiv:2008.08198. ↩
Shuyu Qin, Joshua Agar, and Nhan Tran. Extremely noisy 4d-tem strain mapping using cycle consistent spatial transforming autoencoders. In AI for Accelerated Materials Design - NeurIPS 2023 Workshop. 2023. URL: https://openreview.net/forum?id=7yt3N0o0W9. ↩
Yumou Wei, Ryan F. Forelli, Chris Hansen, Jeffrey P. Levesque, Nhan Tran, Joshua C. Agar, Giuseppe Di Guglielmo, Michael E. Mauel, and Gerald A. Navratil. Low latency optical-based mode tracking with machine learning deployed on fpgas on a tokamak. 2024. URL: https://arxiv.org/abs/2312.00128, arXiv:2312.00128, doi:https://doi.org/10.1063/5.0190354. ↩
Wanling Gao, Fei Tang, Lei Wang, Jianfeng Zhan, Chunxin Lan, Chunjie Luo, Yunyou Huang, Chen Zheng, Jiahui Dai, Zheng Cao, Daoyi Zheng, Haoning Tang, Kunlin Zhan, Biao Wang, Defei Kong, Tong Wu, Minghe Yu, Chongkang Tan, Huan Li, Xinhui Tian, Yatao Li, Junchao Shao, Zhenyu Wang, Xiaoyu Wang, and Hainan Ye. Aibench: an industry standard internet service ai benchmark suite. 2019. URL: https://arxiv.org/abs/1908.08998, arXiv:1908.08998. ↩
Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Xu Wen, Rui Ren, Chen Zheng, Xiwen He, Hainan Ye, Haoning Tang, Zheng Cao, Shujie Zhang, and Jiahui Dai. Bigdatabench: a scalable and unified big data and ai benchmark suite. 2018. URL: https://arxiv.org/abs/1802.08254, arXiv:1802.08254. ↩
Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, Dawei Mu, Amit Ruhela, Kento Sato, Koichi Shirahata, Tsuguchika Tabaru, Aristeidis Tsaris, Jan Balewski, Ben Cumming, Takumi Danjo, Jens Domke, Takaaki Fukai, Naoto Fukumoto, Tatsuya Fukushi, Balazs Gerofi, Takumi Honda, Toshiyuki Imamura, Akihiko Kasagi, Kentaro Kawakami, Shuhei Kudo, Akiyoshi Kuroda, Maxime Martinasso, Satoshi Matsuoka, Henrique Mendonça, Kazuki Minami, Prabhat Ram, Takashi Sawada, Mallikarjun Shankar, Tom St. John, Akihiro Tabuchi, Venkatram Vishwanath, Mohamed Wahib, Masafumi Yamazaki, and Junqi Yin. Mlperf hpc: a holistic benchmark suite for scientific machine learning on hpc systems. 2021. URL: https://arxiv.org/abs/2110.11466, arXiv:2110.11466. ↩
Jeyan Thiyagalingam, Gregor von Laszewski, Junqi Yin, Murali Emani, Juri Papay, Gregg Barrett, Piotr Luszczek, Aristeidis Tsaris, Christine Kirkpatrick, Feiyi Wang, Tom Gibbs, Venkatram Vishwanath, Mallikarjun Shankar, Geoffrey Fox, and Tony Hey. Ai benchmarking for science: efforts from the mlcommons science working group. In Hartwig Anzt, Amanda Bienz, Piotr Luszczek, and Marc Baboulin, editors, High Performance Computing. ISC High Performance 2022 International Workshops, 47–64. Cham, 2022. Springer International Publishing. ↩
Thea Aarrestad, Ekaterina Govorkova, Jennifer Ngadiuba, Ema Puljak, Maurizio Pierini, and Kinga Anna Wozniak. Unsupervised new physics detection at 40 mhz: training dataset. 2021. URL: https://zenodo.org/record/5046389, doi:10.5281/ZENODO.5046389. ↩
Alexandros Karargyris, Renato Umeton, Micah J. Sheller, Alejandro Aristizabal, Johnu George, Anna Wuest, Sarthak Pati, Hasan Kassem, Maximilian Zenk, Ujjwal Baid, Prakash Narayana Moorthy, Alexander Chowdhury, Junyi Guo, Sahil Nalawade, Jacob Rosenthal, David Kanter, Maria Xenochristou, Daniel J. Beutel, Verena Chung, Timothy Bergquist, James Eddy, Abubakar Abid, Lewis Tunstall, Omar Sanseviero, Dimitrios Dimitriadis, Yiming Qian, Xinxing Xu, Yong Liu, Rick Siow Mong Goh, Srini Bala, Victor Bittorf, Sreekar Reddy Puchala, Biagio Ricciuti, Soujanya Samineni, Eshna Sengupta, Akshay Chaudhari, Cody Coleman, Bala Desinghu, Gregory Diamos, Debo Dutta, Diane Feddema, Grigori Fursin, Xinyuan Huang, Satyananda Kashyap, Nicholas Lane, Indranil Mallick, Pietro Mascagni, Virendra Mehta, Cassiano Ferro Moraes, Vivek Natarajan, Nikola Nikolov, Nicolas Padoy, Gennady Pekhimenko, Vijay Janapa Reddi, G. Anthony Reina, Pablo Ribalta, Abhishek Singh, Jayaraman J. Thiagarajan, Jacob Albrecht, Thomas Wolf, Geralyn Miller, Huazhu Fu, Prashant Shah, Daguang Xu, Poonam Yadav, David Talby, Mark M. Awad, Jeremy P. Howard, Michael Rosenthal, Luigi Marchionni, Massimo Loda, Jason M. Johnson, Spyridon Bakas, Peter Mattson, FeTS Consortium, BraTS-2020 Consortium, and AI4SafeChole Consortium. Federated benchmarking of medical artificial intelligence with medperf. Nature Machine Intelligence, 5(7):799–810, July 2023. URL: https://doi.org/10.1038/s42256-023-00652-2, doi:10.1038/s42256-023-00652-2. ↩
Claudius Krause, Michele Faucci Giannelli, Gregor Kasieczka, Benjamin Nachman, Dalila Salamani, David Shih, Anna Zaborowska, Oz Amram, Kerstin Borras, Matthew R. Buckley, Erik Buhmann, Thorsten Buss, Renato Paulo Da Costa Cardoso, Anthony L. Caterini, Nadezda Chernyavskaya, Federico A. G. Corchia, Jesse C. Cresswell, Sascha Diefenbacher, Etienne Dreyer, Vijay Ekambaram, Engin Eren, Florian Ernst, Luigi Favaro, Matteo Franchini, Frank Gaede, Eilam Gross, Shih-Chieh Hsu, Kristina Jaruskova, Benno Käch, Jayant Kalagnanam, Raghav Kansal, Taewoo Kim, Dmitrii Kobylianskii, Anatolii Korol, William Korcari, Dirk Krücker, Katja Krüger, Marco Letizia, Shu Li, Qibin Liu, Xiulong Liu, Gabriel Loaiza-Ganem, Thandikire Madula, Peter McKeown, Isabell-A. Melzer-Pellmann, Vinicius Mikuni, Nam Nguyen, Ayodele Ore, Sofia Palacios Schweitzer, Ian Pang, Kevin Pedro, Tilman Plehn, Witold Pokorski, Huilin Qu, Piyush Raikwar, John A. Raine, Humberto Reyes-Gonzalez, Lorenzo Rinaldi, Brendan Leigh Ross, Moritz A. W. Scham, Simon Schnake, Chase Shimmin, Eli Shlizerman, Nathalie Soybelman, Mudhakar Srivatsa, Kalliopi Tsolaki, Sofia Vallecorsa, Kyongmin Yeo, and Rui Zhang. Calochallenge 2022: a community challenge for fast calorimeter simulation. 2024. URL: https://arxiv.org/abs/2410.21611, arXiv:2410.21611. ↩
Avrim Blum and Moritz Hardt. The ladder: a reliable leaderboard for machine learning competitions. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 1006–1014. Lille, France, July 2015. PMLR. URL: https://proceedings.mlr.press/v37/blum15.html. ↩
Zhen Xu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, and Isabelle Guyon. Codabench: flexible, easy-to-use, and reproducible meta-benchmark platform. Patterns, 3(7):100543, July 2022. URL: http://dx.doi.org/10.1016/j.patter.2022.100543, doi:10.1016/j.patter.2022.100543. ↩
Piotr Luszczek. Sabath: fair metadata technology for surrogate benchmarks. Technical Report, University of Tennessee, 2021. URL: https://github.com/icl-utk-edu/slip/tree/sabath. ↩
Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Dan MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. Pdebench: an extensive benchmark for scientific machine learning. 2024. URL: https://arxiv.org/abs/2210.07182, arXiv:2210.07182. ↩
Ruben Ohana, Michael McCabe, Lucas Meyer, Rudy Morel, Fruzsina J. Agocs, Miguel Beneitez, Marsha Berger, Blakesley Burkhart, Stuart B. Dalziel, Drummond B. Fielding, Daniel Fortunato, Jared A. Goldberg, Keiya Hirashima, Yan-Fei Jiang, Rich R. Kerswell, Suryanarayana Maddu, Jonah Miller, Payel Mukhopadhyay, Stefan S. Nixon, Jeff Shen, Romain Watteaux, Bruno Régaldo-Saint Blancard, François Rozet, Liam H. Parker, Miles Cranmer, and Shirley Ho. The well: a large-scale collection of diverse physics simulations for machine learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 44989–45037. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/4f9a5acd91ac76569f2fe291b1f4772b-Paper-Datasets_and_Benchmarks_Track.pdf. ↩
Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, and Venkatram Vishwanath. Llm-inference-bench: inference benchmarking of large language models on ai accelerators. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, volume, 1362 1379. 2024. doi:10.1109/SCW63240.2024.00178. ↩
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. 2024. URL: https://arxiv.org/abs/2312.07104, arXiv:2312.07104. ↩
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ‘23, 611 626. New York, NY, USA, 2023. Association for Computing Machinery. URL: https://doi.org/10.1145/3600006.3613165, doi:10.1145/3600006.3613165. ↩
Simon Mo. Vllm performance dashboard. 2024. URL: https://simon-mo-workspace.observablehq.cloud/vllm-dashboard-v0/. ↩
Kin G. Olivares, Cristian Challú, Federico Garza, Max Mergenthaler Canseco, and Artur Dubrawski. Neuralforecast: user friendly state-of-the-art neural forecasting models. PyCon Salt Lake City, Utah, US 2022, 2022. URL: https://github.com/Nixtla/neuralforecast. ↩
Cristian Challu, Kin G Olivares, Boris N Oreshkin, Federico Garza Ramirez, Max Mergenthaler Canseco, and Artur Dubrawski. Nhits: neural hierarchical interpolation for time series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 37, 6989–6997. 2023. ↩
Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time-llm: time series forecasting by reprogramming large language models. 2024. URL: https://arxiv.org/abs/2310.01728, arXiv:2310.01728. ↩
Azul Garza, Cristian Challu, and Max Mergenthaler-Canseco. Timegpt-1. 2024. URL: https://arxiv.org/abs/2310.03589, arXiv:2310.03589. ↩
Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova, Wahid Bhimji, Wei-Lun Chao, Chris Harris, Shih-Chieh Hsu, Hilmar Lapp, Mark S. Neubauer, Josephine Namayanja, Aneesh Subramanian, Philip Harris, Advaith Anand, David E. Carlyn, Subhankar Ghosh, Christopher Lawrence, Eric Moreno, Ryan Raikman, Jiaman Wu, Ziheng Zhang, Bayu Adhi, Mohammad Ahmadi Gharehtoragh, Saúl Alonso Monsalve, Marta Babicz, Furqan Baig, Namrata Banerji, William Bardon, Tyler Barna, Tanya Berger-Wolf, Adji Bousso Dieng, Micah Brachman, Quentin Buat, David C. Y. Hui, Phuong Cao, Franco Cerino, Yi-Chun Chang, Shivaji Chaulagain, An-Kai Chen, Deming Chen, Eric Chen, Chia-Jui Chou, Zih-Chen Ciou, Miles Cochran-Branson, Artur Cordeiro Oudot Choi, Michael Coughlin, Matteo Cremonesi, Maria Dadarlat, Peter Darch, Malina Desai, Daniel Diaz, Steven Dillmann, Javier Duarte, Isla Duporge, Urbas Ekka, Saba Entezari Heravi, Hao Fang, Rian Flynn, Geoffrey Fox, Emily Freed, Hang Gao, Jing Gao, Julia Gonski, Matthew Graham, Abolfazl Hashemi, Scott Hauck, James Hazelden, Joshua Henry Peterson, Duc Hoang, Wei Hu, Mirco Huennefeld, David Hyde, Vandana Janeja, Nattapon Jaroenchai, Haoyi Jia, Yunfan Kang, Maksim Kholiavchenko, Elham E. Khoda, Sangin Kim, Aditya Kumar, Bo-Cheng Lai, Trung Le, Chi-Wei Lee, JangHyeon Lee, Shaocheng Lee, Suzan van der Lee, Charles Lewis, Haitong Li, Haoyang Li, Henry Liao, Mia Liu, Xiaolin Liu, Xiulong Liu, Vladimir Loncar, Fangzheng Lyu, Ilya Makarov, Abhishikth Mallampalli Chen-Yu Mao, Alexander Michels, Alexander Migala, Farouk Mokhtar, Mathieu Morlighem, Min Namgung, Andrzej Novak, Andrew Novick, Amy Orsborn, Anand Padmanabhan, Jia-Cheng Pan, Sneh Pandya, Zhiyuan Pei, Ana Peixoto, George Percivall, Alex Po Leung, Sanjay Purushotham, Zhiqiang Que, Melissa Quinnan, Arghya Ranjan, Dylan Rankin, Christina Reissel, Benedikt Riedel, Dan Rubenstein, Argyro Sasli, Eli Shlizerman, Arushi Singh, Kim Singh, Eric R. Sokol, Arturo Sorensen, Yu Su, Mitra Taheri, Vaibhav Thakkar, Ann Mariam Thomas, Eric Toberer, Chenghan Tsai, Rebecca Vandewalle, Arjun Verma, Ricco C. Venterea, He Wang, Jianwu Wang, Sam Wang, Shaowen Wang, Gordon Watts, Jason Weitz, Andrew Wildridge, Rebecca Williams, Scott Wolf, Yue Xu, Jianqi Yan, Jai Yu, Yulei Zhang, Haoran Zhao, Ying Zhao, and Yibo Zhong. Building machine learning challenges for anomaly detection in science. 2025. URL: https://arxiv.org/abs/2503.02112, arXiv:2503.02112. ↩
Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova, Wahid Bhimji, Wei-Lun Chao, Chris Harris, Shih-Chieh Hsu, Hilmar Lapp, Mark S. Neubauer, Josephine Namayanja, Aneesh Subramanian, Philip Harris, Advaith Anand, David E. Carlyn, Subhankar Ghosh, Christopher Lawrence, Eric Moreno, Ryan Raikman, Jiaman Wu, Ziheng Zhang, Bayu Adhi, Mohammad Ahmadi Gharehtoragh, Saúl Alonso Monsalve, Marta Babicz, Furqan Baig, Namrata Banerji, William Bardon, Tyler Barna, Tanya Berger-Wolf, Adji Bousso Dieng, Micah Brachman, Quentin Buat, David C. Y. Hui, Phuong Cao, Franco Cerino, Yi-Chun Chang, Shivaji Chaulagain, An-Kai Chen, Deming Chen, Eric Chen, Chia-Jui Chou, Zih-Chen Ciou, Miles Cochran-Branson, Artur Cordeiro Oudot Choi, Michael Coughlin, Matteo Cremonesi, Maria Dadarlat, Peter Darch, Malina Desai, Daniel Diaz, Steven Dillmann, Javier Duarte, Isla Duporge, Urbas Ekka, Saba Entezari Heravi, Hao Fang, Rian Flynn, Geoffrey Fox, Emily Freed, Hang Gao, Jing Gao, Julia Gonski, Matthew Graham, Abolfazl Hashemi, Scott Hauck, James Hazelden, Joshua Henry Peterson, Duc Hoang, Wei Hu, Mirco Huennefeld, David Hyde, Vandana Janeja, Nattapon Jaroenchai, Haoyi Jia, Yunfan Kang, Maksim Kholiavchenko, Elham E. Khoda, Sangin Kim, Aditya Kumar, Bo-Cheng Lai, Trung Le, Chi-Wei Lee, JangHyeon Lee, Shaocheng Lee, Suzan van der Lee, Charles Lewis, Haitong Li, Haoyang Li, Henry Liao, Mia Liu, Xiaolin Liu, Xiulong Liu, Vladimir Loncar, Fangzheng Lyu, Ilya Makarov, Abhishikth Mallampalli Chen-Yu Mao, Alexander Michels, Alexander Migala, Farouk Mokhtar, Mathieu Morlighem, Min Namgung, Andrzej Novak, Andrew Novick, Amy Orsborn, Anand Padmanabhan, Jia-Cheng Pan, Sneh Pandya, Zhiyuan Pei, Ana Peixoto, George Percivall, Alex Po Leung, Sanjay Purushotham, Zhiqiang Que, Melissa Quinnan, Arghya Ranjan, Dylan Rankin, Christina Reissel, Benedikt Riedel, Dan Rubenstein, Argyro Sasli, Eli Shlizerman, Arushi Singh, Kim Singh, Eric R. Sokol, Arturo Sorensen, Yu Su, Mitra Taheri, Vaibhav Thakkar, Ann Mariam Thomas, Eric Toberer, Chenghan Tsai, Rebecca Vandewalle, Arjun Verma, Ricco C. Venterea, He Wang, Jianwu Wang, Sam Wang, Shaowen Wang, Gordon Watts, Jason Weitz, Andrew Wildridge, Rebecca Williams, Scott Wolf, Yue Xu, Jianqi Yan, Jai Yu, Yulei Zhang, Haoran Zhao, Ying Zhao, and Yibo Zhong. Building machine learning challenges for anomaly detection in science. 2025. URL: https://arxiv.org/abs/2503.02112, arXiv:2503.02112. ↩
Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova, Wahid Bhimji, Wei-Lun Chao, Chris Harris, Shih-Chieh Hsu, Hilmar Lapp, Mark S. Neubauer, Josephine Namayanja, Aneesh Subramanian, Philip Harris, Advaith Anand, David E. Carlyn, Subhankar Ghosh, Christopher Lawrence, Eric Moreno, Ryan Raikman, Jiaman Wu, Ziheng Zhang, Bayu Adhi, Mohammad Ahmadi Gharehtoragh, Saúl Alonso Monsalve, Marta Babicz, Furqan Baig, Namrata Banerji, William Bardon, Tyler Barna, Tanya Berger-Wolf, Adji Bousso Dieng, Micah Brachman, Quentin Buat, David C. Y. Hui, Phuong Cao, Franco Cerino, Yi-Chun Chang, Shivaji Chaulagain, An-Kai Chen, Deming Chen, Eric Chen, Chia-Jui Chou, Zih-Chen Ciou, Miles Cochran-Branson, Artur Cordeiro Oudot Choi, Michael Coughlin, Matteo Cremonesi, Maria Dadarlat, Peter Darch, Malina Desai, Daniel Diaz, Steven Dillmann, Javier Duarte, Isla Duporge, Urbas Ekka, Saba Entezari Heravi, Hao Fang, Rian Flynn, Geoffrey Fox, Emily Freed, Hang Gao, Jing Gao, Julia Gonski, Matthew Graham, Abolfazl Hashemi, Scott Hauck, James Hazelden, Joshua Henry Peterson, Duc Hoang, Wei Hu, Mirco Huennefeld, David Hyde, Vandana Janeja, Nattapon Jaroenchai, Haoyi Jia, Yunfan Kang, Maksim Kholiavchenko, Elham E. Khoda, Sangin Kim, Aditya Kumar, Bo-Cheng Lai, Trung Le, Chi-Wei Lee, JangHyeon Lee, Shaocheng Lee, Suzan van der Lee, Charles Lewis, Haitong Li, Haoyang Li, Henry Liao, Mia Liu, Xiaolin Liu, Xiulong Liu, Vladimir Loncar, Fangzheng Lyu, Ilya Makarov, Abhishikth Mallampalli Chen-Yu Mao, Alexander Michels, Alexander Migala, Farouk Mokhtar, Mathieu Morlighem, Min Namgung, Andrzej Novak, Andrew Novick, Amy Orsborn, Anand Padmanabhan, Jia-Cheng Pan, Sneh Pandya, Zhiyuan Pei, Ana Peixoto, George Percivall, Alex Po Leung, Sanjay Purushotham, Zhiqiang Que, Melissa Quinnan, Arghya Ranjan, Dylan Rankin, Christina Reissel, Benedikt Riedel, Dan Rubenstein, Argyro Sasli, Eli Shlizerman, Arushi Singh, Kim Singh, Eric R. Sokol, Arturo Sorensen, Yu Su, Mitra Taheri, Vaibhav Thakkar, Ann Mariam Thomas, Eric Toberer, Chenghan Tsai, Rebecca Vandewalle, Arjun Verma, Ricco C. Venterea, He Wang, Jianwu Wang, Sam Wang, Shaowen Wang, Gordon Watts, Jason Weitz, Andrew Wildridge, Rebecca Williams, Scott Wolf, Yue Xu, Jianqi Yan, Jai Yu, Yulei Zhang, Haoran Zhao, Ying Zhao, and Yibo Zhong. Building machine learning challenges for anomaly detection in science. 2025. URL: https://arxiv.org/abs/2503.02112, arXiv:2503.02112. ↩
Giuseppe Di Guglielmo, Botao Du, Javier Campos, Alexandra Boltasseva, Akash V. Dixit, Farah Fahim, Zhaxylyk Kudyshev, Santiago Lopez, Ruichao Ma, Gabriel N. Perdue, Nhan Tran, Omer Yesilyurt, and Daniel Bowring. End-to-end workflow for machine learning-based qubit readout with qick and hls4ml. 2025. URL: https://arxiv.org/abs/2501.14663, arXiv:2501.14663. ↩
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: a graduate-level google-proof q and a benchmark. 2023. URL: https://arxiv.org/abs/2311.12022, arXiv:2311.12022. ↩
Kien X. Nguyen, Fengchun Qiao, Arthur Trembanis, and Xi Peng. Seafloorai: a large-scale vision-language dataset for seafloor geological survey. 2024. URL: https://arxiv.org/abs/2411.00172, arXiv:2411.00172. ↩
Pin Chen, Luoxuan Peng, Rui Jiao, Qing Mo, Zhen Wang, Wenbing Huang, Yang Liu, and Yutong Lu. Learning superconductivity from ordered and disordered material structures. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 108902–108928. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/c4e3b55ed4ac9ba52d7df11f8bddbbf4-Paper-Datasets_and_Benchmarks_Track.pdf. ↩
Deyu Zou, Shikun Liu, Siqi Miao, Victor Fung, Shiyu Chang, and Pan Li. Gess: benchmarking geometric deep learning under scientific applications with distribution shifts. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 92499–92528. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/a8063075b00168dc39bc81683619f1a8-Paper-Datasets_and_Benchmarks_Track.pdf. ↩
Ralph E Peterson, Aramis Tanelus, Christopher Ick, Bartul Mimica, Niegil Francis, Violet J Ivan, Aman Choudhri, Annegret L Falkner, Mala Murthy, David M Schneider, Dan H Sanes, and Alex H Williams. Vocal call locator benchmark (vcl) for localizing rodent vocalizations from multi-channel audio. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 106370–106382. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/c00d37d6b04d73b870b963a4d70051c1-Paper-Datasets_and_Benchmarks_Track.pdf. ↩
Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A. Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo Wang, David S. Wishart, Li-Ping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus D. Mak, Soha Hassoun, Florian Huber, Justin J.J. van der Hooft, Michael A. Stravs, Sebastian Böcker, Josef Sivic, and Tomáš Pluskal. Massspecgym: a benchmark for the discovery and identification of molecules. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 110010–110027. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/c6c31413d5c53b7d1c343c1498734b0f-Paper-Datasets_and_Benchmarks_Track.pdf. ↩
Yiheng Wang, Tianyu Wang, Yuying Zhang, Hongji Zhang, Haoyu Zheng, Guanjie Zheng, and Linghe Kong. Urbandatalayer: a unified data pipeline for urban science. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 7296–7310. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/0db7f135f6991e8cec5e516ecc66bfba-Paper-Datasets_and_Benchmarks_Track.pdf. ↩
Kuzma Khrabrov, Anton Ber, Artem Tsypin, Konstantin Ushenin, Egor Rumiantsev, Alexander Telepov, Dmitry Protasov, Ilya Shenbin, Anton Alekseev, Mikhail Shirokikh, Sergey Nikolenko, Elena Tutubalina, and Artur Kadurin. $\nabla ^2$dft: a universal quantum chemistry dataset of drug-like molecules and a benchmark for neural network potentials. 2024. URL: https://arxiv.org/abs/2406.14347, arXiv:2406.14347. ↩
Tingjia Shen, Hao Wang, Jiaqing Zhang, Sirui Zhao, Liangyue Li, Zulong Chen, Defu Lian, and Enhong Chen. Exploring user retrieval integration towards large language models for cross-domain sequential recommendation. 2024. URL: https://arxiv.org/abs/2406.03085, arXiv:2406.03085. ↩
Shraman Pramanick, Rama Chellappa, and Subhashini Venugopalan. Spiqa: a dataset for multimodal question answering on scientific papers. 2025. URL: https://arxiv.org/abs/2407.09413, arXiv:2407.09413. ↩