benchmark

Benchmarks

Date Name Domain Focus Keywords Task Types Metrics Models Citation Specification Rating Specification Reason Dataset Rating Dataset Reason Metrics Rating Metrics Reason Reference Solution Rating Reference Solution Reason Documentation Rating Documentation Reason
2024-05-01 Jet Classification Particle Physics Real-time classification of particle jets using HL-LHC simulation features classification, real-time ML, jet tagging, QKeras Classification Accuracy, AUC Keras DNN, QKeras quantized DNN 1 9.0 Task and format (multiple-choice QA with 5 options) are clearly defined; grounded in ConceptNet with consistent structure, though no hardware/system constraints are specified. 9.0 Public, versioned, and FAIR-compliant; includes metadata, splits, and licensing; well-integrated with HuggingFace and other ML libraries. 9.0 Accuracy is a simple, reproducible metric aligned with task goals; no ambiguity in evaluation. 8.0 Several baseline models (e.g., BERT, RoBERTa) are reported with scores; implementations exist in public repos, but not bundled as an official starter kit. 7.0 Clear paper, GitHub repo, and integration with HuggingFace Datasets; full reproducibility requires manually connecting models to dataset.
2024-05-01 Irregular Sensor Data Compression Particle Physics Real-time compression of sparse sensor data with autoencoders compression, autoencoder, sparse data, irregular sampling Compression MSE, Compression ratio Autoencoder, Quantized autoencoder 2 8.0 Classification is clearly defined for real-time inference on simulated LHC jets. Input features (HLFs) are documented, though exact latency or resource constraints are not numerically specified. 9.0 Two datasets (OpenML and Zenodo) are public, well-formatted, and documented; FAIR principles are followed, though richer metadata would raise confidence to a 10. 9.0 AUC and Accuracy are standard, quantitative, and well-aligned with goals of jet tagging and inference efficiency. 8.0 Float and quantized Keras/QKeras models are provided with results. Reproducibility is good, though full automation and documentation could be improved. 8.0 GitHub contains baseline code, data loaders, and references, but setup for deployment (e.g., FPGA pipeline) requires familiarity with the tooling.
2024-05-01 Beam Control Accelerators and Magnets Reinforcement learning control of accelerator beam position RL, beam stabilization, control systems, simulation Control Stability, Control loss DDPG, PPO (planned) 3, 4 9.0 Task is well defined (real-time compression of sparse, irregular sensor data using autoencoders); latency constraints are implied but not fully quantified. 8.0 Dataset is custom and synthetic but described well; FAIR-compliance is partial (reusable and accessible, but not externally versioned with rich metadata). 9.0 Uses standard quantitative metrics (MSE, compression ratio) clearly aligned with compression and reconstruction goals. 7.0 Baseline (autoencoder and quantized variant) is provided, but training/inference pipeline is minimally documented and needs user setup. 8.0 GitHub repo contains core components, but more structured setup instructions and pretrained weights would improve usability.
2024-07-08 Ultrafast jet classification at the HL-LHC Particle Physics FPGA-optimized real-time jet origin classification at the HL-LHC jet classification, FPGA, quantization-aware training, Deep Sets, Interaction Networks Classification Accuracy, Latency, Resource utilization MLP, Deep Sets, Interaction Network 5 8.0 Task is clear (RL control of beam stability), with BOOSTR-based simulator; control objectives are well motivated, but system constraints and reward structure are still under refinement. 7.0 BOOSTR dataset exists and is cited, but integration into the benchmark is in early stages; metadata and FAIR structure are limited. 7.0 Stability and control loss are mentioned, but metrics are not yet formalized with clear definitions or baselines. 5.5 DDPG baseline mentioned; PPO planned; implementation is still in progress with no reproducible results available yet. 6.0 GitHub has a defined structure but is incomplete; setup and execution instructions for training/evaluation are not fully established.
2024-10-15 Quench detection Accelerators and Magnets Real-time detection of superconducting magnet quenches using ML quench detection, autoencoder, anomaly detection, real-time Anomaly detection, Quench localization ROC-AUC, Detection latency Autoencoder, RL agents (in development)   10.0 Real-time jet origin classification under FPGA constraints is clearly defined, with explicit latency targets (~100 ns) and I/O formats. 9.0 Data available on Zenodo with DOI, includes constituent-level jets; accessible and well-documented, though not deeply versioned with full FAIR metadata. 10.0 Accuracy, latency, and hardware resource usage (LUTs, DSPs) are rigorously measured and aligned with real-time goals. 9.0 Includes models (MLP, Deep Sets, Interaction Networks) with quantization-aware training and synthesis results via hls4ml; reproducible but tightly coupled with specific toolchains. 8.0 Paper and code (via hls4ml) are sufficient, but a centralized, standalone repo for reproducing all models would enhance accessibility.
2024-10-15 DUNE Particle Physics Real-time ML for DUNE DAQ time-series data DUNE, time-series, real-time, trigger Trigger selection, Time-series anomaly detection Detection efficiency, Latency CNN, LSTM (planned) 6 8.0 Task (quench detection via anomaly detection) is clearly described; multi-modal sensors, streaming rates, and objective are provided, but constraints (latency thresholds) are qualitative. 7.0 Custom dataset using real data from BNL; HDF5 formatted and structured, but access may be internal or limited, and not versioned for public FAIR use. 8.0 ROC-AUC and detection latency are defined; relevant and quantitative but not yet paired with benchmark baselines. 6.0 Autoencoder prototype exists; RL methods are in development; no fully reproducible pipeline is available yet. 7.0 Slides and GDocs outline results; implementation is in progress with limited setup/code release.
2025-01-08 Intelligent experiments through real-time AI Instrumentation and Detectors; Nuclear Physics; Particle Physics Real-time FPGA-based triggering and detector control for sPHENIX and future EIC FPGA, Graph Neural Network, hls4ml, real-time inference, detector control Trigger classification, Detector control, Real-time inference Accuracy (charm and beauty detection), Latency (micros), Resource utilization (LUT/FF/BRAM/DSP) Bipartite Graph Network with Set Transformers (BGN-ST), GarNet (edge-classifier) 7 8.0 Task (trigger-level anomaly detection) is clearly defined for low-latency streaming input, but the problem framing lacks complete architectural/system specs. 6.0 Internal DUNE SONIC data; not publicly released and no formal FAIR support; replicability is institutionally gated. 7.0 Metrics include detection efficiency and latency, which are relevant, but only lightly supported by baselines or formal eval scripts. 5.0 One CNN prototype demonstrated; LSTM planned. No public implementation or ready-to-run example yet. 6.0 Slides and some internal documentation exist, but no full pipeline or public GitHub repo yet.
2025-01-09 Neural Architecture Codesign for Fast Physics Applications Physics; Materials Science; Particle Physics Automated neural architecture search and hardware-efficient model codesign for fast physics applications neural architecture search, FPGA deployment, quantization, pruning, hls4ml Classification, Peak finding Accuracy, Latency, Resource utilization NAC-based BraggNN, NAC-optimized Deep Sets (jet) 8 10.0 Task is clearly defined (triggering on rare events with sub-10 micros latency); architecture, constraints, and system context (FPGA, Alveo) are well detailed. 7.0 Simulated tracking data from sPHENIX and EIC; internally structured but not yet released in a public FAIR-compliant format. 10.0 Accuracy, latency, and hardware resource utilization (LUTs, DSPs) are clearly defined and used in evaluation. 9.0 Graph-based models (BGN-ST, GarNet) are implemented and tested on real hardware; reproducibility possible with hls4ml but full scripts not bundled. 8.0 Paper is detailed and tool usage (FlowGNN, hls4ml) is described, but repo release and dataset access remain in progress.
2024-06-24 Smart Pixels for LHC Particle Physics; Instrumentation and Detectors On-sensor, in-pixel ML filtering for high-rate LHC pixel detectors smart pixel, on-sensor inference, data reduction, trigger Image Classification, Data filtering Data rejection rate, Power per pixel 2-layer pixel NN 9 9.0 Task (automated neural architecture search for real-time physics) is well formulated with clear latency, model compression, and deployment goals. 6.0 Internal Bragg and jet datasets used; not publicly hosted or FAIR-compliant, though mentioned in the paper. 10.0 BOP reduction, latency, and accuracy are all quantitatively evaluated. 8.0 NAC-generated models for Bragg peak and jet classification are described, but pipeline requires integration of several tools and is not fully packaged. 7.0 NAC pipeline, hls4ml usage, and results are discussed; code (e.g., nac-opt) referenced, but replication requires stitching together toolchain and data.
2023-10-03 HEDM (BraggNN) Material Science Fast Bragg peak analysis using deep learning in diffraction microscopy BraggNN, diffraction, peak finding, HEDM Peak detection Localization accuracy, Inference time BraggNN 10 10.0 Fully specified: describes task (data filtering/classification, system design (on-sensor inference), latency (25 ns), and power constraints. 8.0 In-pixel charge cluster data used, but dataset release info is minimal; FAIR metadata/versioning limited. 9.0 Data rejection rate and power per pixel are clearly defined and directly tied to hardware goals. 9.0 2-layer NN implementation is evaluated in hardware; reproducible via hls4ml flow with results in paper. 8.0 Paper is clear; Zenodo asset is referenced, but additional GitHub or setup repo would improve reproducibility.
2023-12-03 4D-STEM Material Science Real-time ML for scanning transmission electron microscopy 4D-STEM, electron microscopy, real-time, image processing Image Classification, Streamed data inference Classification accuracy, Throughput CNN models (prototype) 11 9.0 Peak localization task is well-defined for diffraction images; input/output described clearly, but no system constraints. 8.0 Simulated diffraction images provided; reusable and downloadable, but not externally versioned or FAIR-structured. 9.0 Inference speed and localization accuracy are standard and quantitatively reported. 8.0 BraggNN model and training pipeline exist, but need stitching from separate repositories. 8.0 Paper and codebase are available and usable, though not fully turnkey.
2023-12-05 In-Situ High-Speed Computer Vision Fusion/Plasma Real-time image classification for in-situ plasma diagnostics plasma, in-situ vision, real-time ML Image Classification Accuracy, FPS CNN 12 7.0 General task defined (real-time microscopy inference), but no standardized I/O format, latency constraint, or complete problem framing yet. 0.0 Dataset not provided or described in any formal way. 6.0 Mentions throughput and accuracy, but metrics are not formally defined or benchmarked. 2.0 Prototype CNNs described; no baseline or implementation released. 5.0 OpenReview paper and Gemini doc give some insight, but no working code, environment, or example.
2020-01-01 BenchCouncil AIBench General End-to-end AI benchmarking across micro, component, and application levels benchmarking, AI systems, application-level evaluation Training, Inference, End-to-end AI workloads Throughput, Latency, Accuracy ResNet, BERT, GANs, Recommendation systems 13 8.0 Task (plasma diagnostic classification) and real-time deployment described; system specs (FPS targets) implied but not fully quantified. 6.0 Dataset is sensor stream-based but not shared or FAIR-documented. 8.0 FPS and classification accuracy reported and relevant. 7.0 CNN model described and evaluated, but public implementation and benchmarks are not available yet. 6.0 Paper and Gemini doc exist, but full setup instructions and tools are still in progress.
2020-01-01 BenchCouncil BigDataBench General Big data and AI benchmarking across structured, semi-structured, and unstructured data workloads big data, AI benchmarking, data analytics Data preprocessing, Inference, End-to-end data pipelines Data throughput, Latency, Accuracy CNN, LSTM, SVM, XGBoost 14 9.0 Evaluates AI at multiple levels (micro to end-to-end); tasks and workloads are clearly defined, though specific I/O formats and constraints vary. 9.0 Realistic datasets across diverse domains; FAIR structure for many components, but individual datasets may not all be versioned or richly annotated. 9.0 Latency, throughput, and accuracy clearly defined for end-to-end tasks; consistent across models and setups. 8.0 Reference implementations for several tasks exist, but setup across all tasks is complex and not fully streamlined. 8.0 Central documentation exists, with detailed component breakdowns; environment setup across platforms (e.g., hardware variations) can require manual adjustment.
2021-10-20 MLPerf HPC Cosmology, Climate, Protein Structure, Catalysis Scientific ML training and inference on HPC systems HPC, training, inference, scientific ML Training, Inference Training time, Accuracy, GPU utilization CosmoFlow, DeepCAM, OpenCatalyst 15 9.0 Focused on structured/unstructured data pipelines; clearly defined tasks spanning analytics to AI; some scenarios lack hardware constraint modeling. 9.0 Built from 13 real-world sources; structured for realistic big data scenarios; partially FAIR-compliant with documented data motifs. 9.0 Covers data throughput, latency, and accuracy; quantitative and benchmark-ready. 8.0 Many pipeline and model examples provided using Hadoop/Spark/Flink; setup effort varies by task and platform. 8.0 Strong documentation with examples and task specifications; centralized support exists, but task-specific tuning may require domain expertise.
2023-06-01 MLCommons Science Earthquake, Satellite Image, Drug Discovery, Electron Microscope, CFD AI benchmarks for scientific applications including time-series, imaging, and simulation science AI, benchmark, MLCommons, HPC Time-series analysis, Image classification, Simulation surrogate modeling MAE, Accuracy, Speedup vs simulation CNN, GNN, Transformer 16 10.0 Scientific ML tasks (e.g., CosmoFlow, DeepCAM) are clearly defined with HPC system-level constraints and targets. 9.0 Public scientific datasets (e.g., cosmology, weather); used consistently, though FAIR-compliance of individual datasets varies slightly. 10.0 Training time, GPU utilization, and accuracy are all directly measured and benchmarked across HPC systems. 9.0 Reference implementations available and actively maintained; HPC setup may require domain-specific environment. 9.0 GitHub repo and papers provide detailed instructions; reproducibility supported across multiple institutions.
2021-07-05 LHC New Physics Dataset Particle Physics; Real-time Triggering Real-time LHC event filtering for anomaly detection using proton collision data anomaly detection, proton collision, real-time inference, event filtering, unsupervised ML Anomaly detection, Event classification ROC-AUC, Detection efficiency Autoencoder, Variational autoencoder, Isolation forest 17 7.0 The problem (anomaly detection for new physics at LHC) is clearly described with goals and background, but lacks a formal task specification or constraints. 8.0 Large-scale, public dataset derived from LHC simulations; well-documented and available via Zenodo. 7.0 Provides AUROC, accuracy, and anomaly detection metrics but lacks standardized evaluation script. 5.0 Baseline models (autoencoders, GANs) are described in associated papers, but implementations vary across papers. 6.0 Publicly available papers and datasets with descriptions, but no unified README or training setup.
2023-07-17 MLCommons Medical AI Healthcare; Medical AI Federated benchmarking and evaluation of medical AI models across diverse real-world clinical data medical AI, federated evaluation, privacy-preserving, fairness, healthcare benchmarks Federated evaluation, Model validation ROC AUC, Accuracy, Fairness metrics MedPerf-validated CNNs, GaNDLF workflows 18 9.0 Diverse scientific tasks (earthquake, CFD, microscopy) with detailed problem statements and goals; system constraints not uniformly applied. 9.0 Domain-specific datasets (e.g., microscopy, climate); mostly public and structured, but FAIR annotations are not always explicit. 9.0 Task-specific metrics (MAE, speedup, accuracy) are clear and reproducible. 9.0 Reference models (CNN, GNN, Transformer) provided with training/evaluation pipelines. 9.0 Well-documented, open-sourced, and maintained with examples; strong community support and reproducibility focus.
2024-10-28 CaloChallenge 2022 LHC Calorimeter; Particle Physics Fast generative-model-based calorimeter shower simulation evaluation calorimeter simulation, generative models, surrogate modeling, LHC, fast simulation Surrogate modeling Histogram similarity, Classifier AUC, Generation latency VAE variants, GAN variants, Normalizing flows, Diffusion models 19 9.0 Task is clearly defined: real-time anomaly detection from high-rate LHC collisions. Latency and bandwidth constraints are mentioned, though not numerically enforced. 9.0 Publicly available via Zenodo, with structured signal/background splits, and rich metadata; nearly fully FAIR. 9.0 ROC-AUC and detection efficiency are clearly defined and appropriate for unsupervised anomaly detection. 8.0 Several baseline methods (autoencoder, VAE, isolation forest) are evaluated; runnable versions available via community repos but not tightly bundled. 8.0 Paper and data documentation are clear, and the dataset is widely reused. Setup requires some manual effort to reproduce full pipelines.
ongoing Papers With Code (SOTA Platform) General ML; All domains Open platform tracking state-of-the-art results, benchmarks, and implementations across ML tasks and papers leaderboard, benchmarking, reproducibility, open-source Multiple (Classification, Detection, NLP, etc.) Task-specific (Accuracy, F1, BLEU, etc.) All published models with code 20 9.0 Evaluation setting (federated clinical benchmarking) is well-defined; I/O interfaces vary slightly by task but are standardized in MedPerf platform. 8.0 Uses distributed, real-world clinical datasets across institutions; FAIR compliance varies across hospitals and data hosts. 9.0 ROC AUC, accuracy, and fairness metrics are explicitly defined and task-dependent; consistently tracked across institutions. 8.0 Validated CNNs and GaNDLF pipelines are used and shared via the MedPerf tool, but some implementations are abstracted behind the platform. 9.0 Excellent documentation across MedPerf, GaNDLF, and COFE; reproducibility handled via containerized flows and task templates.
2022-01-01 Codabench General ML; Multiple Open-source platform for organizing reproducible AI benchmarks and competitions benchmark platform, code submission, competitions, meta-benchmark Multiple Submission count, Leaderboard ranking, Task-specific metrics Arbitrary code submissions 21 10.0 Simulation task (generative calorimeter showers) is clearly stated with multiple datasets, fidelity requirements, and performance constraints. 9.5 Public datasets available in multiple sizes and formats; well-documented; not versioned 10.0 Histogram similarity, classifier AUC, and generation latency are clearly defined and benchmarked across all submissions. 9.0 31 model implementations submitted; some made public and reproducible, though others remain undocumented or private. 9.0 Paper, leaderboard, and Gemini doc are comprehensive; unified repo or launchable baseline kit would push this to a 10.
2021-09-27 Sabath (SBI-FAIR) Systems; Metadata FAIR metadata framework for ML-driven surrogate workflows in HPC systems meta-benchmark, metadata, HPC, surrogate modeling Systems benchmarking Metadata completeness, FAIR compliance N/A 22 8.0 The benchmark defines simulation-based inference (SBI) tasks clearly with FAIR principles applied to particle physics datasets. 8.0 Data is well-structured for SBI and publicly available with clear licensing. 8.0 Includes likelihood and posterior accuracy; metrics well-matched to SBI. 7.0 Baseline SBI models are implemented and reproducible. 6.0 GitHub repo includes code and instructions, but lacks full tutorials or walkthroughs.
2022-10-13 PDEBench CFD; Weather Modeling Benchmark suite for ML-based surrogates solving time-dependent PDEs PDEs, CFD, scientific ML, surrogate modeling, NeurIPS Supervised Learning RMSE, boundary RMSE, Fourier RMSE FNO, U-Net, PINN, Gradient-Based inverse methods 23 9.0 Clearly defined PDE-solving tasks with well-specified constraints and solution formats. 9.0 Includes synthetic and real-world PDE datasets with detailed format descriptions. 8.0 Uses L2 error and other norms relevant to PDE solutions. 7.0 Includes baseline solvers and trained models across multiple PDE tasks. 8.0 Well-organized GitHub with examples, dataset loading scripts, and training configs.
2024-12-03 The Well biological systems, fluid dynamics, acoustic scattering, astrophysical MHD Foundation model + surrogate dataset spanning 16 physical simulation domains surrogate modeling, foundation model, physics simulations, spatiotemporal dynamics Supervised Learning Dataset size, Domain breadth FNO baselines, U-Net baselines 24 7.0 Explores LLM understanding of mental health scenarios; framing is creative but loosely defined. 6.0 Dataset is described in concept but not released; privacy limits public access though synthetic proxies are referenced. 7.0 Uses manual annotation and quality scores, but lacks standardized automatic metrics. 6.0 Provides few-shot prompt examples and human rating calibration details. 5.0 Paper gives use cases, but code and data are not yet public.
2024-10-31 LLM-Inference-Bench LLM; HPC/inference Hardware performance benchmarking of LLMs on AI accelerators LLM, inference benchmarking, GPU, accelerator, throughput Inference Benchmarking Token throughput (tok/s), Latency, Framework-hardware mix performance LLaMA-2-7B, LLaMA-2-70B, Mistral-7B, Qwen-7B 25 9.0 PDE tasks (forward/inverse) and I/O structures are clearly specified with detailed PDE context and constraints. 10.0 Hosted via DaRUS with a DOI, well-documented, versioned, and FAIR-compliant. 9.0 Uses RMSE variants and Fourier-based errors. 10.0 Baselines (FNO, U-Net, PINN) implemented and ready-to-run; strong community adoption. 9.0 Clean GitHub with usage, dataset links, and tutorial notebooks.
2023-12-12 SGLang Framework LLM Vision Fast serving framework for LLMs and vision-language models LLM serving, vision-language, RadixAttention, performance, JSON decoding Model serving framework Tokens/sec, Time-to-first-token, Throughput gain vs baseline LLaVA, DeepSeek, Llama 26 8.0 Clearly framed around surrogate learning across 16 domains, but not all tasks are formally posed or constrained in a unified benchmark protocol. Paper mentions performance on NVIDIA H100. 9.0 FAIR-compliant physics simulation dataset, structured in HDF5 with unified metadata. 7.0 Metrics like dataset size and domain coverage are listed, but standardized quantitative model evaluation metrics (e.g., RMSE, MAE) are not enforced. 9.0 FNO and U-Net baselines available; full benchmarking implementations pending NeurIPS paper code release. 10.0 Site and GitHub offer a unified API, metadata standards, and dataset loading tools; NeurIPS paper adds detailed design context.
2023-09-12 vLLM Inference and Serving Engine LLM; HPC/inference High-throughput, memory-efficient inference and serving engine for LLMs LLM inference, PagedAttention, CUDA graph, streaming API, quantization Inference Benchmarking Tokens/sec, Time to First Token (TTFT), Memory footprint LLaMA, Mixtral, FlashAttention-based models 27 9.0 Benchmarks hardware performance of LLM inference across multiple platforms with well-defined input/output and platform constraints. 7.0 Uses structured log files and configs instead of conventional datasets; suitable for inference benchmarking. 9.0 Clear throughput, latency, and utilization metrics; platform comparison dashboard enhances evaluation. 8.0 Includes reproducible scripts and example runs; models like LLaMA and Mistral are referenced with platform-specific configs. 8.0 GitHub contains clear instructions, platform details, and framework comparisons.
2022-06-22 vLLM Performance Dashboard LLM; HPC/inference Interactive dashboard showing inference performance of vLLM Dashboard, Throughput visualization, Latency analysis, Metric tracking Performance visualization Tokens/sec, TTFT, Memory usage LLaMA-2, Mistral, Qwen 28 8.0 Framed as a model-serving tool rather than a benchmark, but includes benchmark configurations and real model tasks. 6.0 Mostly uses dummy configs or external model endpoints for evaluation; not designed around a formal dataset. 8.0 Well-defined serving metrics: tokens/sec, time-to-first-token, and gain over baselines. 9.0 Core framework includes full reproducible serving benchmarks and code; multiple deployment case studies. 9.0 High-quality usage guides, examples, and performance tuning docs.
2022-04-01 Nixtla NeuralForecast Time-series forecasting; General ML High-performance neural forecasting library with >30 models time-series, neural forecasting, NBEATS, NHITS, TFT, probabilistic forecasting, usability Time-series forecasting RMSE, MAPE, CRPS NBEATS, NHITS, TFT, DeepAR 29 9.0 Targets high-throughput LLM inference via PagedAttention and memory-optimized serving; benchmarks cover many configs. 7.0 Focuses on model configs and streaming input/output pipelines rather than classical datasets. 9.0 Strong token/sec, memory usage, and TTFT metrics; comparative plots and logs included. 9.0 Benchmarks reproducible via script with support for multiple models and hardware types. 9.0 Excellent GitHub docs, CLI/API usage, and deployment walkthroughs.
2023-06-01 Nixtla Neural Forecast NHITS Time-series; General ML Official NHITS implementation for long-horizon time series forecasting NHITS, long-horizon forecasting, neural interpolation, time-series Time-series forecasting RMSE, MAPE NHITS 30 7.0 Primarily a visualization frontend; underlying benchmark definitions come from vLLM project. 6.0 No traditional dataset; displays live or logged benchmark metrics. 9.0 Live throughput, memory, latency, and TTFT displayed interactively; highly informative for performance analysis. 7.0 Dashboard built on vLLM benchmarks but not itself a complete experiment package. 8.0 Observable notebooks are intuitive; customization instructions are minimal but UI is self-explanatory.
2023-10-03 Nixtla Neural Forecast TimeLLM Time-series; General ML Reprogramming LLMs for time series forecasting Time-LLM, language model, time-series, reprogramming Time-series forecasting RMSE, MAPE Time-LLM 31 7.0 Describes forecasting with LLMs, but less formal on input/output or task framing. 6.0 Uses open time series datasets, but lacks a consolidated data release or splits. 7.0 Reports metrics like MASE and SMAPE, standard in forecasting. 6.0 Provides TimeLLM with open source, but no other baselines included. 6.0 GitHub readme with installation and example usage; lacks API or extensive tutorials.
2023-10-05 Nixtla Neural Forecast TimeGPT Time-series; General ML Time-series foundation model “TimeGPT” for forecasting and anomaly detection TimeGPT, foundation model, time-series, generative model Time-series forecasting, Anomaly detection RMSE, Anomaly detection metrics TimeGPT 32 7.0 Describes forecasting with LLMs, but less formal on input/output or task framing. 6.0 Uses open time series datasets, but lacks a consolidated data release or splits. 7.0 Reports metrics like MASE and SMAPE, standard in forecasting. 6.0 Provides TimeLLM with open source, but no other baselines included. 6.0 GitHub readme with installation and example usage; lacks API or extensive tutorials.
2025-03-03 HDR ML Anomaly Challenge (Gravitational Waves) Astrophysics; Time-series Detecting anomalous gravitational-wave signals from LIGO/Virgo datasets anomaly detection, gravitational waves, astrophysics, time-series Anomaly detection ROC-AUC, Precision/Recall Deep latent CNNs, Autoencoders 33 8.0 Novel approach treating forecasting as text generation is explained; framing is less conventional. 9.0 Compatible with standard forecasting datasets (e.g., M4, electricity). 8.0 RMSE and MAPE are included, but less emphasis on interpretability or time-series domain constraints. 9.0 Open-source with reprogramming layers, LLM interface scripts provided. 8.0 Model and architecture overview present, though usability guide is slightly lighter than others.
2025-03-03 HDR ML Anomaly Challenge (Butterfly) Genomics; Image/CV Detecting hybrid butterflies via image anomaly detection in genomic-informed dataset anomaly detection, computer vision, genomics, butterfly hybrids Anomaly detection Classification accuracy, F1 score CNN-based detectors 34 8.0 Task of detecting rare anomalies in butterfly physics is well-described with physics motivation. 7.0 Real detector data with injected anomalies is available, but requires NDA for full access. 7.0 Uses ROC, F1, and anomaly precision, standard in challenge evaluations. 4.0 Partial baselines described, but no codebase or reproducible runs. 6.0 Challenge site includes overview and metrics, but limited in walkthrough or examples.
2025-03-03 HDR ML Anomaly Challenge (Sea Level Rise) Climate Science; Time-series, Image/CV Detecting anomalous sea-level rise and flooding events via time-series and satellite imagery anomaly detection, climate science, sea-level rise, time-series, remote sensing Anomaly detection ROC-AUC, Precision/Recall CNNs, RNNs, Transformers 35 9.0 Clear anomaly detection objective framed for physical signal discovery (LIGO/Virgo). 10.0 Preprocessed waveform data from dual interferometers, public and well-structured. 9.0 ROC-AUC, Precision/Recall, and confusion-based metrics are standardized. 1.0 No starter model or baseline code linked 9.0 Codabench page, GitHub starter kit, and related papers provide strong guidance.
2025-01-24 Single Qubit Readout on QICK System Quantum Computing Real-time single-qubit state classification using FPGA firmware qubit readout, hls4ml, FPGA, QICK Classification Accuracy, Latency hls4ml quantized NN 36 8.0 Task clearly framed around detecting hybrid species via images, but exact labeling methods and hybrid definitions may need elaboration. 8.0 Dataset hosted on Codabench; appears structured but details on image sourcing and labeling pipeline are limited. 9.0 Classification accuracy and F1 are standard and appropriate. 1.0 No starter model or baseline code linked 7.5 Codabench task page describes dataset and evaluation method but lacks full API/docs.
2023-11-20 GPQA: A Graduate-Level Google-Proof Question and Answer Benchmark Science (Biology, Physics, Chemistry) Graduate-level, expert-validated multiple-choice questions hard even with web access Google-proof, multiple-choice, expert reasoning, science QA Multiple choice Accuracy GPT-4 baseline 37 9.0 Clear dual-modality task (image + time-series); environmental focus is well described. 9.0 Time-series and satellite imagery data provided; sensor info and collection intervals are explained. 9.0 ROC-AUC, Precision/Recall are appropriate and robust. 1.0 No starter model or baseline code linked 6.5 Moderate Codabench documentation with climate context; lacks pipeline-level walkthrough.
2024-12-13 SeafloorAI Marine Science; Vision-Language Large-scale vision-language dataset for seafloor mapping and geological classification sonar imagery, vision-language, seafloor mapping, segmentation, QA Image segmentation, Vision-language QA Segmentation pixel accuracy, QA accuracy SegFormer, ViLT-style multimodal models 38 9.0 Real-time qubit classification task clearly defined in quantum instrumentation context. 9.0 Dataset available on Zenodo with signal traces; compact and reproducible. 9.0 Accuracy and latency are well defined and crucial in this setting. 9.0 GitHub repo has reproducible code and HLS firmware targeting FPGA. 8.0 Good setup instructions, but no interactive visualization or starter notebook.
2024-12-13 SuperCon3D Materials Science; Superconductivity Dataset and models for predicting and generating high-Tc superconductors using 3D crystal structures superconductivity, crystal structures, equivariant GNN, generative models Regression (Tc prediction), Generative modeling MAE (Tc), Validity of generated structures SODNet, DiffCSP-SC 39 10.0 Multimodal task (segmentation + natural language QA pairs);. 10.0 sonar imagery + masks + descriptions, georeferenced and labeled with QA 9.0 Pixel accuracy and QA metrics clearly defined; tasks split by modality. 8.0 Baseline models (SegFormer, ViLT) are cited, partial configs likely available. 8.5 Paper + GitHub metadata and processing details are comprehensive, though full dataset is not yet available.
2024-12-13 GeSS Scientific ML; Geometric Deep Learning Benchmark suite evaluating geometric deep learning models under real-world distribution shifts geometric deep learning, distribution shift, OOD robustness, scientific applications Classification, Regression Accuracy, RMSE, OOD robustness delta GCN, EGNN, DimeNet++ 40 9.0 Well-defined problem (Tc prediction, generation) with strong scientific motivation (high-Tc materials), but no formal hardware constraints. 9.0 Includes curated 3D crystal structures and Tc data; readily downloadable and used in paper models. 9.0 MAE and structural validity used, well-established in materials modeling. 8.0 Provides two reference models (SODNet, DiffCSP-SC) with results. Code likely available post-conference. 8.0 Paper and poster explain design choices well; software availability confirms reproducibility but limited external documentation.
2024-12-13 Vocal Call Locator (VCL) Neuroscience; Bioacoustics Benchmarking sound-source localization of rodent vocalizations from multi-channel audio source localization, bioacoustics, time-series, SSL Sound source localization Localization error (cm), Recall/Precision CNN-based SSL models 41 9.0 Clear benchmark scenarios across GDL tasks under multiple real-world shift settings; OOD settings precisely categorized. 8.0 Scientific graph datasets provided in multiple shift regimes; standardized splits across domains. Exact format of data not specified. 9.0 Includes base metrics (accuracy, RMSE) plus OOD delta robustness for evaluation under shifts. 9.0 Multiple baselines (11 algorithms x 3 backbones) evaluated; setup supports reproducible comparison. 2.0 Paper, poster, and source code provide thorough access to methodology and implementation. Setup instructions and accompanying code not present.
2024-12-13 MassSpecGym Cheminformatics; Molecular Discovery Benchmark suite for discovery and identification of molecules via MS/MS mass spectrometry, molecular structure, de novo generation, retrieval, dataset De novo generation, Retrieval, Simulation Structure accuracy, Retrieval precision, Simulation MSE Graph-based generative models, Retrieval baselines 42 9.0 Focused on sound source localization for rodent vocalizations in lab settings; well-scoped. 9.5 767000 annotated audio segments across diverse conditions. Minor deduction for no train/test/valid split. 9.5 Localization error, precision/recall used 7.0 CNN-based baselines referenced but unclear whether pretrained models or training code are available. 2.0 Poster and paper outline benchmark intent and setup; repo expected but not confirmed in dataset card.
2024-12-13 Urban Data Layer (UDL) Urban Computing; Data Engineering Unified data pipeline for multi-modal urban science research data pipeline, urban science, multi-modal, benchmark Prediction, Classification Task-specific accuracy or RMSE Baseline regression/classification pipelines 43 9.0 Three tasks (de novo generation, retrieval, simulation) are clearly defined for MS/MS molecule discovery. 10.0 Over 1 million spectra with structure annotations; dataset is open-source and well-documented. 9.0 Task-appropriate metrics (structure accuracy, precision, MSE) are specified and used consistently. 8.0 Baseline models are available (graph-based and retrieval), though not exhaustive. 9.0 GitHub repo and poster provide code and reproducibility guidance.
2024-12-13 Delta Squared-DFT Computational Chemistry; Materials Science Benchmarking machine-learning corrections to DFT using Delta Squared-trained models for reaction energies density functional theory, Delta Squared-ML correction, reaction energetics, quantum chemistry Regression Mean Absolute Error (eV), Energy ranking accuracy Delta Squared-ML correction networks, Kernel ridge regression 44 8.0 Clear goals around unifying urban data formats and tasks (e.g., air quality prediction), though some specifics could be more formal. 9.0 Multi-modal data is standardized and accessible; GitHub repo available. 8.0 Uses common task metrics like accuracy/RMSE, though varies by task. 7.0 Baseline regression/classification models included. 8.0 Source code supports pipeline reuse, but formal evaluation splits may vary.
2024-12-13 LLMs for Crop Science Agricultural Science; NLP Evaluating LLMs on crop trait QA and textual inference tasks with domain-specific prompts crop science, prompt engineering, domain adaptation, question answering Question Answering, Inference Accuracy, F1 score GPT-4, LLaMA-2-13B, T5-XXL 45 9.0 The task of ML correction to DFT energy predictions is well-specified. 9.0 10 public reaction datasets with DFT and CC references; well-documented. 8.0 Uses MAE and ranking accuracy, suitable for this task. 8.0 Includes both Delta^2 and KRR baselines. 9.0 Public benchmarks and clear reproducibility via datasets and model code.
2024-12-13 SPIQA (LLM) Multimodal Scientific QA; Computer Vision Evaluating LLMs on image-based scientific paper figure QA tasks (LLM Adapter performance) multimodal QA, scientific figures, image+text, chain-of-thought prompting Multimodal QA Accuracy, F1 score LLaVA, MiniGPT-4, Owl-LLM adapter variants 46 6.0 Task of QA over scientific figures is interesting but not fully formalized in input/output terms. 6.0 Uses SPIQA dataset with ~10 adapters; figures and questions are included, but not fully open. 7.0 Reports accuracy and F1; fair but no visual reasoning-specific metric. 6.0 10 LLM adapter baselines; results included. 5.0 Poster paper and limited documentation; no reproducibility instructions.
  1. Javier Duarte, Nhan Tran, Ben Hawks, Christian Herwig, Jules Muhizi, Shvetank Prakash, and Vijay Janapa Reddi. Fastml science benchmarks: accelerating real-time scientific edge machine learning. 2022. URL: https://arxiv.org/abs/2207.07958, arXiv:2207.07958. 

  2. Javier Duarte, Nhan Tran, Ben Hawks, Christian Herwig, Jules Muhizi, Shvetank Prakash, and Vijay Janapa Reddi. Fastml science benchmarks: accelerating real-time scientific edge machine learning. 2022. URL: https://arxiv.org/abs/2207.07958, arXiv:2207.07958. 

  3. Javier Duarte, Nhan Tran, Ben Hawks, Christian Herwig, Jules Muhizi, Shvetank Prakash, and Vijay Janapa Reddi. Fastml science benchmarks: accelerating real-time scientific edge machine learning. 2022. URL: https://arxiv.org/abs/2207.07958, arXiv:2207.07958. 

  4. Diana Kafkes and Jason St. John. Boostr: a dataset for accelerator control systems. 2021. URL: https://arxiv.org/abs/2101.08359, arXiv:2101.08359. 

  5. Patrick Odagiu, Zhiqiang Que, Javier Duarte, Johannes Haller, Gregor Kasieczka, Artur Lobanov, Vladimir Loncar, Wayne Luk, Jennifer Ngadiuba, Maurizio Pierini, Philipp Rincke, Arpita Seksaria, Sioni Summers, Andre Sznajder, Alexander Tapper, and Thea K. Aarrestad. Ultrafast jet classification on fpgas for the hl-lhc. 2024. URL: https://arxiv.org/abs/2402.01876, arXiv:2402.01876, doi:https://doi.org/10.1088/2632-2153/ad5f10. 

  6. A. Abed Abud, B. Abi, R. Acciarri, M. A. Acero, G. Adamov, D. Adams, M. Adinolfi, A. Aduszkiewicz, Z. Ahmad, J. Ahmed, T. Alion, S. Alonso Monsalve, M. Alrashed, C. Alt, A. Alton, P. Amedo, J. Anderson, C. Andreopoulos, M. P. Andrews, F. Andrianala, S. Andringa, N. Anfimov, A. Ankowski, M. Antonova, S. Antusch, A. Aranda-Fernandez, A. Ariga, L. O. Arnold, M. A. Arroyave, J. Asaadi, A. Aurisano, V. Aushev, D. Autiero, M. Ayala-Torres, F. Azfar, H. Back, J. J. Back, C. Backhouse, P. Baesso, I. Bagaturia, L. Bagby, S. Balasubramanian, P. Baldi, B. Baller, B. Bambah, F. Barao, G. Barenboim, G. J. Barker, W. Barkhouse, C. Barnes, G. Barr, J. Barranco Monarca, N. Barros, J. L. Barrow, A. Basharina-Freshville, A. Bashyal, V. Basque, E. Belchior, J. B. R. Battat, F. Battisti, F. Bay, J. L. Bazo Alba, J. F. Beacom, E. Bechetoille, B. Behera, L. Bellantoni, G. Bellettini, V. Bellini, O. Beltramello, D. Belver, N. Benekos, F. Bento Neves, S. Berkman, P. Bernardini, R. M. Berner, H. Berns, S. Bertolucci, M. Betancourt, A. Betancur Rodríguez, M. Bhattacharjee, S. Bhuller, B. Bhuyan, S. Biagi, J. Bian, M. Biassoni, K. Biery, B. Bilki, M. Bishai, A. Bitadze, A. Blake, F. D. M. Blaszczyk, G. C. Blazey, E. Blucher, J. Boissevain, S. Bolognesi, T. Bolton, L. Bomben, M. Bonesini, M. Bongrand, F. Bonini, A. Booth, C. Booth, S. Bordoni, A. Borkum, T. Boschi, N. Bostan, P. Bour, C. Bourgeois, S. B. Boyd, D. Boyden, J. Bracinik, D. Braga, D. Brailsford, A. Brandt, J. Bremer, C. Brew, E. Brianne, S. J. Brice, C. Brizzolari, C. Bromberg, G. Brooijmans, J. Brooke, A. Bross, G. Brunetti, M. Brunetti, N. Buchanan, H. Budd, D. Caiulo, P. Calafiura, J. Calcutt, M. Calin, S. Calvez, E. Calvo, A. Caminata, M. Campanelli, K. Cankocak, D. Caratelli, G. Carini, B. Carlus, P. Carniti, I. Caro Terrazas, H. Carranza, T. Carroll, J. F. Castaño Forero, A. Castillo, C. Castromonte, E. Catano-Mur, C. Cattadori, F. Cavalier, F. Cavanna, S. Centro, G. Cerati, A. Cervelli, A. Cervera Villanueva, M. Chalifour, A. Chappell, E. Chardonnet, N. Charitonidis, A. Chatterjee, S. Chattopadhyay, H. Chen, M. Chen, Y. Chen, Z. Chen, D. Cherdack, C. Chi, S. Childress, A. Chiriacescu, G. Chisnall, K. Cho, S. Choate, D. Chokheli, S. Choubey, A. Christensen, D. Christian, G. Christodoulou, A. Chukanov, E. Church, P. Clarke, T. E. Coan, A. G. Cocco, J. A. B. Coelho, E. Conley, R. Conley, J. M. Conrad, M. Convery, S. Copello, L. Corwin, L. Cremaldi, L. Cremonesi, J. I. Crespo-Anadón, E. Cristaldo, R. Cross, A. Cudd, C. Cuesta, Y. Cui, D. Cussans, M. Dabrowski, O. Dalager, H. da Motta, L. Da Silva Peres, C. David, Q. David, G. S. Davies, S. Davini, J. Dawson, K. De, R. M. De Almeida, P. Debbins, I. De Bonis, M. P. Decowski, A. de Gouvêa, P. C. De Holanda, I. L. De Icaza Astiz, A. Deisting, P. De Jong, A. Delbart, D. Delepine, M. Delgado, A. Dell’Acqua, P. De Lurgio, J. R. T. de Mello Neto, D. M. DeMuth, S. Dennis, C. Densham, G. W. Deptuch, A. De Roeck, V. De Romeri, G. De Souza, R. Dharmapalan, F. Diaz, J. S. Díaz, S. Di Domizio, L. Di Giulio, P. Ding, L. Di Noto, C. Distefano, R. Diurba, M. Diwan, Z. Djurcic, N. Dokania, S. Dolan, M. J. Dolinski, L. Domine, D. Douglas, D. Douillet, G. Drake, F. Drielsma, D. Duchesneau, K. Duffy, P. Dunne, T. Durkin, H. Duyang, O. Dvornikov, D. A. Dwyer, A. S. Dyshkant, M. Eads, A. Earle, D. Edmunds, J. Eisch, L. Emberger, S. Emery, A. Ereditato, C. O. Escobar, G. Eurin, J. J. Evans, E. Ewart, A. C. Ezeribe, K. Fahey, A. Falcone, C. Farnese, Y. Farzan, J. Felix, M. Fernandes Carneiro da Silva, E. Fernandez-Martinez, P. Fernandez Menendez, F. Ferraro, L. Fields, F. Filthaut, A. Fiorentini, R. S. Fitzpatrick, W. Flanagan, B. Fleming, R. Flight, D. V. Forero, J. Fowler, W. Fox, J. Franc, K. Francis, D. Franco, J. Freeman, J. Freestone, J. Fried, A. Friedland, S. Fuess, I. Furic, A. P. Furmanski, A. Gago, H. Gallagher, A. Gallas, A. Gallego-Ros, N. Gallice, V. Galymov, E. Gamberini, T. Gamble, R. Gandhi, R. Gandrajula, F. Gao, S. Gao, D. Garcia-Gamez, M. Á García-Peris, S. Gardiner, D. Gastler, G. Ge, B. Gelli, A. Gendotti, S. Gent, Z. Ghorbani-Moghaddam, D. Gibin, I. Gil-Botella, S. Gilligan, C. Girerd, A. K. Giri, D. Gnani, O. Gogota, M. Gold, S. Gollapinni, K. Gollwitzer, R. A. Gomes, L. V. Gomez Bermeo, L. S. Gomez Fajardo, F. Gonnella, J. A. Gonzalez-Cuevas, D. Gonzalez-Diaz, M. Gonzalez-Lopez, M. C. Goodman, O. Goodwin, S. Goswami, C. Gotti, E. Goudzovski, C. Grace, M. Graham, R. Gran, E. Granados, P. Granger, A. Grant, C. Grant, D. Gratieri, P. Green, L. Greenler, J. Greer, W. C. Griffith, M. Groh, J. Grudzinski, K. Grzelak, W. Gu, V. Guarino, R. Guenette, E. Guerard, A. Guglielmi, B. Guo, K. K. Guthikonda, R. Gutierrez, P. Guzowski, M. M. Guzzo, S. Gwon, A. Habig, H. Hadavand, R. Haenni, A. Hahn, J. Haiston, P. Hamacher-Baumann, T. Hamernik, P. Hamilton, J. Han, D. A. Harris, J. Hartnell, J. Harton, T. Hasegawa, C. Hasnip, R. Hatcher, K. W. Hatfield, A. Hatzikoutelis, C. Hayes, E. Hazen, A. Heavey, K. M. Heeger, J. Heise, K. Hennessy, S. Henry, M. A. Hernandez Morquecho, K. Herner, L. Hertel, V Hewes, A. Higuera, T. Hill, S. J. Hillier, A. Himmel, J. Hoff, C. Hohl, A. Holin, E. Hoppe, G. A. Horton-Smith, M. Hostert, A. Hourlier, B. Howard, R. Howell, J. Huang, J. Huang, J. Hugon, G. Iles, N. Ilic, A. M. Iliescu, R. Illingworth, A. Ioannisian, L. Isenhower, R. Itay, A. Izmaylov, S. Jackson, V. Jain, E. James, B. Jargowsky, F. Jediny, D. Jena, Y. S. Jeong, C. Jesús-Valls, X. Ji, L. Jiang, S. Jiménez, A. Jipa, R. Johnson, B. Jones, S. B. Jones, M. Judah, C. K. Jung, T. Junk, Y. Jwa, M. Kabirnezhad, A. Kaboth, I. Kadenko, I. Kakorin, F. Kamiya, N. Kaneshige, G. Karagiorgi, G. Karaman, A. Karcher, M. Karolak, Y. Karyotakis, S. Kasai, S. P. Kasetti, L. Kashur, N. Kazaryan, E. Kearns, P. Keener, K. J. Kelly, E. Kemp, O. Kemularia, W. Ketchum, S. H. Kettell, M. Khabibullin, A. Khotjantsev, A. Khvedelidze, D. Kim, B. King, B. Kirby, M. Kirby, J. Klein, K. Koehler, L. W. Koerner, S. Kohn, P. P. Koller, L. Kolupaeva, M. Kordosky, T. Kosc, U. Kose, V. A. Kostelecký, K. Kothekar, F. Krennrich, I. Kreslo, Y. Kudenko, V. A. Kudryavtsev, S. Kulagin, J. Kumar, P. Kumar, P. Kunze, N. Kurita, C. Kuruppu, V. Kus, T. Kutter, A. Lambert, B. Land, K. Lande, C. E. Lane, K. Lang, T. Langford, J. Larkin, P. Lasorak, D. Last, C. Lastoria, A. Laundrie, A. Lawrence, I. Lazanu, R. LaZur, T. Le, S. Leardini, J. Learned, P. LeBrun, T. LeCompte, G. Lehmann Miotto, R. Lehnert, M. A. Leigui de Oliveira, M. Leitner, L. Li, S. W. Li, T. Li, Y. Li, H. Liao, C. S. Lin, Q. Lin, S. Lin, A. Lister, B. R. Littlejohn, J. Liu, S. Lockwitz, T. Loew, M. Lokajicek, I. Lomidze, K. Long, K. Loo, D. Lorca, T. Lord, J. M. LoSecco, W. C. Louis, X. -G. Lu, K. B. Luk, X. Luo, N. Lurkin, T. Lux, V. P. Luzio, D. MacFarlane, A. A. Machado, P. Machado, C. T. Macias, J. R. Macier, A. Maddalena, A. Madera, P. Madigan, S. Magill, K. Mahn, A. Maio, A. Major, J. A. Maloney, G. Mandrioli, R. C. Mandujano, J. Maneira, L. Manenti, S. Manly, A. Mann, K. Manolopoulos, M. Manrique Plata, V. N. Manyam, L. Manzanillas, M. Marchan, A. Marchionni, W. Marciano, D. Marfatia, C. Mariani, J. Maricic, R. Marie, F. Marinho, A. D. Marino, D. Marsden, M. Marshak, C. M. Marshall, J. Marshall, J. Marteau, J. Martin-Albo, N. Martinez, D. A. Martinez Caicedo, S. Martynenko, K. Mason, A. Mastbaum, M. Masud, S. Matsuno, J. Matthews, C. Mauger, N. Mauri, K. Mavrokoridis, I. Mawby, R. Mazza, A. Mazzacane, E. Mazzucato, T. McAskill, E. McCluskey, N. McConkey, K. S. McFarland, C. McGrew, A. McNab, A. Mefodiev, P. Mehta, P. Melas, O. Mena, S. Menary, H. Mendez, D. P. Méndez, A. Menegolli, G. Meng, M. D. Messier, W. Metcalf, T. Mettler, M. Mewes, H. Meyer, T. Miao, G. Michna, T. Miedema, J. Migenda, V. Mikola, R. Milincic, W. Miller, J. Mills, C. Milne, O. Mineev, O. G. Miranda, S. Miryala, C. S. Mishra, S. R. Mishra, A. Mislivec, D. Mladenov, I. Mocioiu, K. Moffat, N. Moggi, R. Mohanta, T. A. Mohayai, N. Mokhov, J. Molina, L. Molina Bueno, A. Montanari, C. Montanari, D. Montanari, L. M. Montano Zetina, J. Moon, M. Mooney, A. F. Moor, D. Moreno, C. Morris, C. Mossey, E. Motuk, C. A. Moura, J. Mousseau, W. Mu, L. Mualem, J. Mueller, M. Muether, S. Mufson, F. Muheim, A. Muir, M. Mulhearn, D. Munford, H. Muramatsu, S. Murphy, J. Musser, J. Nachtman, S. Nagu, M. Nalbandyan, R. Nandakumar, D. Naples, S. Narita, D. Navas-Nicolás, A. Navrer-Agasson, N. Nayak, M. Nebot-Guinot, K. Negishi, J. K. Nelson, J. Nesbit, M. Nessi, D. Newbold, M. Newcomer, D. Newhart, H. Newton, R. Nichol, F. Nicolas-Arnaldos, E. Niner, K. Nishimura, A. Norman, A. Norrick, R. Northrop, P. Novella, J. A. Nowak, M. Oberling, J. P. Ochoa-Ricoux, A. Olivares Del Campo, A. Olivier, A. Olshevskiy, Y. Onel, Y. Onishchuk, J. Ott, L. Pagani, S. Pakvasa, G. Palacio, O. Palamara, S. Palestini, J. M. Paley, M. Pallavicini, C. Palomares, J. L. Palomino-Gallo, E. Pantic, V. Paolone, V. Papadimitriou, R. Papaleo, A. Papanestis, S. Paramesvaran, S. Parke, Z. Parsa, M. Parvu, S. Pascoli, L. Pasqualini, J. Pasternak, J. Pater, C. Patrick, L. Patrizii, R. B. Patterson, S. J. Patton, T. Patzak, A. Paudel, B. Paulos, L. Paulucci, Z. Pavlovic, G. Pawloski, D. Payne, V. Pec, S. J. M. Peeters, E. Pennacchio, A. Penzo, O. L. G. Peres, J. Perry, D. Pershey, G. Pessina, G. Petrillo, C. Petta, R. Petti, F. Piastra, L. Pickering, F. Pietropaolo, R. Plunkett, R. Poling, X. Pons, N. Poonthottathil, S. Pordes, J. Porter, M. Potekhin, R. Potenza, B. V. K. S. Potukuchi, J. Pozimski, M. Pozzato, S. Prakash, T. Prakash, S. Prince, D. Pugnere, X. Qian, M. C. Queiroga Bazetto, J. L. Raaf, V. Radeka, J. Rademacker, B. Radics, A. Rafique, E. Raguzin, M. Rai, M. Rajaoalisoa, I. Rakhno, A. Rakotonandrasana, L. Rakotondravohitra, Y. A. Ramachers, R. Rameika, M. A. Ramirez Delgado, B. Ramson, A. Rappoldi, G. Raselli, P. Ratoff, S. Raut, R. F. Razakamiandra, J. S. Real, B. Rebel, M. Reggiani-Guzzo, T. Rehak, J. Reichenbacher, S. D. Reitzner, H. Rejeb Sfar, A. Renshaw, S. Rescia, F. Resnati, A. Reynolds, C. Riccio, G. Riccobene, L. C. J. Rice, J. Ricol, A. Rigamonti, Y. Rigaut, D. Rivera, L. Rochester, M. Roda, P. Rodrigues, M. J. Rodriguez Alonso, E. Rodriguez Bonilla, J. Rodriguez Rondon, S. Rosauro-Alcaraz, M. Rosenberg, P. Rosier, B. Roskovec, M. Rossella, J. Rout, P. Roy, S. Roy, A. Rubbia, C. Rubbia, F. C. Rubio, B. Russell, D. Ruterbories, R. Saakyan, S. Sacerdoti, T. Safford, R. Sahay, N. Sahu, P. Sala, N. Samios, O. Samoylov, M. C. Sanchez, D. A. Sanders, D. Sankey, S. Santana, M. Santos-Maldonado, N. Saoulidou, P. Sapienza, C. Sarasty, I. Sarcevic, G. Savage, V. Savinov, A. Scaramelli, A. Scarff, A. Scarpelli, T. Schaffer, H. Schellman, P. Schlabach, D. Schmitz, K. Scholberg, A. Schukraft, E. Segreto, J. Sensenig, I. Seong, A. Sergi, D. Sgalaberna, M. H. Shaevitz, S. Shafaq, M. Shamma, R. Sharankova, H. R. Sharma, R. Sharma, R. Kumar, T. Shaw, C. Shepherd-Themistocleous, S. Shin, D. Shooltz, R. Shrock, L. Simard, F. Simon, N. Simos, J. Sinclair, G. Sinev, J. Singh, J. Singh, V. Singh, R. Sipos, F. W. Sippach, G. Sirri, A. Sitraka, K. Siyeon, K. Skarpaas VIII, A. Smith, E. Smith, P. Smith, J. Smolik, M. Smy, E. L. Snider, P. Snopok, M. Soares Nunes, H. Sobel, M. Soderberg, C. J. Solano Salinas, S. Söldner-Rembold, N. Solomey, V. Solovov, W. E. Sondheim, M. Sorel, J. Soto-Oton, A. Sousa, K. Soustruznik, F. Spagliardi, M. Spanu, J. Spitz, N. J. C. Spooner, K. Spurgeon, R. Staley, M. Stancari, L. Stanco, R. Stanley, R. Stein, H. M. Steiner, J. Stewart, B. Stillwell, J. Stock, F. Stocker, T. Stokes, M. Strait, T. Strauss, S. Striganov, A. Stuart, J. G. Suarez, H. Sullivan, D. Summers, A. Surdo, V. Susic, L. Suter, C. M. Sutera, R. Svoboda, B. Szczerbinska, A. M. Szelc, R. Talaga, H. A. Tanaka, B. Tapia Oregui, A. Tapper, S. Tariq, E. Tatar, R. Tayloe, A. M. Teklu, M. Tenti, K. Terao, C. A. Ternes, F. Terranova, G. Testera, A. Thea, J. L. Thompson, C. Thorn, S. C. Timm, J. Todd, A. Tonazzo, D. Torbunov, M. Torti, M. Tortola, F. Tortorici, D. Totani, M. Toups, C. Touramanis, J. Trevor, S. Trilov, W. H. Trzaska, Y. T. Tsai, Z. Tsamalaidze, K. V. Tsang, N. Tsverava, S. Tufanli, C. Tull, E. Tyley, M. Tzanov, M. A. Uchida, J. Urheim, T. Usher, S. Uzunyan, M. R. Vagins, P. Vahle, G. A. Valdiviesso, E. Valencia, Z. Vallari, J. W. F. Valle, S. Vallecorsa, R. Van Berg, R. G. Van de Water, F. Varanini, D. Vargas, G. Varner, J. Vasel, S. Vasina, G. Vasseur, N. Vaughan, K. Vaziri, S. Ventura, A. Verdugo, S. Vergani, M. A. Vermeulen, M. Verzocchi, M. Vicenzi, H. Vieira de Souza, C. Vignoli, C. Vilela, B. Viren, T. Vrba, T. Wachala, A. V. Waldron, M. Wallbank, H. Wang, J. Wang, M. H. L. S. Wang, Y. Wang, Y. Wang, K. Warburton, D. Warner, M. Wascko, D. Waters, A. Watson, P. Weatherly, A. Weber, M. Weber, H. Wei, A. Weinstein, D. Wenman, M. Wetstein, A. White, L. H. Whitehead, D. Whittington, M. J. Wilking, C. Wilkinson, Z. Williams, F. Wilson, R. J. Wilson, J. Wolcott, T. Wongjirad, A. Wood, K. Wood, E. Worcester, M. Worcester, C. Wret, W. Wu, W. Wu, Y. Xiao, E. Yandel, G. Yang, K. Yang, S. Yang, T. Yang, A. Yankelevich, N. Yershov, K. Yonehara, T. Young, B. Yu, H. Yu, J. Yu, W. Yuan, R. Zaki, J. Zalesak, L. Zambelli, B. Zamorano, A. Zani, L. Zazueta, G. Zeit, G. P. Zeller, J. Zennamo, K. Zeug, C. Zhang, M. Zhao, E. Zhivun, G. Zhu, P. Zilberman, E. D. Zimmerman, M. Zito, S. Zucchelli, J. Zuklin, V. Zutshi, and R. Zwaska. Deep underground neutrino experiment (dune) near detector conceptual design report. 2021. URL: https://arxiv.org/abs/2103.13910, arXiv:2103.13910. 

  7. J. Kvapil, G. Borca-Tasciuc, H. Bossi, K. Chen, Y. Chen, Y. Corrales Morales, H. Da Costa, C. Da Silva, C. Dean, J. Durham, S. Fu, C. Hao, P. Harris, O. Hen, H. Jheng, Y. Lee, P. Li, X. Li, Y. Lin, M. X. Liu, V. Loncar, J. P. Mitrevski, A. Olvera, M. L. Purschke, J. S. Renck, G. Roland, J. Schambach, Z. Shi, N. Tran, N. Wuerfel, B. Xu, D. Yu, and H. Zhang. Intelligent experiments through real-time ai: fast data processing and autonomous detector control for sphenix and future eic detectors. 2025. URL: https://arxiv.org/abs/2501.04845, arXiv:2501.04845. 

  8. Jason Weitz, Dmitri Demler, Luke McDermott, Nhan Tran, and Javier Duarte. Neural architecture codesign for fast physics applications. 2025. URL: https://arxiv.org/abs/2501.05515, arXiv:2501.05515. 

  9. Benjamin Parpillon, Chinar Syal, Jieun Yoo, Jennet Dickinson, Morris Swartz, Giuseppe Di Guglielmo, Alice Bean, Douglas Berry, Manuel Blanco Valentin, Karri DiPetrillo, Anthony Badea, Lindsey Gray, Petar Maksimovic, Corrinne Mills, Mark S. Neubauer, Gauri Pradhan, Nhan Tran, Dahai Wen, and Farah Fahim. Smart pixels: in-pixel ai for on-sensor data filtering. 2024. URL: https://arxiv.org/abs/2406.14860, arXiv:2406.14860. 

  10. Zhengchun Liu, Hemant Sharma, Jun-Sang Park, Peter Kenesei, Antonino Miceli, Jonathan Almer, Rajkumar Kettimuthu, and Ian Foster. Braggnn: fast x-ray bragg peak analysis using deep learning. 2021. URL: https://arxiv.org/abs/2008.08198, arXiv:2008.08198. 

  11. Shuyu Qin, Joshua Agar, and Nhan Tran. Extremely noisy 4d-tem strain mapping using cycle consistent spatial transforming autoencoders. In AI for Accelerated Materials Design - NeurIPS 2023 Workshop. 2023. URL: https://openreview.net/forum?id=7yt3N0o0W9. 

  12. Yumou Wei, Ryan F. Forelli, Chris Hansen, Jeffrey P. Levesque, Nhan Tran, Joshua C. Agar, Giuseppe Di Guglielmo, Michael E. Mauel, and Gerald A. Navratil. Low latency optical-based mode tracking with machine learning deployed on fpgas on a tokamak. 2024. URL: https://arxiv.org/abs/2312.00128, arXiv:2312.00128, doi:https://doi.org/10.1063/5.0190354. 

  13. Wanling Gao, Fei Tang, Lei Wang, Jianfeng Zhan, Chunxin Lan, Chunjie Luo, Yunyou Huang, Chen Zheng, Jiahui Dai, Zheng Cao, Daoyi Zheng, Haoning Tang, Kunlin Zhan, Biao Wang, Defei Kong, Tong Wu, Minghe Yu, Chongkang Tan, Huan Li, Xinhui Tian, Yatao Li, Junchao Shao, Zhenyu Wang, Xiaoyu Wang, and Hainan Ye. Aibench: an industry standard internet service ai benchmark suite. 2019. URL: https://arxiv.org/abs/1908.08998, arXiv:1908.08998. 

  14. Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Xu Wen, Rui Ren, Chen Zheng, Xiwen He, Hainan Ye, Haoning Tang, Zheng Cao, Shujie Zhang, and Jiahui Dai. Bigdatabench: a scalable and unified big data and ai benchmark suite. 2018. URL: https://arxiv.org/abs/1802.08254, arXiv:1802.08254. 

  15. Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, Dawei Mu, Amit Ruhela, Kento Sato, Koichi Shirahata, Tsuguchika Tabaru, Aristeidis Tsaris, Jan Balewski, Ben Cumming, Takumi Danjo, Jens Domke, Takaaki Fukai, Naoto Fukumoto, Tatsuya Fukushi, Balazs Gerofi, Takumi Honda, Toshiyuki Imamura, Akihiko Kasagi, Kentaro Kawakami, Shuhei Kudo, Akiyoshi Kuroda, Maxime Martinasso, Satoshi Matsuoka, Henrique Mendonça, Kazuki Minami, Prabhat Ram, Takashi Sawada, Mallikarjun Shankar, Tom St. John, Akihiro Tabuchi, Venkatram Vishwanath, Mohamed Wahib, Masafumi Yamazaki, and Junqi Yin. Mlperf hpc: a holistic benchmark suite for scientific machine learning on hpc systems. 2021. URL: https://arxiv.org/abs/2110.11466, arXiv:2110.11466. 

  16. Jeyan Thiyagalingam, Gregor von Laszewski, Junqi Yin, Murali Emani, Juri Papay, Gregg Barrett, Piotr Luszczek, Aristeidis Tsaris, Christine Kirkpatrick, Feiyi Wang, Tom Gibbs, Venkatram Vishwanath, Mallikarjun Shankar, Geoffrey Fox, and Tony Hey. Ai benchmarking for science: efforts from the mlcommons science working group. In Hartwig Anzt, Amanda Bienz, Piotr Luszczek, and Marc Baboulin, editors, High Performance Computing. ISC High Performance 2022 International Workshops, 47–64. Cham, 2022. Springer International Publishing. 

  17. Thea Aarrestad, Ekaterina Govorkova, Jennifer Ngadiuba, Ema Puljak, Maurizio Pierini, and Kinga Anna Wozniak. Unsupervised new physics detection at 40 mhz: training dataset. 2021. URL: https://zenodo.org/record/5046389, doi:10.5281/ZENODO.5046389. 

  18. Alexandros Karargyris, Renato Umeton, Micah J. Sheller, Alejandro Aristizabal, Johnu George, Anna Wuest, Sarthak Pati, Hasan Kassem, Maximilian Zenk, Ujjwal Baid, Prakash Narayana Moorthy, Alexander Chowdhury, Junyi Guo, Sahil Nalawade, Jacob Rosenthal, David Kanter, Maria Xenochristou, Daniel J. Beutel, Verena Chung, Timothy Bergquist, James Eddy, Abubakar Abid, Lewis Tunstall, Omar Sanseviero, Dimitrios Dimitriadis, Yiming Qian, Xinxing Xu, Yong Liu, Rick Siow Mong Goh, Srini Bala, Victor Bittorf, Sreekar Reddy Puchala, Biagio Ricciuti, Soujanya Samineni, Eshna Sengupta, Akshay Chaudhari, Cody Coleman, Bala Desinghu, Gregory Diamos, Debo Dutta, Diane Feddema, Grigori Fursin, Xinyuan Huang, Satyananda Kashyap, Nicholas Lane, Indranil Mallick, Pietro Mascagni, Virendra Mehta, Cassiano Ferro Moraes, Vivek Natarajan, Nikola Nikolov, Nicolas Padoy, Gennady Pekhimenko, Vijay Janapa Reddi, G. Anthony Reina, Pablo Ribalta, Abhishek Singh, Jayaraman J. Thiagarajan, Jacob Albrecht, Thomas Wolf, Geralyn Miller, Huazhu Fu, Prashant Shah, Daguang Xu, Poonam Yadav, David Talby, Mark M. Awad, Jeremy P. Howard, Michael Rosenthal, Luigi Marchionni, Massimo Loda, Jason M. Johnson, Spyridon Bakas, Peter Mattson, FeTS Consortium, BraTS-2020 Consortium, and AI4SafeChole Consortium. Federated benchmarking of medical artificial intelligence with medperf. Nature Machine Intelligence, 5(7):799–810, July 2023. URL: https://doi.org/10.1038/s42256-023-00652-2, doi:10.1038/s42256-023-00652-2. 

  19. Claudius Krause, Michele Faucci Giannelli, Gregor Kasieczka, Benjamin Nachman, Dalila Salamani, David Shih, Anna Zaborowska, Oz Amram, Kerstin Borras, Matthew R. Buckley, Erik Buhmann, Thorsten Buss, Renato Paulo Da Costa Cardoso, Anthony L. Caterini, Nadezda Chernyavskaya, Federico A. G. Corchia, Jesse C. Cresswell, Sascha Diefenbacher, Etienne Dreyer, Vijay Ekambaram, Engin Eren, Florian Ernst, Luigi Favaro, Matteo Franchini, Frank Gaede, Eilam Gross, Shih-Chieh Hsu, Kristina Jaruskova, Benno Käch, Jayant Kalagnanam, Raghav Kansal, Taewoo Kim, Dmitrii Kobylianskii, Anatolii Korol, William Korcari, Dirk Krücker, Katja Krüger, Marco Letizia, Shu Li, Qibin Liu, Xiulong Liu, Gabriel Loaiza-Ganem, Thandikire Madula, Peter McKeown, Isabell-A. Melzer-Pellmann, Vinicius Mikuni, Nam Nguyen, Ayodele Ore, Sofia Palacios Schweitzer, Ian Pang, Kevin Pedro, Tilman Plehn, Witold Pokorski, Huilin Qu, Piyush Raikwar, John A. Raine, Humberto Reyes-Gonzalez, Lorenzo Rinaldi, Brendan Leigh Ross, Moritz A. W. Scham, Simon Schnake, Chase Shimmin, Eli Shlizerman, Nathalie Soybelman, Mudhakar Srivatsa, Kalliopi Tsolaki, Sofia Vallecorsa, Kyongmin Yeo, and Rui Zhang. Calochallenge 2022: a community challenge for fast calorimeter simulation. 2024. URL: https://arxiv.org/abs/2410.21611, arXiv:2410.21611. 

  20. Avrim Blum and Moritz Hardt. The ladder: a reliable leaderboard for machine learning competitions. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 1006–1014. Lille, France, July 2015. PMLR. URL: https://proceedings.mlr.press/v37/blum15.html. 

  21. Zhen Xu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, and Isabelle Guyon. Codabench: flexible, easy-to-use, and reproducible meta-benchmark platform. Patterns, 3(7):100543, July 2022. URL: http://dx.doi.org/10.1016/j.patter.2022.100543, doi:10.1016/j.patter.2022.100543. 

  22. Piotr Luszczek. Sabath: fair metadata technology for surrogate benchmarks. Technical Report, University of Tennessee, 2021. URL: https://github.com/icl-utk-edu/slip/tree/sabath. 

  23. Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Dan MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. Pdebench: an extensive benchmark for scientific machine learning. 2024. URL: https://arxiv.org/abs/2210.07182, arXiv:2210.07182. 

  24. Ruben Ohana, Michael McCabe, Lucas Meyer, Rudy Morel, Fruzsina J. Agocs, Miguel Beneitez, Marsha Berger, Blakesley Burkhart, Stuart B. Dalziel, Drummond B. Fielding, Daniel Fortunato, Jared A. Goldberg, Keiya Hirashima, Yan-Fei Jiang, Rich R. Kerswell, Suryanarayana Maddu, Jonah Miller, Payel Mukhopadhyay, Stefan S. Nixon, Jeff Shen, Romain Watteaux, Bruno Régaldo-Saint Blancard, François Rozet, Liam H. Parker, Miles Cranmer, and Shirley Ho. The well: a large-scale collection of diverse physics simulations for machine learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 44989–45037. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/4f9a5acd91ac76569f2fe291b1f4772b-Paper-Datasets_and_Benchmarks_Track.pdf. 

  25. Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, and Venkatram Vishwanath. Llm-inference-bench: inference benchmarking of large language models on ai accelerators. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, volume, 1362 1379. 2024. doi:10.1109/SCW63240.2024.00178. 

  26. Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. 2024. URL: https://arxiv.org/abs/2312.07104, arXiv:2312.07104. 

  27. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ‘23, 611 626. New York, NY, USA, 2023. Association for Computing Machinery. URL: https://doi.org/10.1145/3600006.3613165, doi:10.1145/3600006.3613165. 

  28. Simon Mo. Vllm performance dashboard. 2024. URL: https://simon-mo-workspace.observablehq.cloud/vllm-dashboard-v0/. 

  29. Kin G. Olivares, Cristian Challú, Federico Garza, Max Mergenthaler Canseco, and Artur Dubrawski. Neuralforecast: user friendly state-of-the-art neural forecasting models. PyCon Salt Lake City, Utah, US 2022, 2022. URL: https://github.com/Nixtla/neuralforecast. 

  30. Cristian Challu, Kin G Olivares, Boris N Oreshkin, Federico Garza Ramirez, Max Mergenthaler Canseco, and Artur Dubrawski. Nhits: neural hierarchical interpolation for time series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 37, 6989–6997. 2023. 

  31. Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time-llm: time series forecasting by reprogramming large language models. 2024. URL: https://arxiv.org/abs/2310.01728, arXiv:2310.01728. 

  32. Azul Garza, Cristian Challu, and Max Mergenthaler-Canseco. Timegpt-1. 2024. URL: https://arxiv.org/abs/2310.03589, arXiv:2310.03589. 

  33. Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova, Wahid Bhimji, Wei-Lun Chao, Chris Harris, Shih-Chieh Hsu, Hilmar Lapp, Mark S. Neubauer, Josephine Namayanja, Aneesh Subramanian, Philip Harris, Advaith Anand, David E. Carlyn, Subhankar Ghosh, Christopher Lawrence, Eric Moreno, Ryan Raikman, Jiaman Wu, Ziheng Zhang, Bayu Adhi, Mohammad Ahmadi Gharehtoragh, Saúl Alonso Monsalve, Marta Babicz, Furqan Baig, Namrata Banerji, William Bardon, Tyler Barna, Tanya Berger-Wolf, Adji Bousso Dieng, Micah Brachman, Quentin Buat, David C. Y. Hui, Phuong Cao, Franco Cerino, Yi-Chun Chang, Shivaji Chaulagain, An-Kai Chen, Deming Chen, Eric Chen, Chia-Jui Chou, Zih-Chen Ciou, Miles Cochran-Branson, Artur Cordeiro Oudot Choi, Michael Coughlin, Matteo Cremonesi, Maria Dadarlat, Peter Darch, Malina Desai, Daniel Diaz, Steven Dillmann, Javier Duarte, Isla Duporge, Urbas Ekka, Saba Entezari Heravi, Hao Fang, Rian Flynn, Geoffrey Fox, Emily Freed, Hang Gao, Jing Gao, Julia Gonski, Matthew Graham, Abolfazl Hashemi, Scott Hauck, James Hazelden, Joshua Henry Peterson, Duc Hoang, Wei Hu, Mirco Huennefeld, David Hyde, Vandana Janeja, Nattapon Jaroenchai, Haoyi Jia, Yunfan Kang, Maksim Kholiavchenko, Elham E. Khoda, Sangin Kim, Aditya Kumar, Bo-Cheng Lai, Trung Le, Chi-Wei Lee, JangHyeon Lee, Shaocheng Lee, Suzan van der Lee, Charles Lewis, Haitong Li, Haoyang Li, Henry Liao, Mia Liu, Xiaolin Liu, Xiulong Liu, Vladimir Loncar, Fangzheng Lyu, Ilya Makarov, Abhishikth Mallampalli Chen-Yu Mao, Alexander Michels, Alexander Migala, Farouk Mokhtar, Mathieu Morlighem, Min Namgung, Andrzej Novak, Andrew Novick, Amy Orsborn, Anand Padmanabhan, Jia-Cheng Pan, Sneh Pandya, Zhiyuan Pei, Ana Peixoto, George Percivall, Alex Po Leung, Sanjay Purushotham, Zhiqiang Que, Melissa Quinnan, Arghya Ranjan, Dylan Rankin, Christina Reissel, Benedikt Riedel, Dan Rubenstein, Argyro Sasli, Eli Shlizerman, Arushi Singh, Kim Singh, Eric R. Sokol, Arturo Sorensen, Yu Su, Mitra Taheri, Vaibhav Thakkar, Ann Mariam Thomas, Eric Toberer, Chenghan Tsai, Rebecca Vandewalle, Arjun Verma, Ricco C. Venterea, He Wang, Jianwu Wang, Sam Wang, Shaowen Wang, Gordon Watts, Jason Weitz, Andrew Wildridge, Rebecca Williams, Scott Wolf, Yue Xu, Jianqi Yan, Jai Yu, Yulei Zhang, Haoran Zhao, Ying Zhao, and Yibo Zhong. Building machine learning challenges for anomaly detection in science. 2025. URL: https://arxiv.org/abs/2503.02112, arXiv:2503.02112. 

  34. Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova, Wahid Bhimji, Wei-Lun Chao, Chris Harris, Shih-Chieh Hsu, Hilmar Lapp, Mark S. Neubauer, Josephine Namayanja, Aneesh Subramanian, Philip Harris, Advaith Anand, David E. Carlyn, Subhankar Ghosh, Christopher Lawrence, Eric Moreno, Ryan Raikman, Jiaman Wu, Ziheng Zhang, Bayu Adhi, Mohammad Ahmadi Gharehtoragh, Saúl Alonso Monsalve, Marta Babicz, Furqan Baig, Namrata Banerji, William Bardon, Tyler Barna, Tanya Berger-Wolf, Adji Bousso Dieng, Micah Brachman, Quentin Buat, David C. Y. Hui, Phuong Cao, Franco Cerino, Yi-Chun Chang, Shivaji Chaulagain, An-Kai Chen, Deming Chen, Eric Chen, Chia-Jui Chou, Zih-Chen Ciou, Miles Cochran-Branson, Artur Cordeiro Oudot Choi, Michael Coughlin, Matteo Cremonesi, Maria Dadarlat, Peter Darch, Malina Desai, Daniel Diaz, Steven Dillmann, Javier Duarte, Isla Duporge, Urbas Ekka, Saba Entezari Heravi, Hao Fang, Rian Flynn, Geoffrey Fox, Emily Freed, Hang Gao, Jing Gao, Julia Gonski, Matthew Graham, Abolfazl Hashemi, Scott Hauck, James Hazelden, Joshua Henry Peterson, Duc Hoang, Wei Hu, Mirco Huennefeld, David Hyde, Vandana Janeja, Nattapon Jaroenchai, Haoyi Jia, Yunfan Kang, Maksim Kholiavchenko, Elham E. Khoda, Sangin Kim, Aditya Kumar, Bo-Cheng Lai, Trung Le, Chi-Wei Lee, JangHyeon Lee, Shaocheng Lee, Suzan van der Lee, Charles Lewis, Haitong Li, Haoyang Li, Henry Liao, Mia Liu, Xiaolin Liu, Xiulong Liu, Vladimir Loncar, Fangzheng Lyu, Ilya Makarov, Abhishikth Mallampalli Chen-Yu Mao, Alexander Michels, Alexander Migala, Farouk Mokhtar, Mathieu Morlighem, Min Namgung, Andrzej Novak, Andrew Novick, Amy Orsborn, Anand Padmanabhan, Jia-Cheng Pan, Sneh Pandya, Zhiyuan Pei, Ana Peixoto, George Percivall, Alex Po Leung, Sanjay Purushotham, Zhiqiang Que, Melissa Quinnan, Arghya Ranjan, Dylan Rankin, Christina Reissel, Benedikt Riedel, Dan Rubenstein, Argyro Sasli, Eli Shlizerman, Arushi Singh, Kim Singh, Eric R. Sokol, Arturo Sorensen, Yu Su, Mitra Taheri, Vaibhav Thakkar, Ann Mariam Thomas, Eric Toberer, Chenghan Tsai, Rebecca Vandewalle, Arjun Verma, Ricco C. Venterea, He Wang, Jianwu Wang, Sam Wang, Shaowen Wang, Gordon Watts, Jason Weitz, Andrew Wildridge, Rebecca Williams, Scott Wolf, Yue Xu, Jianqi Yan, Jai Yu, Yulei Zhang, Haoran Zhao, Ying Zhao, and Yibo Zhong. Building machine learning challenges for anomaly detection in science. 2025. URL: https://arxiv.org/abs/2503.02112, arXiv:2503.02112. 

  35. Elizabeth G. Campolongo, Yuan-Tang Chou, Ekaterina Govorkova, Wahid Bhimji, Wei-Lun Chao, Chris Harris, Shih-Chieh Hsu, Hilmar Lapp, Mark S. Neubauer, Josephine Namayanja, Aneesh Subramanian, Philip Harris, Advaith Anand, David E. Carlyn, Subhankar Ghosh, Christopher Lawrence, Eric Moreno, Ryan Raikman, Jiaman Wu, Ziheng Zhang, Bayu Adhi, Mohammad Ahmadi Gharehtoragh, Saúl Alonso Monsalve, Marta Babicz, Furqan Baig, Namrata Banerji, William Bardon, Tyler Barna, Tanya Berger-Wolf, Adji Bousso Dieng, Micah Brachman, Quentin Buat, David C. Y. Hui, Phuong Cao, Franco Cerino, Yi-Chun Chang, Shivaji Chaulagain, An-Kai Chen, Deming Chen, Eric Chen, Chia-Jui Chou, Zih-Chen Ciou, Miles Cochran-Branson, Artur Cordeiro Oudot Choi, Michael Coughlin, Matteo Cremonesi, Maria Dadarlat, Peter Darch, Malina Desai, Daniel Diaz, Steven Dillmann, Javier Duarte, Isla Duporge, Urbas Ekka, Saba Entezari Heravi, Hao Fang, Rian Flynn, Geoffrey Fox, Emily Freed, Hang Gao, Jing Gao, Julia Gonski, Matthew Graham, Abolfazl Hashemi, Scott Hauck, James Hazelden, Joshua Henry Peterson, Duc Hoang, Wei Hu, Mirco Huennefeld, David Hyde, Vandana Janeja, Nattapon Jaroenchai, Haoyi Jia, Yunfan Kang, Maksim Kholiavchenko, Elham E. Khoda, Sangin Kim, Aditya Kumar, Bo-Cheng Lai, Trung Le, Chi-Wei Lee, JangHyeon Lee, Shaocheng Lee, Suzan van der Lee, Charles Lewis, Haitong Li, Haoyang Li, Henry Liao, Mia Liu, Xiaolin Liu, Xiulong Liu, Vladimir Loncar, Fangzheng Lyu, Ilya Makarov, Abhishikth Mallampalli Chen-Yu Mao, Alexander Michels, Alexander Migala, Farouk Mokhtar, Mathieu Morlighem, Min Namgung, Andrzej Novak, Andrew Novick, Amy Orsborn, Anand Padmanabhan, Jia-Cheng Pan, Sneh Pandya, Zhiyuan Pei, Ana Peixoto, George Percivall, Alex Po Leung, Sanjay Purushotham, Zhiqiang Que, Melissa Quinnan, Arghya Ranjan, Dylan Rankin, Christina Reissel, Benedikt Riedel, Dan Rubenstein, Argyro Sasli, Eli Shlizerman, Arushi Singh, Kim Singh, Eric R. Sokol, Arturo Sorensen, Yu Su, Mitra Taheri, Vaibhav Thakkar, Ann Mariam Thomas, Eric Toberer, Chenghan Tsai, Rebecca Vandewalle, Arjun Verma, Ricco C. Venterea, He Wang, Jianwu Wang, Sam Wang, Shaowen Wang, Gordon Watts, Jason Weitz, Andrew Wildridge, Rebecca Williams, Scott Wolf, Yue Xu, Jianqi Yan, Jai Yu, Yulei Zhang, Haoran Zhao, Ying Zhao, and Yibo Zhong. Building machine learning challenges for anomaly detection in science. 2025. URL: https://arxiv.org/abs/2503.02112, arXiv:2503.02112. 

  36. Giuseppe Di Guglielmo, Botao Du, Javier Campos, Alexandra Boltasseva, Akash V. Dixit, Farah Fahim, Zhaxylyk Kudyshev, Santiago Lopez, Ruichao Ma, Gabriel N. Perdue, Nhan Tran, Omer Yesilyurt, and Daniel Bowring. End-to-end workflow for machine learning-based qubit readout with qick and hls4ml. 2025. URL: https://arxiv.org/abs/2501.14663, arXiv:2501.14663. 

  37. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: a graduate-level google-proof q and a benchmark. 2023. URL: https://arxiv.org/abs/2311.12022, arXiv:2311.12022. 

  38. Kien X. Nguyen, Fengchun Qiao, Arthur Trembanis, and Xi Peng. Seafloorai: a large-scale vision-language dataset for seafloor geological survey. 2024. URL: https://arxiv.org/abs/2411.00172, arXiv:2411.00172. 

  39. Pin Chen, Luoxuan Peng, Rui Jiao, Qing Mo, Zhen Wang, Wenbing Huang, Yang Liu, and Yutong Lu. Learning superconductivity from ordered and disordered material structures. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 108902–108928. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/c4e3b55ed4ac9ba52d7df11f8bddbbf4-Paper-Datasets_and_Benchmarks_Track.pdf. 

  40. Deyu Zou, Shikun Liu, Siqi Miao, Victor Fung, Shiyu Chang, and Pan Li. Gess: benchmarking geometric deep learning under scientific applications with distribution shifts. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 92499–92528. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/a8063075b00168dc39bc81683619f1a8-Paper-Datasets_and_Benchmarks_Track.pdf. 

  41. Ralph E Peterson, Aramis Tanelus, Christopher Ick, Bartul Mimica, Niegil Francis, Violet J Ivan, Aman Choudhri, Annegret L Falkner, Mala Murthy, David M Schneider, Dan H Sanes, and Alex H Williams. Vocal call locator benchmark (vcl) for localizing rodent vocalizations from multi-channel audio. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 106370–106382. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/c00d37d6b04d73b870b963a4d70051c1-Paper-Datasets_and_Benchmarks_Track.pdf. 

  42. Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A. Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo Wang, David S. Wishart, Li-Ping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus D. Mak, Soha Hassoun, Florian Huber, Justin J.J. van der Hooft, Michael A. Stravs, Sebastian Böcker, Josef Sivic, and Tomáš Pluskal. Massspecgym: a benchmark for the discovery and identification of molecules. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 110010–110027. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/c6c31413d5c53b7d1c343c1498734b0f-Paper-Datasets_and_Benchmarks_Track.pdf. 

  43. Yiheng Wang, Tianyu Wang, Yuying Zhang, Hongji Zhang, Haoyu Zheng, Guanjie Zheng, and Linghe Kong. Urbandatalayer: a unified data pipeline for urban science. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, 7296–7310. Curran Associates, Inc., 2024. URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/0db7f135f6991e8cec5e516ecc66bfba-Paper-Datasets_and_Benchmarks_Track.pdf. 

  44. Kuzma Khrabrov, Anton Ber, Artem Tsypin, Konstantin Ushenin, Egor Rumiantsev, Alexander Telepov, Dmitry Protasov, Ilya Shenbin, Anton Alekseev, Mikhail Shirokikh, Sergey Nikolenko, Elena Tutubalina, and Artur Kadurin. $\nabla ^2$dft: a universal quantum chemistry dataset of drug-like molecules and a benchmark for neural network potentials. 2024. URL: https://arxiv.org/abs/2406.14347, arXiv:2406.14347. 

  45. Tingjia Shen, Hao Wang, Jiaqing Zhang, Sirui Zhao, Liangyue Li, Zulong Chen, Defu Lian, and Enhong Chen. Exploring user retrieval integration towards large language models for cross-domain sequential recommendation. 2024. URL: https://arxiv.org/abs/2406.03085, arXiv:2406.03085. 

  46. Shraman Pramanick, Rama Chellappa, and Subhashini Venugopalan. Spiqa: a dataset for multimodal question answering on scientific papers. 2025. URL: https://arxiv.org/abs/2407.09413, arXiv:2407.09413.