Skip to content

March 19, 2025

March 19, 2025

Present

Amit M, Armstrong Foundjem, Claus Weiland, Cyril Kondratenko, David Kanter, Gregg Barrett, Geoffrey Fox, Gregor von Laszewski, Howard Pritchard, Javier Toledo,, Jeyan Thiyagalingam, Jihao Shi, Lee Sharma, Matt Sinclair, Murali Emani, Nhan Tran, Philip Harris, Piotr Luszczek, Sankar Avula, Satoshi Iwata, Shirley Moore, Victor Lu, Vijay Janapa Reddi

Apologies

Christine Kirkpatrick

Tentative Agenda

New Members

  • Jihao Shi from University of Michigan
  • Sankar Avula LLM for medicine.
  • Cyril Kondratenko reminded us that he came Feb. 19 and is from Nebius.

Scientific benchmarks and challenges - follow up discussion

  • Nhan’s presentation MLCommons Science - follow up discussion? suggested classifying by scientific domain, ML Motif and Computational Motif where latter is criterion largely used by MLPerf (Tiny, Mobile, Client, Datacenter, …)
  • MLCommons has unclear categories; what is missing
  • Jeyan discussed the difficulty of scaling to realistic problems and reviewing what we did in the past. Hementioned Scientific machine learning benchmarks | Nature Reviews Physics or Scientific Machine Learning Benchmarks from Arxiv
  • MLPerf HPC benchmark scales to large size but becomes very costly to run
  • The entry barrier is high
  • Focus on large scale not suitable for students
  • Vijay in Tiny work group has light weight benchmarks
  • Philip Harris noted Codabench makes benchmarks more attractive and Gregor agreed
  • Kaggle is limited
  • Workflow needs FAIR principles

Catalog of Science benchmarks

  • Turning to catalog MLCommons Science/HPC Benchmarks Overview
  • Geoffrey described his attempt to use “Deep Research LLMs” to automate the finding of benchmarks details. These are recorded in columns called “Results from Gemini LLM Deep Research”, “Results from ChatGPT LLM” as links to LLM result stored in Google Docs
  • A typical query used to produce from Gemini was
  • Google WeatherBench2 https://sites.research.google/weatherbench/ https://arxiv.org/abs/2308.15560 has artifacts: benchmarks, datasets, models and software. Present all benchmarks, datasets, models, or software from this project as an Excel Spreadsheet, with columns as
  • Artifact Name, Type of Artifact(benchmarks, datasets, models or software), Description, Application Field, Nature of Model, Is model Deep Learning or not?, Size of model, Nature of Data, Modality of Data, Size of Data with units, ?Nature of artifacts with machines used, Links to Papers, Links to data and software (GitHub)
  • The rows should be individual benchmarks, datasets, models, or software artifacts.
  • Call the report "Google WeatherBench2".
  • Please record the number of websites used in the analysis.
  • This gave WeatherBench2 Research and Analysis
  • The main flaw was incompleteness; current LLMs will not delve down reliably into sublinks and often ignored instructions to produce “Excel Spreadsheet”
  • You can analyze MLCommons as long as you do it as a working group at a time but analyzing MLCommons as an entity gave inconsistent results
  • Typically Gemini was more complete than OpenAI

White Papers

Any Other Business

  • Prompted by Jeyan, we discussed a presence at the ISC meeting https://isc-hpc.com/ (Hamburg Germany, June 10-13 2025) where we perhaps missed an opportunity as usually these are organized by HPC working group (Tom St. John)
  • David Kanter confirmed that MLCommons will have no organizational presence at ISC
  • It was noted by Lee that there is an ISC workshop on “The Future of Benchmarks in Supercomputing" https://isc-hpc.com/program/ but it has no call for papers and no proceedings
  • Jeyan and Piotr will be at ISC

Data Breakout at the MLC Community Meeting 2025-03-11

Motivating datasets

  1. What 3-5 datasets would most advance the field if they had rapidly evolving quality?
    a. Public time series datasets of production data (telemetry coming from complex systems such as manufacturing facilities and HPC systems).
    b. A scalable annotated corpus for evaluation of various knowledge extraction tasks (e.g., NER - Named Entity Recognition, Relation Extraction - RE, etc.) that also scales with various expanding context sizes.
    c. Dataset of datasets (700k Croissant datasets analyzed jointly via metadata)
    d. Multi-modal datasets: time series data (telemetry) + textual data (log data) + textual data (knowledge bases).
    e. LLM interaction/ReAct agent datasets e.g. WildChat, AgentBench, etc

                                  i.        MLCommons runs a wildchat-esque site
    
  2. How would you screen out \~100 important datasets with the goal of ensuring and auditing high quality metadata?
    a. Do they use ontology/machine readable semantic representations for domain specific elements?

Proprietary Datasets of Interest:

  1. The New York Times’ archive.
  2. The comments on YouTube videos.

Formats/Tools

  1. What are the most time consuming data tasks in actual AI development/deployment today?
  2. Semantic annotation/curation using controlled vocabularies, metadata harmonization
  3. Cleaning, data “quality”.
  4. What kind of tools could make those tasks more efficient, and do any, even primitive, examples exist in academia or industry?
  5. OpenREFINE, CEDAR, Protege, Automated Croissant creation using Hugging Face/Kaggle et al
    1. Dataset preparation tools
    2. Dataset annotation tools
    3. Dataset discovery tools
  6. What is the number one thing we could do to accelerate Croissant adoption?
  7. Better tooling for (collaborative, iterative) dataset creation
  8. Is one RAI data format emerging as dominant? If not, which one should? :-)

Licenses

  1. What kind of licensing concerns most typically trip up data use?
  2. Non machine readable license pydescriptions https://ui.adsabs.harvard.edu/abs/2023arXiv231016787L/abstract
  3. What is the most commonly used AI data license type you encounter?
  4. Solution: Can we find a way to encourage data owners to make their data usable for free?
  5. One way would be to educate data owners about algorithms which remove personally identifiable information (that’s one concern they might have).
  6. Internal (private) datasets are usually considered to be valuable assets. Can they be even more valuable once made publicly available (e.g., more researchers working on these problems resulting in better models companies can use later?).

Science

  1. Which are the most active Dynabench sub-communities and what do they have in common?
  2. What are the top 5 big challenge questions in AI data science, which, if answered, would radically advance the field?
  3. Evaluate which algorithms use data minimally-efficiently in order to overcome current data limitations (towards how humans do it). Such benchmarks could control, evaluate and score the amount of data used relative to performance. Eg https://dynabench.org/tasks/tlic (in progress).
  4. Benchmarking algorithm resistance to unbalanced datasets or continuous learning.
  5. Understanding what has been learned and quantifying how data affects the learning.

General Points

  1. Specifically in the area of data valuation, what is the current SOTA?
  2. What would be the best way to get major AI conferences to add data tracks?
  3. What would be the best way to encourage new/existing tracks to focus on "better datasets" rather than "more datasets"?
  4. Knowledge graph of all datasets, benchmarks for metadata coverage

Challenges

  1. How do we make progress in an age of data pollution?
  2. What will plateau progress? How do we address that?
  3. The treatment of developments in AI/ML being silenced by governments because of the power of the technology.
  4. High-quality synthetic data. Why? Real data is not always available, better coverage, more robust ML models.