Armstrong Foundjem, Ben Hawks, Christine Kirkpatrick, Gary Mazzaferro, Geoffrey Fox, Gregg Barrett, Howard Pritchard, Marco Colombo, Matt Sinclair, Piotr Luszczek, Satoshi Iwata, Victor Lu
Tentative Agenda
Any New Members Introduction
Continuing discussion of New Benchmarks and the catalog of Science benchmarks
Ben Hawks provided updates on the benchmark cataloging project, highlighting the significant progress made on the website by Marco Colombo and Gregor von Laszewski, which now features a filterable, searchable, and sortable index of benchmarks and has been Dockerized for easier contributions. Gregor von Laszewski emphasized the importance of the report format and proposed replicating the LaTeX table format onto the website for improved readability and sorting, while also discussing plans for fixing table issues and enhancing column interactivity. Matt Sinclair inquired about the benchmark rating system, and Ben Hawks explained that it is a foundational, first iteration developed by students, while Gregor von Laszewski suggested simplifying it to a three-point system and questioned the weighting of points. Ben Hawks and Matt Sinclair discussed merging their independently developed sections of the paper, with Ben Hawks requesting feedback on the accuracy of the benchmark table and the categorization of motifs, particularly regarding LLM-generated summaries that require human verification. Christine Kirkpatrick suggested incorporating features to make benchmarks more machine actionable, such as using persistent identifiers, to which Gregor von Laszewski responded that semi-unique IDs are already generated in YAML format and that adding new arbitrary fields is straightforward. Armstrong Foundjem sought clarification on the paper's scope, and Matt Sinclair and Ben Hawks confirmed it is a separate, shorter paper targeting a September 15th release on arXiv to leverage recent funding calls. Geoffrey Fox presented their work on time series, noting the absence of established benchmarks and their focus on cataloging data sets, identifying around 80 sources, half of which are foundation models, and using an LLM to extract 731 individual data sets. Geoffrey Fox also highlighted challenges in interpreting time series results due to inconsistent data sets and models, observing significant discrepancies between reported and re-evaluated model performance, while Gary Mazzaferro expressed interest in Hugging Face's data portfolio, and Gregor von Laszewski noted that Hugging Face has arXiv metadata that could augment dataset information.
Discussion
Gregg Barrett and Christine Kirkpatrick liked work reported by Gregor