Jump to content

Croissant (metadata format)

From Wikipedia, the free encyclopedia

Croissant is a metadata format design to support sharing of datasets for machine learning applications. It is a platform-agnostic schema used to standardize metadata in data repositories like Hugging Face, kaggle, Dataverse and OpenML.[1][2]

Structure

[edit]

Croissant builds upon schema.org, uses primarily JSON-LD, and divides metadata in four "layers": Dataset Metadata, Resource, Structure and Semantic:[1][3]

  • The Dataset Metadata layer constrains which schema.org properties should be used, including additional properties, linking together the resources (files) of the dataset with general metadata, like licensing and citation information.
  • The Resource layer describes the individual files and sets of those using two new classes, FileObject and FileSet. A FileSet may be a collection of related images.
  • The Structure layer specifies how the files are organized in the dataset. A RecordSet class describes how resources are present, configurations that may very a lot between modality. This specification facilitates interoperability of the datasets.
  • Finally, the Semantic layer adds information for practical reuse of the dataset, such as splits for train, test and validation subsets.

It also provides a default extension for metadata related to responsible AI.[1][2]

The use of a standard machine-readable structure increases, for example, the discoverability of datasets in search engines such as Google Dataset Search.[4][5]

History

[edit]

Croissant was shared in arXiv in March 2024 and published in the proceedings of NeurIPS 2024.[1][6][7] It started as community driven as a MLCommons Croissant Working Group, including stakeholders organizations from academia and industry, including Google, the open data institute, Sage Bionetworks and King's College London.[1][8]

Variations of Croissant are developed to support datasets in different areas of research, such as Geo-Croissant for geospatial datasets.[9] Other technical extensions, such as support for RDF, soon followed.[10][11]

References

[edit]
  1. ^ a b c d e Akhtar, Mubashara; Benjelloun, Omar; Conforti, Costanza; Foschini, Luca; Gijsbers, Pieter; Giner-Miguelez, Joan; Goswami, Sujata; Jain, Nitisha; Karamousadakis, Michalis; Krishna, Satyapriya; Kuchnik, Michael; Lesage, Sylvain; Lhoest, Quentin; Marcenac, Pierre; Maskey, Manil (2024-12-16). "Croissant: A Metadata Format for ML-Ready Datasets". Advances in Neural Information Processing Systems. 37: 82133–82148.
  2. ^ a b Bischl, Bernd; Casalicchio, Giuseppe; Das, Taniya; Feurer, Matthias; Fischer, Sebastian; Gijsbers, Pieter; Mukherjee, Subhaditya; Müller, Andreas C.; Németh, László; Oala, Luis; Purucker, Lennart; Ravi, Sahithya; Rijn, Jan N. van; Singh, Prabhant; Vanschoren, Joaquin (2025-07-11). "OpenML: Insights from 10 years and more than a thousand papers". Patterns. 6 (7) 101317. doi:10.1016/j.patter.2025.101317. ISSN 2666-3899. PMC 12416095. PMID 40926970.
  3. ^ Meroño-Peñuela, Albert; Simperl, Elena; Kurteva, Anelia; Reklos, Ioannis (2025-05-01). "KG.GOV: Knowledge graphs as the backbone of data governance in AI". Journal of Web Semantics. 85 100847. doi:10.1016/j.websem.2024.100847. ISSN 1570-8268.
  4. ^ Giner-Miguelez, Joan; Gómez, Abel; Cabot, Jordi (2025-01-13). "On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning". Scientific Data. 12 (1): 61. Bibcode:2025NatSD..12...61G. doi:10.1038/s41597-025-04402-4. ISSN 2052-4463. PMC 11730645. PMID 39805856.
  5. ^ Hulsebos, Madelon; Lin, Wenjing; Shankar, Shreya; Parameswaran, Aditya (2024-06-18). "It Took Longer than I was Expecting: Why is Dataset Search Still so Hard?". Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics. HILDA 24. New York, NY, USA: Association for Computing Machinery. pp. 1–4. doi:10.1145/3665939.3665959. ISBN 979-8-4007-0693-6.
  6. ^ "NeurIPS Poster Croissant: A Metadata Format for ML-Ready Datasets". neurips.cc. Retrieved 2025-10-14.
  7. ^ Akhtar, Mubashara; Benjelloun, Omar; Conforti, Costanza; Foschini, Luca; Giner-Miguelez, Joan; Gijsbers, Pieter; Goswami, Sujata; Jain, Nitisha; Karamousadakis, Michalis (2024-12-09), "Croissant: A Metadata Format for ML-Ready Datasets", Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, pp. 1–6, arXiv:2403.19546, doi:10.1145/3650203.3663326, ISBN 979-8-4007-0611-0
  8. ^ "Transforming AI data governance with Croissant: a new standard for ML metadata". The ODI. 2024-06-19. Retrieved 2025-10-14.
  9. ^ IMPACT, Rajat Shinde and Derek Koehl NASA (2024-03-28). "Introducing Croissant: A Format for Machine Learning Datasets | NASA Earthdata". www.earthdata.nasa.gov. Retrieved 2025-10-14.
  10. ^ Bolleman, Jerven. "An assessment of Croissant ML metadata descriptors for AI-ready datasets". osf.io. doi:10.37044/osf.io/4sgdq_v1. Retrieved 2025-10-14.
  11. ^ Steinberg, David. "Bridging Machine Learning and Semantic Web: A Case Study on Converting Hugging Face Metadata to RDF". osf.io. Retrieved 2025-10-14.
[edit]