Croissant (metadata format)

Croissant is a metadata format design to support sharing of datasets for machine learning applications. It is a platform-agnostic schema used to standardize metadata in data repositories like Hugging Face, kaggle, Dataverse and OpenML.^[1]^[2]

Structure

Croissant builds upon schema.org, uses primarily JSON-LD, and divides metadata in four "layers": Dataset Metadata, Resource, Structure and Semantic:^[1]^[3]

The Dataset Metadata layer constrains which schema.org properties should be used, including additional properties, linking together the resources (files) of the dataset with general metadata, like licensing and citation information.
The Resource layer describes the individual files and sets of those using two new classes, FileObject and FileSet. A FileSet may be a collection of related images.
The Structure layer specifies how the files are organized in the dataset. A RecordSet class describes how resources are present, configurations that may very a lot between modality. This specification facilitates interoperability of the datasets.
Finally, the Semantic layer adds information for practical reuse of the dataset, such as splits for train, test and validation subsets.

It also provides a default extension for metadata related to responsible AI.^[1]^[2]

The use of a standard machine-readable structure increases, for example, the discoverability of datasets in search engines such as Google Dataset Search.^[4]^[5]

History

Croissant was shared in arXiv in March 2024 and published in the proceedings of NeurIPS 2024.^[1]^[6]^[7] It started as community driven as a MLCommons Croissant Working Group, including stakeholders organizations from academia and industry, including Google, the open data institute, Sage Bionetworks and King's College London.^[1]^[8]

Variations of Croissant are developed to support datasets in different areas of research, such as Geo-Croissant for geospatial datasets.^[9] Other technical extensions, such as support for RDF, soon followed.^[10]^[11]

References

^ ^a ^b ^c ^d ^e Akhtar, Mubashara; Benjelloun, Omar; Conforti, Costanza; Foschini, Luca; Gijsbers, Pieter; Giner-Miguelez, Joan; Goswami, Sujata; Jain, Nitisha; Karamousadakis, Michalis; Krishna, Satyapriya; Kuchnik, Michael; Lesage, Sylvain; Lhoest, Quentin; Marcenac, Pierre; Maskey, Manil (2024-12-16). "Croissant: A Metadata Format for ML-Ready Datasets". Advances in Neural Information Processing Systems. 37: 82133–82148.
^ ^a ^b Bischl, Bernd; Casalicchio, Giuseppe; Das, Taniya; Feurer, Matthias; Fischer, Sebastian; Gijsbers, Pieter; Mukherjee, Subhaditya; Müller, Andreas C.; Németh, László; Oala, Luis; Purucker, Lennart; Ravi, Sahithya; Rijn, Jan N. van; Singh, Prabhant; Vanschoren, Joaquin (2025-07-11). "OpenML: Insights from 10 years and more than a thousand papers". Patterns. 6 (7) 101317. doi:10.1016/j.patter.2025.101317. ISSN 2666-3899. PMC 12416095. PMID 40926970.
^ Meroño-Peñuela, Albert; Simperl, Elena; Kurteva, Anelia; Reklos, Ioannis (2025-05-01). "KG.GOV: Knowledge graphs as the backbone of data governance in AI". Journal of Web Semantics. 85 100847. doi:10.1016/j.websem.2024.100847. ISSN 1570-8268.
^ Giner-Miguelez, Joan; Gómez, Abel; Cabot, Jordi (2025-01-13). "On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning". Scientific Data. 12 (1): 61. Bibcode:2025NatSD..12...61G. doi:10.1038/s41597-025-04402-4. ISSN 2052-4463. PMC 11730645. PMID 39805856.
^ Hulsebos, Madelon; Lin, Wenjing; Shankar, Shreya; Parameswaran, Aditya (2024-06-18). "It Took Longer than I was Expecting: Why is Dataset Search Still so Hard?". Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics. HILDA 24. New York, NY, USA: Association for Computing Machinery. pp. 1–4. doi:10.1145/3665939.3665959. ISBN 979-8-4007-0693-6.
^ "NeurIPS Poster Croissant: A Metadata Format for ML-Ready Datasets". neurips.cc. Retrieved 2025-10-14.
^ Akhtar, Mubashara; Benjelloun, Omar; Conforti, Costanza; Foschini, Luca; Giner-Miguelez, Joan; Gijsbers, Pieter; Goswami, Sujata; Jain, Nitisha; Karamousadakis, Michalis (2024-12-09), "Croissant: A Metadata Format for ML-Ready Datasets", Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, pp. 1–6, arXiv:2403.19546, doi:10.1145/3650203.3663326, ISBN 979-8-4007-0611-0
^ "Transforming AI data governance with Croissant: a new standard for ML metadata". The ODI. 2024-06-19. Retrieved 2025-10-14.
^ IMPACT, Rajat Shinde and Derek Koehl NASA (2024-03-28). "Introducing Croissant: A Format for Machine Learning Datasets | NASA Earthdata". www.earthdata.nasa.gov. Retrieved 2025-10-14.
^ Bolleman, Jerven. "An assessment of Croissant ML metadata descriptors for AI-ready datasets". osf.io. doi:10.37044/osf.io/4sgdq_v1. Retrieved 2025-10-14.
^ Steinberg, David. "Bridging Machine Learning and Semantic Web: A Case Study on Converting Hugging Face Metadata to RDF". osf.io. Retrieved 2025-10-14.

External links

Croissant format specification

[:0-1] Akhtar, Mubashara; Benjelloun, Omar; Conforti, Costanza; Foschini, Luca; Gijsbers, Pieter; Giner-Miguelez, Joan; Goswami, Sujata; Jain, Nitisha; Karamousadakis, Michalis; Krishna, Satyapriya; Kuchnik, Michael; Lesage, Sylvain; Lhoest, Quentin; Marcenac, Pierre; Maskey, Manil (2024-12-16). "Croissant: A Metadata Format for ML-Ready Datasets". Advances in Neural Information Processing Systems. 37: 82133–82148.

[:1-2] Bischl, Bernd; Casalicchio, Giuseppe; Das, Taniya; Feurer, Matthias; Fischer, Sebastian; Gijsbers, Pieter; Mukherjee, Subhaditya; Müller, Andreas C.; Németh, László; Oala, Luis; Purucker, Lennart; Ravi, Sahithya; Rijn, Jan N. van; Singh, Prabhant; Vanschoren, Joaquin (2025-07-11). "OpenML: Insights from 10 years and more than a thousand papers". Patterns. 6 (7) 101317. doi:10.1016/j.patter.2025.101317. ISSN 2666-3899. PMC 12416095. PMID 40926970.

[3] Meroño-Peñuela, Albert; Simperl, Elena; Kurteva, Anelia; Reklos, Ioannis (2025-05-01). "KG.GOV: Knowledge graphs as the backbone of data governance in AI". Journal of Web Semantics. 85 100847. doi:10.1016/j.websem.2024.100847. ISSN 1570-8268.

[4] Giner-Miguelez, Joan; Gómez, Abel; Cabot, Jordi (2025-01-13). "On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning". Scientific Data. 12 (1): 61. Bibcode:2025NatSD..12...61G. doi:10.1038/s41597-025-04402-4. ISSN 2052-4463. PMC 11730645. PMID 39805856.

[5] Hulsebos, Madelon; Lin, Wenjing; Shankar, Shreya; Parameswaran, Aditya (2024-06-18). "It Took Longer than I was Expecting: Why is Dataset Search Still so Hard?". Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics. HILDA 24. New York, NY, USA: Association for Computing Machinery. pp. 1–4. doi:10.1145/3665939.3665959. ISBN 979-8-4007-0693-6.

[6] "NeurIPS Poster Croissant: A Metadata Format for ML-Ready Datasets". neurips.cc. Retrieved 2025-10-14.

[7] Akhtar, Mubashara; Benjelloun, Omar; Conforti, Costanza; Foschini, Luca; Giner-Miguelez, Joan; Gijsbers, Pieter; Goswami, Sujata; Jain, Nitisha; Karamousadakis, Michalis (2024-12-09), "Croissant: A Metadata Format for ML-Ready Datasets", Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, pp. 1–6, arXiv:2403.19546, doi:10.1145/3650203.3663326, ISBN 979-8-4007-0611-0

[8] "Transforming AI data governance with Croissant: a new standard for ML metadata". The ODI. 2024-06-19. Retrieved 2025-10-14.

[9] IMPACT, Rajat Shinde and Derek Koehl NASA (2024-03-28). "Introducing Croissant: A Format for Machine Learning Datasets | NASA Earthdata". www.earthdata.nasa.gov. Retrieved 2025-10-14.

[10] Bolleman, Jerven. "An assessment of Croissant ML metadata descriptors for AI-ready datasets". osf.io. doi:10.37044/osf.io/4sgdq_v1. Retrieved 2025-10-14.

[11] Steinberg, David. "Bridging Machine Learning and Semantic Web: A Case Study on Converting Hugging Face Metadata to RDF". osf.io. Retrieved 2025-10-14.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]