Log in
Enquire now
‌

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

OverviewStructured DataIssuesContributors

Contents

Is a
‌
Academic paper
0

Academic Paper attributes

arXiv ID
2310.042920
arXiv Classification
Computer science
Computer science
0
Publication URL
arxiv.org/pdf/2310.0...92.pdf0
Publisher
ArXiv
ArXiv
DOI
doi.org/10.48550/ar...10.042920
Paid/Free
Free
Academic Discipline
Computer science
Computer science
0
Machine learning
Machine learning
0
Submission Date
October 6, 2023
0
October 9, 2023
0
October 18, 2023
0
Author Names
Callum McLean0
Cas Wognum0
Chad Martin0
Ali Parviz0
Andrew Fitzgibbon0
Błażej Banaszewski0
Oleksandr Dymov0
Prudencio Tossou0
...
Paper abstract

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.

Timeline

No Timeline data yet.

Further Resources

Title
Author
Link
Type
Date
No Further Resources data yet.

References

Find more entities like Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Use the Golden Query Tool to find similar entities by any field in the Knowledge Graph, including industry, location, and more.
Open Query Tool
Access by API
Golden Query Tool
Golden logo

Company

  • Home
  • Pricing
  • Become an Editor
  • Enterprise

Legal

  • Terms of Service
  • Enterprise Terms of Service
  • Privacy Policy

Help

  • Help center
  • API Documentation
  • Contact Us

Explore companies

  • Artificial Intelligence
  • Fintech
  • Biotechnology
  • Cybersecurity
  • Semiconductors
  • Electric Vehicles
  • Cloud Computing
  • Robotics
  • SaaS
  • Renewable Energy
  • Venture Capital
  • Blockchain
  • Browse all →
By using this site, you agree to our Terms of Service.