Geoscience language models and their intrinsic evaluation

dc.contributor.authorLawley C.J.M.
dc.contributor.authorRaimondo S.
dc.contributor.authorChen T.
dc.contributor.authorBrin L.
dc.contributor.authorZakharov A.
dc.contributor.authorKur D.
dc.contributor.authorHui J.
dc.contributor.authorNewton G.
dc.contributor.authorBurgoyne S.L.
dc.contributor.authorMarquis G.
dc.date.accessioned2023-07-24T05:42:53Z
dc.date.available2023-07-24T05:42:53Z
dc.date.issued2022
dc.description.abstractGeoscientists use observations and descriptions of the rock record to study the origins and history of our planet, which has resulted in a vast volume of scientific literature. Recent progress in natural language processing (NLP) has the potential to parse through and extract knowledge from unstructured text, but there has, so far, been only limited work on the concepts and vocabularies that are specific to geoscience. Herein we harvest and process public geoscientific reports (i.e., Canadian federal and provincial geological survey publications databases) and a subset of open access and peer-reviewed publications to train new, geoscience-specific language models to address that knowledge gap. Language model performance is validated using a series of new geoscience-specific NLP tasks (i.e., analogies, clustering, relatedness, and nearest neighbour analysis) that were developed as part of the current study. The raw and processed national geological survey corpora, language models, and evaluation criteria are all made public for the first time. We demonstrate that non-contextual (i.e., Global Vectors for Word Representation, GloVe) and contextual (i.e., Bidirectional Encoder Representations from Transformers, BERT) language models updated using the geoscientific corpora outperform the generic versions of these models for each of the evaluation criteria. Principal component analysis further demonstrates that word embeddings trained on geoscientific text capture meaningful semantic relationships, including rock classifications, mineral properties and compositions, and the geochemical behaviour of elements. Semantic relationships that emerge from the vector space have the potential to unlock latent knowledge within unstructured text, and perhaps more importantly, also highlight the potential for other downstream geoscience-focused NLP tasks (e.g., keyword prediction, document similarity, recommender systems, rock and mineral classification).ru_RU
dc.identifier.citationApplied Computing and Geosciences, 2022, 14, 100084, p. 1-10ru_RU
dc.identifier.doi10.1016/j.acags.2022.100084
dc.identifier.urihttps://repository.geologyscience.ru/handle/123456789/41600
dc.language.isoenru_RU
dc.subjectWord embeddingru_RU
dc.subjectLanguage modelru_RU
dc.subjectMachine learningru_RU
dc.subjectArtificial intelligenceru_RU
dc.subjectBERTru_RU
dc.subjectGloVeru_RU
dc.titleGeoscience language models and their intrinsic evaluationru_RU
dc.typeArticleru_RU

Файлы

Оригинальный пакет

Показано 1 - 1 из 1
Загрузка...
Изображение-миниатюра
Имя:
Lawl1_22.pdf
Размер:
1.14 MB
Формат:
Adobe Portable Document Format
Описание:

Пакет лицензий

Показано 1 - 1 из 1
Загрузка...
Изображение-миниатюра
Имя:
license.txt
Размер:
1.71 KB
Формат:
Item-specific license agreed upon to submission
Описание: