Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts

Padarian J.; Fuentes I.

doi:10.5194/soil-5-177-2019

Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts

Файлы

Para_19.pdf (1.47 MB)

Дата

2019

Авторы

Padarian J.

Fuentes I.

Аннотация

A large amount of descriptive information is available in geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descriptive information and encode it as dense vectors. These word embeddings, which encode information about a word and its linguistic relationships with other words, lay on a multidimensional space where angles and distances have a linguistic interpretation.We used 280 764 full-text scientiﬁc articles related to geosciences to train a domain-speciﬁc language model capable of generating such embeddings. To evaluate the quality of the numerical representations, we performed three intrinsic evaluations: the capacity to generate analogies, term relatedness compared with the opinion of a human subject, and categorisation of different groups of words. As this is the ﬁrst attempt to evaluate word embedding for tasks in the geosciences domain, we created a test suite speciﬁc for geosciences.We compared our results with general domain embeddings commonly used in other disciplines. As expected, our domain-speciﬁc embeddings (GeoVec) outperformed general domain embeddings in all tasks, with an overall performance improvement of 107.9%. We also presented an example were we successfully emulated part of a taxonomic analysis of soil proﬁles that was originally applied to soil numerical data, which would not be possible without the use of embeddings. The resulting embedding and test suite will be made available for other researchers to use and expand upon.

Ключевые слова

Machine learning, geoscience, natural language processing, domain-speciﬁc language model, GeoVec, word embeddings

Цитирование

SOIL, 2019, 5, 177–187

URI

https://repository.geologyscience.ru/handle/123456789/41596

Коллекции

Статьи, тезисы докладов

Полная страница элемента

Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts

Файлы

Дата

Авторы

Название журнала

ISSN журнала

Название тома

Издатель

Аннотация

Описание

Ключевые слова

Цитирование

URI

Коллекции

Подтверждение

Обзор

Дополнено

Упоминается в