The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Download
Publications Copernicus
Download
Citation
Articles | Volume XLIV-2/W1-2021
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIV-2/W1-2021, 21–26, 2021
https://doi.org/10.5194/isprs-archives-XLIV-2-W1-2021-21-2021
Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLIV-2/W1-2021, 21–26, 2021
https://doi.org/10.5194/isprs-archives-XLIV-2-W1-2021-21-2021

  15 Apr 2021

15 Apr 2021

VISUAL ANALYSIS OF TEXT DATA COLLECTIONS BY FREQUENCIES OF JOINT USE OF WORDS

A. E. Bondarev1, A. V. Bondarenko2, and V. A. Galaktionov1 A. E. Bondarev et al.
  • 1Keldysh Institute of Applied Mathematics RAS, Moscow, Russia
  • 2State Res. Institute of Aviation Systems (GosNIIAS), Moscow, Russia

Keywords: Multidimensional Data, Visual Analysis, Elastic Maps, Frequencies of Joint Use, Cluster Structures

Abstract. The presented research considers the problems of studying the cluster structure of multidimensional data volumes. This paper presents the results of numerical experiments on the study of data volumes consisting of frequencies of joint use of words from different parts of speech, for instance “noun + verb” or “adjective + noun”. The volumes of data are obtained from samples from text collections in Russian. The aim of the research is to analyze the cluster structure of the studied volume and semantic proximity of words in clusters and subclusters. The hypothesis was used that words with similar meaning should occur in approximately the same context. In this regard, in the space of features, they will be at a relatively close distance from each other, while differing words will be at a more distant distance from each other. Research is carried out using elastic maps, which are effective tools for visual analysis of multidimensional data. The construction of elastic maps and their extensions in the space of the first three principal components makes it possible to determine the cluster structure of the studied multidimensional data volumes. Such analysis can be useful in the tasks of confronting negative verbal influences such as fake news, hidden propaganda, involvement in sects, verbal manipulation, etc. Also this approach can be applied to text collections having medical origin.