최근 비정형 텍스트 데이터에 대한 다양한 분석 기법의 적용을 위해 텍스트 데이터의 구조화 방안에 대한 연구가 활발하게 이루어지고 있다. doc2Vec을 비롯한 기존의 문서 임베딩 방법은 문서 내 모든 용어를 사용하여 문서 벡터를 생성하기 때문에, 문서의 핵심 내용을 가리키는 용어뿐 아니라 비 핵심 용어의 영향도 받는다는 한계를 갖는다. 더불어 하나의 문서를 하나의 벡터로 나타내는 기존의 문서 임베딩 방식은 다양한 주제를 다루고 있는 복합 문서를 정확하게 표현하기 어렵다는 한계를 갖고 있다. 본 논문에서는 이러한 두 가지 한계를 극복하기 위한 다중 벡터 문서 임베딩 방법론을 새롭게 제안한다. 구체적으로 제안 방법론은 문서의 핵심 단어만을 이용하여 문서를 벡터화하고, 문서 내 다양한 주제를 의미적으로 분해하여 하나의 문서를 여러 개의 벡터 집합으로 표현한다. 한국학술정보 사이트에서 수집한 총 3,147개의 논문을 이용한 실험을 통해 복합 문서를 하나의 벡터로 표현할 경우 벡터의 왜곡 현상이 발생함을 확인하였으며, 주제를 의미적으로 분해하여 다중 벡터로 표현하는 제안 방법론으로 이러한 왜곡 현상을 보정하고 각 문서를 보다 정확하게 임베딩할 수 있다는 사실을 확인하였다.
According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and analysis techniques such as machine learning, Unstructured text should be transformed as a numeric value that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, as the demand for a document analysis increases rapidly, many methods have been studied to support it. Among them, doc2Vec which embeds each document into a single vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a document vector using the whole corpus of the document. This causes a limitation that the document vector is affected by not only core words but also miscellaneous words. Moreover, the previous document embedding schemes map each document into a single vector. Therefore, it is difficult to represent multiple subjects of a complex document accurately. In this paper, we propose a new Multi-Vector Document Embedding method to overcome these limitations of the traditional document embedding methods. This study targets the documents that explicitly separate body content and keywords. In the case of a document without keywords predefined, the proposed method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. Specifically, all text in the document is tokenized and each token is converted into a vector that have N-dimensional real value through word embedding method. After that, to minimize the impact of the miscellaneous words in the embedding process, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, the representative vectors is generated from each cluster, each document is represented as these Multi-Vector The experiments for 3.147 academic papers revealed that the single vector-based approach cannot properly map complex documents because of interference among subjects. However, we ascertained that the proposed method can vectorize complex documents more accurately by eliminating the interference among subjects.
목차
1. 서론 12. 관련 연구 72.1. 텍스트 분석 72.2. 텍스트 임베딩 83. 제안 방법론 123.1. 연구 모형 123.2. 단어의 벡터화 133.3. 다중 벡터 임베딩 164. 실험 224.1. 실험 개요 224.2. 복합 문서 생성 244.3. 문서 벡터 생성 264.4. 성능 평가 274.4.1. 성능 평가 척도 274.4.2. 성능 분석 결과 294.4.3. 다중 벡터 표현의 효과 분석 325. 결론 36참고문헌 38Abstract 41