텍스트의 의미적 분해를 통한 문서 임베딩 방법론 :

박종인

추천

검색

자료유형: 학위논문

저자정보: 박종인 (국민대학교, 국민대학교 비즈니스IT전문대학원)

지도교수: 김남규

발행연도: 2020

저작권: 국민대학교 논문은 저작권에 의해 보호받습니다.

이용수2

이 논문의 연구 히스토리 (4)

2020

텍스트의 의미적 분해를 통한 문서 임베딩 방법론

박종인 비즈니스IT 2020.01 학위논문

2019

복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론

박종인 , 김남규 지능정보연구 2019.09 학술저널

복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론

박종인 , 김남규 한국지능정보시스템학회 학술대회논문집 2019.06 학술대회자료

2018

도메인 지식을 활용한 하이브리드 문서 임베딩 방법론

박종인 , 김남규 한국지능정보시스템학회 학술대회논문집 2018.11 학술대회자료

이 논문의 후속연구가 궁금하신가요?
연관 학술논문 또는 학술발표를 통해 보다 발전된 연구결과를 확인하실 수 있습니다.
이 논문의 연구 히스토리 확인하기

초록· 키워드

오류제보하기

최근 비정형 텍스트 데이터에 대한 다양한 분석 기법의 적용을 위해 텍스트 데이터의 구조화 방안에 대한 연구가 활발하게 이루어지고 있다. doc2Vec을 비롯한 기존의 문서 임베딩 방법은 문서 내 모든 용어를 사용하여 문서 벡터를 생성하기 때문에, 문서의 핵심 내용을 가리키는 용어뿐 아니라 비 핵심 용어의 영향도 받는다는 한계를 갖는다. 더불어 하나의 문서를 하나의 벡터로 나타내는 기존의 문서 임베딩 방식은 다양한 주제를 다루고 있는 복합 문서를 정확하게 표현하기 어렵다는 한계를 갖고 있다. 본 논문에서는 이러한 두 가지 한계를 극복하기 위한 다중 벡터 문서 임베딩 방법론을 새롭게 제안한다. 구체적으로 제안 방법론은 문서의 핵심 단어만을 이용하여 문서를 벡터화하고, 문서 내 다양한 주제를 의미적으로 분해하여 하나의 문서를 여러 개의 벡터 집합으로 표현한다. 한국학술정보 사이트에서 수집한 총 3,147개의 논문을 이용한 실험을 통해 복합 문서를 하나의 벡터로 표현할 경우 벡터의 왜곡 현상이 발생함을 확인하였으며, 주제를 의미적으로 분해하여 다중 벡터로 표현하는 제안 방법론으로 이러한 왜곡 현상을 보정하고 각 문서를 보다 정확하게 임베딩할 수 있다는 사실을 확인하였다.

According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors.
Unlike structured data, which can be directly applied to a variety of operations and analysis techniques such as machine learning, Unstructured text should be transformed as a numeric value that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, as the demand for a document analysis increases rapidly, many methods have been studied to support it. Among them, doc2Vec which embeds each document into a single vector is most widely used.
However, the traditional document embedding method represented by doc2Vec generates a document vector using the whole corpus of the document. This causes a limitation that the document vector is affected by not only core words but also miscellaneous words. Moreover, the previous document embedding schemes map each document into a single vector. Therefore, it is difficult to represent multiple subjects of a complex document accurately. In this paper, we propose a new Multi-Vector Document Embedding method to overcome these limitations of the traditional document embedding methods.
This study targets the documents that explicitly separate body content and keywords. In the case of a document without keywords predefined, the proposed method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords.
The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. Specifically, all text in the document is tokenized and each token is converted into a vector that have N-dimensional real value through word embedding method. After that, to minimize the impact of the miscellaneous words in the embedding process, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, the representative vectors is generated from each cluster, each document is represented as these Multi-Vector The experiments for 3.147 academic papers revealed that the single vector-based approach cannot properly map complex documents because of interference among subjects. However, we ascertained that the proposed method can vectorize complex documents more accurately by eliminating the interference among subjects.

1. 서론 1
2. 관련 연구 7
2.1. 텍스트 분석 7
2.2. 텍스트 임베딩 8
3. 제안 방법론 12
3.1. 연구 모형 12
3.2. 단어의 벡터화 13
3.3. 다중 벡터 임베딩 16
4. 실험 22
4.1. 실험 개요 22
4.2. 복합 문서 생성 24
4.3. 문서 벡터 생성 26
4.4. 성능 평가 27
4.4.1. 성능 평가 척도 27
4.4.2. 성능 분석 결과 29
4.4.3. 다중 벡터 표현의 효과 분석 32
5. 결론 36
참고문헌 38
Abstract 41

최근 본 자료

전체보기

구분	그룹	데이터 항목
AI 학습용 데이터	원문	원문 PDF 파일
AI 학습용 데이터	원문 + 메타 (기본/상세)	원문 PDF 파일 및 서지정보 CSV
대량 구매용 데이터	B2B 구독 방식	특정 자료 한정으로 원문 접근 권한 부여
대량 구매용 데이터	URL 전달 방식	바로 PDF 뷰어를 열람할 수 있는 URL 제공

구분	그룹	데이터 항목
AI 학습용 데이터	기본 메타	발행기관명, 간행물명, 권호명, 권(vol), 호(issue), 통권, 발행연도, 발행월, 논문명, 저자명, 시작페이지, 종료페이지, 전체페이지, 상세페이지URL
상세 메타 데이터	발행기관 메타	발행기관 이명, 영문명, 창립연도, 홈페이지URL, 발행기관 소개
	간행물 메타	부제목, 간행물 유형, ISSN, ISBN, 최초발행연도, 폐간연도, 간행빈도, 발행주기, 등재사항, 이용수, 피인용수, 권호수, 논문수, 표지이미지
	논문 메타	작성 언어, 부제목, 대등제목, 목차, 키워드, 초록, 이미지, 참고문헌, 이용수, 피인용수, 논문활용도, DBpia통합주제분류, KDC분류, DDC분류, 한국연구재단분류, UCI, DOI
	저자 메타	소속기관, 소속부서, 직급, 연구분야, 연구키워드, 이용수, 피인용수, 저자 논문활용도

구분	그룹	데이터 항목
※ 결합형/맞춤형 메타 데이터는 신청 내용에 따라 다양하게 제공 가능
이용순위 정보	주제분야별 많이 이용된 논문	“인문학”에서 많이 이용된 논문 TOP100
	이용기관별 많이 이용된 논문	“중고등학교”에서 많이 이용된 논문 TOP100
	세부기관별 많이 이용된 논문	“서울대학교”에서 많이 이용된 논문 TOP100
	키워드별 많이 이용된 논문	“Chat GPT”에서 많이 이용된 논문 TOP100
키워드 정보	많이 이용된 키워드	특정기간/분야/저널 내 많이 이용된 키워드
	많이 발행된 키워드	특정기간/분야/저널 내 많이 발행된 키워드
	많이 검색된 키워드	특정기간/분야/저널 내 많이 검색된 키워드
	연구 트렌드 키워드	특정 키워드 연관 연구동향 분석 데이터 키워드

논문 기본 정보

이 논문의 연구 히스토리 (4)

초록· 키워드

목차

최근 본 자료

댓글(0)