교보문고

학술논문

핵심어 추출 및 데이터 증강기법을 이용한 텍스트 분류 모델 성능 개선

이용수 122

영문명: Improving the performance of text classification models using keyword extraction and data augmentation techniques
발행기관: 한국자료분석학회
저자명: 이강철(Kangchul Lee) 안정용(Jeongyong Ahn)
간행물 정보: 『Journal of The Korean Data Analysis Society (JKDAS)』Vol.24 No.5, 1719~1731쪽, 전체 13쪽
주제분류: 자연과학 > 통계학
파일형태: PDF
발행일자: 2022.10.31

4,360원

구매일시로부터 72시간 이내에 다운로드 가능합니다.
이 학술논문 정보는 (주)교보문고와 각 발행기관 사이에 저작물 이용 계약이 체결된 것으로, 교보문고를 통해 제공되고 있습니다.

1:1 문의

국문 초록

토픽 모델링은 문서에 잠재되어 있는 토픽을 발견하고 분류하는 기법으로 각 문서의 핵심 토픽과 토픽들이 가지고 있는 특성을 파악하는데 유용하다. 그러나 동일한 단어가 여러 토픽에서 높은 가중치를 가지는 경우, 토픽 간 변별력이 있는 핵심어 추출이 어렵다는 문제점을 가지고 있다. 또한, 이 기법은 핵심어와 의미적 유사성이 있으나 핵심어로 채택되지 못한 단어들이 존재하는 경우 정보의 누락이 발생하며, 데이터의 크기와 질에 따라 분류 성능이 달라진다는 단점을 가지고 있다. 이러한 문제점을 개선하기 위하여 본 연구에서는 핵심어를 추출할 때 연관성 척도(relevance)와 워드 임베딩(word embedding) 기법을 적용하는 방법을 제안한다. 또한, 데이터 분류성능을 개선하기 위해 EDA(easy data augmentation) 기법을 이용하여 데이터를 증강한 후 KoBERT 모델을 적용한다. 데이터 분석 결과, 토픽 간 변별력 있는 핵심어를 추출하여 해당 토픽의 구체적인 내용을 파악할 수 있었다. 또한, 데이터 증강기법을 적용한 경우 94% 정확한 분류 결과를 얻어 데이터 증강기법을 적용하지 않은 경우에 비해 9% 정도 개선된 결과를 얻을 수 있었다.

영문 초록

Topic modeling aims to identify and categorize topics latent in documents, and is useful for exploring core topics of each document and the characteristics of the topics. However, a problem with interpreting topics this technique is that common terms often appear near the top of multiple topics, making it hard to extract keywords identifying the topics. Another weakness is that this technique can lead to loss of information when synonyms are excluded from keywords, and high performance often depends on the size and quality of data. To improve these problems, we propose a method that utilizes relevance and word embedding techniques for extracting keywords. In addition, we use the EDA(easy data augmentation) techniques to increase the size of the data, and then apply the KoBERT model for boosting performance on text classification tasks. As a result of data analysis, it was possible to grasp the specific characteristics of the topics based on the discriminating keywords. The results also showed that using the augmented data sets, the text classifier model has higher accuracy than the original data sets with a score of 0.94 and 0.85, respectively.

키워드

토픽 모델 연관성 척도 워드 임베딩 데이터 증강 텍스트 분류 topic model relevance word embedding data augmentation text classification

국문 초록

영문 초록

목차

키워드

해당간행물 수록 논문

참고문헌

관련논문

자연과학 > 통계학분야 BEST

자연과학 > 통계학분야 NEW

최근 이용한 논문

APA

MLA