単語の分散表現を用いた文書分類

田中, 昌昭

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

単語の分散表現を用いた文書分類

https://doi.org/10.15112/00014520

名前 / ファイル	ライセンス	アクション
PDF (1.2 MB)

Item type

紀要論文(ELS) / Departmental Bulletin Paper(1)

公開日

2019-02-25

タイトル

単語の分散表現を用いた文書分類

言語

jpn

キーワード

言語

主題Scheme

Other

主題

document classification

キーワード

言語

主題Scheme

Other

主題

distributed representation

キーワード

言語

主題Scheme

Other

主題

Word2Vec

キーワード

言語

主題Scheme

Other

主題

skip-gram model

キーワード

言語

主題Scheme

Other

主題

natural language processing

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

departmental bulletin paper

ID登録

10.15112/00014520

ID登録タイプ

JaLC

ページ属性

内容記述タイプ

Other

内容記述

P(論文)

その他(別言語等)のタイトル

その他のタイトル

Document Classification using Distributed Representation of Words as Features

著者名(日)

田中, 昌昭

著者別名

姓名

TANAKA, Masaaki

著者所属(日)

川崎医療福祉大学　医療福祉マネジメント学部　医療情報学科

抄録(日)

文書分類は自然言語処理の代表的な研究課題のひとつで，トピック分類，評判分析，フィルタリングなどに応用されている．文書分類では従来，文書の特徴量として単語の出現頻度が用いられてきた．しかしながら，単語そのものが持つ情報からは単語間の類似度や関連性を計算することは難しい．そこで，特徴量として単語の分散表現を用い，分類性能の向上を目指した．まず，医学論文情報データベースである医中誌 Webから重複を除いた7,881件の抄録を抽出して学習コーパスとした．次に，skip-gramモデルを使って単語のベクトル表現（分散表現）を獲得した．得られた単語ベクトルの重心および合成ベクトルを特徴量に用いて，抄録を5つの疾患に分類する実験を行った．評価のため，単語の出現頻度を用いる従来の方法で分類した結果と比較した．その結果，本手法による分類の正確度は0.770となり，従来の方法（0.807）を上回ることはできなかったが，それに匹敵する分類性能を得ることができた．本手法による分類性能が従来の手法よりも低かった原因として，単語ベクトルの品質，単語の多義性，特徴選択の問題などが考えられた．なかでも獲得した情報の大部分を利用しないで捨ててしまう特徴選択には改善の余地が残された．

抄録(英)

Document classification is one of the representative research subjects of natural language processing and it has been applied to topic classification, reputation analysis, filtering, etc. In document classification, the word frequency has been used as features of a document. However, it is difficult to calculate the similarity and relevance between words from the information of the word itself. Therefore, the author aimed to improve the classification performance by using distributed representation of words as features. First, 7,881 abstracts excluding duplications were extracted from the ICHUSHI Web, which is a Japanese medical literature information database, and they were used as a corpus for machine learning. Next, vector representation of words was obtained using skip-gram model. Experiments were performed to classify the abstracts into five diseases using the centroids and synthetic vectors of the obtained word vectors as features. For the purpose of evaluation, the result was compared with the classification result by the conventional method using word frequency. As a result, the accuracy of classification by this method was 0.770, which was not able to exceed the conventional method (0.807), but it was able to obtain classification performance comparable to it. The reason why the classification performance by this method was lower than that of the conventional method was considered as the quality of the word vector, ambiguity of the word, problem of feature selection, and so on. Among them, there is room for improvement in feature selection which discards most of the acquired information without using it.

記事種別(日)

内容記述タイプ

Other

内容記述

原著

書誌情報

川崎医療福祉学会誌

巻 28, 号 1-2, p. 167-178, 発行日 2018

公開者

出版者

川崎医療福祉学会

その他(別言語)の雑誌名

Kawasaki medical welfare journal

雑誌書誌ID

収録物識別子タイプ

NCID

収録物識別子

AN10375470

ISSN

収録物識別子タイプ

ISSN

収録物識別子

0917-4605

戻る

views

See details

	Views

Versions

Ver.1

2023-06-19 10:30:57.472748

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR 2.0
JPCOAR 1.0
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

単語の分散表現を用いた文書分類

× 田中, 昌昭

Versions

Share

Cite as

エクスポート