Notice
Recent Posts
Recent Comments
ยซ   2024/09   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30
Tags more
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

๐ŸŒฒ์ž๋ผ๋‚˜๋Š”์ฒญ๋…„

FastText Classification ์ ์šฉํ•ด๋ณด๊ธฐ ๋ณธ๋ฌธ

๋Œ€ํšŒ,๊ณต๋ชจ์ „

FastText Classification ์ ์šฉํ•ด๋ณด๊ธฐ

JihyunLee 2021. 1. 8. 13:12
๋ฐ˜์‘ํ˜•

Fast text ๋Š” Facebook's AI Research (FAIR) lab ์—์„œ ๋งŒ๋“ wordembdding๊ณผ,text classification์„ ์œ„ํ•œ library ์ด๋‹ค. 294๊ฐœ ์–ธ์–ด์— ๋Œ€ํ•ด์„œ pretrained model์„ ์ œ๊ณตํ•œ๋‹ค. ํ•œ๊ตญ์–ด๋„ ํฌํ•จ๋œ๋‹ค(๋งŒ์„ธ!)

์ด๋ฒˆ์— MZ text classification ๋Œ€ํšŒ์— ๋‚˜๊ฐ€๊ฒŒ ๋˜๋ฉด์„œ, fast text classification์„ ์‚ฌ์šฉํ•ด ๋ณด์•˜๊ณ , ์‚ฌ์šฉ๋ฒ•๊ณผ ํ›„๊ธฐ๋ฅผ ๋‚จ๊ฒจ๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

๊ธฐ์กด์˜ word to vector์™€์˜ ์ฐจ์ด์ 

๊ธฐ์กด์˜ word to vector๋Š” ๋‹จ์–ด ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋Š” ์ œ๋Œ€๋กœ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์—†์—ˆ๋‹ค. ํ•˜์ง€๋งŒ fasttext๋Š” ๋‹จ์–ด ๋‚ด๋ถ€์—์„œ๋„ ngram์œผ๋กœ ์ชผ๊ฐœ์„œ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์–ด๋Š์ •๋„ ์˜คํƒ€๋‚˜ ์‹ ์กฐ์–ด์— ๋Œ€ํ•ด์„œ๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

๋ชจ๋ฅด๋Š” ๋‹จ์–ด๊ฐ€ ๋‚˜์™”์„ ๋•Œ ๋‘ ๋ฐฉ์‹์˜ ์ฐจ์ด์ 

๊ทธ๋ฆฌ๊ณ  ์†๋„๋ฉด์—์„œ๋„ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค. 

Fasttext๋Š” ์†๋„๊ฐ€ ๋น ๋ฅด๋‹ค! classification ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด 60000๊ฐœ์˜ ๋ฌธ์žฅ์„ ํ•™์Šตํ•˜๋Š”๋ฐ 5๋ถ„์ด ์ฑ„ ๊ฑธ๋ฆฌ์ง€ ์•Š์•˜๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ word to vector ๋ณด๋‹ค ์ปค์„œ, model load์— ์˜ค๋žœ ์‹œ๊ฐ„์ด ๊ฑธ๋ ธ๋‹ค. ํŠนํžˆ fasttext ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•œ๊ธ€ pretrainned๋ชจ๋ธ ์ ์šฉํ•˜๋ ค๋ฉด ์‹œ๊ฐ„์ด ํ•œ์ฐธ ๊ฑธ๋ฆฐ๋‹ค ใ… ใ… 

Fasttext ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ

data:

๋จผ์ € ๋ฐ์ดํ„ฐ ํ˜•์‹์„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋ฐ”๊ฟ” ์ค˜์•ผํ•œ๋‹ค

1
2
3
4
5
6
7
__label__1.์ฐจ๋Ÿ‰์ œ์–ด    __label__์šด์ „์ œ์–ด    __label__์ผ์ž๋”ฐ๋ผ๊ฐ€๊ธฐ    ๋‚ด ์•ž ์— ์žˆ ๋Š” ์ฐจ ๊ณ„์† ๋”ฐ๋ผ๊ฐ€ .
__label__1.์ฐจ๋Ÿ‰์ œ์–ด    __label__์šด์ „์ œ์–ด    __label__์ผ์ž๋”ฐ๋ผ๊ฐ€๊ธฐ    ์ฐจ ๋”ฐ๋ผ๋ถ™ ๊ธฐ ๊ฐ€๋Šฅ ํ•˜ ๋ƒ ?
__label__1.์ฐจ๋Ÿ‰์ œ์–ด    __label__์šด์ „์ œ์–ด    __label__์ผ์ž๋”ฐ๋ผ๊ฐ€๊ธฐ    ์•ž์ฐจ ์™€ ์ œ ๋™๊ฑฐ๋ฆฌ ๋ฅผ ์œ ์ง€ ํ•˜ ๋ฉด์„œ ๋”ฐ๋ผ๊ฐ€ ์ค˜ .
__label__1.์ฐจ๋Ÿ‰์ œ์–ด    __label__์šด์ „์ œ์–ด    __label__์ผ์ž๋”ฐ๋ผ๊ฐ€๊ธฐ    ์•ž์ฐจ ๋”ฐ๋ผ ์•ˆ ๋ถ™ ์–ด ?
__label__1.์ฐจ๋Ÿ‰์ œ์–ด    __label__์šด์ „์ œ์–ด    __label__์ผ์ž๋”ฐ๋ผ๊ฐ€๊ธฐ    ์•ž ์— ์ฐจ ๋’ค ์— ๋ถ™ ์–ด ์ค„๋ž˜ ?
__label__1.์ฐจ๋Ÿ‰์ œ์–ด    __label__์šด์ „์ œ์–ด    __label__์ผ์ž๋”ฐ๋ผ๊ฐ€๊ธฐ    ๋‚˜ ์˜ ์•ž์ฐจ ๋ฅผ ๋”ฐ๋ผ ๊ฐ€ ์ฃผ ๋ ด .
__label__1.์ฐจ๋Ÿ‰์ œ์–ด    __label__์šด์ „์ œ์–ด    __label__์ผ์ž๋”ฐ๋ผ๊ฐ€๊ธฐ    ์•ž ๋”ฐ๋ผ๊ฐ€ ๊ฐ€ ๊ธฐ ๋Š” ํ•  ์ˆ˜ ์žˆ ์ง€ ?
cs

txtํŒŒ์ผ์—์„œ ๋ฌธ์žฅ์˜ ์ฒซ๋ถ€๋ถ„์—๋Š” __label__['class ์ด๋ฆ„'] ์„, ๋’ค์—๋Š” ํ˜•ํƒœ์†Œ ๋ถ„๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋œ ๋ฌธ์žฅ์„ '\t'๊ธฐํ˜ธ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ dataํ˜•์‹์„ ๋ฐ”๊ฟ”์ค€๋‹ค. ํ•œ ๋ฌธ์žฅ์— ์—ฌ๋Ÿฌ class๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

๋ชจ๋ธ์˜ ํ•™์Šต๋„ ๊ฐ„๋‹จํ•˜๋‹ค.

1
2
3
4
5
6
7
8
9
10
import fasttext
model1 = fasttext.train_supervised(input="[ํ•™์ŠตํŒŒ์ผ๊ฒฝ๋กœ]",
                                  epoch=100,
                                  bucket = 20000,
                                  lr = 1,
                                  wordNgrams=2,
                                  dim=80,
                                  )

print(model1.test("[ํ…Œ์ŠคํŠธํŒŒ์ผ๊ฒฝ๋กœ]")

model1.save_model(".[๋ชจ๋ธ์ €์žฅ๊ฒฝ๋กœ].bin")
cs

ํ•™์Šต์‹œํ‚ฌ ํŒŒ์ผ์˜ ๊ฒฝ๋กœ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด, ์•Œ์•„์„œ ํ•™์Šต์„ ํ•œ๋‹ค. ๊ทธ ํŒŒ์ผ์„ ์ €์žฅ๋งŒ ์ž˜ ํ•˜๋ฉด ๋œ๋‹ค.

์„ฑ๋Šฅ์€ model.test๋ฅผ ํ†ตํ•ด ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ •ํ™•๋„๋ฅผ ์ถœ๋ ฅํ•ด์ค€๋‹ค.

์•„๋ž˜๋Š” ๋ฌธ์žฅ์— ๋Œ€ํ•ด predict๋ฅผ ํ•œ ์˜ˆ์‹œ์ด๋‹ค.

model1.predict("์•ž ์ฐจ๋ฅผ ๋”ฐ๋ผ ๊ฐ€ ์ค˜")
 >>> (('__label__1.์ฐจ๋Ÿ‰์ œ์–ด',), array([1.00001001]))
model1.predict("์‘๊ธ‰์‹ค ๋กœ ๊ฐ€๋Š” ๊ธธ ์„ ์•Œ๋ ค ์ค˜")
>>> (('__label__5.์˜๋ฃŒ',), array([1.00001001]))

 

ํ•™์Šต์†๋„๊ฐ€ ๋น ๋ฅด๊ณ , ์„ฑ๋Šฅ๋„ ์–ด๋Š์ •๋„ ๊ดœ์ฐฎ์•„์„œ ๊ดœ์ฐฎ์€ ๊ฒƒ ๊ฐ™๋‹ค!

๋ฐ˜์‘ํ˜•