Multi-stage automatic NE and PoS annotation using pattern-based and statistical-based techniques for Thai corpus construction

ridm@nrct.go.th ระบบคลังข้อมูลงานวิจัยไทย รายการโปรดที่คุณเลือกไว้

Multi-stage automatic NE and PoS annotation using pattern-based and statistical-based techniques for Thai corpus construction

หน่วยงาน สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์

รายละเอียด

ชื่อเรื่อง	:	Multi-stage automatic NE and PoS annotation using pattern-based and statistical-based techniques for Thai corpus construction
นักวิจัย	:	Nattapong Tongtep , Thanaruk Theeramunkong
คำค้น	:	Corpus construction , Multi-stage annotation , Named entity , Part of speech , Syllabic alphabetic language
หน่วยงาน	:	สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์
ผู้ร่วมงาน	:	-
ปีพิมพ์	:	2556
อ้างอิง	:	IEICE Transactions on information and systems. E96-D,10 (2013) pp. 2245-2256 , 0916-8532 , http://dspace.library.tu.ac.th/handle/3517/7045
ที่มา	:	-
ความเชี่ยวชาญ	:	-
ความสัมพันธ์	:	-
ขอบเขตของเนื้อหา	:	-
บทคัดย่อ/คำอธิบาย	:	Automated or semi-automated annotation is a practical solution for large-scale corpus construction. However, the special characteristics of Thai language, such as lack of word-boundary and sentenceboundary markers, trigger several issues in automatic corpus annotation. This paper presents a multi-stage annotation framework, containing two stages of chunking and three stages of tagging. The two chunking stages are pattern matching-based named entity (NE) extraction and dictionarybased word segmentation while the three succeeding tagging stages are dictionary-, pattern- and statist09812490981249ical-based tagging. Applying heuristics of ambiguity priority, NE extraction is performed first on an original text using a set of patterns, in the order of pattern ambiguity. Next, the remaining text is segmented into words with a dictionary. The obtained chunks are then tagged with types of named entities or parts-of-speech (PoS) using dictionaries, patterns and statistics. Focusing on the reduction of human intervention in corpus construction, our experimental results show that the dictionary-based tagging process can assign unique tags to 64.92% of the words, with the remaining of 24.14% unknown words and 10.94% ambiguously tagged words. Later, the pattern-based tagging can reduce unknown words to only 13.34% while the statistical-based tagging can solve the ambiguously tagged words to only 3.01%. Copyright © 2013 The Institute of Electronics, Information and Communication Engineers.
บรรณานุกรม	:	APA Chicago MLA Vancouver Nattapong Tongtep , Thanaruk Theeramunkong . (2556). Multi-stage automatic NE and PoS annotation using pattern-based and statistical-based techniques for Thai corpus construction. กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ . Nattapong Tongtep , Thanaruk Theeramunkong . 2556. "Multi-stage automatic NE and PoS annotation using pattern-based and statistical-based techniques for Thai corpus construction". กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ . Nattapong Tongtep , Thanaruk Theeramunkong . "Multi-stage automatic NE and PoS annotation using pattern-based and statistical-based techniques for Thai corpus construction." กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ , 2556. Print. Nattapong Tongtep , Thanaruk Theeramunkong . Multi-stage automatic NE and PoS annotation using pattern-based and statistical-based techniques for Thai corpus construction. กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ ; 2556.