A Rule-based method for thai elementary discourse unit segmentation (TED-Seg)

ridm@nrct.go.th ระบบคลังข้อมูลงานวิจัยไทย รายการโปรดที่คุณเลือกไว้

A Rule-based method for thai elementary discourse unit segmentation (TED-Seg)

หน่วยงาน สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์

รายละเอียด

ชื่อเรื่อง	:	A Rule-based method for thai elementary discourse unit segmentation (TED-Seg)
นักวิจัย	:	Nongnuch Ketui , Thanaruk Theeramunkong , Chutamanee Onsuwan
คำค้น	:	Chart parser , Discourse unit segmentation , Thai elementary discourse unit
หน่วยงาน	:	สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์
ผู้ร่วมงาน	:	-
ปีพิมพ์	:	2555
อ้างอิง	:	Proceedings - 2012 7th international conference on knowledge, information and creativity support systems, KICSS 2012 (2012) Art. no.6405529 pp. 195-202 , 978-076954861-6 , http://dspace.library.tu.ac.th/handle/3517/7003
ที่มา	:	-
ความเชี่ยวชาญ	:	-
ความสัมพันธ์	:	-
ขอบเขตของเนื้อหา	:	-
บทคัดย่อ/คำอธิบาย	:	Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for TEDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NEand POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied; one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one TEDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied; left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2RLM and M-LM. © 2012 IEEE.
บรรณานุกรม	:	APA Chicago MLA Vancouver Nongnuch Ketui , Thanaruk Theeramunkong , Chutamanee Onsuwan . (2555). A Rule-based method for thai elementary discourse unit segmentation (TED-Seg). กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ . Nongnuch Ketui , Thanaruk Theeramunkong , Chutamanee Onsuwan . 2555. "A Rule-based method for thai elementary discourse unit segmentation (TED-Seg)". กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ . Nongnuch Ketui , Thanaruk Theeramunkong , Chutamanee Onsuwan . "A Rule-based method for thai elementary discourse unit segmentation (TED-Seg)." กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ , 2555. Print. Nongnuch Ketui , Thanaruk Theeramunkong , Chutamanee Onsuwan . A Rule-based method for thai elementary discourse unit segmentation (TED-Seg). กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ ; 2555.