การพัฒนาแหล่งทรัพยากรบนระบบเครือข่ายสำหรับงานวิจัยด้านการประมวลผลภาษาธรรมชาติ (ระยะที่ 3)

ridm@nrct.go.th ระบบคลังข้อมูลงานวิจัยไทย รายการโปรดที่คุณเลือกไว้

การพัฒนาแหล่งทรัพยากรบนระบบเครือข่ายสำหรับงานวิจัยด้านการประมวลผลภาษาธรรมชาติ (ระยะที่ 3)

หน่วยงาน สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ

รายละเอียด

ชื่อเรื่อง	:	การพัฒนาแหล่งทรัพยากรบนระบบเครือข่ายสำหรับงานวิจัยด้านการประมวลผลภาษาธรรมชาติ (ระยะที่ 3)
นักวิจัย	:	อัศนีย์ ก่อตระกูล
คำค้น	:	การประมวลผลภาษาธรรมชาติ , คลังประโยคขนาด , ศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ , โมเดลการประมวลผลภาษาอิงสถิติ
หน่วยงาน	:	สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ
ผู้ร่วมงาน	:	-
ปีพิมพ์	:	2543
อ้างอิง	:	http://www.nstda.or.th/thairesearch/node/20086
ที่มา	:	-
ความเชี่ยวชาญ	:	-
ความสัมพันธ์	:	-
ขอบเขตของเนื้อหา	:	-
บทคัดย่อ/คำอธิบาย	:	"งานวิจัยด้านการประมวลผลภาษาธรรมชาติ มีความสำคัญในการพัฒนาให้คอมพิวเตอร์สามารถประมวลผลข้อสนเทศที่อยู่ในรูปภาษาธรรมชาติหรือภาษามนุษย์ ผลจากการวิจัย สามารถนำไปประยุกต์เพื่อพัฒนาซอฟต์แวร์ที่เป็นประโยชน์ต่อสังคม และมีผลเชิงพาณิชย์ ตัวอย่างเช่น การสืบค้นข้อมูลด้วยภาษาของผู้ใช้ทั้งในรูปข้อความและเสียง การแปลภาษาอัตโนมัติ การย่อความอัตโนมัติ การจัดหมวดหมู่เอกสารอัตโนมัติ และ การสร้างดัชนีอัตโนมัติ เป็นต้น แนวทางในการพัฒนาระบบประมวลผลภาษาธรรมชาติมี 2 แนวทางหลัก คือ แนวทางอิงฐานความรู้หรือปัญญาประดิษฐ์ และ แนวทางอิงโมเดลสถิติ อย่างไรก็ตาม แนวทางแรก เป็นแนวทางที่ต้องใช้กำลังคนและผู้เชี่ยวชาญทางด้านภาษาศาสตร์เชิงคำนวณเพื่อสร้างฐานความรู้ทางภาษา ทำให้ความสามารถของระบบประมวลผลภาษามีขอบเขตจำกัด เปราะบาง กล่าวคือต้องขึ้นอยู่กับขนาดของฐานความรู้ สำหรับแนวทางอิงโมเดลสถิติ เป็นอีกทางเลือกหนึ่งที่ใช้ค่าทางสถิติของปรากฏการณ์ทางภาษาที่คำนวณจากคลังประโยคขนาดใหญ่แทนการใช้ฐานความรู้ในการประมวลผลและแก้ปัญหาทางภาษาทุกระดับ ตัวอย่างเช่น การใช้ โมเดล n-gram ในการแก้ปัญหาความคลุมเครือของ การหาขอบเขตของคำ (word boundary ambiguity) การกำกับชนิดคำ (part-of-speech ambiguity)ความหมายของคำ (word sense ambiguity) เป็นต้น แนวทางอิงสถิติจึงเป็นแนวทางสำคัญที่มีศักยภาพสูงและแก้ปัญหาดังต่อไปนี้ได้ ได้แก่ - ความครอบคลุมของฐานความรู้ - ความครอบคลุมในการแก้ปัญหาทางภาษา - ความแกร่งในการประมวลผลเมื่อใช้กับภาษที่ใช้ในชีวิตจริง - ความยืดหยุ่นในการประมวลผลภาษาที่มีความหลากหลายสำหรับการวิจัยและพัฒนาระบบประมวลผลภาษาธรรมชาติในประเทศไทยมีทั้งแนวทางอิงฐานความรู้ และ แนวทางอิงสถิติ ซึ่งจำเป็นต้องใช้เวลา กำลังคน ในการสร้างฐานความรู้ คลังประโยค รวมทั้งซอฟต์แวร์เครื่องมือที่เกี่ยวข้อง โครงการวิจัยนี้จึงมีวัตถุประสงค์ที่จะ ศึกษาปัญหาและวิจัยทฤษฎีภาษาศาสตร์เชิงคำนวณ สำหรับการประมวลภาษาไทย รวมทั้งมีเป้าหมายที่จะพัฒนาแหล่งทรัพยากรบนระบบเครือข่ายสำหรับใช้ร่วมกันระหว่างนักวิจัยที่อยู่ต่างคณะ ต่างสถาบัน ซึ่งประกอบด้วย คลังประโยคขนาดใหญ่ที่มีการกำกับและไม่กำกับข้อมูลทางภาษา พร้อมทั้งซอฟต์แวร์เครื่องมือเพื่อใช้ในการรวบรวม ใช้ และประมวลผลข้อมูลทางสถิติ จากคลังประโยค ทำให้สามารถลดการทำงานที่ซ้ำซ้อน และสามารถใช้คลังประโยชน์ในการประเมินผล เปรียบเทียบระบบที่พัฒนาด้วยแนวทางหรือทฤษฎีที่ต่างกันโดยใช้ข้อมูลอ้างอิงชุดเดียวกัน ผลจากการวิจัยในโครงการที่แล้ว ประกอบด้วย - คลังประโยคขนาด 153 MB กำกับข้อมูลแล้วคิดเป็น 20 %ของคลังประโยค - คลังคำศัพท์ขนาด 9,529 คำพร้อมทั้งข้อมูลที่เกี่ยวกับคำ ได้แก่ ชนิดคำ ซึ่งแบ่งได้ทั้งหมด 46 ชนิด ความหมายของคำ 817 คอนเซ็พท์ และ คำที่ใช้ร่วมกันในระดับ คำ วลี และ ประโยค- ซอฟต์แวร์เครื่องมือที่ใช้ประมวลผลคลังประโยค ได้แก่ โปรแกรมช่วยตัดคำ โปรแกรมสำหรับกำกับข้อมูลในคลังประโยคซึ่งอิงมาตรฐาน XML และ โปรแกรมสำหรับคำนวณค่าทางสถิติที่เกี่ยวข้องกับลักษณะและปรากฏการณ์ทางภาษา นอกจากนี้โครงการวิจัยนี้ยังได้พัฒนาระบบต้นแบบสำหรับให้บริการการประมวลผลเอกสารภาษาไทยซึ่งประกอบขึ้นด้วย การสร้างดัชนีเอกสารหลายระดับ และ การจัดหมวดหมู่เอกสารอัตโนมัติ อย่างไรก็ตาม เพื่อให้งานวิจัยบรรลุเป้าหมาย และ สามารถประเมินผลได้ โครงการวิจัยนี้จำเป็นต้องดำเนินงานต่อดังนี้:- ขยายขนาด และ ประเภทของคลังประโยคให้ครอบคลุมเอกสารที่ใช้จริงมากขึ้น และ หลากหลายขึ้น - ศึกษา วิจัย และ พัฒนา โปรแกรมจัดการ และ ประมวลผลคลังประโยค เพิ่มเติมเพื่อนำไปสู่งานวิจัยด้าน การสร้างพจนานุกรมและเครือข่ายคำอัตโนมัติ การดึงข้อมูลสำคัญแบบลำดับเหตุการณ์ - ปรับปรุง และ พัฒนา แหล่งทรัพยากรภาษาบนแพลตฟอร์ม VLSHDS- Very Large Scale Hypermedia Delivery System พร้อมทั้งเผยแพร่ - ปรับปรุง และ พัฒนาระบบต้นแบบสำหรับให้บริการการสร้างดัชนี และจัดหมวดหมู่เอกสารภาษาไทยโดยอัตโนมัติผ่านระบบเครือข่าย โดยใช้ เทคนิคผสมผสานระหว่างอิงความรู้และอิงสถิติ Research on Natural Language Processing (NLP) is essential to increase the ability of computer to process information represented in natural language or human language. The results in NLP can lead to the development of business software such as Information Retrieval based on natural language queries in text or speech, Machine Translation, Text Summarization Document Indexing and Document Clustering.There are two main approaches to NLP: Knowledge-based approach and Empirical & Statistical approach. The former technique requires time and human resources, especially in computational linguistic field, in order to build rule- based system with carefully handcrafted rules and domain knowledge. These techniques, consequently, were too brittle and not scalable. Empirical and statistical approaches, which need large corpora, offer new methods for providing potential solutions to the four key-problems. That is linguistic information acquisition, coverage for all of the phenomena in different application, robustness in computing real text and extensibility for applying the model and data to a new domain, anew problem and a new set of texts.Researches on NLP in Thailand, have been developed by using both approaches. As a consequence, time and human resources for developing both computational linguistic knowledge and large corpora become essential. This research is, then, aimed to develop the knowledge resources with software tools on Network for sharing in Thai computational linguistic and NLP. It is expected that thisAt the current state, the resources on network consist of:- 153 MB text corpora of which 20% is tagged corpus,- 9529 lexical items with their information consisting of 46 part -of –speeches, 817 semantic concepts and co-occurrence information in word, phrase and sentence level, and,- Software tools for word segmenting, word tagging and corpus tagging with linguistic information. Additionally, the prototype of Thai document indexing and clustering has been developed for evaluating this researches idea and concepts. However, in order to succeed the project goal and be able to evaluate the project, the work should be continued as follows:- enlarge the corpora both in size and genres that could cover the real life text,- study and develop computational linguistics theories and related corpus tools for research in creating thesaurus and word net automatically and in information extraction ,- improve and implement linguistic resources on VLSHDS platform- Very Large Scale Hypermedia Delivery System including distribution,- improve and implement Natural Language Processing Service on NetworkResearch on Natural Language Processing (NLP) is essential to increase the ability of computer to process information represented in natural language or human language. The results in NLP can lead to the development of business software such as Information Retrieval based on natural language queries in text or speech, Machine Translation, Text Summarization Document Indexing and Document Clustering.There are two main approaches to NLP: Knowledge-based approach and Empirical & Statistical approach. The former technique requires time and human resources, especially in computational linguistic field, in order to build rule- based system with carefully handcrafted rules and domain knowledge. These techniques, consequently, were too brittle and not scalable. Empirical and statistical approaches, which need large corpora, offer new methods for providing potential solutions to the four key-problems. That is linguistic information acquisition, coverage for all of the phenomena in different application, robustness in computing real text and extensibility for applying the model and data to a new domain, anew problem and a new set of texts.Researches on NLP in Thailand, have been developed by using both approaches. As a consequence, time and human resources for developing both computational linguistic knowledge and large corpora become essential. This research is, then, aimed to develop the knowledge resources with software tools on Network for sharing in Thai computational linguistic and NLP. It is expected that thisAt the current state, the resources on network consist of:- 153 MB text corpora of which 20% is tagged corpus,- 9529 lexical items with their information consisting of 46 part -of –speeches, 817 semantic concepts and co-occurrence information in word, phrase and sentence level, and,- Software tools for word segmenting, word tagging and corpus tagging with linguistic information. Additionally, the prototype of Thai document indexing and clustering has been developed for evaluating this researches idea and concepts. However, in order to succeed the project goal and be able to evaluate the project, the work should be continued as follows:- enlarge the corpora both in size and genres that could cover the real life text,- study and develop computational linguistics theories and related corpus tools for research in creating thesaurus and word net automatically and in information extraction ,- improve and implement linguistic resources on VLSHDS platform- Very Large Scale Hypermedia Delivery System including distribution,- improve and implement Natural Language Processing Service on NetworkResearch on Natural Language Processing (NLP) is essential to increase the ability of computer to process information represented in natural language or human language. The results in NLP can lead to the development of business software such as Information Retrieval based on natural language queries in text or speech, Machine Translation, Text Summarization Document Indexing and Document Clustering.There are two main approaches to NLP: Knowledge-based approach and Empirical & Statistical approach. The former technique requires time and human resources, especially in computational linguistic field, in order to build rule- based system with carefully handcrafted rules and domain knowledge. These techniques, consequently, were too brittle and not scalable. Empirical and statistical approaches, which need large corpora, offer new methods for providing potential solutions to the four key-problems. That is linguistic information acquisition, coverage for all of the phenomena in different application, robustness in computing real text and extensibility for applying the model and data to a new domain, anew problem and a new set of texts.Researches on NLP in Thailand, have been developed by using both approaches. As a consequence, time and human resources for developing both computational linguistic knowledge and large corpora become essential. This research is, then, aimed to develop the knowledge resources with software tools on Network for sharing in Thai computational linguistic and NLP. It is expected that thisAt the current state, the resources on network consist of:- 153 MB text corpora of which 20% is tagged corpus,- 9529 lexical items with their information consisting of 46 part -of –speeches, 817 semantic concepts and co-occurrence information in word, phrase and sentence level, and,- Software tools for word segmenting, word tagging and corpus tagging with linguistic information. Additionally, the prototype of Thai document indexing and clustering has been developed for evaluating this researches idea and concepts. However, in order to succeed the project goal and be able to evaluate the project, the work should be continued as follows:- enlarge the corpora both in size and genres that could cover the real life text,- study and develop computational linguistics theories and related corpus tools for research in creating thesaurus and word net automatically and in information extraction ,- improve and implement linguistic resources on VLSHDS platform- Very Large Scale Hypermedia Delivery System including distribution,- improve and implement Natural Language Processing Service on Network"
บรรณานุกรม	:	APA Chicago MLA Vancouver อัศนีย์ ก่อตระกูล . (2543). การพัฒนาแหล่งทรัพยากรบนระบบเครือข่ายสำหรับงานวิจัยด้านการประมวลผลภาษาธรรมชาติ (ระยะที่ 3). ปทุมธานี : สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ. อัศนีย์ ก่อตระกูล . 2543. "การพัฒนาแหล่งทรัพยากรบนระบบเครือข่ายสำหรับงานวิจัยด้านการประมวลผลภาษาธรรมชาติ (ระยะที่ 3)". ปทุมธานี : สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ. อัศนีย์ ก่อตระกูล . "การพัฒนาแหล่งทรัพยากรบนระบบเครือข่ายสำหรับงานวิจัยด้านการประมวลผลภาษาธรรมชาติ (ระยะที่ 3)." ปทุมธานี : สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ, 2543. Print. อัศนีย์ ก่อตระกูล . การพัฒนาแหล่งทรัพยากรบนระบบเครือข่ายสำหรับงานวิจัยด้านการประมวลผลภาษาธรรมชาติ (ระยะที่ 3). ปทุมธานี : สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ; 2543.