Job-site level fault tolerance for cluster and grid environments

ridm@nrct.go.th ระบบคลังข้อมูลงานวิจัยไทย รายการโปรดที่คุณเลือกไว้

Job-site level fault tolerance for cluster and grid environments

หน่วยงาน สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์

รายละเอียด

ชื่อเรื่อง	:	Job-site level fault tolerance for cluster and grid environments
นักวิจัย	:	Limaye, Kshitij , Leangsuksun, Box , Greenwood, Zeno D. , Scott, Stephen L. , Engelmann, Christian , Libby, Richard , Kasidit Chanchio
คำค้น	:	fault tolerance , grid computing , job-site level , cluster-based grid environments
หน่วยงาน	:	สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์
ผู้ร่วมงาน	:	-
ปีพิมพ์	:	2548
อ้างอิง	:	Proceedings - IEEE International Conference on Cluster Computing, ICCC, art. no. 4154086 , 1530-1885 , http://dspace.library.tu.ac.th/handle/3517/3332 , http://dspace.library.tu.ac.th/handle/3517/3332 , http://dspace.library.tu.ac.th/handle/3517/3332
ที่มา	:	-
ความเชี่ยวชาญ	:	-
ความสัมพันธ์	:	-
ขอบเขตของเนื้อหา	:	-
บทคัดย่อ/คำอธิบาย	:	In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called "Smart Failover" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state.
บรรณานุกรม	:	APA Chicago MLA Vancouver Limaye, Kshitij , Leangsuksun, Box , Greenwood, Zeno D. , Scott, Stephen L. , Engelmann, Christian , Libby, Richard , Kasidit Chanchio . (2548). Job-site level fault tolerance for cluster and grid environments. กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ . Limaye, Kshitij , Leangsuksun, Box , Greenwood, Zeno D. , Scott, Stephen L. , Engelmann, Christian , Libby, Richard , Kasidit Chanchio . 2548. "Job-site level fault tolerance for cluster and grid environments". กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ . Limaye, Kshitij , Leangsuksun, Box , Greenwood, Zeno D. , Scott, Stephen L. , Engelmann, Christian , Libby, Richard , Kasidit Chanchio . "Job-site level fault tolerance for cluster and grid environments." กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ , 2548. Print. Limaye, Kshitij , Leangsuksun, Box , Greenwood, Zeno D. , Scott, Stephen L. , Engelmann, Christian , Libby, Richard , Kasidit Chanchio . Job-site level fault tolerance for cluster and grid environments. กรุงเทพมหานคร : สถาบันวิจัยและให้คำปรึกษาแห่ง มหาวิทยาลัยธรรมศาสตร์ ; 2548.