During the workshop this group was led by Al Geist. Participants included Michel Jaunin, Martin Frey, Guy Cormier, and Juan Meza. Notes are courtesy of Al Geist.
1. Describe where we are
   resource allocation and user validation
   scheduling
   partitioning of the system
   checkpoint/restart at system level
   disk quotas, archiving, migration
   External media
   statistics accounting
   performance monitoring & tuning
   security
Which of these are (critical, necessary, useful)
Which are: (included in OS, add-on to OS, 3rd party, develop on our on. )
What is Dist Computing? heterogeneous environment, span administrative domains
Which are unique to Distributed Computing?
2. Identify problems
All solutions need to span Unix and NT domains – heterogeneity in general
System Admin Tools that span the whole domain
add user – change quotas, modify resource pool, PS kill
Meta-scheduling – coupling exiting scheduling, local developed scheduler run at all sites
Fault Tolerance – automatic detection, recovery/repair, notification
Common Program Development Envir. – common set of tools and libraries. (Apps Group)
Hetero between sites eg. C compilers, debugger, …
Conferencing – commercial products exist
Notebooks – useful tool, just set it up, not a big issue
3.Strategies to eliminate problems
System Administration Tools
Web-based resource management/monitoring tools
- add / change quotas for user – for system-wide access
- modify resource pool – local resources only
- user accessable features: status of resources, my quotas, status of job, list processes, kill job
Meta-scheduling
- existing tools – LSF, Condor, DQS, LoadLeveler (limited results from use with large MPP, NT)
- coupling existing schedulers – (like Condor “flock”) home-grown and vendor schedulers.
- Initially define an interface “file” of available-willing-to-share resources (GUSTO approach).
- Longer term develop distributed broker/negotiation tool – no one site can have control of the meta-scheduler. (Harness distributed symmetric control could be useful)
Fault Tolerance
- detection (lillith tool could be used to help build this) could be integrated with scheduling software
- What to detect? Network/CPU/Disk/Process/Host
- Heartbeat (GUSTO approach, but its course-grained)
- Longer term develop fault monitor daemon hierarchy – machine-site-enterprise that can survive failures and notify the hierarchy of all detections.
- recovery (run in degraded mode) local policies
- repair (recover to full operation) hot-swap hardware, restart application, replace monitor