Nepal:Redundancy: Difference between revisions

From OLPC
Jump to navigation Jump to search
No edit summary
m (Fix Etoys spelling)
 
(8 intermediate revisions by 4 users not shown)
Line 2: Line 2:


== The Basic Heuristics for Redundancy ==
== The Basic Heuristics for Redundancy ==
(a) Determine the minimum setup and identify all components that can fail individually.
(a) Determine the minimum setup and identify all components that can fail individually. <br>
(b) For each failure, what would be the impact to the entire system.
(b) For each failure, what would be the impact to the entire system. <br>
(c) Determine an N+1 or N+2 configuration that might address the concern.
(c) Determine an N+1 or N+2 configuration that might address the concern. <br>


== One View of Failure Cases - GS ==
Minimum setup:
150 x XOs<br>
XS w/3 x Active Antennas<br>
Squid Server<br>
Wireless AP/Router<br>

Failure cases: <br>

1 - XO fails <br>
Recovery Actions <br>
- Restore image via USB <br>
- Restore user generated content from XS? <br>

Impact During Downtime <br>
Single student down for some time and may need admin help.<BR>

2 - XS fails completely or offline <br>
Recovery Actions <br>
- Manually bring online backup XS with identical image. <br>
- See also: http://wiki.laptop.org/go/Nepal:Redundancy#High_Level_Server_Failure_Design_Suggestion

Impact During Downtime <br>
- Mesh stays up?<br>
- Off the Internet <br>
- All activities local (Etoys only, no web sites) <br>

3 - Squid box fails <br>
Recovery Actions <br>
- Update XS routing table to reach internet directly? <br>
- Bring up backup Squid box? <br>
- Shutdown internet access but leave moodle online? <br>
- See also: http://wiki.laptop.org/go/Nepal:Redundancy#High_Level_Server_Failure_Design_Suggestion

Impact During Downtime <br>
- All internet offline <br>
- Admin intervention needed <br>
- Moodle and local Etoys only <br>

4 - Wireless AP/Router Fails
Recovery Actions <br>
- Backup Wireless AP/Router? <br>
- Connect XS or Squid box directly to DSL Modem? <br>

Impact During Downtime <br>
- All internet offline <br>
- Admin intervention needed <br>
- Moodle and local Etoys only <br>

5 - Mesh overload until mesh offline <br>
Recovery Actions <br>
- Take down mesh (how?) <br>
- Associate XOs directly with wireless AP/router (how? prebuilt script or kids click on something?)<br>
-- If wireless/AP takes over, change wireless router gateway to go back to Squid and XS before going over WAN

Impact During Downtime <br>
- XOs offline from each other and the internet
- Admin intervention needed <br>
- Local Etoys only <br>


== Individual XO's ==
== Individual XO's ==
Line 31: Line 90:
** Have spare School Server on hand
** Have spare School Server on hand
* System Backups?
* System Backups?

===High Level Server Failure Design Suggestion ===
I have a radical idea, and it fits in the N+1 redundancy. For OLE-Nepal, have three machines:
1) The primary XS
2) The primary Squid
3) The backup machine

Now, here's the trick, we make the "backup machine" have both an XS and Squid implemented on them, and with scripts activate one or the other so the next time it starts up, that is what it runs as. This simplifies the recovery to:
XS goes down --> tell backup machine to be "primary XS" and reboot machine (with correct IP addresses) Squid goes down --> tell backup machine to be "primary Squid" and reboot machine (with correct IP addresses)

The primary XS and Squid boxes can be scripted to send updates to the backup box, so that when it needs to play one or the other role, it will have all the latest data.

In the event XS goes down and is replaced by the Backup box, then when the repaired XS comes back, it becomes the new backup box.

This also supports "scheduled/planned maintenance", putting in a bigger drive, adding memory, etc.
One machine can be upgrade while the other two are getting the job done.

While the Squid box may not need as much memory or processor as the XS box, perhaps it would be simpler just to have a standard HW config for all three boxes so that they are interchangeable. The alternative would be to put the weakest box as Squid, strongest as primary XS, and the third is Backup.


== Library Server ==
== Library Server ==
Line 42: Line 119:


== Power ==
== Power ==
Kathmandu only has power for 14 hours/day.

The case of no power could be the covered by the school server offline case above.



== Monitoring ==
== Monitoring ==
Line 52: Line 129:




[[category:Countries|Nepal]]
[[category:Nepal]]
[[Category:OLPC Nepal]]
[[category:SchoolServer]]

Latest revision as of 21:06, 9 December 2008

This page is meant to layout redundancy plans for Nepal's spring pilot of OLPC. See the Nepal page for more details on the pilot.

The Basic Heuristics for Redundancy

(a) Determine the minimum setup and identify all components that can fail individually.
(b) For each failure, what would be the impact to the entire system.
(c) Determine an N+1 or N+2 configuration that might address the concern.

One View of Failure Cases - GS

Minimum setup: 150 x XOs
XS w/3 x Active Antennas
Squid Server
Wireless AP/Router

Failure cases:

1 - XO fails
Recovery Actions
- Restore image via USB
- Restore user generated content from XS?

Impact During Downtime
Single student down for some time and may need admin help.

2 - XS fails completely or offline
Recovery Actions
- Manually bring online backup XS with identical image.
- See also: http://wiki.laptop.org/go/Nepal:Redundancy#High_Level_Server_Failure_Design_Suggestion

Impact During Downtime
- Mesh stays up?
- Off the Internet
- All activities local (Etoys only, no web sites)

3 - Squid box fails
Recovery Actions
- Update XS routing table to reach internet directly?
- Bring up backup Squid box?
- Shutdown internet access but leave moodle online?
- See also: http://wiki.laptop.org/go/Nepal:Redundancy#High_Level_Server_Failure_Design_Suggestion

Impact During Downtime
- All internet offline
- Admin intervention needed
- Moodle and local Etoys only

4 - Wireless AP/Router Fails Recovery Actions
- Backup Wireless AP/Router?
- Connect XS or Squid box directly to DSL Modem?

Impact During Downtime
- All internet offline
- Admin intervention needed
- Moodle and local Etoys only

5 - Mesh overload until mesh offline
Recovery Actions
- Take down mesh (how?)
- Associate XOs directly with wireless AP/router (how? prebuilt script or kids click on something?)
-- If wireless/AP takes over, change wireless router gateway to go back to Squid and XS before going over WAN

Impact During Downtime
- XOs offline from each other and the internet - Admin intervention needed
- Local Etoys only

Individual XO's

  • LiveCD+USB w/ correct image and settings
  • ?Possible to restore over the network?
  • Need way to XS_backup_restore backup and restore individual student files
  • Need extra XO's for teachers, at least N + 1 where N is the # of teachers
  • How many extra XO's for kids?

Active Antennas

  • Need 3 antennas
  • 1 active antenna per 100 students
  • 2 antennas in use

?How many clients can an active antenna support?


School Server

There should be two School Servers, one for the 2nd grade class, and one for the 6th grade class. They should mirror each other.

  • Disk Failure
    • LiveCD + USB stick
    • Possibly use Fedora's LVM for disk mirroring
  • CPU failure
    • Have spare cpu fan on hand
    • Have spare School Server on hand
  • System Backups?

High Level Server Failure Design Suggestion

I have a radical idea, and it fits in the N+1 redundancy. For OLE-Nepal, have three machines: 1) The primary XS 2) The primary Squid 3) The backup machine

Now, here's the trick, we make the "backup machine" have both an XS and Squid implemented on them, and with scripts activate one or the other so the next time it starts up, that is what it runs as. This simplifies the recovery to: XS goes down --> tell backup machine to be "primary XS" and reboot machine (with correct IP addresses) Squid goes down --> tell backup machine to be "primary Squid" and reboot machine (with correct IP addresses)

The primary XS and Squid boxes can be scripted to send updates to the backup box, so that when it needs to play one or the other role, it will have all the latest data.

In the event XS goes down and is replaced by the Backup box, then when the repaired XS comes back, it becomes the new backup box.

This also supports "scheduled/planned maintenance", putting in a bigger drive, adding memory, etc. One machine can be upgrade while the other two are getting the job done.

While the Squid box may not need as much memory or processor as the XS box, perhaps it would be simpler just to have a standard HW config for all three boxes so that they are interchangeable. The alternative would be to put the weakest box as Squid, strongest as primary XS, and the third is Backup.

Library Server

  • Need backup Library server that mirrors the production Library Server

NOTE: The Library Server will be in a centralized location


Internet Connection

Need some kind of commitment from local ISP for both support and service levels

Power

Kathmandu only has power for 14 hours/day. The case of no power could be the covered by the school server offline case above.

Monitoring

  • Nagios for remote monitoring of Internet connection?
  • Another tool to report system usage for the school server? ZENOSS?

Tony Pearson has contributed extensively to this plan.