Scripts for testing multiple XOs

From OLPC
Revision as of 17:25, 14 November 2008 by Garycmartin (talk | contribs) (host tricks)
Jump to navigation Jump to search

Introduction

Problem to solve: From a central machine, execute a bash script on a given list of IP addresses (of XOs). End with the output of the script on each XO in text documents in the central machine (one per XO, titled with the XO's IP address and name of the script).

Ideal use scenario

Given the following files...

in xo-ip-list.txt
-------------------
12.34.56.01
12.34.56.02
12.34.56.03
in run-this-script.sh
-----------------------
#!/bin/bash          
ps

I should be able to run a command like this, which will remotely run run-this-script.sh on the list of XOs at the IP addresses in xo-ip-list.txt, and save the results to a folder called testresults.

mchua@master-machine:~$ ./test-multiple-xos --iplist xo-ip-list.txt --script run-this-script.sh --folder testresults

...and then see something like this, where each textfile contains the output of run-this-script.sh on the XO whose IP is in its filename.

mchua@master-machine:~$ ls testresults
12.34.56.01-run-this-script.sh.txt
12.34.56.02-run-this-script.sh.txt
12.34.56.03-run-this-script.sh.txt

Initial set-up requirements

A secure method of authentication is the main initial hurdle for remote management:

  1. generate a public/private key on your master machine
  2. copy the public key (~/.ssh/id_dsa.pub) onto all the XOs (/home/olpc/.ssh/authorized_keys)
  3. make sure the authorized_keys only has write access for user (chmod g-w usually all that's needed)
  4. then from your master you can remotely run commands as needed (ssh olpc@192.168.1.5 ps aux)

You can now access all XOs remotely from a machine with the private key.

Note the first time you try to connect to a remote XO host, you'll be asked to manually acknowledge (yes/no) the 'authenticity' of the host. This is fine if you just have a couple of XOs to remotely manage, but will be a real pain if you have 50 or 100 XOs to access and/or your DHCP lease addresses time out and your XOs get given new IP addresses (you'll need to clean up your ~/.ssh/known_hosts and then re-acknowledge each XOs authenticity). There are a couple of useful commands for scripting around this, ssh-keygen -R <ip_address> removes a bad host IP address from your ~/.ssh/known_hosts file, and ssh-keyscan <ip_address> will generate a new host key for the given address. A script towards the end of this page demonstrates auto cleaning and regenerating ~/.ssh/known_hosts from a given list of trusted IP addresses.

Example code

One thing you'll need to resolve is getting the list of all IP addresses, the three XOs here occasionally play tricks on me (DHCP timeout) and change addresses, so I need to clean out host keys in ~/.ssh/known_hosts from time to time. I guess you could also tell ssh to switch off it's strict host checking. Here's a really quick stab at scanning some subnet and getting the list of active IP addresses to try some other script on. --Garycmartin

for (( i=1;i<=254;i+=1 )); do ping -q -c 1 -t 1 192.168.1.${i} > /dev/null && echo 192.168.1.${i}; done
192.168.1.1
192.168.1.3
192.168.1.4
192.168.1.5
192.168.1.6
Notes: I reduced the ping time-out to just wait 1sec before assuming no one is home, this lets the script complete in 1sec per IP address. If you're just testing, trying to ctrl-c out of any for loop is a pain :-) use ctrl-z and then kill % or wait till the script is done.

Once you have a list of IP addresses, add the ones you want into a file, lets call it xo-ip-list.txt to match the above spec, here's my example file with 3 XO IP addresses:

cat xo-ip-list.txt
192.168.1.4
192.168.1.5
192.168.1.6

If you've set-up the public/private keys on all the XOs, you can now run a command on each like this:

for ip in $(cat xo-ip-list.txt); do echo -n "$ip is running build "; ssh olpc@${ip} cat /boot/olpc_build; done
192.168.1.4 is running build 767
192.168.1.5 is running build 767
192.168.1.6 is running build 767

To take this a next step and store the output of each command into a file, it just needs a output redirect to a suitable file name. In the below example I'll just use the IP address as the filename for each machines results, but it may be more sensible to use an invariant name (perhaps the XO serial number or mac address), as XO IP addresses will change over time:

for ip in $(cat xo-ip-list.txt); do ssh olpc@${ip} 'echo -n "Running build "; cat /boot/olpc_build' > $ip; done
Notes: Both the echo and the cat commands are run on each XO, this makes it easier to redirect both outputs at once.

The scripts so far all run in serial, from one machine to the next, waiting for each result, so if you have a lot of machines this can get slow. A simple change will allow all XOs to be contacted at once and run in parallel, but be aware that this may busy the network more than you want, especially if you are actually trying to measure network usage or problems related to network congestion. I'd need to try it on a testbed of 100 XOs to see, hard to test with just 3 XOs. If you find the network load is too much (and it also depends on the scripts output size), I'd recommend breaking your list of IP addresses up into several ranges, and running parallel commands on each block. Here's the parallel running example:

for ip in $(cat xo-ip-list.txt); do ssh olpc@${ip} 'echo -n "Running build "; cat /boot/olpc_build' > $ip& done
Notes: When you run the above, you'll see a process job number displayed for each the parallel jobs, give them 5-10sec to complete and just tap return, bash should output completed job numbers. A simple modification would be to replace the output redirect > with an append >> so that you can collect data over multiple passes.

Here's a more complicated example collecting and appending a range of stats for each XO:

for ip in $(cat xo-ip-list.txt); do ssh olpc@${ip} 'echo `date;
cat /ofw/mfg-data/SN; echo " ";
cat /ofw/mfg-data/WM; echo " ";
cat /boot/olpc_build;
cat /proc/loadavg | sed "s/^\([0-9.]* [0-9.]* [0-9.]*\).*/\1/";
free | grep "buffers/cache" | sed "s/.*: *\([0-9]*\) *\([0-9]*\).*/\1 \2/";
avahi-browse -t _presence._tcp | grep eth0 | wc -l;
avahi-browse -t _presence._tcp | grep msh0 | wc -l;
nm-tool | sed "1,/.*Wireless Networks/d; /^$/,//d" | wc -l`' >> $ip& done
Notes: This can all be entered on one line, but I've put it on separate lines to make the wiki easier to read, either way works fine copy/pasting into a console. The echo `<command1>; <command2>` trick removes any new-line characters so that each file gets just one extra line of data added per cycle, so as to make the files more parse-able (in case you want to load them into a spreadsheet and graph some of the values).

The fields collected are:

Day
|   Month
|   |   Date
|   |   | Time
|   |   | |        TZ
|   |   | |        |   Year
|   |   | |        |   |    XO_serial
|   |   | |        |   |    |           XO_mac_address
|   |   | |        |   |    |           |                 XO_build
|   |   | |        |   |    |           |                 |   1min_load_avg
|   |   | |        |   |    |           |                 |   |    5min_load_avg
|   |   | |        |   |    |           |                 |   |    |    15min_load_avg
|   |   | |        |   |    |           |                 |   |    |    |    used_kb
|   |   | |        |   |    |           |                 |   |    |    |    |      free_kb
|   |   | |        |   |    |           |                 |   |    |    |    |      |     AP_buddies
|   |   | |        |   |    |           |                 |   |    |    |    |      |     | Mesh_buddies
|   |   | |        |   |    |           |                 |   |    |    |    |      |     | | Visible_APs
|   |   | |        |   |    |           |                 |   |    |    |    |      |     | | |
Sat Nov 8 04:32:48 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.30 0.18 0.12 151568 84180 4 3 10

Example results gathered over several runs on 3 XOs are shown below:

cat 192.168.1.4
Sat Nov 8 04:32:48 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.30 0.18 0.12 151568 84180 4 3 10
Sat Nov 8 04:34:42 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.29 0.20 0.12 151588 84160 4 3 10
Sat Nov 8 04:36:41 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.33 0.26 0.15 151588 84160 4 3 10
Sat Nov 8 04:37:22 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.36 0.27 0.16 151624 84124 4 3 10
Sat Nov 8 04:40:18 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.13 0.17 0.14 147836 87912 4 3 10
Sat Nov 8 04:42:13 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.16 0.16 0.13 147896 87852 4 3 10

cat 192.168.1.5
Sat Nov 8 04:32:56 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.09 0.05 0.01 117976 117780 4 3 10
Sat Nov 8 04:34:50 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.08 0.06 0.01 117976 117780 4 3 10
Sat Nov 8 04:36:49 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.01 0.04 0.00 114124 121632 4 3 10
Sat Nov 8 04:37:30 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.00 0.03 0.00 114184 121572 4 3 10
Sat Nov 8 04:40:26 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.42 0.12 0.04 114244 121512 4 3 10
Sat Nov 8 04:42:21 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.10 0.10 0.04 114244 121512 4 3 10

cat 192.168.1.6
Sat Nov 8 04:33:00 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.10 0.10 0.03 122836 112916 4 3 10
Sat Nov 8 04:34:54 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.16 0.10 0.03 122836 112916 4 3 10
Sat Nov 8 04:36:53 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.02 0.06 0.02 122832 112920 4 3 10
Sat Nov 8 04:37:34 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.09 0.07 0.02 122832 112920 4 3 10
Sat Nov 8 04:40:30 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.00 0.03 0.00 122832 112920 4 3 10
Sat Nov 8 04:42:24 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.07 0.03 0.00 122832 112920 4 3 10
Note: A curiosity... you can see from the logs above (second to last end 2 numbers, for eth0 buddies seen, and msh0 buddies seen), that the 767 build is NOT correctly disabling the msh0 network chatter while connected to an AP. The eth0 count of 4, is for the 3 XOs and a Mac running Bonjour, the msh0 count of 3, is just the 3 XOs seeing each other via the mesh (this should be 0 as the mesh should be off when connected to an AP). You can simulate the correct behaviour by running "sudo ifconfig msh0 down", the log will immediately show zero msh0 buddies, and the XO Neighbourhood view will slowly time-out any mesh only buddies over the next ~30min.

Host authentication tricks

If you have many XOs to initially authenticate with, or you have a DHCP network that frequently reassigns new IP addresses (often an XO will get a new IP after a reboot), the below example script will read your list of xo-ip-list.txt addresses, remove any existing host keys you may already have for them, and regenerate a new set of host keys for each. This automates the cleanup of "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!" you will get when IP addresses change, and useful if you want to quickly auto verify the authenticity of many XOs all at once when you first start to manage them.

for ip in $(cat xo-ip-list.txt); do ssh-keygen -R $ip; ssh-keyscan $ip >> ~/.ssh/known_hosts; done

Todo

  • Get agreement on a useful set of data to collect (need input from at least Mel & Joe).
  • Implement collection of agreed data set.
  • Put final set of commands into a single batch file for ease of use.
  • Test and harden batch file against expected operational errors.
    • Missing/off-line XOs.
    • Network errors/time-outs.
  • Decide on use requirements for continuos monitoring.
    • Add as a cron job?
    • Could initially be a simple sleep loop, given a script called "xo_batch_test", this would work fine:
while :; do ./xo_batch_test; sleep 60; done
  • Test system for a few days at least to iron out any issues.
  • Explore some tools for analysis.
    • Basic graphing and numerical analysis via spread-sheet?
    • Auto-generate daily/weekly charts (Python/PIL type code)?
    • Occasional SOM maps of the interesting vectors?
    • Auto-publish useful core data/graphs to wiki page for public access?