Scripts for testing multiple XOs

From OLPC
Revision as of 14:00, 8 November 2008 by Garycmartin (talk | contribs) (Todo)
Jump to: navigation, search

Introduction

Problem to solve: From a central machine, execute a bash script on a given list of IP addresses (of XOs). End with the output of the script on each XO in text documents in the central machine (one per XO, titled with the XO's IP address and name of the script).

Ideal use scenario

Given the following files...

in xo-ip-list.txt
-------------------
12.34.56.01
12.34.56.02
12.34.56.03
in run-this-script.sh
-----------------------
#!/bin/bash          
ps

I should be able to run a command like this, which will remotely run run-this-script.sh on the list of XOs at the IP addresses in xo-ip-list.txt, and save the results to a folder called testresults.

mchua@master-machine:~$ ./test-multiple-xos --iplist xo-ip-list.txt --script run-this-script.sh --folder testresults

...and then see something like this, where each textfile contains the output of run-this-script.sh on the XO whose IP is in its filename.

mchua@master-machine:~$ ls testresults
12.34.56.01-run-this-script.sh.txt
12.34.56.02-run-this-script.sh.txt
12.34.56.03-run-this-script.sh.txt

Procedure so far

  1. generate a public/private key on your master machine
  2. copy the public key (~/.ssh/id_dsa.pub) onto all the XOs (/home/olpc/.ssh/authorized_keys)
  3. make sure the authorized_keys only has write access for user (chmod g-w usually all that's needed)
  4. then from your master you can remotely run commands as needed (ssh olpc@192.168.1.5 ps aux)

Things left to do

  • Write harness that gives the output of the script on each XO in text documents in the central machine (one per XO, titled with the XO's IP address, name of the script, and timestamp).

Example code

One thing you'll need to resolve is getting the list of all IP addresses, the three XOs here occasionally play tricks on me (DHCP timeout) and change addresses, so I need to clean out host keys in ~/.ssh/known_hosts from time to time. I guess you could also tell ssh to switch off it's strict host checking. Here's a really quick stab at scanning some subnet and getting the list of active IP addresses to try some other script on. --Garycmartin

for (( i=1;i<=254;i+=1 )); do ping -q -c 1 -t 1 192.168.1.${i} > /dev/null && echo 192.168.1.${i}; done
192.168.1.1
192.168.1.3
192.168.1.4
192.168.1.5
192.168.1.6
Notes: I reduced the ping time-out to just wait 1sec before assuming no one is home, this lets the script complete in 1sec per IP address. If you're just testing, trying to ctrl-c out of any for loop is a pain :-) use ctrl-z and then kill % or wait till the script is done.

Once you have a list of IP addresses, add the ones you want into a file, lets call it xo-ip-list.txt to match the above spec, here's my example file with 3 XO IP addresses:

192.168.1.4
192.168.1.5
192.168.1.6

If you've set-up the public/private keys on all the XOs, you can now run a command on each like this:

for ip in $(cat xo-ip-list.txt); do echo -n "$ip is running build "; ssh olpc@${ip} cat /boot/olpc_build; done
192.168.1.4 is running build 767
192.168.1.5 is running build 767
192.168.1.6 is running build 767

To take this a next step and store the output of each command into a file, it just needs a output redirect to a suitable file name. In the below example I'll just use the IP address as the filename for each machines results, but it may be more sensible to use an invariant name (perhaps the XO serial number or mac address), as XO IP addresses will change over time:

for ip in $(cat xo-ip-list.txt); do ssh olpc@${ip} 'echo -n "Running build "; cat /boot/olpc_build' > $ip; done
Notes: Both the echo and the cat commands are run on each XO, this makes it easier to redirect both outputs at once.

The scripts so far all run in serial, from one machine to the next, waiting for each result, so if you have a lot of machines this can get slow. A simple change will allow all XOs to be contacted at once and run in parallel, but be aware that this may busy the network more than you want, especially if you are actually trying to measure network usage or problems related to network congestion. I'd need to try it on a testbed of 100 XOs to see, hard to test with just 3 XOs. If you find the network load is too much (and it also depends on the scripts output size), I'd recommend breaking your list of IP addresses up into several ranges, and running parallel commands on each block. Here's the parallel running example:

for ip in $(cat xo-ip-list.txt); do ssh olpc@${ip} 'echo -n "Running build "; cat /boot/olpc_build' > $ip& done
Notes: When you run the above, you'll see a process job number displayed for each the parallel jobs, give them 5-10sec to complete and just tap return, bash should output completed job numbers. A simple modification would be to replace the output redirect > with an append >> so that you can collect data over multiple passes.

Here's a more complicated example collecting and appending a range of stats for each XO:

for ip in $(cat xo-ip-list.txt); do ssh olpc@${ip} 'echo `date;
cat /ofw/mfg-data/SN; echo " ";
cat /ofw/mfg-data/WM; echo " ";
cat /boot/olpc_build;
cat /proc/loadavg | sed "s/^\([0-9.]* [0-9.]* [0-9.]*\).*/\1/";
free | grep "buffers/cache" | sed "s/.*: *\([0-9]*\) *\([0-9]*\).*/\1 \2/";
avahi-browse -t _presence._tcp | grep eth0 | wc -l;
avahi-browse -t _presence._tcp | grep msh0 | wc -l;
nm-tool | sed "1,/.*Wireless Networks/d; /^$/,//d" | wc -l`' >> $ip& done
Notes: This can all be entered on one line, but I've put it on separate lines to make the wiki easier to read, either way works fine copy/pasting into a console. The echo `<command1>; <command2>` trick removes any new-line characters so that each file gets just one extra line of data added per cycle, so as to make the files more parse-able (in case you want to load them into a spreadsheet and graph some of the values).

The fields collected are:

Day
|   Month
|   |   Date
|   |   | Time
|   |   | |        TZ
|   |   | |        |   Year
|   |   | |        |   |    XO_serial
|   |   | |        |   |    |           XO_mac_address
|   |   | |        |   |    |           |                 XO_build
|   |   | |        |   |    |           |                 |   1min_load_avg
|   |   | |        |   |    |           |                 |   |    5min_load_avg
|   |   | |        |   |    |           |                 |   |    |    15min_load_avg
|   |   | |        |   |    |           |                 |   |    |    |    used_kb
|   |   | |        |   |    |           |                 |   |    |    |    |      free_kb
|   |   | |        |   |    |           |                 |   |    |    |    |      |     AP_buddies
|   |   | |        |   |    |           |                 |   |    |    |    |      |     | Mesh_buddies
|   |   | |        |   |    |           |                 |   |    |    |    |      |     | | Visible_APs
|   |   | |        |   |    |           |                 |   |    |    |    |      |     | | |
Sat Nov 8 04:32:48 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.30 0.18 0.12 151568 84180 4 3 10

Example results gathered over several runs on 3 XOs are shown below:

cat 192.168.1.4
Sat Nov 8 04:32:48 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.30 0.18 0.12 151568 84180 4 3 10
Sat Nov 8 04:34:42 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.29 0.20 0.12 151588 84160 4 3 10
Sat Nov 8 04:36:41 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.33 0.26 0.15 151588 84160 4 3 10
Sat Nov 8 04:37:22 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.36 0.27 0.16 151624 84124 4 3 10
Sat Nov 8 04:40:18 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.13 0.17 0.14 147836 87912 4 3 10
Sat Nov 8 04:42:13 GMT 2008 CSN7480303E 00-17-C4-10-B0-DA 767 0.16 0.16 0.13 147896 87852 4 3 10

cat 192.168.1.5
Sat Nov 8 04:32:56 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.09 0.05 0.01 117976 117780 4 3 10
Sat Nov 8 04:34:50 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.08 0.06 0.01 117976 117780 4 3 10
Sat Nov 8 04:36:49 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.01 0.04 0.00 114124 121632 4 3 10
Sat Nov 8 04:37:30 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.00 0.03 0.00 114184 121572 4 3 10
Sat Nov 8 04:40:26 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.42 0.12 0.04 114244 121512 4 3 10
Sat Nov 8 04:42:21 GMT 2008 SHF72500672 00-17-C4-05-24-02 767 0.10 0.10 0.04 114244 121512 4 3 10

cat 192.168.1.6
Sat Nov 8 04:33:00 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.10 0.10 0.03 122836 112916 4 3 10
Sat Nov 8 04:34:54 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.16 0.10 0.03 122836 112916 4 3 10
Sat Nov 8 04:36:53 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.02 0.06 0.02 122832 112920 4 3 10
Sat Nov 8 04:37:34 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.09 0.07 0.02 122832 112920 4 3 10
Sat Nov 8 04:40:30 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.00 0.03 0.00 122832 112920 4 3 10
Sat Nov 8 04:42:24 GMT 2008 CSN7470154A 00-17-C4-0C-E6-BB 767 0.07 0.03 0.00 122832 112920 4 3 10
Note: A curiosity... you can see from the logs above (second to last end 2 numbers, for eth0 buddies seen, and msh0 buddies seen), that the 767 build is NOT correctly disabling the msh0 network chatter while connected to an AP. The eth0 count of 4, is for the 3 XOs and a Mac running Bonjour, the msh0 count of 3, is just the 3 XOs seeing each other via the mesh (this should be 0 as the mesh should be off when connected to an AP). You can simulate the correct behaviour by running "sudo ifconfig msh0 down", the log will immediately show zero msh0 buddies, and the XO Neighbourhood view will slowly time-out any mesh only buddies over the next ~30min.

Todo

  • Get agreement on a useful set of data to collect (need input from at least Mel & Joe).
  • Implement collection of agreed data set.
  • Put final set of commends into a single batch file for ease of use .
  • Test and harden batch file against expected operational errors
    • Missing/off-line XOs.
    • Network errors/time-outs
  • Decide on use requirements for continuos monitoring.
    • Could initially be a simple sleep loop?
    • Add as a cron job?
  • Test system for a few days at least to iron out any issues.
  • Explore some tools for analysis.
    • Basic graphing and numerical analysis via spread-sheet?
    • Auto-generate daily/weekly charts (Python/PIL type code)?
    • Occasional SOM maps of the interesting vectors?