User:Mstone/Commentaries/Releases 1
Chris Ball, Marco Pesenti Gritti, Michael Stone, Tomeu Vizoso, and several others conducted a 2-hour planning session this morning. I've created a transcript of that discussion, reproduced below. If you're interested, please review the questions that were raised and contribute your thoughts (preferably inline in this document).
The end goal of this effort is a convincing written statement of where we want to go in the next four months, why we want to go there, and how we intend to get there.
Questions
What are our goals for the next four months
- Some advocate trying to "fix the network."
- Some advocate trying to "stabilize the UI, particularly the Journal and the Datastore."
- Some think we should deliver working power management, even if we can only do so in limited circumstances.
Should we divide our effort among several goals or not?
- We could pour labor into our most critical goal until we reach the point of diminishing returns, then let the remainder spill over to the second goal, and so on.
- Alternately, we can try to make sure that our top three goals all receive enough effort to achieve limited gains.
- Alternately, we can try to make sure that all of our subsystems receive at least minimal maintenance.
How can we get better information with which to make our decision?
The team should go to a pilot site and see for themselves how the XO's and Sugar are being used. Their is a limit to how much information the deployment teams can communicate back to the developers. Direct customer/developer interaction is an essential aspect of agile software development. And we are all agile right? Berrybw 22:48, 10 April 2008 (EDT)
Annotated Transcript
Initial Unhappiness
m_stone> I take it that you guys are concerned that you'll have sugar in a releasable state say around mid-June or early July? and that you'll be unable to make big changes for the remainder of the 1-1.5 months until the release? tomeu> well, I think the issue is not releasing something like last releases; some thing that more or less works but having a better API and based on components that are more predictable tomeu> "something that more or less works" == "something that passes the minimal QA work that has been done but that shows critical problems after being deployed" - we are not happy how new features have been rolled out
m_stone> tomeu: keep explaining, please. I really want to understand what things look like from your vantage point. (incidentally, that sounds more like a QA failure rather than a software team failure to me.)
tomeu> m_stone: building on an orphaned component like the DS is not very nice, either. well, project issues that needs to be solved - we cannot blame QA because it has been composed by interns cjb> and is currently composed of no-one :) tomeu> we already complained last summer about interns leaving just when they started being useful. Also, I'm not happy with the parts of the sugar API that I have contributed most, either
Present Unhappiness
m_stone> tomeu: help me understand how this affects the work you think we should undertake to do for August 1 or the next release beyond. tomeu> m_stone: we want to have a software architecture more resilient to bugs, more maintainable, and want to give activity authors a better API. nothing of that are new features from the POV of the project manager, I think. just dedicate time to it ;). Then discuss, discuss, discuss, then implement, deploy.
tomeu> so well, we have a problem deciding on what to spend time - every few months we are promised a new DS. the activity API depends on that so it's not evolving and really sucks because depends on important implementation things like storing entries with delta compression. so well, we would like to better plan things, and spend effort in invisible features
m_stone> tomeu: cscott, cjb, and I spent several minutes yesterday debating whether cscott should try to fix networking things or DS things. (and consequently whether cjb should fix networking things or power things). (and whether the kernel folks should fix power things or ???). (and whether I should concentrate on collaboration or stability/compatibility things). we have two plans and we haven't figured out yet which one is better.
tomeu> m_stone: right, that's also our problem. I was supposed to work in the journal. but had to choose between sugar, read, browse, ds, etc. at different points and performance
How should we allocate effort?
m_stone> marcopg_: plan a) is cscott, cjb, & kernel folks on networking, mstone & collabora on collaboration, and sugar on ??? where ??? is probably the as yet unspecified "UI features that are already in the bag + any accessible stability". the other plan is cscott on DS, cjb on limited cases of power, kernel folks on power, mstone on ???, and sugar on ??? where ??? is the same as above. cjb> I think there's a third plan of cscott, mstone and collabora on networking, cjb and kernel on power, sugar on ???. marcopg_> I'm not sure I understand why we are going through the two extremes about networking marcopg_> we certainly need someone on it, but I doubt putting the whole team on it will help
m_stone> I'll try to answer marcopg's question and I'll try to explain the other flaws here. first flaw: someone has to manage the release. at present, I'm the default candidate (though feel free to suggest others). that's not immediately a full time job. but it will turn into one and we can't forget it. also, effort will need to shift in the second half of the cycle toward making it happen. i.e. the release manager will need to impose on erikos, or cjb, or... :)
m_stone> second problem: there is lots of debate about how many people can profitably work on networking, on power, on collaboration, on UI, on "stability", or on "compatibility". in part, debate exists because there are different approaches to each of these problems.
Power Management
m_stone> i.e. one way to do power is to make it work in limited circumstances. I think of this as the "low-hanging fruit" approach. It mainly requires cjb + some kernel, we think. marcopg_> some kernel == not full time kernel person? cjb> Yeah. I'd be surprised if we can't come up with an acceptable limited/incremental use of power that takes less than a day to implement. I'll try to write something about this. For example, we don't do power management if you're currently connected to a jabber server, else we do.
m_stone> the other way to deal with power (over a longer time frame) is to comprehend the networking stack (so that we know what timeouts & wakeups are needed), to fix the kernel to make those timeouts & wakeups occur, then to do more userland power management. similarly for networking. cjb> (There are still the timeouts. It's hard to know how to tackle them, when no particular misbehavior has ever been reported due to them.)
Networking
m_stone> one way to approach networking is to write down a believable networking architecture, then to follow through. (where a believable networking architecture would contain a small number of protocols that "stretch" to cover all of our present use cases). the other way is to try to follow the critical path toward one use case (e.g. reliable read sharing on a school server wifi with K/N laptops) and to fix every damn bug you find along the way until it works or until you recognize defeat. the latter might get you a single working use case (or maybe 3), but it's not going to get you reliability, stability, or leverage. (they also take different amounts of time)
Focus or Divide and Conquer?
morgs> Are the approaches mutually exclusive from a resource point of view? cjb> morgs: my third approach does a fairly even split of three groups into three things. we can argue that this *sort* of approach has been the main problem with OLPC software development and has led to power management that doesn't really work, a mesh that doesn't really work, collaboration that doesn't really work, and a datastore that doesn't really work. m_stone> and, as cjb alluded, we presently feel like we have N demos rather than 1 solid product. some people believe this is because we sent N people in N+2 directions rather than either sending 10N people in N+2 directions or sending N people in, say, 3 directions
marcopg_> I don't think that's matter of people; it's matter of too many ambitious goals
tomeu> I don't think that having twice the number of people would have been a waste of resources - in many projects it is, but in our case... m_stone> marcopg_: the complete absence of a QA team definitely made itself felt here... (and the absence of software attacking the QA problem, e.g. test suites) tomeu> the absence of a DS maintainer, too
tomeu> the problem is that it depends on what we aim to do and, regarding sugar, we just don't know right now... perhaps later today
marcopg_> let me rephrase: I don't think moving everyone to work on the network will help the network a lot, at least on 4 months schedule. gradually adding people to the current team would be much more sensible, ihmo
m_stone> marcopg_: a) I think scott proposes to make sure that every aspect of networking has at least one person working on it; not to make everyone work on it regardless of whether they are doing something useful m_stone> marcopg_: b) why, precisely, do you anticipate that (a) would fail? (or would be too expensive to pursue?)
marcopg_> because it takes time for people to figure out how to work together in a productive way. if we have uncovered areas then we should cover them. which aspects of networking we are not covering right now?
tomeu> m_stone: do you think that many of our orphaned or under-resourced areas can only be tackled efficiently for people already in the team? (in this case, without wasting most of the new resources' potential) m_stone> depends what you mean by "efficiently"; specifically: I think that we have individuals who have the experience required to 'begin labor' on many areas but I don't feel that they'll have the ability to close noticeable numbers of bugs
tomeu> if we talk about how to distribute resources, we need to characterize every area as in how high is the entry point. some areas will require a longer effort to start being productive, some less
OLE Nepal: Journal & Datastore
BryanWB> hey guys, if you dont' mind i have a few ideas of where OLPC should put its resources. I think OLPC should focus on the datastore and Sugar over the next serveral months, whether that means putting all or most of your human resources towards that. W/ the datastore/Journal being most important. the networking and power work well enough but it is fairly difficult to use the Journal to find stuff. Also, in my week's worth of QA, Sugar just does unexpected stuff. Over the last 2 weeks of working w/ os699, then 702, now 703, I have had the most problems w/ the Journal/datastore then Sugar weirdness. marcopg_> I tend to agree with BryanWB, simply because the Journal is more a base/fundamental use case then network sharing; i.e. losing your work is more critical then not being able to share an activity
cjb> I think my counter-argument would be that if you live somewhere with ample power, have a wireless AP and only one XO, of course you're going to say that networking and power work fine. :) BryanWB> cjb: we are using an AP not active antenna cjb> BryanWB: Yes. That's why I say it's natural that you would think it works fine. What *doesn't* work is the mesh (non-AP) layer. m_stone> cjb: rather, our choice of protocols and implementations is incompatible with the mesh. (I speak relationally because I suspect that we are more likely to change the protocols & implementations than we are to change the firmware)
BryanWB> cjb: I know, but that doesn't work anyway when you've got 70 XO's at one school, AFAIK. mesh is less important than basic usability of datastore tomeu> BryanWB: that's another problem we have, knowing the very different problems of every deployment location. I think peru's next deployments will use mainly mesh, but I'm not sure. BryanWB> tomeu: they can buy a small AP and power backup for Peruvian schools. Power backup is pretty ubiquitous anywhere they have electricity and load-shedding. ** at least in south and east asia ** don't know south america cjb> BryanWB: what about when the kids go home from school, though?
erikos> i think both are important use cases mesh and datastore and should both be addressed - i hope that for august we can do both - at least parts. we once started to provide mesh usability - and should continue to push for it; i think the statement that the datastore should work is out of question cjb> the problem is that no-one is confident we can get there by pushing, rather than by throwing out what we have and redesigning. I guess then we come to a question of whether the datastore can also be made to work properly incrementally.
BryanWB> cjb: it needs to work at school first, that's where the other kids are and they spend 8+ hours per day. Home should be a lower priority cjb> since cscott's time is one of the main variables here, and he'd be doing olpcfs. m_stone> cjb: well, we might be able to have two people work on it. for example, one person could start implementing the testing! erikos> cjb: sure - i mean if we feel we have to throw something out - we might have to keep a beta solution in the hands as well. in the case of the datastore - if we decide we rewrite (i think that is what people feel) we should start now cjb> I think the short answer is that we shouldn't have to be making these decisions ourselves. But if we write them up clearly enough, we can pass the decision up the chain. tomeu> having to choose between mesh and datastore is like choosing between drinking or eating - both are needed
Feedback from Peru?
m_stone> so two questions: 1) what have the peruvian teachers discovered about 703? carla was clearly upset about a mesh bug. marcopg_> m_stone: that's something that the deployment team should figure out - we just don't have enough information at the moment to say it m_stone> marcopg_: in which case the priority should be to get that information, yes? or to change the decision so that it's no longer necessary. marcopg_> m_stone: I think that should be the priority, yeah
What is "compatibility"?
m_stone> marcopg_: you say the "compatibility" requirement is "unknown"? marcopg_> m_stone: I don't know why it has been suddenly added to the list of top reqs. (I'm not against, I'd just like to understand better)
m_stone> marcopg_: first, let's make sure we mean the same thing. compatibility means two things: 1) our software running on other people's platforms and 2) other people's software running on our platform.
m_stone> we mean both of these, I think. marcopg_> but there are a lot of possible approaches with pretty different end results and that's the main reason I want to know why it was pushed - to figure out solutions that meets the exact requirements bemasc> cscott specifically referred to #2, not #1
Why do we want it?
m_stone> agreed that the reasoning behind making it a priority has not been explained - I'll try to articulate the justification. 1) we wish more people were supporting our platform. We thing that there are two ways to do this - first, to bring our software to them, and second, to give them an immediate userbase running on our platform and 2) we're concerned, from time to time, that the days of funded support for our platform are limited; therefore, we wish it to be less expensive to maintain.
marcopg_> how much limited? 1-2 years or 1-2 months? ;) m_stone> marcopg_: we've been greatly reassured on this point in recent weeks. (specifically by the fact that kim's budget was accepted). but I think that anxiety is a driving factor in the desire to increase compatibility. as for your question: "both, with different liklihoods"
tomeu> we are having to guess too much, considering we have kids using the laptops in the thousands and adding anxiety to the guessing... m_stone> tomeu: I say the same thing all the time but I don't know what to do about it. the anxiety has greatly abated. we are now confident enough in survival to begin hiring.
Who wants it and how?
marcopg_> m_stone: do we expect countries to actually deploy machines with standard applications installed? and are we going to encourage or discourage it? marcopg_> the degree of compatibility we need is also an important factor for the datastore work even though I think we will be forced to do *something* about the datastore situation by august tomeu> m_stone: which legacy software needs to be run in the laptops and which problems it has? if it's flash apps, then the compatibility stuff gains a totally different meaning if it's java, we may have different problems than APIs
m_stone> marcopg_, tomeu: excellent questions which I do not have authoritative answers for. marcopg_> m_stone: ok, that's a question we need to answer I think
How does it affect the DS (and other APIs)?
tomeu> m_stone: I wonder why the compatibility thing is so tied to a new DS m_stone> tomeu: because of the desire to use the journal as the only navigation tool pointing to local persistent data tomeu> from my pov, the same mechanisms that olpcfs has for that, could be added on top of the current implementation. we could add a fuse thingy on top of the current DS m_stone> tomeu: I'm skeptical that the current DS would withstand that much stress... marcopg_> yeah what m_stone said tomeu> m_stone: ok, I'm only concerned of mixing so many different issues m_stone> tomeu: it's a good concern. I have no good answers for you :(
tomeu> we talk about compatibility, but do something because of performance marcopg_> there is value in the idea of reusing the API though and to start by fixing the implementation m_stone> marcopg_: except for the fact that it's an API that we know is terrible. cjb> I don't think we should be too afraid about breaking APIs, if we have a principled reason for doing it. It happens.
bemasc> I don't think there is very much legacy software worth using marcopg_> m_stone: it is, but we have tons of activities using it out there m_stone> marcopg_: not very many, actually. marcopg_> so in some way we will probably need to keep supporting it
tomeu> I'm perhaps the person who has suffered more because of the DS. I have been promised lots of DS replacements - let me count them... 4! Four times, I have been promised a DS, and have received only half!
Can we drop the current DS?
marcopg_> m_stone: do you think we can drop it and ask authors to port? (real question) m_stone> marcopg_: yes. cjb> And we can offer to update all activity code in our git repo to the new one. cjb> marcopg_: I think so too. marcopg_> ok, I tend to agree. (I'm sure there will be people that don't though). but that's fine
m_stone> marcopg_: I'm not sure that we should commit to doing much else. but I think we could be confident of achieving that thing.
tomeu> I think there are cheap alternatives to breaking existing activities, worth discussing, at least m_stone> tomeu: I don't think there are cheap alternatives to breaking activities that will noticeably improve stability. I don't really believe it. But I'm happy to hear your thoughts. marcopg_> yeah that's a good point
tomeu> m_stone: using python namespaces, having daemons exporting two different dbus interfaces, etc m_stone> tomeu: what does that have to do with the causes of failure in the present system? (which I believe are primarily bitrot and invalid assumptions) tomeu> m_stone: was referring to maintain API compatibility, while revamping all of it
marcopg_> personally I think that if we are confident cscott can come up with something solid by august, then it's totally worth to have him focus on the ds m_stone> marcopg_: I'm very confident in him (modulo the risk that he gets sick, gets run over by a bus, gets "promoted", etc.) :) - recall olpc-update... once he understand a problem, he's quite difficult to stop. marcopg_> then let's do it ;) and let's figure out how to properly resource network without Scott, it doesn't seem impossible erikos> m_stone: i think as well that it is worth to have at least one person focusing on the ds reimplementation
bemasc> the high-level API will (read(), save(), etc.) will presumably remain unchanged, so most activities won't require any modification tomeu> bemasc: we have some problems with that API, I'm not sure yet how to solve them... well, we overcome those by redesigning, and use those techniques for old activities not breaking afterwards
m_stone> marcopg_: let's see if we can get some better information from the deployment & sales folks first. even if it takes us a week to make a decision, it will be time well spent if it winds up being a better decision. many times over. marcopg_> m_stone: I agree
In more detail, what do we want?
bemasc> m_stone: it would be very nice to have a list of desired legacy-Linux programs. the only ones I know of are gnumeric and inkscape m_stone> bemasc: agreed. (though tomeu & marcopg rightly suggest that flash & java might also be good compatibility targets) bemasc> m_stone: well, flash doesn't write to disk, so it's immaterial. It's essentially out of our hands m_stone> bemasc: totally false. we can decide to ask developers to work on gnash or not. rsavoye has repeatedly said that gnash is shorthanded. we've just never tried to actively assist him.
tomeu> about gnumeric, embedding it like we do with abiword shouldn't be very complicated marcopg_> tomeu: it would require to refactor the code heavily tomeu> marcopg_: yeah, but how many months would take it to you? marcopg_> tomeu: that doesn't scale - we can't embed every possible application m_stone> tomeu: this would not improve compatibility though.
tomeu> inkscape, had some memory usage problems that I don't know if have been solved yet m_stone> tomeu: scott apparently looked at inkscape when he was writing his icon editor - he thinks it would be quite hard to sugarize. (I don't fully understand why)
bemasc> but for both gnumeric and inkscape, the datastore is still not the problem because both of them use the standard OS-provided open/save dialogs, which sugar can substitute for DS-based dialogs tomeu> marcopg_: I have heard gnumeric and inkscape, only bemasc> marcopg_: there really aren't very many applications marcopg_> it's not much a problem of existing applications
tomeu> marcopg_: we are getting back to the same point: we need to have someone from the ministries of education to tell us which legacy applications they feel the need for. we are struggling to provide features that maybe nobody will appreciate, while failing to provide the important ones
marcopg_> if I want to write a cool application for sugar, I know that it will be always confined to that platform and honestly I can see how it's not an appealing perspective. we have the possibility to change this situation without a lot of effort and so I think we should do it. bemasc> marcopg_: yes. It's just a question of priority. homunq> marcopg_++ on should, no opinion on can
tomeu> marcopg_: you, as a good software developer, will abstract the platform dependent code. marcopg_> tomeu: don't design your platform for very good software developers tomeu> marcopg_: my point is not if we should do something or not, what I mean is that I'm not convinced of the urgency of this code-on-xo run- everywhere requirement. the current DS API is insufficient no matter what we decide. and we have already devised a scheme for legacy applications to store its files in the journal. where are we differing?
What compatibility might we try to provide?
marcopg_> I think it would make for a much better experience for developers which start working on sugar. it would put us in a better position in the case of a end-of-funding and it will make more likely that our applications are reused outside sugar. I think those are pretty strong reasons and I argue the effort is minimal.
bemasc> tomeu: (I think we need to distinguish between legacy FD.o compliant applications and legacy non-FD.o-compliant)
tomeu> marcopg_: we can make files dropped in ~/Documents to appear in the journal as well as possible. but those apps won't be able to give a so good experience as an activity. no matter what we do, "You created the document My Dog with the application Gimp" is not as good as "You drawed My Dog with Tim and Silvia". marcopg_> tomeu: apps not specifically designed for sugar will never give you an as good experience as an activity, but that's fine tomeu> as I said, we agree on what needs to be done marcopg_> then let's just do it :P
How should we try to get there?
tomeu> and I think we agree in that the current DS API will need to be scraped, right? well, the problem is that nobody is discussing what I think needs to be discussed marcopg_> tomeu: what do you think needs to be discussed... :) tomeu> marcopg_: what will be the new API(s) marcopg_> my suggestion about that would be to look into scott prototype and build a prototype UI on the top of it. that would give us a much better base to discuss on. Scott is in charge of designing the API - the more feedback he gets, better the API will be. I'm not really a fan of a long abstract discussion of an API, without experimenting with it. tomeu> marcopg_: well, activity authors may be able to provide something more close to their real needs - I don't think we are at the point where activity authors feedback is useful, yet marcopg_> when olpcfs is more mature, that will certainly be useful
Why is the current DS API unhappy?
Blaketh> tomeu: Can you point me towards some failings of the current DS API? bemasc> activity authors are pretty much happy with the current high-level API, which is what almost all of them use. the only significant issue is an inability to save multiple files as a single entry marcopg_> bemasc: well that's your opinion. I heard a bunch of complaints about the current API, and I've seen a bunch of confused developers tomeu> Blaketh: see http://wiki.laptop.org/go/DatastoreOpenIssues. being able to update a file without having to copy it around is one missing point. tomeu> bemasc: the issue here is API. right now, activity authors need to copy the whole file to a temp dir, modify it, and submit again to the DS. it's the same at the end, but would be good if activity authors weren't concerned about how the DS stores data. the DS passes a path where the activity can read or write to so the DS should make the copy. after the activity is happy with it, commits to the DS, no matter if the activity is Read, VideoEditor, etc
bemasc> anyone thinking about legacy compatibility should know about http://www.pygtk.org/docs/pygtk/class-gtkfilechooser.html - Sugar can provide its own filechooser, thus integrating any GTK application with the datastore without fancy filesystem footwork, and without ever showing paths to the user tomeu> bemasc: well, using the fs as a means for getting things into the journal is better than that, right?