OLPC:Bot Interest Group/Tracking
I've looked at the patterns of vandalism while wiki patrolling and by conducting an analysis of the block logs and researched how other wikis are handling the vandalism problem. I wanted to present a few observations.
Unlike Wikipedia where very short blocks are used on named users to provide a "cooling off period" during edit wars or other infringements of community standards, nearly all blocks imposed on the OLPC wiki can be attributed to vandalism. This may make our block log an interesting resource for harvesting a dataset for counter-vandalism bot based on artificial neural network techniques. Crispy is working on such a bot as a successor to Cluebot.
Have you ever wondered how it is that vandalbots pick the pages they choose to vandalize? Many incidents of single gibberish or page blanking vandalism seem to be using:
as their targetting mechanism, no special pattern emerges in what pages get hit by these one-off attacks.
Recently, there has been an increase multipage strafing runs by vandals, interestingly it seems clear from reviewing the recent changes page for the period of time just prior to the vandalism, that the vandals are selecting pages from:
One particularly disturbing variant of this trend is that rather than targetting the recently edited page, the vandals are attacking the User and User Talk pages of recent editors.
This switch to multipage vandalism appears to be growing, suggesting an increase in sophistication in the automation of attacks, more detailed analysis of individual vandal edit counts will be needed to confirm this impression, but I believe that counting pages vandalized (and not just number of vandals blocked) would present an even more disturbing picture of the attacks on the wiki as multipage vandalism would essential multiply the recent numbers several fold.
Vandals use many different IP addresses to make edits, on occasion, there is a very slightly suspicious, but generally benign looking edit made by an anonymous IP (say introducing an extra blank line) that is followed some time later (often days later) by a quick spurt of vandalism by a whole series of other IP addresses. It remains to be seen if these "scouting missions" can provide a recognizable pattern for anticipating further attacks.
As an example this is actually a highly suspicious pattern of edits:
One anonymous editor makes a little nonsense edit
which is later corrected by another anonymous editor
What makes this suspicious is that these editors have absolutely no other edits and these have occurred on a page that is otherwise fairly static. There may however be no purpose served by preemptive blocking of these particular IP addresses as it is unlikely that these same IP numbers will be used again.
The vast majority of vandalism is performed by anonymous IP addresses (unfortunately so are many legitimate edits). Semi-protection is sometimes a useful technique for pages that attract repeated attention from vandals.
There are enough different patterns to suggest that mulitple vandals are involved.
Recently there have been a number of multipage vandals that are not anonymous IP addresses, but rather employ registered user names.
Individual IP blocks are only temporarily effective. There has been repeated vandalism by IP addresses when the intial block has expired. In addition, there are some 4.2 billion IP addresses and a simple IP blocking strategy must ultimately fail.
It is generally held that the "community of editors" can address the damage caused by vandalism and that the "community of sysops" will collectively fight vandalism with blocks and other tools. The charting of the block log data shows that this is simply not the case. The majority of vandalism blocks (in any given time period) are performed by one or two sysops. This imposes a significant burden on those that are willing to take on the task. It should be noted that these wiki "sheriffs" seem to have an unfortunately short term in office. It would be very disturbing if the vandalism fight is "burning out" sysops that a) could be making other contributions and b) may just give up when a successor steps up to the fight.
One slightly discouraging observation was the correction of a single vandalism edit (that was part of a series) and a block by a sysop editor (that was an interested editor on the page in question); however, no further investigation / rollback of the vandal's other edits was performed. This sort of "free-riding" is a bad sign for a community-maintained resource. It suggests a need to more clearly communicate to sysops (and users) what is expected of them when they find vandalsm, possibly by means of publishing or referencing some wiki-patrolling guidelines.
Whereas one can assume that spam on lang-en Wikipedia seems to be mostly in lang-en, the OLPC wiki does show signs of multilingual spam attacks. This may place a high premium on employing techniques with more sophisticated heuristics than recognition based on lang-en pattern matching.
There was a significant spike in vandalism temporally associated with G1G1. Recent trends indicate an increasing number of vandalism, I believe further analysis would very possibly reveal an increase in the number of pages vandalized per attack. That would suggest that the grand totals by month may be an underestimate of the real scope of the vandalism problem in recent months.