OLPC:Bot Interest Group/Tracking

From OLPC
< OLPC:Bot Interest Group
Revision as of 20:30, 3 August 2008 by Cjl (talk | contribs) (New page: I've looked at the patterns of vandalism while wiki patrolling and by conducting an analysis of the block logs and researched how other wikis are handling the vandalism problem. I wan...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

I've looked at the patterns of vandalism while wiki patrolling and by

conducting an analysis of the block logs and researched how other wikis

are handling the vandalism problem. I wanted to present a few

observations.

Observation 1:

Unlike Wikipedia where very short blocks are used on named users to

provide a "cooling off period" during edit wars or other infringements

of community standards, nearly all blocks imposed on the OLPC wiki can

be attributed to vandalism. This may make our block log an interesting

resource for harvesting a dataset for counter-vandalism bot based on

artificial neural network techniques. Crispy is working on such a bot

as a successor to Cluebot.


Observation 2:

Have you ever wondered how it is that vandalbots pick the pages they

choose to vandalize? Many incidents of single gibberish or page

blanking vandalism seem to be using:

http://wiki.laptop.org/go/Special:Random

as their targetting mechanism, no special pattern emerges in what pages

get hit by these one-off attacks.


Observation 3:

Recently, there has been an increase multipage strafing runs by vandals,

interestingly it seems clear from reviewing the recent changes page for

the period of time just prior to the vandalism, that the vandals are

selecting pages from:

http://wiki.laptop.org/go/Special:Recentchanges

One particularly disturbing variant of this trend is that rather than

targetting the recently edited page, the vandals are attacking the User

and User Talk pages of recent editors.

This switch to multipage vandalism appears to be growing, suggesting an

increase in sophistication in the automation of attacks, more detailed

analysis of individual vandal edit counts will be needed to confirm this

impression, but I believe that counting pages vandalized (and not just

number of vandals blocked) would present an even more disturbing picture

of the attacks on the wiki as multipage vandalism would essential

multiply the recent numbers several fold.

Observation 4:

Vandals use many different IP addresses to make edits, on occasion,

there is a very slightly suspicious, but generally benign looking edit

made by an anonymous IP (say introducing an extra blank line) that is

followed some time later (often days later) by a quick spurt of

vandalism by a whole series of other IP addresses. It remains to be

seen if these "scouting missions" can provide a recognizable pattern for

anticipating further attacks.


As an example this is actually a highly suspicious pattern of edits:

http://wiki.laptop.org/index.php?title=OS_images_for_USB_disks&action=hi

story

One anonymous editor makes a little nonsense edit

http://wiki.laptop.org/go/Special:Contributions/200.243.151.151

which is later corrected by another anonymous editor

http://wiki.laptop.org/go/Special:Contributions/205.209.91.210

What makes this suspicious is that these editors have absolutely no

other edits and these have occurred on a page that is otherwise fairly

static. There may however be no purpose served by preemptive blocking

of these particular IP addresses as it is unlikely that these same IP

numbers will be used again.

Observation 5:

The vast majority of vandalism is performedby anonymous IP addresses

(unfortuantely so are many legitimate edits). Semi-protection is

sometimes a useful technique for pages that attract repeated attention

from vandals.

Observation 6:

There are enough different patterns to suggest that mulitple vandals are

involved.

Observation 7:

Recently there have been a number of multipage vandals that are not

anonymous IP addresses, but rather employ registered user names.


Observation 8:

Individual IP blocks are only temporarily effective. There has been

repeated vandalism by IP addresses when the intial block has expired. In

addition, there are some 4.2 billion IP addresses and a simple IP

blocking strategy must ultimately fail.

Observation 9:

It is generally held that the "community of editors" can address the

damage caused by vandalism and that the "community of sysops" will

collectively fight vandalism with blocks and other tools. The charting

of the block log data shows that this is simply not the case. The

majority of vandalism blocks (in any given time period) are performed

by one or two sysops. This imposes a significant burden on those that

are willing to take on the task. It should be noted that these wiki

"sheriffs" seem to have an unfortunately short term in office. It

would be very disturbing if the vandalism fight is "burning out" sysops

that a) could be making other contributions and b) may just give up when

a successor steps up to the fight.

One discouraging observation was the correction of a single vandalism

edit (that was part of a series) and a block by a sysop editor (that was

an interested editor on the page in question); however, no further

investigation / rollback of the vandal's other edits was performed.

This sort of "free-riding" is a bad sign for a community maintained

resource.

Observation 10:

Whereas one can assume that spam on lang-en Wikipedia seems to be mostly

in lang-en, the OLPC wiki does show signs of multilingual spam attacks.

This may place a high premium on employing techniques with more

sophisticated heuristics than recognition based on lang-en pattern

matching.


Observation 11:

There was a significant spike in vandalism temporally associated with

G1G1. Recent trends indicate an increasing number ofvandalism, I

believe further analysis would very possibly reveal an increase in the

number of pages vandalized per attack. That wouldsuggest that the

grandtotals bymothmay be an underestimate of the real scope of the

vandalism problem in recent months