Python Style Guide

From OLPC
Revision as of 14:10, 19 December 2006 by 86.135.8.160 (talk) (Indentation)
Jump to: navigation, search

Note: this document is still being discussed, and is not authoritative for the OLPC project. Also, no code has been updated to fit this style guide. No code should be updated until this is finalized.

Introduction

This document gives coding conventions for the Python code in the One Laptop Per Child project.

This document was adapted from Guido's original Python Style Guide essay, with some additions from Barry's style guide. This guide was then modified from PEP 8 by Ian for One Laptop Per Child, to cover additional issues that are present in that environment and to make some of the language stronger.


A Foolish Consistency is the Hobgoblin of Little Minds

One of Guido's key insights is that code is read much more often than it is written. The guidelines provided here are intended to improve the readability of code and make it consistent across the wide spectrum of Python code. As PEP 20 says, "Readability counts".

A style guide is about consistency. Consistency with this style guide is important. Consistency within a project is more important. Consistency within one module or function is most important.

But most importantly: know when to be inconsistent -- sometimes the style guide just doesn't apply. When in doubt, use your best judgment. Look at other examples and decide what looks best. And don't hesitate to ask!

Two good reasons to break a particular rule:

  1. When applying the rule would make the code less readable, even for someone who is used to reading code that follows the rules.
  2. To be consistent with surrounding code that also breaks it (maybe for historic reasons) -- although this is also an opportunity to clean up someone else's mess (in true XP style).


A Note On Consistency

When you are interfacing with another library and providing a Python wrapping for its functions, you should always adopt the naming style of that library.

If a library is maintained or authored outside of the OLPC project, you should respect the style guidelines of that library when making edits or additions.

If you are changing the style of a piece of code, this should be done all at once and no other changes should be made at the same time. Whitespace changes in particular should be done separate from even naming changes.


Code lay-out

Indentation

Use 4 spaces per indentation level. Do not use tabs.

The number of spaces used can be easily changed with a script. I think we should give serious consideration to reducing this to 2 spaces per indentation level to minimize the number of line breaks needed and also minimize the whitespace on a screenful of code. Admittedly, lots of people, using 19 and 21 inch monitors, currently use a 4-space standard, but that can be easily fixed with a simple script. Python has a builtin parser module that can be used to do this. If all the code lives in a repository such as SVN, then this can be done as part of the code check-in process without anyone needing to think about it. However, the end-users of the laptop, working on their small screens, will thank you for it.

Maximum Line Length

Limit all lines to a maximum of 79 characters.

There are still many devices around that are limited to 80 character lines; plus, limiting windows to 80 characters makes it possible to have several windows side-by-side. The default wrapping on such devices looks ugly. Therefore, please limit all lines to a maximum of 79 characters. For flowing long blocks of text (docstrings or comments), limiting the length to 72 characters is recommended.

The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. If necessary, you can add an extra pair of parentheses around an expression, but sometimes using a backslash looks better. Make sure to indent the continued line appropriately. Some examples:

   class Rectangle(Blob):
   
       def __init__(self, width, height,
                    color='black', emphasis=None, highlight=0):
           if width == 0 and height == 0 and \
              color == 'red' and emphasis == 'strong' or \
              highlight > 100:
               raise ValueError("sorry, you lose")
           if width == 0 and height == 0 and (color == 'red' or
                                              emphasis is None):
               raise ValueError("I don't think so")
           Blob.__init__(self, width, height,
                         color, emphasis, highlight)

Assert statements in particular tend to go over the line boundaries; so generally asserts should look like this:

   assert value is not None, (
       "value should not be None")

Blank Lines

Vertical whitespace (blank lines) are not that important to readability. For the most part this can be left to the developers discretion. As a general guideline:

  • Separate top-level function and class definitions with two blank lines.
  • Method definitions inside a class are separated by a single blank line.
  • Extra blank lines may be used (sparingly) to separate groups of related functions. Blank lines may be omitted between a bunch of related one-liners (e.g. a set of dummy implementations).
  • Use blank lines in functions, sparingly, to indicate logical sections.

Encodings (PEP 263)

[note: this diverges from PEP 8]

Python source for the OLPC must contain a Unicode UTF-8 encoding declaration, which looks like:

   # coding: UTF8

Only UTF8 should be used even if you are not using non-ASCII characters in your code. The reason is to make it easy for others to take up any Python file, make modifications and add comments in their own language.

As a special case a file with the UTF8 signature '\xef\xbb\xbf' at the beginning of the file will be detected by Python as a UTF8 file. Do not use or rely on this signature since some editors will remove it. Always include the UTF-8 encoding declaration.

Note that you cannot use unicode in any identifiers in Python; the encoding only applies to Unicode strings like u"a string" and comments. Long strings of text (that are not English) should be in localization files, not in the code itself.

Files produced by OLPC should generally be UTF8-encoded Unicode. Even simple things like config files should be read and written as UTF-8.

Imports

Imports should usually be on separate lines, e.g.:

   Yes: import os
        import sys
   
   No:  import sys, os

it's okay to say this though:

   from subprocess import Popen, PIPE

[note: this is a soft requirement]

Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.

Imports should be grouped in the following order:

  1. standard library imports
  2. related third party imports
  3. local application/library specific imports

You should put a blank line between each group of imports. [note: I don't care about the blank line, and consider the ordering to be only a suggestion]

Put any relevant __all__ specification after the imports.

Relative imports for intra-package imports are highly discouraged.

Always use the absolute package path for all imports. If or until we settle on Python 2.5 we cannot use PEP 328, and so cannot do explicit relative imports.

"from x import *" is generally discouraged.

You should only import this way from packages that are intended to be used like this (the packages generally define __all__).

You should never use "import *" more than once in a file. If you use it more than once then there is no way to know (without leaving the file) exactly where a name comes from. So long as "import *" is used just once, one can assume when no other source can be found for a name that it must come from this import.

When importing a class from a class-containing module

It's usually okay to spell this:

   from myclass import MyClass
   from foo.bar.yourclass import YourClass

If this spelling causes local name clashes, then spell them

   import myclass
   import foo.bar.yourclass

and use "myclass.MyClass" and "foo.bar.yourclass.YourClass"

[note: I hate "import foo.bar.yourclass" and prefer just "from foo.bar import yourclass" or "from foo.bar.yourclass import YourClass"; this note should probably be changed.]

In summary

A file should generally look like this:

   # -*- coding: UTF8 -*-  (MUST always be used)
   """
   docstring: may also be a unicode or 'raw' string
   If you are using doctest then a raw string is recommented
   (prefix the string with an r)
   [are unicode strings generally preferred for docstrings?
   that would give a prefix or u or ur]
   """
   from __future__ ...
   import stdlib modules
   import external modules
   import internal modules
   __all__ = [...]   # If you use __all__
   constants...
   functions and classes...

__init__.py Files

__init__.py files should generally contain no substantive code. Instead they should import from other modules. Importing from other modules is done so that a package can provide a front-facing set of objects and functions it exports, without exposing each of the internal modules in the package. Note however that this causes the submodules to be eagerly imported; if this is likely to cause unnecessary overhead then the import in __init__.py should be reconsidered.

Whitespace in Expressions and Statements

Pet Peeves

Avoid extraneous whitespace in the following situations:

Immediately inside parentheses, brackets or braces.

   Yes: spam(ham[1], {eggs: 2})
   No:  spam( ham[ 1 ], { eggs: 2 } )

Immediately before a comma, semicolon, or colon:

   Yes: if x == 4: print x, y; x, y = y, x
   No:  if x == 4 : print x , y ; x , y = y , x

[note: if you do not put a space after a comma, it is harder to visually distinguish . from ,; e.g., foo(a,b) and foo(a.b). Please use spaces after commas!]

Immediately before the open parenthesis that starts the argument list of a function call:

   Yes: spam(1)
   No:  spam (1)

Immediately before the open parenthesis that starts an indexing or slicing:

   Yes: dict['key'] = list[index]
   No:  dict ['key'] = list [index]

More than one space around an assignment (or other) operator to align it with another.

   Yes:
   
       x = 1
       y = 2
       long_variable = 3
   
   No:
   
       x             = 1
       y             = 2
       long_variable = 3

[note: I'm soft on this one, though less soft on the others]


Other Recommendations

Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not).

Use spaces around arithmetic operators:

   Yes:
   
       i = i + 1
       submitted += 1
       x = x * 2 - 1
       hypot2 = x * x + y * y
       c = (a + b) * (a - b)
   
   No:
   
       i=i+1
       submitted +=1
       x = x*2 - 1
       hypot2 = x*x + y*y
       c = (a+b) * (a-b)

Don't use spaces around the '=' sign when used to indicate a keyword argument or a default parameter value.

   Yes:
   
       def complex(real, imag=0.0):
           return magic(r=real, i=imag)
   
   No:
   
       def complex(real, imag = 0.0):
           return magic(r = real, i = imag)

[note: this is really helpful to make the code more readable; please use this convention. Keyword arguments aren't assignments, and this makes that visually clear.]

Compound statements (multiple statements on the same line) are strongly discouraged.

   Yes:
   
       if foo == 'blah':
           do_blah_thing()
       do_one()
       do_two()
       do_three()
   
   Rather not:
   
       if foo == 'blah': do_blah_thing()
       do_one(); do_two(); do_three()

Don't be lazy, just hit enter!

if/else expressions and list comprehensions should not be deeply nested.

   [this needs some examples]


Comments

Comments that contradict the code are worse than no comments. Always make a priority of keeping the comments up-to-date when the code changes!

Comments should go before the thing they are commenting on, like:

   # match will be the regex match object:
   match = None

Or sometimes inside an if statement or other control structure:

   if match is None:
       # None of our attempts to match worked
       raise ValueError("Nothing matched!")

Comments should be complete grammatically correct sentences.

If a comment is short, the period at the end can be omitted. Block comments generally consist of one or more paragraphs built out of complete sentences, and each sentence should end in a period.

Regardless of the language you use, you should write clear and easily understandable sentences. If you use English, many readers will only understand basic English. If you use your native language, many readers will be children who are still learning their language.

When choosing the language for comments, think of who will have to read these comments. If you are writing code that will be used by people in many countries, then English is probably the best choice.

Block Comments

Block comments generally apply to some (or all) code that follows them, and are indented to the same level as that code. Each line of a block comment starts with a # and a single space (unless it is indented text inside the comment).

Paragraphs inside a block comment are separated by a line containing a single #.

Inline Comments

Use inline comments sparingly.

An inline comment is a comment on the same line as a statement. Inline comments should be separated by at least two spaces from the statement. They should start with a # and a single space.

Inline comments are unnecessary and in fact distracting if they state the obvious. Don't do this:

   x = x + 1                 # Increment x

But sometimes, this is useful:

   x = x + 1                 # Compensate for border

Generally comments on separate lines are easier to edit:

   # Compensate for border:
   x = x + 1

Documentation Strings

Conventions for writing good documentation strings (a.k.a. "docstrings") are immortalized in PEP 257.

Write docstrings for all public modules, functions, classes, and methods. Docstrings are not necessary for non-public methods, but you should have a comment that describes what the method does. This comment should appear after the "def" line.

PEP 257 describes good docstring conventions. Note that most importantly, the """ that ends a multiline docstring should be on a line by itself, and preferably preceded by a blank line, e.g.:

   """Return a foobang
   
   Optional plotz says to frobnicate the bizbaz first.
   
   """

For one liner docstrings, it's okay to keep the closing """ on the same line.

Avoid using ''' for docstrings.


Naming Conventions

Descriptive: Naming Styles

There are a lot of different naming styles. It helps to be able to recognize what naming style is being used, independently from what they are used for.

The following naming styles are commonly distinguished:

  • b (single lowercase letter)
  • B (single uppercase letter)
  • lowercase
  • lower_case_with_underscores
  • UPPERCASE
  • UPPER_CASE_WITH_UNDERSCORES
  • CapitalizedWords (or CapWords, or CamelCase -- so named because of the bumpy look of its letters). This is also sometimes known as StudlyCaps. (Note: When using abbreviations in CapWords, capitalize all the letters of the abbreviation. Thus HTTPServerError is better than HttpServerError.)
  • mixedCase (differs from CapitalizedWords by initial lowercase character!)
  • Capitalized_Words_With_Underscores (ugly!)

There's also the style of using a short unique prefix to group related names together. This is not used much in Python, but it is mentioned for completeness. For example, the os.stat() function returns a tuple whose items traditionally have names like st_mode, st_size, st_mtime and so on. (This is done to emphasize the correspondence with the fields of the POSIX system call struct, which helps programmers familiar with that.)

The X11 library uses a leading X for all its public functions. In Python, this style is generally deemed unnecessary because attribute and method names are prefixed with an object, and function names are prefixed with a module name.

In addition, the following special forms using leading or trailing underscores are recognized (these can generally be combined with any case convention):

_single_leading_underscore:

Weak "internal use" indicator. E.g. "from M import *" does not import objects whose name starts with an underscore.

single_trailing_underscore_:

used by convention to avoid conflicts with Python keyword, e.g.

 Tkinter.Toplevel(master, class_='ClassName')

__double_leading_underscore:

Wen naming a class attribute, invokes name mangling (inside class FooBar, __boo becomes _FooBar__boo; see below).

__double_leading_and_trailing_underscore__:

"magic" objects or attributes that live in user-controlled namespaces. E.g. __init__, __import__ or __file__. Never invent such names; only use them as documented.

Prescriptive: Naming Conventions

Names to Avoid

Never use the characters `l' (lowercase letter el), `O' (uppercase letter oh), or `I' (uppercase letter eye) as single character variable names.

In some fonts, these characters are indistinguishable from the numerals one and zero. When tempted to use `l', use `L' instead.

Do not abbreviate names by removing vowels. Instead truncate the name.

   Yes:
   
     func
     decl
    
   No:
   
     fnctn
     dcln [note: these aren't very good examples, because they are just
       *too* ugly to be plausible...]

Module Names

Modules should have short, lowercase names, without underscores.

This naming convention distinguishes modules from both functions and classes. This is important; consider this example from Zope 2:

 from DateTime.DateTime import DateTime

In Zope 2 the DateTime package contained a DateTime module with a DateTime class. As a result when you see "DateTime" in the source you can't be sure if it's referring to the package, module, or class. If the module had been named datetime it would be obvious when you were referring to the module and when you were referring to the class. Similar confusion can exist with functions, which is the motivation for leaving underscores out of module names (but using them in function names).

When an extension module written in C or C++ has an accompanying Python module that provides a higher level (e.g. more object oriented) interface, the C/C++ module has a leading underscore (e.g. _socket).

Like modules, Python packages should have short, all-lowercase names, without underscores.

Class Names

Almost without exception, class names use the CapWords convention. Classes for internal use have a leading underscore in addition.

Exception Names

Because exceptions should be classes, the class naming convention applies here. However, you should use the suffix "Error" on your exception names (if the exception actually is an error). [note: I find the Error suffix to often be redundant, but maybe it is best to use]

Global Variable Names

(Let's hope that these variables are meant for use inside one module only.) The conventions are about the same as those for functions.

Modules that are designed for use via "from M import *" should use the __all__ mechanism to prevent exporting globals, or use the the older convention of prefixing such globals with an underscore (which you might want to do to indicate these globals are "module non-public").

Many modules are not really intended to be used with "from M import *" and will export many unintended objects (like other modules). Generally you should not use "import *" unless a module is intended to be used like that, and the presence of __all__ is a good indication if a module is intended to be used that way.

Function Names

Function names should be lowercase, with words separated by underscores as necessary to improve readability.

mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py).

Function and method arguments

Always use 'self' for the first argument to instance methods.

Always use 'cls' for the first argument to class methods.

Always use 'metacls' for the first argument to metaclass method. These are technically class methods of the metaclass, but if you don't distinguish metaclasses from classes you will confuse readers terribly.

If a function argument's name clashes with a reserved keyword, it is generally better to append a single trailing underscore rather than use an abbreviation or spelling corruption. Thus "print_" is better than "prnt". (Perhaps better is to avoid such clashes by using a synonym.)

Method Names and Instance Variables

Use the function naming rules: lowercase with words separated by underscores as necessary to improve readability.

Use one leading underscore only for non-public methods and instance variables.

Do *not* use two leading underscores. Python mangles these names with the class name: if class Foo has an attribute named __a, it cannot be accessed by Foo.__a. (An insistent user could still gain access by calling Foo._Foo__a.) If you have some reason to want to avoid name clashes in subclasses, you should use *explicit* name mangling by using an explicit prefix in front of your attributes or functions, like Foo._Foo_a.

Designing for inheritance

[note: this is rather complex; generally I think designing for inheritance should be avoided except in specific cases where it provides real benefits. In many cases first class functions and other techniques are easier to understand and manage than subclassing.]

Always decide whether a class's methods and instance variables (collectively: "attributes") should be public or non-public. If in doubt, choose non-public; it's easier to make it public later than to make a public attribute non-public.

Public attributes are those that you expect unrelated clients of your class to use, with your commitment to avoid backward incompatible changes. Non-public attributes are those that are not intended to be used by third parties; you make no guarantees that non-public attributes won't change or even be removed.

We don't use the term "private" here, since no attribute is really private in Python (without a generally unnecessary amount of work).

Another category of attributes are those that are part of the "subclass API" (often called "protected" in other languages). Some classes are designed to be inherited from, either to extend or modify aspects of the class's behavior. When designing such a class, take care to make explicit decisions about which attributes are public, which are part of the subclass API, and which are truly only to be used by your base class.

With this in mind, here are the Pythonic guidelines:

  • Public attributes should have no leading underscores.
  • If your public attribute name collides with a reserved keyword, append a single trailing underscore to your attribute name. This is preferable to an abbreviation or corrupted spelling. (However, notwithstanding this rule, 'cls' is the preferred spelling for any variable or argument which is known to be a class, especially the first argument to a class method.) Note 1: See the argument name recommendation above for class methods.
  • For simple public data attributes, it is best to expose just the attribute name, without complicated accessor/mutator methods. Keep in mind that Python provides an easy path to future enhancement, should you find that a simple data attribute needs to grow functional behavior. In that case, use properties to hide functional implementation behind simple data attribute access syntax. Note 1: Properties only work on new-style classes. Note 2: Try to keep the functional behavior side-effect free, although side-effects such as caching are generally fine. Note 3: Avoid using properties for computationally expensive operations; the attribute notation makes the caller believe that access is (relatively) cheap.

Programming Recommendations

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Pyrex, Psyco, and such).

For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a+=b or a=a+b. Those statements run more slowly in Jython. In performance sensitive parts of the library, the .join() form should be used instead. This will assure that concatenation occurs in linear time across various implementations.

[note: I think we can be softer about this, as we need to target more than just CPython but the performance characteristics of our particular software and hardware stack.]

Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators.

Note is and is not compare the identity of an object. == can be overridden and does more complex comparisons, and so there is a small performance penalty. There is and only will ever by one None.

Also, beware of writing "if x" when you really mean "if x is not None" -- e.g. when testing whether a variable or argument that defaults to None was set to some other value. The other value might have a type (such as a container) that could be false in a boolean context!

Use class-based exceptions.

String exceptions in new code are strongly discouraged, as they will eventually (in Python 2.5) be deprecated and then (in Python 3000 or perhaps sooner) removed.

Modules or packages should define their own domain-specific base exception class, which should be subclassed from the built-in Exception class. Always include a class docstring. E.g.:

   class MessageError(Exception):
       """Base class for errors in the email package."""

Class naming conventions apply here, although you should add the suffix "Error" to your exception classes, if the exception is an error. Non-error exceptions need no special suffix.

[note: this is a strict requirement in OLPC]

When raising an exception, use "raise ValueError('message')" instead of the older form "raise ValueError, 'message'".

The paren-using form is preferred because when the exception arguments are long or include string formatting, you don't need to use line continuation characters thanks to the containing parentheses. The older form will be removed in Python 3000.

[note: also a strict requirement for OLPC]

Use string methods instead of the string module.

String methods are always much faster and share the same API with unicode strings. Override this rule if backward compatibility with Pythons older than 2.0 is required.

[note: we can be strict here. string.Template is an exception, which is the only reason the string module should be used at all.]

Use .startswith() and .endswith() instead of string slicing to check for prefixes or suffixes.

startswith() and endswith() are cleaner and less error prone. For example:

   Yes: if foo.startswith('bar'):
   
   No:  if foo[:3] == 'bar':

The exception is if your code must work with Python 1.5.2 (but let's hope not!). [note: clearly we don't]

Object type comparisons should always use isinstance() instead of comparing types directly.

   Yes: if isinstance(obj, int):
   
   No:  if type(obj) is type(1):

When checking if an object is a string, keep in mind that it might be a unicode string too! In Python 2.3, str and unicode have a common base class, basestring, so you can do:

   if isinstance(obj, basestring):

In Python 2.2, the types module has the StringTypes type defined for that purpose, e.g.:

   from types import StringTypes
   if isinstance(obj, StringTypes):

In Python 2.0 and 2.1, you should do:

   from types import StringType, UnicodeType
   if isinstance(obj, StringType) or \
      isinstance(obj, UnicodeType) :

[note: obviously we can just use basestring; though we need to be careful about distinguishing str and unicode. It is valid and perhaps preferred for us to be careful in distinguishing these two values. assert isinstance(value, unicode) is probably an assert we should use liberally]

For sequences, (strings, lists, tuples), use the fact that empty sequences are false.

   Yes: if not seq:
        if seq:
   
   No: if len(seq)
       if not len(seq)

Don't write string literals that rely on significant trailing whitespace.

Such trailing whitespace is visually indistinguishable and some editors (or more recently, reindent.py) will trim them.

[note: this only applies to multi-line/triple-quoted strings]

Don't compare boolean values to True or False

Using:

   Yes:   if greeting:
   
   No:    if greeting == True:
   
   Worse: if greeting is True:

Strings and Unicode

Generally there are three types of strings:

  1. 8-bit strings ("str") that contain binary data
  2. Unicode strings that contain textual data
  3. Encoded strings, represented as 8-bit strs, that contain textual data

The third form can cause problems. Python is encoding agnostic; the only encoding it does automatically is ASCII. When using ASCII text, an encoded and unicode string look very similar; they compare as equal, they hash to the same value, and str() and unicode() will convert cleanly between the two. Once non-ASCII text is introduced this all breaks.

We should avoid encoded strings when possible. When we expect to receive unicode strings, it is acceptable and even encouraged to do "assert isinstance(value, unicode)".


Internationalization and Localization

If you are writing code for use in many countries then all user-visible strings should be in English and should be translatable. You do this like so:

 from gettext import gettext as _
 import getpass
 
 print _("Hello %(name)s!") % {'name': getpass.getuser()}

Note that string substitutions should be done after the translation via _(). Also, named values should be used. You may find string.Template preferable to %-based substitution; you can use it like:

 import string
 print string.Template(_("Hello $name!")).substitute(name=getpass.getuser())

There's a long document on internationalizing Pylons, most of which applies to any Python i18n code.

[Q: are there translation domains? How does activity translation work? Are we going to monkeypatch gettext.gettext to make it work like we want for activities?]

[Q: what about dates and other localized values?]

[Q: Should we prefer string.Template over % substitution?]

Testing

The Testing Tool Taxonomy provides a long and comprehensive list of test systems available for Python.

There's three core packages that can be used for testing:

doctest

doctest is a standard library module, and a testing system. It's probably the simplest test system to use and read.

This is a common pattern for testing a package:

 if __name__ == '__main__':
     import doctest
     doctest.testmod()
     doctest.testfile('test_this_module.txt')

While this works, it's very easy to forget to run tests after making changes. It's also easy to forget to test for regressions. Because of this, you should provide a way to run all of your tests.

Note that you can put tests in your file's docstrings, or in an external text file. For tests that don't have documentation value an external text file is best (it won't clutter your source or the helpful information in your docstrings). For extended examples external files are also best; inline docstring doctests are mostly best to simply confirm those examples are correct, not to do extensive testing of your routines.

unittest

unittest is the "standard" standard library testing module. It is modeled after SUnit, JUnit, etc. Tests using this tend to be somewhat long-winded, and not very readable (this is Ian's personal opinion, but he holds it very strongly).

When a project is already using unittest, you should use it for new tests to maintain consistency. Note that doctest can produce unittest-compatible tests. When creating new tests, seriously consider using doctest, as the resulting tests are usually much more readable. This is less true for tests that contain considerable logic (especially things like stress testing, or using fuzzed input).

If you are using unittest-based tests you should provide a test runner as part of your code; this is a script that will run all the tests in your code. While some people use the same kind of __name__ == '__main__' trick for unittest that they do for doctest, this is not desirable (for all the same reasons).

nose

nose is a (non-standard) library/script for finding and running tests. It is based on unittest, and provides the tests collection that the other two modules are missing. It also can run doctests directly (without having to explicitly wrap them as unittests) and has some improved features over typical unittest test runners (like showing detail about failed assertions, and dropping into a debugger on failure). It has features very similar to py.test, but is easier to install and is more compatible with unittest-based tests than py.test.

Nose also lets you use simpler tests than unittest's class-based tests. Functions with names starting with test_ will be run

If you use this test runner, it is recommended that you include a shell script or Python script to run nose with your project; this will make it easy for other developers to see how you run your tests.


File names

Except for embedded doctests, tests should generally go in files separate from the module they are testing. This way importing the module will not load the tests and won't add any overhead unless you are actually running the tests.

Tests should be named test_modulename.... You can add more to the name if you have multiple files associated with one module. Use .py for Python-based files (of course), and .txt for external doctests. Tests are sometimes put in a subpackage called tests (note that test is unfortunately used by a very boring standard library module, and it can lead to confusing situations if you use that name). It's also fine to simply put tests right beside the modules they test.

External doctest files that have documentation value should be named the same as the module (with .txt), and should not have a test_ prefix; their primary value is not the testing they do, but the information they convey. Ideally all programmer documentation will use doctest, so that the accuracy of the documentation can be easily confirmed.

Documentation

All documentation should be provided as in an e-book format readable by Evince, the OLPC e-book reader. In particular, avoid supplying a directory tree of HTML files. Consolidate these together in a single document.

One possibility is to use something based on TiddlyWiki. This is based on using Javascript to have a tree of many small documents inside a single HTML file. Since the OLPC uses xulrunner it supports Javascript in HTML.

File Layout

[The file layout for Python packages is pretty clear. However, where should other files go? For example, should images go beside Python code? And other declarations, like XML and the activity info file]


Distribution

[Something about distutils, setup.py, setuptools, etc. Should we have a single author, an author list? Should the author email point to a support address or developer discussion list?]


Deprecations and Warnings

When other people use code of your, you will have to support them as you update your code. Even if you mark your package as being "version 0.1", it doesn't matter -- if your code is useful, and someone uses it, then you'll need to start thinking about backward compatibility, or else make life difficult for your users.

Deprecations and warnings are specifically meant to deal with this. Warnings should seldom go in new code. For instance, you could do:

   def send_content(dest, data):
       if not isinstance(data, str):
           warnings.warn('You should only send str data')
           data = str(data)

But because there are no current users (if this is new code), this should simply be an error:

   def send_content(dest, data):
       assert isinstance(data, str), (
           "data should be a str, not %r" % data)

Then callers will see this error and call str(data) on their end, removing any potential ambiguity.

When you want to use warnings is when in the past you've allowed non-str data, and you want to change that. There is no firm rule about when you should simply turn something into an error, and when you should provide warnings.

If you provide a warning, it should be in this form:

 import warnings
 def send_content(dest, data):
     if not isinstance(data, str):
         # Deprecated since 2005-05-01
         warnings.warn('send_data(dest, data=%r) should only be passed a str value for data',
                       DeprecationWarning, stacklevel=2)
         data = str(data)

DeprecationWarning is a category of warnings. You can disable warnings by category, or turn them into errors. stacklevel=2 means that the bad behavior happened at stack level 2 (the immediate caller of this function). This will show the caller's filename and line number in the warning. You might have to increase this number if you are using more indirection in your code.

Including the date of the deprecation in a comment makes it easier to determine when the deprecated usage should be turned into an error (after some time one can assume all callers have fixed their code).

When a function has been moved or removed, you should start with a warning and then turn it into an error like:

 def send_content(dest, data):
     # Moved on 2005-07-10
     raise NotImplementedError(
         'The send_content function has been moved to mypkg.content_sending.send_content')

You should not simply remove a public function; by putting in an error you tell callers exactly how they should update their code. Like warnings, these should eventually be removed. There should always be a stage where you make it an explicit error, as some users may ignore warnings entirely until it is turned into an error.

str, unicode, and repr

There are three ways to coerce an object to text: str(obj), unicode(obj) and repr(obj).

str coerces an object to its non-unicode textual representation. Though this is very commonly used, unicode(obj) should be preferred as it creates unicode text. As an example, here's code that can cause a problem:

 class User(object):
     def __init__(self, name):
         self.name = name
     def __str__(self):
         return 'User %s' % self.name
 u = User(u'Iñtërnâtiônàlizætiøn')
 # This works fine:
 print repr(unicode(u))
 # This won't work:
 print str(u)
 # This won't work either:
 print u
 # This won't work either:
 'Hi ' + str(u)

So what happened there? Well, self.name was a unicode string. When you do 'User %s' % self.name it returns a unicode string as well (str strings are turned into unicode when used with % -- this itself is a little scary). Then str() calls u.__str__(). It sees unicode, and it tries u'Iñtërnâtiônàlizætiøn'.encode('ascii') ('ascii' is sys.getdefaultencoding()).

What is scary is here is that if you do your testing with this:

 u = User(u'Bob')

then everything will work, because u'Bob'.encode('ascii') succeeds.

The moral of this story? If you had implemented __unicode__ there would have been no problem. (Well, calling code could still be broken, but your code would not be broken.)

So:

  • If you are dealing with textual data, use __unicode__ and unicode(obj).
  • If you are really just dealing with binary data, __str__ is okay to use, but it is usually preferable to just use another method that returns the string/binary form of the object. It is not necessary to overload every magic method just because you can.

repr()

The repr(obj) form (and its __repr__ magic method) are intended for Programmer representations of objects. These are handy representations that you can use to see what kind of object it is. Sometimes it is true that eval(repr(obj)) == obj, but this should never be relied upon (in fact eval() should generally never be used). More to the point, the repr() of an object should show useful or interesting information about an object. It should never be confused for a textual description. It should never be shown to a user (unless that user is acting as a programmer). If you want to override the default repr() for an object (and that is encouraged!) The general form for this method is:

 class User(object):
     def __init__(self, name):
         self.name = name
     def __repr__(self):
         return '<%s %s name=%r>' % (
             self.__class__.__name__, hex(id(self)), self.name)

Note: we use self.__class__.__name__ so that subclasses won't lie about their class. We use hex(id(self)) so that two similar objects will look distinct (it is often important when there are two different objects with the same data). The id is not necessary for Value Objects. Lastly, any instance variables you want to expose are done through %r, which puts the repr() of those objects into the string. This is very helpful when, for instance, a newline is embedded in the value. Because repr() is most useful when there are bugs, you shouldn't assume all instance variables contain well formed data.

It is also acceptable (especially for Value Objects) to use:

 def __repr__(self):
     return '%s(name=%r)' % (self.__class__.__name__, self.name)

You should be careful that repr() return values are not too long. If you sometimes have long data in a value, you might do this:

 def __repr__(self):
     bio_repr = repr(self.bio)
     if len(bio_repr) > 20:
         bio_repr = bio_repr[:15]+'...'+bio_repr[-5:]
     return '<%s bio=%s>' % (self.__class__.__name__, bio_repr)

Copyright

Portions of this text are from PEP 8 are in the public domain. Other portions are under a CC-Attribution license.