Books/Formats

From OLPC
Jump to: navigation, search

Formats

Book formats should be compressed (to conserve space) and open. In particular, they must not be encumbered by patents, and must be inclusive - they should not favor any particular vendor.

See wikipedia:Comparison of e-book formats for a more comprehensive list.

Epub

EPub is a free standard for reflowable content, which lets a reading device determine how it gets displayed. It uses XHTML or DTBook to represent text and zip as a packaging format. It replaced the older Open eBook standard in 2007. Related subformats and standards include the Open Publication Structure (OPS), Open Packaging Format (OPF), and OEBPS Container Format (OCF).

This is the latest format supported by the Read activity, and a standard supported by fbreader and many modern ebook publishers.

HTML

Although not particularly designed as a book format, HTML is widely used for books. Most newer Project Gutenberg books are available as HTML. Both special purpose Ebook readers and web browsers can be used to access HTML Ebooks.

HTML eBook packaging formats

HTML/XHTML has basically "won" as the presentation format for commercial and newly-published eBooks. Most eBook formats are just subsets or supersets of HTML or XHTML, using standard tags like <p>, <img>, <h1>, etc.

The debate has moved on to the organization and representation of such files for offline self-contained viewing.

  • OLPC's own .xol bundle format for Collections is a download format that the Journal unpacks and adds to the Library in Browse.
  • The International Digital Publishing Forum (IDPF) promotes a .epub XML file format, see ePub demystified.
  • In that, Bill Janssen comments "On the eBabel front, let’s examine the other web site packaging formats." Googling for “web page archive format”] gives one an interesting list
    • Microsoft’s MHTML, introduced in IE 5, documented in RFC 2557, and apparently also supported by the Opera browser.
    • Apple’s WebArchive, used by Safari.
    • The Library of Congress’ WARC, which they’re using to preserve Web site captures, and is a draft ISO standard.
    • WARC is based on ARC_IA, the format used by the Internet Archive.
    • There’s also MAF, the Mozilla Archive Format for Firefox 3, Christopher Ottley’s project.
  • the WikiBrowse activity is a web server that serves content from a compressed MediaWiki dump, forming a self-contained browsable offline wikireader

XML

XML is not a directly usable format, but rather a meta-format. XML alone is not a book format, but many modern formats that can be used for books are XML based, such as ODF, and the XHTML variant of HTML. Other XML based formats are DocBook, popular for computer manuals, or TEI, used in the Humanities. Modern web browsers can render XML directly, but to make such a display attractive some transform (expressed in CSS or XSLT) may be required.

Text / ASCII / Unicode

The most basic format, can be compressed with simple/standard compression programs if needed. Original and default format for the Project Gutenberg e-texts.

Browse and the Read Etexts activity can render text files. Read also can but it opens them for editing.

DejaVU

The DJVU format was developed in order to provide a much higher level of compression for scanned paper books, than existing formats like JPEG and TIFF can provide.

PDF

The PDF format is a simplified form of the Postscript programming language that only includes the commands necessary to paint ink on the page. It is easy for end users to create PDFs with the Print function of a word processing or drawing application. There are extensive Free/Open Source libraries of functions for creating, editing, and otherwise modifying PDFS, and applications built from them. For example, libpoppler and the Poppler PDF Utilities. There are also several Free PDF display programs, including xpdf, kpdf, evince, gv, and ViewPDF. The Read activity uses Evince and poppler to render PDFs.

OpenDocument

OpenDocument is a compressed format (zip-compressed XML) for documents, including books, presentations, and spreadsheets. Complex documents (with many images) can be sent as a single document (unlike HTML), yet it can flow in a display (unlike PDF). It is also editable. The Write activity on the XO uses libAbiword and can open ODF files.

Greenstone

Greenstone is a self-contained bespoke format for document collections. A Greenstone library allows quick full-text search access to large collections, and is typically smaller than the full-text it contains, due to the compression scheme it uses. A Greenstone library can be both accessed via a web server or locally on a (read-only) disk. A complete Greenstone collection can be large, which makes it less useful, given the storage constraints of the OLPC.

FictionBook

"FictionBook is an XML format for storage of books where each element of the book is described by tags." Also known as fb2, it is supported by FBReader. In fact, the FBReader website lists other book formats that are not listed here.

DVI / TeX

DeVice Independent format. Output of a typesetting system called TeX that is very widely used in academic and open source technical literature. See wikipedia for more information.