Image file formats: Difference between revisions

From OLPC
Jump to navigation Jump to search
(...)
(Undo revision 208436 by 122.163.48.182 (Talk))
 
(19 intermediate revisions by 7 users not shown)
Line 1: Line 1:
A collection of '''file formats'''.
Information on some '''image file formats'''.


== Primary image formats for OLPC ==
==DJVU==


Whatever format you use, check your file sizes! Even SVG and PNG images can end up being bloated if the right options are not used. Use applications which trim down image file size for the web.
The main site for information on [http://en.wikipedia.org/wiki/Djvu DJVU] compression format for ebooks is here http://www.djvuzone.org/

There are four primary image formats that will probably see much use on the OLPC. All are directly supported by [[XULRunner]] and thus the [[Browse]] activity:

=== SVG ===
[[SVG]] is a vector drawing format ideally used for drawings created with tools like Inkscape, CorelDraw, CAD software, etc. It is an XML text format and thus somewhat human readable when you "View Source".

=== JPEG ===
[[JPEG]] is a compression format that is ideally suited to photographs whether they are scanned or photographed on a digital camera. This is a very, very bad choice for scanned text. You can tell when someone has made this mistake because the text is blurry.

=== PNG ===
[[PNG]] is a ''lossless'' compression format. It the best choice for icons and other hand-drawn imagery with few colors. It is also the best choice when you have to scan material like line drawings, cartoons, or text. The common factor in all these source materials is that they have a few different shades of color, perhaps only black and white. Even when the scanned original is stained or has shadows on it, you can usually tell your image editor to convert it to a 2-color black and white image (or increase the contrast to maximum) and sharpen the image.

== Other image file formats ==

Some other formats can be useful as well but they have only limited support in applications and it's not clear if their advantages are worth the pain of non-standard format support.

See also [[Talk:{{PAGENAME}}]] for discussion of other deprecated formats such as TIFF, WMF, etc.

=== Animated PNG ===
[[wikipedia:APNG]] is an animated extension of the PNG format. It is supported by [[XULRunner]] and thus works in [[Browse]]. There are tools to create APNG files from sequences of images and from animated GIFs.

=== GIF ===
[[GIF]] is an old, [http://www.w3.org/Graphics/GIF/spec-gif89a.txt proprietary] but widely supported format. Properly compressed [[PNG]] with index (not true color) is almost always smaller than GIF, see the [[PNG]] page for more details. GIF's remaining niche is support for simple small animations (big ones take up a lot of space), but [[XULRunner]] supports the unofficial Animated PNG extension. GIF works fine in [[Browse]] and all patents related to GIF are expired, but in general prefer [[PNG]], [[JPEG]], and the APNG variant.

=== MNG ===
[[wikipedia:MNG]] is another animated image format. It is not supported by [[XULRunner]] and thus does not work in [[Browse]]

=== SWF ===
[[wikipedia:SWF]] is the proprietary Shockwave Flash file format for vector art and multimedia, as used by the [[Adobe Flash]] player. The XO includes the [[Gnash]] browser plug-in which has some ability to play .swf files, so this file format may be appropriate if you test carefully.

=== JPEG 2000 ===
[http://www.jpeg.org/jpeg2000/ JPEG 2000] is the most recent work of the [[JPEG]] standards committee. ''Unclear if this works in [[Browse]] without a plug-in'' It is intended to improve upon several areas of the original JPEG 2000 standard such as lossy encoding, blocky appearance, and inefficient compression. Although this standard is intended to be implemented on a no-charge basis, the JPEG committee does allow for [http://www.jpeg.org/faq.phtml?action=show_answer&question_id=q3f042a68b1081 inclusion of patented or otherwise non-free technology] as per ITU and ISO policy on [http://en.wikipedia.org/wiki/Reasonable_and_non-discriminatory reasonable and non-descriminatory] terms. It is problematic or impossible for open source/free software to implement such standards.

== DjVu==
DjVu is not primarily an image format although it is used to encode compressed page images from books. It is the best choice when you are [[scanning]] a multipage book intended to be used in an [[Ebooks|Ebook]] reader like [[Evince]]. However it should not be used for images in your applications or in HTML documents. In this usage scenario, it is possible to add OCR results behind to page images, which makes the text searchable, without the labor intensive proofreading a text-only version would require.

The main site for information on [[wikipedia:DjVu]] compression format for ebooks is here http://www.djvuzone.org/
<br>Recently a [http://software.newsforge.com/article.pl?sid=06/03/08/2314247 good overview article] was published on News Forge.
<br>Recently a [http://software.newsforge.com/article.pl?sid=06/03/08/2314247 good overview article] was published on News Forge.


In a nutshell, DJVU was invented to solve this problem:
In a nutshell, DjVu was invented to solve the following problem:
:Conventional web formats such as JPEG, GIF, and PNG produce prohibitively large image files at decent resolution. As a result, Web site content developers have been largely unable to leverage existing printed materials.
:Conventional web formats such as JPEG, GIF, and PNG produce prohibitively large image files at decent resolution. As a result, Web site content developers have been largely unable to leverage existing printed materials.


DJVU is intended to be used with scanned images of book pages, either black & white or full color. It then compresses those scanned pages to produce [http://www.djvuzone.org/wid/index.html very highly compressed files].
DjVu is intended to be used with scanned images of book pages, either black & white or full color. It then compresses those scanned pages to produce [http://www.djvuzone.org/wid/index.html very highly compressed files].


Given that the target countries for the OLPC have poorly developed computing infrastructures, [[scanning]] of existing printed documents into DJVU format may be the fastest way of making a wide variety of educational material and [[Ebooks]] available to the kids.
Given that the target countries for the OLPC have poorly developed computing infrastructures, [[scanning]] of existing printed documents into DjVu format may be the fastest way of making a wide variety of educational material and [[Ebooks]] available to the kids.


=== Status of DjVu ===
DJVU is supported by the [[Evince]] reader which is being used by the OLPC project.
DjVu is supported by the [[Evince]] document viewer which is used by the [[Read]] activity.
'''But''', it requires the DjVuLibre backend and/or evince-djvu support which is not packaged in [[Update.1]] or [[8.1.1]] (see {{Ticket|6223}}).


If you want to contribute to the DJVU project in any way, here is the site:
If you want to contribute to the DjVu project in any way, here is the site:
http://djvulibre.djvuzone.org/
http://djvulibre.djvuzone.org/


===Why is DJVU important?===
====Why is DjVu important?====


In regions where computers are scarce and there is little support for native scripts, DJVU allows existing paper books to be scanned and distributed as ebooks. Even handwritten books can be distributed this way. Tie this together with the OLPC chat application's support for [[SVG]] input and the [[GECKO]] support for displaying [[SVG]] graphics and it is conceivable to distribute a computer with no font support and '''NO TEXT AT ALL''' in its user interface. Icons would substitute for text in the UI and handwriting would be a primary mode of input. Note that the OLPC has a wider than normal touchpad that can be used as a handwriting input device.
In regions where computers are scarce and there is little support for native scripts, DjVu allows existing paper books to be scanned and distributed as ebooks. Even handwritten books can be distributed this way. Tie this together with the OLPC chat application's support for [[SVG]] input and the [[GECKO]] support for displaying [[SVG]] graphics and it is conceivable to distribute a computer with no font support and '''NO TEXT AT ALL''' in its user interface. Icons would substitute for text in the UI and handwriting would be a primary mode of input.


Of course, this is a bootstrap scenario. Once the OLPC is deployed in this way, native language speakers will begin to work on fonts, and a keyboard layout to enable text use on the OLPC. This could take months or years to sort out, but in the meantime, the kids have an educational tool to use.
Of course, this is a bootstrap scenario. Once the OLPC is deployed in this way, native language speakers will begin to work on fonts, and a keyboard layout to enable text use on the OLPC. This could take months or years to sort out, but in the meantime, the kids have an educational tool to use.


===Producing DJVU Documents===
===Producing DjVu documents===
[http://www.howtoforge.com/creating_djvu_documents_on_linux A simple guide] to create DJVU files (from JPEG and PDF) and to bind multiple DJVU files into one, is available at howtoforge.
[http://www.howtoforge.com/creating_djvu_documents_on_linux A simple guide] to create DjVu files (from JPEG and PDF) and to bind multiple DjVu files into one, is available at howtoforge.


====Workflow Planning====
====Workflow planning====


First, you need to think of this in terms of setting up a workflow. There are several steps, some of which require technical expertise and some which do not. In addition, the expertise required to set up and maintain the workflow is different from that required to make encoding decisions and check the quality of scans.
First, you need to think of this in terms of setting up a workflow. There are several steps, some of which require technical expertise and some which do not. In addition, the expertise required to set up and maintain the workflow is different from that required to make encoding decisions and check the quality of scans.
Line 33: Line 72:
====Scanning====
====Scanning====


Some scanners can handle bound books but they cost a lot more money. However, if you can spare a copy, then you can take it apart and scan the pages on a flatbed scanner. Save the files in an uncompressed [[TIFF]] format because they will be processed further. Pages should be scanned in color because the DJVU compression software produces a better result that way.
Some scanners can handle bound books but they cost a lot more money. However, if you can spare a copy, then you can take it apart and scan the pages on a flatbed scanner. Save the files in an uncompressed [[TIFF]] format because they will be processed further. Pages should be scanned in color because the DjVu compression software produces a better result that way.


If your original scans are not perfect, you may need to use software such as [http://www.i2s-bookscanner.com/en/products_software.asp Book Restorer] or [http://unpaper.berlios.de/ Unpaper] to clean them up. This is especially important when you are scanning old, rare books that have been damaged in some way, for instance stains on the pages. In addition, when a book is rare you cannot cut out the pages to do perfectly flat scans. This means that the scans will be curved but software can repair these curves.
If your original scans are not perfect, you may need to use software such as [http://www.i2s-bookscanner.com/en/products_software.asp Book Restorer] or [http://unpaper.berlios.de/ Unpaper] to clean them up. This is especially important when you are scanning old, rare books that have been damaged in some way, for instance stains on the pages. In addition, when a book is rare you cannot cut out the pages to do perfectly flat scans. This means that the scans will be curved but software can repair these curves.
Line 41: Line 80:
====Encoding the Pages====
====Encoding the Pages====


Next, you need to process the individual page scans with various tools to [http://djvulibre.djvuzone.org/doc/index.html encode the pages]. Different encoding tools may be used for different pages depending on the presence of illustrations, photos, colored text, etc. Pages can be segmented into a black and white layer and a color layer so that different encoders can be used on each. In addition, if you have an OCR program for the script that the book is written in, you can run the black and white segment through it. DJVU readers are capable of using the OCR to do text searches and then highlighting the words in the actual scanned text image.
Next, you need to process the individual page scans with various tools to [http://djvulibre.djvuzone.org/doc/index.html encode the pages]. Different encoding tools may be used for different pages depending on the presence of illustrations, photos, colored text, etc. Pages can be segmented into a black and white layer and a color layer so that different encoders can be used on each. In addition, if you have an OCR program for the script that the book is written in, you can run the black and white segment through it. DjVu readers are capable of using the OCR to do text searches and then highlighting the words in the actual scanned text image.


====Bundling and Postprocessing====
====Bundling and Postprocessing====
Line 49: Line 88:
'''Testing''':
'''Testing''':


Don't forget to test your book thoroughly using [[Evince]] to make sure that there are no problems with using it on the OLPC.
Don't forget to test your book thoroughly using the [[Read]] activity to make sure that there are no problems with using it on the OLPC.


====Tools====
====Tools====
Line 55: Line 94:
If you would rather have the scanning done by a [http://www.headway.co.uk/products/capture/lizardtech/lizardtech.htm company with expertise in the field], that is possible. Once the first pilot country is deployed, there will likely be other companies who can offer this service. But the tools needed are all [http://djvulibre.djvuzone.org/ open source] so you can also set up your own production line for scanning books.
If you would rather have the scanning done by a [http://www.headway.co.uk/products/capture/lizardtech/lizardtech.htm company with expertise in the field], that is possible. Once the first pilot country is deployed, there will likely be other companies who can offer this service. But the tools needed are all [http://djvulibre.djvuzone.org/ open source] so you can also set up your own production line for scanning books.


===Articles and Papers===
===Articles and papers===

* [http://www.profsurv.com/archive.php?issue=48&article=671 this article from Professional Surveyor magazine] explains how the National Land Survey of Sweden went about converting their historical archive to DJVU format.

* [http://www.howtoforge.com/creating_djvu_documents_on_linux this article on howtoforge] details the entire workflow for producing DJVU books on Linux. It also includes some scripts useful in making the process run more smoothly.


== GIF ==
GIF is old, [http://www.w3.org/Graphics/GIF/spec-gif89a.txt proprietary] but widely supported format. Today it's mostly surpassed by [[PNG]] - in all regards except one: support for simple animations. If you need animated icon or something like that - use GIF, by all means (all patents related to GIF are expired by now), but for anything else... there are [[PNG]] and [[JPEG]].

There is a page on [[choosing image formats]] that will help you to understand the differences and how to know which format will be best for the intended use.


* [http://www.profsurv.com/archive.php?issue=48&article=671 this article from Professional Surveyor magazine] explains how the National Land Survey of Sweden went about converting their historical archive to DjVu format.
More info can be found on [http://en.wikipedia.org/wiki/GIF Wikipedia].


* [http://www.howtoforge.com/creating_djvu_documents_on_linux this article on howtoforge] details the entire workflow for producing DjVu books on Linux. It also includes some scripts useful in making the process run more smoothly.
== JPEG ==
[http://www.jpeg.org/jpeg2000/ JPEG 2000] is the most recent work of the JPEG standards committee. It is intended to improve upon several areas of the original JPEG standard such as lossy encoding, blocky appearance, and inefficient compression.


== See also ==
Although this standard is intended to be implemented on a no-charge basis, the JPEG committee does allow for [http://www.jpeg.org/faq.phtml?action=show_answer&question_id=q3f042a68b1081 inclusion of patented or otherwise non-free technology] as per ITU and ISO policy on [http://en.wikipedia.org/wiki/Reasonable_and_non-discriminatory reasonable and non-descriminatory] terms. It is problematic or impossible for open source/free software to implement such standards.
* [[Data file formats]]


[[Category:File formats]]
[[Category:File formats]]

Latest revision as of 23:10, 30 June 2009

Information on some image file formats.

Primary image formats for OLPC

Whatever format you use, check your file sizes! Even SVG and PNG images can end up being bloated if the right options are not used. Use applications which trim down image file size for the web.

There are four primary image formats that will probably see much use on the OLPC. All are directly supported by XULRunner and thus the Browse activity:

SVG

SVG is a vector drawing format ideally used for drawings created with tools like Inkscape, CorelDraw, CAD software, etc. It is an XML text format and thus somewhat human readable when you "View Source".

JPEG

JPEG is a compression format that is ideally suited to photographs whether they are scanned or photographed on a digital camera. This is a very, very bad choice for scanned text. You can tell when someone has made this mistake because the text is blurry.

PNG

PNG is a lossless compression format. It the best choice for icons and other hand-drawn imagery with few colors. It is also the best choice when you have to scan material like line drawings, cartoons, or text. The common factor in all these source materials is that they have a few different shades of color, perhaps only black and white. Even when the scanned original is stained or has shadows on it, you can usually tell your image editor to convert it to a 2-color black and white image (or increase the contrast to maximum) and sharpen the image.

Other image file formats

Some other formats can be useful as well but they have only limited support in applications and it's not clear if their advantages are worth the pain of non-standard format support.

See also Talk:Image file formats for discussion of other deprecated formats such as TIFF, WMF, etc.

Animated PNG

wikipedia:APNG is an animated extension of the PNG format. It is supported by XULRunner and thus works in Browse. There are tools to create APNG files from sequences of images and from animated GIFs.

GIF

GIF is an old, proprietary but widely supported format. Properly compressed PNG with index (not true color) is almost always smaller than GIF, see the PNG page for more details. GIF's remaining niche is support for simple small animations (big ones take up a lot of space), but XULRunner supports the unofficial Animated PNG extension. GIF works fine in Browse and all patents related to GIF are expired, but in general prefer PNG, JPEG, and the APNG variant.

MNG

wikipedia:MNG is another animated image format. It is not supported by XULRunner and thus does not work in Browse

SWF

wikipedia:SWF is the proprietary Shockwave Flash file format for vector art and multimedia, as used by the Adobe Flash player. The XO includes the Gnash browser plug-in which has some ability to play .swf files, so this file format may be appropriate if you test carefully.

JPEG 2000

JPEG 2000 is the most recent work of the JPEG standards committee. Unclear if this works in Browse without a plug-in It is intended to improve upon several areas of the original JPEG 2000 standard such as lossy encoding, blocky appearance, and inefficient compression. Although this standard is intended to be implemented on a no-charge basis, the JPEG committee does allow for inclusion of patented or otherwise non-free technology as per ITU and ISO policy on reasonable and non-descriminatory terms. It is problematic or impossible for open source/free software to implement such standards.

DjVu

DjVu is not primarily an image format although it is used to encode compressed page images from books. It is the best choice when you are scanning a multipage book intended to be used in an Ebook reader like Evince. However it should not be used for images in your applications or in HTML documents. In this usage scenario, it is possible to add OCR results behind to page images, which makes the text searchable, without the labor intensive proofreading a text-only version would require.

The main site for information on wikipedia:DjVu compression format for ebooks is here http://www.djvuzone.org/
Recently a good overview article was published on News Forge.

In a nutshell, DjVu was invented to solve the following problem:

Conventional web formats such as JPEG, GIF, and PNG produce prohibitively large image files at decent resolution. As a result, Web site content developers have been largely unable to leverage existing printed materials.

DjVu is intended to be used with scanned images of book pages, either black & white or full color. It then compresses those scanned pages to produce very highly compressed files.

Given that the target countries for the OLPC have poorly developed computing infrastructures, scanning of existing printed documents into DjVu format may be the fastest way of making a wide variety of educational material and Ebooks available to the kids.

Status of DjVu

DjVu is supported by the Evince document viewer which is used by the Read activity. But, it requires the DjVuLibre backend and/or evince-djvu support which is not packaged in Update.1 or 8.1.1 (see #6223).

If you want to contribute to the DjVu project in any way, here is the site: http://djvulibre.djvuzone.org/

Why is DjVu important?

In regions where computers are scarce and there is little support for native scripts, DjVu allows existing paper books to be scanned and distributed as ebooks. Even handwritten books can be distributed this way. Tie this together with the OLPC chat application's support for SVG input and the GECKO support for displaying SVG graphics and it is conceivable to distribute a computer with no font support and NO TEXT AT ALL in its user interface. Icons would substitute for text in the UI and handwriting would be a primary mode of input.

Of course, this is a bootstrap scenario. Once the OLPC is deployed in this way, native language speakers will begin to work on fonts, and a keyboard layout to enable text use on the OLPC. This could take months or years to sort out, but in the meantime, the kids have an educational tool to use.

Producing DjVu documents

A simple guide to create DjVu files (from JPEG and PDF) and to bind multiple DjVu files into one, is available at howtoforge.

Workflow planning

First, you need to think of this in terms of setting up a workflow. There are several steps, some of which require technical expertise and some which do not. In addition, the expertise required to set up and maintain the workflow is different from that required to make encoding decisions and check the quality of scans.

Scanning

Some scanners can handle bound books but they cost a lot more money. However, if you can spare a copy, then you can take it apart and scan the pages on a flatbed scanner. Save the files in an uncompressed TIFF format because they will be processed further. Pages should be scanned in color because the DjVu compression software produces a better result that way.

If your original scans are not perfect, you may need to use software such as Book Restorer or Unpaper to clean them up. This is especially important when you are scanning old, rare books that have been damaged in some way, for instance stains on the pages. In addition, when a book is rare you cannot cut out the pages to do perfectly flat scans. This means that the scans will be curved but software can repair these curves.

Check this wiki for additional scanning advice.

Encoding the Pages

Next, you need to process the individual page scans with various tools to encode the pages. Different encoding tools may be used for different pages depending on the presence of illustrations, photos, colored text, etc. Pages can be segmented into a black and white layer and a color layer so that different encoders can be used on each. In addition, if you have an OCR program for the script that the book is written in, you can run the black and white segment through it. DjVu readers are capable of using the OCR to do text searches and then highlighting the words in the actual scanned text image.

Bundling and Postprocessing

After this you have various pieces which you need to bundle together into a multipage book file. Then, you may wish to further process the book to add text annotations, precompute thumbnail images of pages, etc. Perhaps the book is written in an archaic form of the language and you wish to annotate it with a glossary similar to what we do with Shakespeare's plays.

Testing:

Don't forget to test your book thoroughly using the Read activity to make sure that there are no problems with using it on the OLPC.

Tools

If you would rather have the scanning done by a company with expertise in the field, that is possible. Once the first pilot country is deployed, there will likely be other companies who can offer this service. But the tools needed are all open source so you can also set up your own production line for scanning books.

Articles and papers

  • this article on howtoforge details the entire workflow for producing DjVu books on Linux. It also includes some scripts useful in making the process run more smoothly.

See also