Internet Archive: Difference between revisions

From OLPC
Jump to navigation Jump to search
(see also scripts)
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Description ==
== Description ==
[[Image:IA.png|200px|right]]
[[Image:InternetArchive.png|84px|right]]


From the Internet Archive, the online digital library. The Archive includes a wide assortment of collections, including texts, audio, moving images, and software as well as a rather complete, sporadic archive of the public Web.
From the Internet Archive, the online digital library. The Archive includes a wide assortment of collections, including petabytes of texts, audio, moving images, software, and a rather complete, sporadic archive of the entire public Web.




Line 22: Line 22:


The Archive contains roughly 200,000 audio recordings, 230,000 texts, and 80,000 moving images, as well as many reviews and varieties of formats for each of the items.
The Archive contains roughly 200,000 audio recordings, 230,000 texts, and 80,000 moving images, as well as many reviews and varieties of formats for each of the items.



== Formats ==
== Formats ==
* Many books are currently in plain text, pdf, and djvu. Movies tend to be in MPEG and Quicktime, and occasionally also in ogg theora. Audio tends to be in mp3 and regularly in ogg vorbis.
* Many books are currently in RAW photo images, plain text, pdf, and djvu. Movies tend to be in MPEG and Quicktime, some with very high resolution suitable for further video editing, and occasionally also in ogg theora. Audio tends to be in FLAC, with mp3 copies and regularly in ogg vorbis. The Archive has background processing scripts that transform the high quality originals into the smaller lower quality files for user convenience. If some works are not available in OLPC's desired format, the Archive can probably do a scripted transliteration without too much trouble.


* Scripting : there is no automated way to take a snapshot of a collection, or download a dump or tarball; individual entries can be downloaded one by one.
* Scripting: there is no automated way to take a snapshot of a collection, or download a dump or tarball; individual entries can be downloaded one by one. Curators at the Archive can help create or review a proposed collection. It's possible for trusted archive users to get unrestricted access to the raw "stacks" to do their own scripted processing on the Archive's servers.


==Curator Info==
==Curator Info==
Line 49: Line 48:
'''Multilingualism''' (specifically es, pt, en, ar)
'''Multilingualism''' (specifically es, pt, en, ar)


Not perfect, but reasonable. The Archive would be happy to work with national or regional libraries in any country to scan public domain books. Such scanning costs about 10c/page and produces very high quality scans.
Not perfect, but reasonable.




'''Quality''' (incl. suitability for audience)
'''Quality''' (incl. suitability for audience)


High quality. Book scans are very high resolution (20MB/page), and available processed down to lower resolutions, i.e. PDFs. OCR processing of book scans aren't so good. Audio files are generally in lossless FLAC at 44Khz/16bit, with a few at 96Khz and/or 24-bit. Moving image quality varies widely, based on source material and how the donor transferred it to digital.
Relatively high quality; OCR scans aren't so good.


Cultural quality of the works varies widely -- the Archive strives for breadth of inclusion rather than selectivity. The music collection, being focused on touring artists who permit their audiences to record their live concerts, is generally professional. The vast majority of the scanned books are from libraries (such as university libraries), and so are professional published, and had been judged by librarians as worth retaining for the many years required to enter the public domain.


'''Freeness''' (license and format)
'''Freeness''' (license and format)
Line 61: Line 61:
Copyright notices vary considerably, and are often vague. There is no clear way to search by copyright. There are collections of open source film; scanned texts are largely public domain.
Copyright notices vary considerably, and are often vague. There is no clear way to search by copyright. There are collections of open source film; scanned texts are largely public domain.


Copyright licenses on the live audio files are controlled by the individual bands (and documented by band, but are generally informal notices rather than legal terms). Many such notices restrict "commercial use".

The vast majority of the books available on the Archive are in the public domain. A few contributed by publishers (e.g. O'Reilly) come with licenses. Some such licenses may restrict "commercial use".


==Comments, Tags, and Ratings ==
==Comments, Tags, and Ratings ==
Line 70: Line 73:


==How to Add==
==How to Add==
Sign into the archive.org site and follow the details there.
Sign into the archive.org site and follow the details there. (Login is not required to access the collection.)


[[Category:Content Repository]]
[[Category:Content Repository]]

Latest revision as of 21:17, 18 July 2007

Description

InternetArchive.png

From the Internet Archive, the online digital library. The Archive includes a wide assortment of collections, including petabytes of texts, audio, moving images, software, and a rather complete, sporadic archive of the entire public Web.


Film: Open source movies (Spanish, [Portuguese])

Links


Languages

  • English: A tremendous collection of books, art, music, animations, videos, games, and miscellany.
  • Spanish: many books, 250 open source videos
  • Portuguese: A few hundred books
  • Arabic:
  • Current list of languages and sizes:

Size

The Archive contains roughly 200,000 audio recordings, 230,000 texts, and 80,000 moving images, as well as many reviews and varieties of formats for each of the items.

Formats

  • Many books are currently in RAW photo images, plain text, pdf, and djvu. Movies tend to be in MPEG and Quicktime, some with very high resolution suitable for further video editing, and occasionally also in ogg theora. Audio tends to be in FLAC, with mp3 copies and regularly in ogg vorbis. The Archive has background processing scripts that transform the high quality originals into the smaller lower quality files for user convenience. If some works are not available in OLPC's desired format, the Archive can probably do a scripted transliteration without too much trouble.
  • Scripting: there is no automated way to take a snapshot of a collection, or download a dump or tarball; individual entries can be downloaded one by one. Curators at the Archive can help create or review a proposed collection. It's possible for trusted archive users to get unrestricted access to the raw "stacks" to do their own scripted processing on the Archive's servers.

Curator Info

Group: Literature, Music, Film

Group coordinator: A. Druin?

Contributing groups and their curators:

Allottable size: 10G on the school libraries

Assessment

Scope (subjects, ages, other)


Completeness (comprehensiveness for given topic and audience)


Multilingualism (specifically es, pt, en, ar)

Not perfect, but reasonable. The Archive would be happy to work with national or regional libraries in any country to scan public domain books. Such scanning costs about 10c/page and produces very high quality scans.


Quality (incl. suitability for audience)

High quality. Book scans are very high resolution (20MB/page), and available processed down to lower resolutions, i.e. PDFs. OCR processing of book scans aren't so good. Audio files are generally in lossless FLAC at 44Khz/16bit, with a few at 96Khz and/or 24-bit. Moving image quality varies widely, based on source material and how the donor transferred it to digital.

Cultural quality of the works varies widely -- the Archive strives for breadth of inclusion rather than selectivity. The music collection, being focused on touring artists who permit their audiences to record their live concerts, is generally professional. The vast majority of the scanned books are from libraries (such as university libraries), and so are professional published, and had been judged by librarians as worth retaining for the many years required to enter the public domain.

Freeness (license and format)

Copyright notices vary considerably, and are often vague. There is no clear way to search by copyright. There are collections of open source film; scanned texts are largely public domain.

Copyright licenses on the live audio files are controlled by the individual bands (and documented by band, but are generally informal notices rather than legal terms). Many such notices restrict "commercial use".

The vast majority of the books available on the Archive are in the public domain. A few contributed by publishers (e.g. O'Reilly) come with licenses. Some such licenses may restrict "commercial use".

Comments, Tags, and Ratings

XO : subsets of pdf and image collection; books with audiobooks.

School library : a Children's Library of 2000 works, an open video collection (250 spanish works, a Brick Films collection which need little narration), the audio books collection, the open books collection (again, with good spanish subcollections).

Uses: read straight through; design activities around visual elements -- have people write transcripts of audio or video, make up stories to go with them, draw their own illustrations for audio, match text or audio snippets with images, sounds, or video clips... all great ways to start a class. Have discussions about color, composition, tone, conversation, or audience.

How to Add

Sign into the archive.org site and follow the details there. (Login is not required to access the collection.)